public inbox for pbs-devel@lists.proxmox.com
 help / color / mirror / Atom feed
From: "Fabian Grünbichler" <f.gruenbichler@proxmox.com>
To: Christian Ebner <c.ebner@proxmox.com>, pbs-devel@lists.proxmox.com
Subject: Re: [RFC PATCH proxmox-backup] fix #7670: datastore: s3: allow for per-chunk file lock cleanup
Date: Fri, 05 Jun 2026 15:39:58 +0200	[thread overview]
Message-ID: <178066679899.993050.14588796821627975859@yuna.proxmox.com> (raw)
In-Reply-To: <20260604140919.97686-1-c.ebner@proxmox.com>

Quoting Christian Ebner (2026-06-04 16:09:19)
> Per-chunk file locks are located on a tmpfs but never cleaned up to
> avoid TOCTOU race conditions. Therefore lock files can accumulate
> over time, the memory required to store the inodes finally lead to
> OOM conditions if the system is not rebooted for a long time or a
> high number of different chunks is written to the s3 backed
> datastore.
> 
> To fix this, use the double stating strategy already implemented by
> pbs_datastore::backup_info::lock_helper(), but adapt it so that the
> lock file is cleaned up before unlocking. Since after file removal
> the lock can be acquired by a different thread/process, the file lock
> must also be dropped immediately without performing any other
> critical operation. To assure this, a ChunkLockGuard is implemented
> which removes the file and drops the file descriptor by implementing
> the Drop trait.
> 
> After each flock() call, which is performed as part of
> proxmox_sys::fs::open_file_locked(), stating the file and comparing
> locked files inode from the open file handle to the one currently
> present on the filesystem is performed. By this possible races are
> detected, resulting in missing or newly create lock files. In that
> case locking must be retried.
> 
> Note that this locking mechanism is not fair, the first caller might
> end up being the last to actually acquire the lock. This is however
> not problematic for the intended use case of per-chunk file locking,
> with limited lock contention.
> 
> Fixes: https://bugzilla.proxmox.com/show_bug.cgi?id=7670
> Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
> ---
> Sending this as RFC in case there are ideas for a different solution
> to the problem at hand which might be preferable.

to recap - when we originally introduced this, we checked for memory usage by
querying tmpfs file system usage. the 1kb per file overhead in slab was missed.

one solution to this would be to move the locks back to a regular filesystem,
and either

- clean them up when removing chunks (with corresponding retry logic
  similar to this patch, but less frequent?)

- clean them up by dropping the lock dir on reboot (e.g., via
  systemd-tmpfiles.d), which would be the equivalent of the current tmpfs
  approach but trading disk usage vs. memory usage, and probably slightly slower
  lock allocation

for ext4, 100k empty lock files take up 2.2M (roughly 2% of the tmpfs slab
usage), 500k take up 11.2M, for ZFS it's 56M for 100k (roughly half the slab
usage), though in both cases we of course would need a hierarchy of prefix dirs
again to keep access times okayish? for 23M chunk locks we'd still end up with
(estimated) 5G of usage on ext4..

another approach would be to switch to a different kind of locking mechanism
entirely (though with the cross-process and multi-threaded requirements we
have, this might not be easy either ;)) or to reduce the lock granularity.

given that a mistake in the handling of retries below can cause dataloss, doing
it for every lock/unlock pair sounds a bit dangerous. there's also the
additional overhead if the lock is actually contended to account for - we need
at least two loop iterations if it is, potentially a lot more?

or we go back to square one, and revisit the whole interaction here to see if
we can get rid of the per-chunk locks again in favor of something that scales
with the number of pending uploads.. the last time we tried this turned out to
be quite tricky though.

> Currently the only reason for this to fail as far as I can see is if
> external tooling prematurely removes the lock file. The retry
> mechanism could further be limited to a fixed number of retries to
> protect against infinite loops. This should also be forwards
> compatible with the previous implementation, the previous one might
> hover fail if a still ongoing job tries to lock the per-chunk file,
> looses however to the new implementation getting there first, which
> will then remove the file and the subsequent stat of the old
> implementation after getting the lock will fail.
> Such a race is however rather unlikely to happen and would not lead
> to an inconsistent state.
> 
>  pbs-datastore/src/chunk_store.rs | 69 +++++++++++++++++++++++++++++---
>  pbs-datastore/src/datastore.rs   |  6 +--
>  2 files changed, 66 insertions(+), 9 deletions(-)
> 
> diff --git a/pbs-datastore/src/chunk_store.rs b/pbs-datastore/src/chunk_store.rs
> index a936f5034..020d5ff2b 100644
> --- a/pbs-datastore/src/chunk_store.rs
> +++ b/pbs-datastore/src/chunk_store.rs
> @@ -9,7 +9,6 @@ use hex::FromHex;
>  use tracing::{info, warn};
>  
>  use pbs_api_types::{DatastoreFSyncLevel, GarbageCollectionStatus};
> -use pbs_config::BackupLockGuard;
>  use proxmox_io::ReadExt;
>  use proxmox_s3_client::S3Client;
>  use proxmox_sys::fs::{CreateOptions, create_dir, create_path, file_type_from_file_stat};
> @@ -43,6 +42,32 @@ pub struct ChunkStore {
>      locker: Option<Arc<Mutex<ProcessLocker>>>,
>  }
>  
> +pub(crate) struct ChunkLockGuard {
> +    file: std::fs::File,
> +    path: PathBuf,
> +}
> +
> +impl AsRawFd for ChunkLockGuard {
> +    fn as_raw_fd(&self) -> i32 {
> +        self.file.as_raw_fd()
> +    }
> +}
> +
> +impl Drop for ChunkLockGuard {
> +    fn drop(&mut self) {
> +        // After unlink lock can be acquired, local lock is invalid
> +        if let Err(err) = nix::unistd::unlink(&self.path) {
> +            tracing::error!(
> +                "Failed to unlink chunk lock guard {:?}: {err}",
> +                self.path,
> +            );
> +        }
> +        // implicit drop of lock guard file descriptor which unblocks flock() waiters for it,
> +        // they can however not use it and must retry from scratch (this is checked by stat after
> +        // lock).
> +    }
> +}
> +
>  // TODO: what about sysctl setting vm.vfs_cache_pressure (0 - 100) ?
>  
>  pub fn verify_chunk_size(size: usize) -> Result<(), Error> {
> @@ -963,12 +988,44 @@ impl ChunkStore {
>          &self,
>          digest: &[u8],
>          timeout: Duration,
> -    ) -> Result<BackupLockGuard, Error> {
> +    ) -> Result<ChunkLockGuard, Error> {
>          let lock_path = self.chunk_lock_path(digest);
> -        let guard = crate::backup_info::lock_helper(self.name(), &lock_path, |path| {
> -            pbs_config::open_backup_lockfile(path, Some(timeout), true)
> -        })?;
> -        Ok(guard)
> +
> +        let mut lock_dir = Path::new(DATASTORE_LOCKS_DIR).join(self.name.clone());
> +
> +        if let Some(parent) = lock_path.parent() {
> +            lock_dir = lock_dir.join(parent);
> +        };
> +
> +        std::fs::create_dir_all(&lock_dir)?;
> +
> +        let user = pbs_config::backup_user()?;
> +        let options = proxmox_sys::fs::CreateOptions::new()
> +            .perm(nix::sys::stat::Mode::from_bits_truncate(0o660))
> +            .owner(user.uid)
> +            .group(user.gid);
> +
> +        loop {
> +            let file = proxmox_sys::fs::open_file_locked(&lock_path, timeout, true, options)?;
> +
> +            // if stat fails here, file might have been tampered with from unrelated third party or
> +            // due to a bug, always fail with error.
> +            let inode = nix::sys::stat::fstat(file.as_raw_fd())?.st_ino;
> +
> +            match nix::sys::stat::stat(&lock_path) {
> +                Ok(stat) => if inode == stat.st_ino {
> +                    return Ok(ChunkLockGuard {
> +                        file,
> +                        path: lock_path.clone(),
> +                    });
> +                }
> +                Err(err) => if err != nix::errno::Errno::ENOENT {
> +                    bail!("failed to stat chunk lock file {lock_path:?}: {err}");
> +                }
> +            }
> +            // neither matching inode after lock, nor unrelated error on stat
> +            // spin retry
> +        }
>      }
>  
>      /// Generate the next bad chunk file path for given digest. Returns the path as well as the bad
> diff --git a/pbs-datastore/src/datastore.rs b/pbs-datastore/src/datastore.rs
> index e2d1ae67c..e930a3b95 100644
> --- a/pbs-datastore/src/datastore.rs
> +++ b/pbs-datastore/src/datastore.rs
> @@ -43,7 +43,7 @@ use proxmox_section_config::SectionConfigData;
>  use crate::backup_info::{
>      BackupDir, BackupGroup, BackupInfo, OLD_LOCKING, PROTECTED_MARKER_FILENAME,
>  };
> -use crate::chunk_store::ChunkStore;
> +use crate::chunk_store::{ChunkStore, ChunkLockGuard};
>  use crate::dynamic_index::{DynamicIndexReader, DynamicIndexWriter};
>  use crate::fixed_index::{FixedIndexReader, FixedIndexWriter};
>  use crate::hierarchy::{ListGroups, ListGroupsType, ListNamespaces, ListNamespacesRecursive};
> @@ -3497,7 +3497,7 @@ impl DataStore {
>  
>  /// Track S3 object keys to be deleted by garbage collection while holding their file lock.
>  struct S3DeleteList {
> -    list: Vec<(S3ObjectKey, BackupLockGuard)>,
> +    list: Vec<(S3ObjectKey, ChunkLockGuard)>,
>      first_entry_added: SystemTime,
>      age_threshold: Duration,
>      capacity_threshold: usize,
> @@ -3516,7 +3516,7 @@ impl S3DeleteList {
>  
>      /// Pushes the current key and backup lock guard to the list, updating the delete list age if
>      /// the list was empty before the insert.
> -    fn push(&mut self, key: S3ObjectKey, guard: BackupLockGuard) {
> +    fn push(&mut self, key: S3ObjectKey, guard: ChunkLockGuard) {
>          // set age based on first insertion
>          if self.list.is_empty() {
>              self.first_entry_added = SystemTime::now();
> -- 
> 2.47.3
> 
> 
> 
> 
>




  parent reply	other threads:[~2026-06-05 13:40 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-06-04 14:09 [RFC PATCH proxmox-backup] fix #7670: datastore: s3: allow for per-chunk file lock cleanup Christian Ebner
2026-06-05 12:11 ` Robert Obkircher
2026-06-05 12:31   ` Christian Ebner
2026-06-05 16:21     ` Robert Obkircher
2026-06-06  8:42       ` Christian Ebner
2026-06-05 13:39 ` Fabian Grünbichler [this message]
2026-06-05 15:06   ` Christian Ebner
2026-06-05 15:17     ` Christian Ebner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=178066679899.993050.14588796821627975859@yuna.proxmox.com \
    --to=f.gruenbichler@proxmox.com \
    --cc=c.ebner@proxmox.com \
    --cc=pbs-devel@lists.proxmox.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox
Service provided by Proxmox Server Solutions GmbH | Privacy | Legal