[RFC PATCH proxmox-backup] fix #7670: datastore: s3: allow for per-chunk file lock cleanup

public inbox for pbs-devel@lists.proxmox.com
 help / color / mirror / Atom feed

* [RFC PATCH proxmox-backup] fix #7670: datastore: s3: allow for per-chunk file lock cleanup
@ 2026-06-04 14:09 Christian Ebner
  2026-06-05 12:11 ` Robert Obkircher
  2026-06-05 13:39 ` Fabian Grünbichler
  0 siblings, 2 replies; 8+ messages in thread
From: Christian Ebner @ 2026-06-04 14:09 UTC (permalink / raw)
  To: pbs-devel

Per-chunk file locks are located on a tmpfs but never cleaned up to
avoid TOCTOU race conditions. Therefore lock files can accumulate
over time, the memory required to store the inodes finally lead to
OOM conditions if the system is not rebooted for a long time or a
high number of different chunks is written to the s3 backed
datastore.

To fix this, use the double stating strategy already implemented by
pbs_datastore::backup_info::lock_helper(), but adapt it so that the
lock file is cleaned up before unlocking. Since after file removal
the lock can be acquired by a different thread/process, the file lock
must also be dropped immediately without performing any other
critical operation. To assure this, a ChunkLockGuard is implemented
which removes the file and drops the file descriptor by implementing
the Drop trait.

After each flock() call, which is performed as part of
proxmox_sys::fs::open_file_locked(), stating the file and comparing
locked files inode from the open file handle to the one currently
present on the filesystem is performed. By this possible races are
detected, resulting in missing or newly create lock files. In that
case locking must be retried.

Note that this locking mechanism is not fair, the first caller might
end up being the last to actually acquire the lock. This is however
not problematic for the intended use case of per-chunk file locking,
with limited lock contention.

Fixes: https://bugzilla.proxmox.com/show_bug.cgi?id=7670
Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
---
Sending this as RFC in case there are ideas for a different solution
to the problem at hand which might be preferable.

Currently the only reason for this to fail as far as I can see is if
external tooling prematurely removes the lock file. The retry
mechanism could further be limited to a fixed number of retries to
protect against infinite loops. This should also be forwards
compatible with the previous implementation, the previous one might
hover fail if a still ongoing job tries to lock the per-chunk file,
looses however to the new implementation getting there first, which
will then remove the file and the subsequent stat of the old
implementation after getting the lock will fail.
Such a race is however rather unlikely to happen and would not lead
to an inconsistent state.

 pbs-datastore/src/chunk_store.rs | 69 +++++++++++++++++++++++++++++---
 pbs-datastore/src/datastore.rs   |  6 +--
 2 files changed, 66 insertions(+), 9 deletions(-)

diff --git a/pbs-datastore/src/chunk_store.rs b/pbs-datastore/src/chunk_store.rs
index a936f5034..020d5ff2b 100644
--- a/pbs-datastore/src/chunk_store.rs
+++ b/pbs-datastore/src/chunk_store.rs
@@ -9,7 +9,6 @@ use hex::FromHex;
 use tracing::{info, warn};
 
 use pbs_api_types::{DatastoreFSyncLevel, GarbageCollectionStatus};
-use pbs_config::BackupLockGuard;
 use proxmox_io::ReadExt;
 use proxmox_s3_client::S3Client;
 use proxmox_sys::fs::{CreateOptions, create_dir, create_path, file_type_from_file_stat};
@@ -43,6 +42,32 @@ pub struct ChunkStore {
     locker: Option<Arc<Mutex<ProcessLocker>>>,
 }
 
+pub(crate) struct ChunkLockGuard {
+    file: std::fs::File,
+    path: PathBuf,
+}
+
+impl AsRawFd for ChunkLockGuard {
+    fn as_raw_fd(&self) -> i32 {
+        self.file.as_raw_fd()
+    }
+}
+
+impl Drop for ChunkLockGuard {
+    fn drop(&mut self) {
+        // After unlink lock can be acquired, local lock is invalid
+        if let Err(err) = nix::unistd::unlink(&self.path) {
+            tracing::error!(
+                "Failed to unlink chunk lock guard {:?}: {err}",
+                self.path,
+            );
+        }
+        // implicit drop of lock guard file descriptor which unblocks flock() waiters for it,
+        // they can however not use it and must retry from scratch (this is checked by stat after
+        // lock).
+    }
+}
+
 // TODO: what about sysctl setting vm.vfs_cache_pressure (0 - 100) ?
 
 pub fn verify_chunk_size(size: usize) -> Result<(), Error> {
@@ -963,12 +988,44 @@ impl ChunkStore {
         &self,
         digest: &[u8],
         timeout: Duration,
-    ) -> Result<BackupLockGuard, Error> {
+    ) -> Result<ChunkLockGuard, Error> {
         let lock_path = self.chunk_lock_path(digest);
-        let guard = crate::backup_info::lock_helper(self.name(), &lock_path, |path| {
-            pbs_config::open_backup_lockfile(path, Some(timeout), true)
-        })?;
-        Ok(guard)
+
+        let mut lock_dir = Path::new(DATASTORE_LOCKS_DIR).join(self.name.clone());
+
+        if let Some(parent) = lock_path.parent() {
+            lock_dir = lock_dir.join(parent);
+        };
+
+        std::fs::create_dir_all(&lock_dir)?;
+
+        let user = pbs_config::backup_user()?;
+        let options = proxmox_sys::fs::CreateOptions::new()
+            .perm(nix::sys::stat::Mode::from_bits_truncate(0o660))
+            .owner(user.uid)
+            .group(user.gid);
+
+        loop {
+            let file = proxmox_sys::fs::open_file_locked(&lock_path, timeout, true, options)?;
+
+            // if stat fails here, file might have been tampered with from unrelated third party or
+            // due to a bug, always fail with error.
+            let inode = nix::sys::stat::fstat(file.as_raw_fd())?.st_ino;
+
+            match nix::sys::stat::stat(&lock_path) {
+                Ok(stat) => if inode == stat.st_ino {
+                    return Ok(ChunkLockGuard {
+                        file,
+                        path: lock_path.clone(),
+                    });
+                }
+                Err(err) => if err != nix::errno::Errno::ENOENT {
+                    bail!("failed to stat chunk lock file {lock_path:?}: {err}");
+                }
+            }
+            // neither matching inode after lock, nor unrelated error on stat
+            // spin retry
+        }
     }
 
     /// Generate the next bad chunk file path for given digest. Returns the path as well as the bad
diff --git a/pbs-datastore/src/datastore.rs b/pbs-datastore/src/datastore.rs
index e2d1ae67c..e930a3b95 100644
--- a/pbs-datastore/src/datastore.rs
+++ b/pbs-datastore/src/datastore.rs
@@ -43,7 +43,7 @@ use proxmox_section_config::SectionConfigData;
 use crate::backup_info::{
     BackupDir, BackupGroup, BackupInfo, OLD_LOCKING, PROTECTED_MARKER_FILENAME,
 };
-use crate::chunk_store::ChunkStore;
+use crate::chunk_store::{ChunkStore, ChunkLockGuard};
 use crate::dynamic_index::{DynamicIndexReader, DynamicIndexWriter};
 use crate::fixed_index::{FixedIndexReader, FixedIndexWriter};
 use crate::hierarchy::{ListGroups, ListGroupsType, ListNamespaces, ListNamespacesRecursive};
@@ -3497,7 +3497,7 @@ impl DataStore {
 
 /// Track S3 object keys to be deleted by garbage collection while holding their file lock.
 struct S3DeleteList {
-    list: Vec<(S3ObjectKey, BackupLockGuard)>,
+    list: Vec<(S3ObjectKey, ChunkLockGuard)>,
     first_entry_added: SystemTime,
     age_threshold: Duration,
     capacity_threshold: usize,
@@ -3516,7 +3516,7 @@ impl S3DeleteList {
 
     /// Pushes the current key and backup lock guard to the list, updating the delete list age if
     /// the list was empty before the insert.
-    fn push(&mut self, key: S3ObjectKey, guard: BackupLockGuard) {
+    fn push(&mut self, key: S3ObjectKey, guard: ChunkLockGuard) {
         // set age based on first insertion
         if self.list.is_empty() {
             self.first_entry_added = SystemTime::now();
-- 
2.47.3





^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH proxmox-backup] fix #7670: datastore: s3: allow for per-chunk file lock cleanup
  2026-06-04 14:09 [RFC PATCH proxmox-backup] fix #7670: datastore: s3: allow for per-chunk file lock cleanup Christian Ebner
@ 2026-06-05 12:11 ` Robert Obkircher
  2026-06-05 12:31   ` Christian Ebner
  2026-06-05 13:39 ` Fabian Grünbichler
  1 sibling, 1 reply; 8+ messages in thread
From: Robert Obkircher @ 2026-06-05 12:11 UTC (permalink / raw)
  To: Christian Ebner, pbs-devel


On 04.06.26 16:09, Christian Ebner wrote:
> Per-chunk file locks are located on a tmpfs but never cleaned up to
> avoid TOCTOU race conditions. Therefore lock files can accumulate
> over time, the memory required to store the inodes finally lead to
> OOM conditions if the system is not rebooted for a long time or a
> high number of different chunks is written to the s3 backed
> datastore.
>
> To fix this, use the double stating strategy already implemented by
> pbs_datastore::backup_info::lock_helper(), but adapt it so that the
> lock file is cleaned up before unlocking. Since after file removal
> the lock can be acquired by a different thread/process, the file lock
> must also be dropped immediately without performing any other
> critical operation. To assure this, a ChunkLockGuard is implemented
> which removes the file and drops the file descriptor by implementing
> the Drop trait.
One potential problem is that drop is not guaranteed to run if the
process aborts, so this could still slowly leak files over time.
>
> After each flock() call, which is performed as part of
> proxmox_sys::fs::open_file_locked(), stating the file and comparing
> locked files inode from the open file handle to the one currently
> present on the filesystem is performed. By this possible races are
> detected, resulting in missing or newly create lock files. In that
> case locking must be retried.
>
> Note that this locking mechanism is not fair, the first caller might
> end up being the last to actually acquire the lock. This is however
> not problematic for the intended use case of per-chunk file locking,
> with limited lock contention.
>
> Fixes: https://bugzilla.proxmox.com/show_bug.cgi?id=7670
> Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
> ---
> Sending this as RFC in case there are ideas for a different solution
> to the problem at hand which might be preferable.
>
> Currently the only reason for this to fail as far as I can see is if
> external tooling prematurely removes the lock file. The retry
> mechanism could further be limited to a fixed number of retries to
> protect against infinite loops. This should also be forwards
> compatible with the previous implementation, the previous one might
> hover fail if a still ongoing job tries to lock the per-chunk file,
> looses however to the new implementation getting there first, which
> will then remove the file and the subsequent stat of the old
> implementation after getting the lock will fail.
> Such a race is however rather unlikely to happen and would not lead
> to an inconsistent state.
>
>  pbs-datastore/src/chunk_store.rs | 69 +++++++++++++++++++++++++++++---
>  pbs-datastore/src/datastore.rs   |  6 +--
>  2 files changed, 66 insertions(+), 9 deletions(-)
>
> diff --git a/pbs-datastore/src/chunk_store.rs b/pbs-datastore/src/chunk_store.rs
> index a936f5034..020d5ff2b 100644
> --- a/pbs-datastore/src/chunk_store.rs
> +++ b/pbs-datastore/src/chunk_store.rs
> @@ -9,7 +9,6 @@ use hex::FromHex;
>  use tracing::{info, warn};
>  
>  use pbs_api_types::{DatastoreFSyncLevel, GarbageCollectionStatus};
> -use pbs_config::BackupLockGuard;
>  use proxmox_io::ReadExt;
>  use proxmox_s3_client::S3Client;
>  use proxmox_sys::fs::{CreateOptions, create_dir, create_path, file_type_from_file_stat};
> @@ -43,6 +42,32 @@ pub struct ChunkStore {
>      locker: Option<Arc<Mutex<ProcessLocker>>>,
>  }
>  
> +pub(crate) struct ChunkLockGuard {
> +    file: std::fs::File,
> +    path: PathBuf,
> +}
> +
> +impl AsRawFd for ChunkLockGuard {
> +    fn as_raw_fd(&self) -> i32 {
> +        self.file.as_raw_fd()
> +    }
> +}
Is there an actual reason to expose this?
> +
> +impl Drop for ChunkLockGuard {
> +    fn drop(&mut self) {
> +        // After unlink lock can be acquired, local lock is invalid
> +        if let Err(err) = nix::unistd::unlink(&self.path) {
> +            tracing::error!(
> +                "Failed to unlink chunk lock guard {:?}: {err}",
> +                self.path,
> +            );
> +        }
> +        // implicit drop of lock guard file descriptor which unblocks flock() waiters for it,
> +        // they can however not use it and must retry from scratch (this is checked by stat after
> +        // lock).
> +    }
> +}
> +
>  // TODO: what about sysctl setting vm.vfs_cache_pressure (0 - 100) ?
>  
>  pub fn verify_chunk_size(size: usize) -> Result<(), Error> {
> @@ -963,12 +988,44 @@ impl ChunkStore {
>          &self,
>          digest: &[u8],
>          timeout: Duration,
> -    ) -> Result<BackupLockGuard, Error> {
> +    ) -> Result<ChunkLockGuard, Error> {
>          let lock_path = self.chunk_lock_path(digest);
> -        let guard = crate::backup_info::lock_helper(self.name(), &lock_path, |path| {
> -            pbs_config::open_backup_lockfile(path, Some(timeout), true)
> -        })?;
> -        Ok(guard)
> +
> +        let mut lock_dir = Path::new(DATASTORE_LOCKS_DIR).join(self.name.clone());
unnecessary clone
> +
> +        if let Some(parent) = lock_path.parent() {
> +            lock_dir = lock_dir.join(parent);
This only works because parent is absolute so lock_dir is ignored. If
it was relative we would potentially create the wrong directory.

I think we could just create_dir_all(parent) here.
> +        };
> +
> +        std::fs::create_dir_all(&lock_dir)?;
> +
> +        let user = pbs_config::backup_user()?;
> +        let options = proxmox_sys::fs::CreateOptions::new()
> +            .perm(nix::sys::stat::Mode::from_bits_truncate(0o660))
> +            .owner(user.uid)
> +            .group(user.gid);
> +
> +        loop {
> +            let file = proxmox_sys::fs::open_file_locked(&lock_path, timeout, true, options)?;
This needs something like timeout.saturating_sub(start.elapsed())
> +
> +            // if stat fails here, file might have been tampered with from unrelated third party or
> +            // due to a bug, always fail with error.
> +            let inode = nix::sys::stat::fstat(file.as_raw_fd())?.st_ino;
> +
> +            match nix::sys::stat::stat(&lock_path) {
> +                Ok(stat) => if inode == stat.st_ino {
What if the inode number was reused?

Maybe also compare st_dev and st_ctime?
> +                    return Ok(ChunkLockGuard {
> +                        file,
> +                        path: lock_path.clone(),
unnecessary clone
> +                    });
> +                }
> +                Err(err) => if err != nix::errno::Errno::ENOENT {
> +                    bail!("failed to stat chunk lock file {lock_path:?}: {err}");
> +                }
> +            }
> +            // neither matching inode after lock, nor unrelated error on stat
> +            // spin retry
> +        }
>      }
>  
>      /// Generate the next bad chunk file path for given digest. Returns the path as well as the bad
> diff --git a/pbs-datastore/src/datastore.rs b/pbs-datastore/src/datastore.rs
> index e2d1ae67c..e930a3b95 100644
> --- a/pbs-datastore/src/datastore.rs
> +++ b/pbs-datastore/src/datastore.rs
> @@ -43,7 +43,7 @@ use proxmox_section_config::SectionConfigData;
>  use crate::backup_info::{
>      BackupDir, BackupGroup, BackupInfo, OLD_LOCKING, PROTECTED_MARKER_FILENAME,
>  };
> -use crate::chunk_store::ChunkStore;
> +use crate::chunk_store::{ChunkStore, ChunkLockGuard};
>  use crate::dynamic_index::{DynamicIndexReader, DynamicIndexWriter};
>  use crate::fixed_index::{FixedIndexReader, FixedIndexWriter};
>  use crate::hierarchy::{ListGroups, ListGroupsType, ListNamespaces, ListNamespacesRecursive};
> @@ -3497,7 +3497,7 @@ impl DataStore {
>  
>  /// Track S3 object keys to be deleted by garbage collection while holding their file lock.
>  struct S3DeleteList {
> -    list: Vec<(S3ObjectKey, BackupLockGuard)>,
> +    list: Vec<(S3ObjectKey, ChunkLockGuard)>,
>      first_entry_added: SystemTime,
>      age_threshold: Duration,
>      capacity_threshold: usize,
> @@ -3516,7 +3516,7 @@ impl S3DeleteList {
>  
>      /// Pushes the current key and backup lock guard to the list, updating the delete list age if
>      /// the list was empty before the insert.
> -    fn push(&mut self, key: S3ObjectKey, guard: BackupLockGuard) {
> +    fn push(&mut self, key: S3ObjectKey, guard: ChunkLockGuard) {
>          // set age based on first insertion
>          if self.list.is_empty() {
>              self.first_entry_added = SystemTime::now();




^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH proxmox-backup] fix #7670: datastore: s3: allow for per-chunk file lock cleanup
  2026-06-05 12:11 ` Robert Obkircher
@ 2026-06-05 12:31   ` Christian Ebner
  2026-06-05 16:21     ` Robert Obkircher
  0 siblings, 1 reply; 8+ messages in thread
From: Christian Ebner @ 2026-06-05 12:31 UTC (permalink / raw)
  To: Robert Obkircher, pbs-devel

On 6/5/26 2:11 PM, Robert Obkircher wrote:
> 
> On 04.06.26 16:09, Christian Ebner wrote:
>> Per-chunk file locks are located on a tmpfs but never cleaned up to
>> avoid TOCTOU race conditions. Therefore lock files can accumulate
>> over time, the memory required to store the inodes finally lead to
>> OOM conditions if the system is not rebooted for a long time or a
>> high number of different chunks is written to the s3 backed
>> datastore.
>>
>> To fix this, use the double stating strategy already implemented by
>> pbs_datastore::backup_info::lock_helper(), but adapt it so that the
>> lock file is cleaned up before unlocking. Since after file removal
>> the lock can be acquired by a different thread/process, the file lock
>> must also be dropped immediately without performing any other
>> critical operation. To assure this, a ChunkLockGuard is implemented
>> which removes the file and drops the file descriptor by implementing
>> the Drop trait.
> One potential problem is that drop is not guaranteed to run if the
> process aborts, so this could still slowly leak files over time.

Good catch, maybe such files could be cleaned up by spawning a cleanup 
task triggered by chunk store instantiation. It could then iterate all 
the lock direntries and perform the same lock -> cleanup lock logic but 
without timeout, removing all lockfiles not currently being held at the 
cost of some IO for that, unfortunately. OTOH the number of lockfiles 
should be not to dramatic for that case, so IO should not be that 
dramatic and could be controlled by delays.

Other ideas?

>>
>> After each flock() call, which is performed as part of
>> proxmox_sys::fs::open_file_locked(), stating the file and comparing
>> locked files inode from the open file handle to the one currently
>> present on the filesystem is performed. By this possible races are
>> detected, resulting in missing or newly create lock files. In that
>> case locking must be retried.
>>
>> Note that this locking mechanism is not fair, the first caller might
>> end up being the last to actually acquire the lock. This is however
>> not problematic for the intended use case of per-chunk file locking,
>> with limited lock contention.
>>
>> Fixes: https://bugzilla.proxmox.com/show_bug.cgi?id=7670
>> Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
>> ---
>> Sending this as RFC in case there are ideas for a different solution
>> to the problem at hand which might be preferable.
>>
>> Currently the only reason for this to fail as far as I can see is if
>> external tooling prematurely removes the lock file. The retry
>> mechanism could further be limited to a fixed number of retries to
>> protect against infinite loops. This should also be forwards
>> compatible with the previous implementation, the previous one might
>> hover fail if a still ongoing job tries to lock the per-chunk file,
>> looses however to the new implementation getting there first, which
>> will then remove the file and the subsequent stat of the old
>> implementation after getting the lock will fail.
>> Such a race is however rather unlikely to happen and would not lead
>> to an inconsistent state.
>>
>>   pbs-datastore/src/chunk_store.rs | 69 +++++++++++++++++++++++++++++---
>>   pbs-datastore/src/datastore.rs   |  6 +--
>>   2 files changed, 66 insertions(+), 9 deletions(-)
>>
>> diff --git a/pbs-datastore/src/chunk_store.rs b/pbs-datastore/src/chunk_store.rs
>> index a936f5034..020d5ff2b 100644
>> --- a/pbs-datastore/src/chunk_store.rs
>> +++ b/pbs-datastore/src/chunk_store.rs
>> @@ -9,7 +9,6 @@ use hex::FromHex;
>>   use tracing::{info, warn};
>>   
>>   use pbs_api_types::{DatastoreFSyncLevel, GarbageCollectionStatus};
>> -use pbs_config::BackupLockGuard;
>>   use proxmox_io::ReadExt;
>>   use proxmox_s3_client::S3Client;
>>   use proxmox_sys::fs::{CreateOptions, create_dir, create_path, file_type_from_file_stat};
>> @@ -43,6 +42,32 @@ pub struct ChunkStore {
>>       locker: Option<Arc<Mutex<ProcessLocker>>>,
>>   }
>>   
>> +pub(crate) struct ChunkLockGuard {
>> +    file: std::fs::File,
>> +    path: PathBuf,
>> +}
>> +
>> +impl AsRawFd for ChunkLockGuard {
>> +    fn as_raw_fd(&self) -> i32 {
>> +        self.file.as_raw_fd()
>> +    }
>> +}
> Is there an actual reason to expose this?

Actually no, leftover from initial prototyping which I forgot to drop, 
will remove for a new version.

>> +
>> +impl Drop for ChunkLockGuard {
>> +    fn drop(&mut self) {
>> +        // After unlink lock can be acquired, local lock is invalid
>> +        if let Err(err) = nix::unistd::unlink(&self.path) {
>> +            tracing::error!(
>> +                "Failed to unlink chunk lock guard {:?}: {err}",
>> +                self.path,
>> +            );
>> +        }
>> +        // implicit drop of lock guard file descriptor which unblocks flock() waiters for it,
>> +        // they can however not use it and must retry from scratch (this is checked by stat after
>> +        // lock).
>> +    }
>> +}
>> +
>>   // TODO: what about sysctl setting vm.vfs_cache_pressure (0 - 100) ?
>>   
>>   pub fn verify_chunk_size(size: usize) -> Result<(), Error> {
>> @@ -963,12 +988,44 @@ impl ChunkStore {
>>           &self,
>>           digest: &[u8],
>>           timeout: Duration,
>> -    ) -> Result<BackupLockGuard, Error> {
>> +    ) -> Result<ChunkLockGuard, Error> {
>>           let lock_path = self.chunk_lock_path(digest);
>> -        let guard = crate::backup_info::lock_helper(self.name(), &lock_path, |path| {
>> -            pbs_config::open_backup_lockfile(path, Some(timeout), true)
>> -        })?;
>> -        Ok(guard)
>> +
>> +        let mut lock_dir = Path::new(DATASTORE_LOCKS_DIR).join(self.name.clone());
> unnecessary clone
>> +
>> +        if let Some(parent) = lock_path.parent() {
>> +            lock_dir = lock_dir.join(parent);
> This only works because parent is absolute so lock_dir is ignored. If
> it was relative we would potentially create the wrong directory.
> 
> I think we could just create_dir_all(parent) here.

Ack, can optimize this as well, thanks!

>> +        };
>> +
>> +        std::fs::create_dir_all(&lock_dir)?;
>> +
>> +        let user = pbs_config::backup_user()?;
>> +        let options = proxmox_sys::fs::CreateOptions::new()
>> +            .perm(nix::sys::stat::Mode::from_bits_truncate(0o660))
>> +            .owner(user.uid)
>> +            .group(user.gid);
>> +
>> +        loop {
>> +            let file = proxmox_sys::fs::open_file_locked(&lock_path, timeout, true, options)?;
> This needs something like timeout.saturating_sub(start.elapsed())

True, should honor the requested timeout independent from number of 
retries, will fix for a new version.

>> +
>> +            // if stat fails here, file might have been tampered with from unrelated third party or
>> +            // due to a bug, always fail with error.
>> +            let inode = nix::sys::stat::fstat(file.as_raw_fd())?.st_ino;
>> +
>> +            match nix::sys::stat::stat(&lock_path) {
>> +                Ok(stat) => if inode == stat.st_ino {
> What if the inode number was reused?
> 
> Maybe also compare st_dev and st_ctime?

Good point! Given that, it might be worth to do the same for the 
pre-existing locking helper as well, as this is not done there as well. 
Will include a patch for that and adapt the checks here.

>> +                    return Ok(ChunkLockGuard {
>> +                        file,
>> +                        path: lock_path.clone(),
> unnecessary clone
>> +                    });
>> +                }
>> +                Err(err) => if err != nix::errno::Errno::ENOENT {
>> +                    bail!("failed to stat chunk lock file {lock_path:?}: {err}");
>> +                }
>> +            }
>> +            // neither matching inode after lock, nor unrelated error on stat
>> +            // spin retry
>> +        }
>>       }
>>   
>>       /// Generate the next bad chunk file path for given digest. Returns the path as well as the bad
>> diff --git a/pbs-datastore/src/datastore.rs b/pbs-datastore/src/datastore.rs
>> index e2d1ae67c..e930a3b95 100644
>> --- a/pbs-datastore/src/datastore.rs
>> +++ b/pbs-datastore/src/datastore.rs
>> @@ -43,7 +43,7 @@ use proxmox_section_config::SectionConfigData;
>>   use crate::backup_info::{
>>       BackupDir, BackupGroup, BackupInfo, OLD_LOCKING, PROTECTED_MARKER_FILENAME,
>>   };
>> -use crate::chunk_store::ChunkStore;
>> +use crate::chunk_store::{ChunkStore, ChunkLockGuard};
>>   use crate::dynamic_index::{DynamicIndexReader, DynamicIndexWriter};
>>   use crate::fixed_index::{FixedIndexReader, FixedIndexWriter};
>>   use crate::hierarchy::{ListGroups, ListGroupsType, ListNamespaces, ListNamespacesRecursive};
>> @@ -3497,7 +3497,7 @@ impl DataStore {
>>   
>>   /// Track S3 object keys to be deleted by garbage collection while holding their file lock.
>>   struct S3DeleteList {
>> -    list: Vec<(S3ObjectKey, BackupLockGuard)>,
>> +    list: Vec<(S3ObjectKey, ChunkLockGuard)>,
>>       first_entry_added: SystemTime,
>>       age_threshold: Duration,
>>       capacity_threshold: usize,
>> @@ -3516,7 +3516,7 @@ impl S3DeleteList {
>>   
>>       /// Pushes the current key and backup lock guard to the list, updating the delete list age if
>>       /// the list was empty before the insert.
>> -    fn push(&mut self, key: S3ObjectKey, guard: BackupLockGuard) {
>> +    fn push(&mut self, key: S3ObjectKey, guard: ChunkLockGuard) {
>>           // set age based on first insertion
>>           if self.list.is_empty() {
>>               self.first_entry_added = SystemTime::now();





^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH proxmox-backup] fix #7670: datastore: s3: allow for per-chunk file lock cleanup
  2026-06-05 12:31   ` Christian Ebner
@ 2026-06-05 16:21     ` Robert Obkircher
  2026-06-06  8:42       ` Christian Ebner
  0 siblings, 1 reply; 8+ messages in thread
From: Robert Obkircher @ 2026-06-05 16:21 UTC (permalink / raw)
  To: Christian Ebner, pbs-devel


On 05.06.26 14:31, Christian Ebner wrote:
> On 6/5/26 2:11 PM, Robert Obkircher wrote:
>>
>> On 04.06.26 16:09, Christian Ebner wrote:
>>> Per-chunk file locks are located on a tmpfs but never cleaned up to
>>> avoid TOCTOU race conditions. Therefore lock files can accumulate
>>> over time, the memory required to store the inodes finally lead to
>>> OOM conditions if the system is not rebooted for a long time or a
>>> high number of different chunks is written to the s3 backed
>>> datastore.
>>>
>>> To fix this, use the double stating strategy already implemented by
>>> pbs_datastore::backup_info::lock_helper(), but adapt it so that the
>>> lock file is cleaned up before unlocking. Since after file removal
>>> the lock can be acquired by a different thread/process, the file lock
>>> must also be dropped immediately without performing any other
>>> critical operation. To assure this, a ChunkLockGuard is implemented
>>> which removes the file and drops the file descriptor by implementing
>>> the Drop trait.
>> One potential problem is that drop is not guaranteed to run if the
>> process aborts, so this could still slowly leak files over time.
>
> Good catch, maybe such files could be cleaned up by spawning a
> cleanup task triggered by chunk store instantiation. It could then
> iterate all the lock direntries and perform the same lock -> cleanup
> lock logic but without timeout, removing all lockfiles not currently
> being held at the cost of some IO for that, unfortunately. OTOH the
> number of lockfiles should be not to dramatic for that case, so IO
> should not be that dramatic and could be controlled by delays.
>
> Other ideas?
> er ideas? 

With a regular cleanup task we could also get rid of the unlink on
drop, which would probably make repeated lock acquisition a bit cheaper.

But I'm not sure if it's worth the complexity.

>>> [...]
>>> ---
>>> Sending this as RFC in case there are ideas for a different solution
>>> to the problem at hand which might be preferable. 

Besides shared memory or a coordinator process, we could also just store a list of locked files in the active operations file.

Flock on the actual chunk file should work as well. We only need to coordinate on the local machine, so flock should be fine even if it is a remote file system, no?


Do we actually need the locks at all? Wouldn't it be sufficient if we
prevent duplicate S3 uploads on a per-process basis?

> [..]





^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH proxmox-backup] fix #7670: datastore: s3: allow for per-chunk file lock cleanup
  2026-06-05 16:21     ` Robert Obkircher
@ 2026-06-06  8:42       ` Christian Ebner
  0 siblings, 0 replies; 8+ messages in thread
From: Christian Ebner @ 2026-06-06  8:42 UTC (permalink / raw)
  To: Robert Obkircher, pbs-devel

On 6/5/26 6:20 PM, Robert Obkircher wrote:
> 
> On 05.06.26 14:31, Christian Ebner wrote:
>> On 6/5/26 2:11 PM, Robert Obkircher wrote:
>>>
>>> On 04.06.26 16:09, Christian Ebner wrote:
>>>> Per-chunk file locks are located on a tmpfs but never cleaned up to
>>>> avoid TOCTOU race conditions. Therefore lock files can accumulate
>>>> over time, the memory required to store the inodes finally lead to
>>>> OOM conditions if the system is not rebooted for a long time or a
>>>> high number of different chunks is written to the s3 backed
>>>> datastore.
>>>>
>>>> To fix this, use the double stating strategy already implemented by
>>>> pbs_datastore::backup_info::lock_helper(), but adapt it so that the
>>>> lock file is cleaned up before unlocking. Since after file removal
>>>> the lock can be acquired by a different thread/process, the file lock
>>>> must also be dropped immediately without performing any other
>>>> critical operation. To assure this, a ChunkLockGuard is implemented
>>>> which removes the file and drops the file descriptor by implementing
>>>> the Drop trait.
>>> One potential problem is that drop is not guaranteed to run if the
>>> process aborts, so this could still slowly leak files over time.
>>
>> Good catch, maybe such files could be cleaned up by spawning a
>> cleanup task triggered by chunk store instantiation. It could then
>> iterate all the lock direntries and perform the same lock -> cleanup
>> lock logic but without timeout, removing all lockfiles not currently
>> being held at the cost of some IO for that, unfortunately. OTOH the
>> number of lockfiles should be not to dramatic for that case, so IO
>> should not be that dramatic and could be controlled by delays.
>>
>> Other ideas?
>> er ideas?
> 
> With a regular cleanup task we could also get rid of the unlink on
> drop, which would probably make repeated lock acquisition a bit cheaper.
> 
> But I'm not sure if it's worth the complexity.

Cleanup could be done as part of GC phase 3 on s3 backed datastores 
though, by that it should not introduce to much complexity. But probably 
not the best way forwards, agreed.

> 
>>>> [...]
>>>> ---
>>>> Sending this as RFC in case there are ideas for a different solution
>>>> to the problem at hand which might be preferable.
> 
> Besides shared memory or a coordinator process, we could also just store a list of locked files in the active operations file.

I do not think that will scale very well neither? Not performance wise 
nor usage wise, after all this would need to keep the list for all 
active operations on each chunk file up-to-date?

> Flock on the actual chunk file should work as well. We only need to coordinate on the local machine, so flock should be fine even if it is a remote file system, no?

There are some caveats with that IIRC, especially with respect to 
latency and proper support from e.g. older NFS implementations. Further, 
the existence of a chunk file is currently used as marker for the chunk 
to being present on the S3 backend (at least one successful and 
acknowledge upload). For this to work that would need to be reworked as 
well and is probably not backwards compatible.

> Do we actually need the locks at all? Wouldn't it be sufficient if we
> prevent duplicate S3 uploads on a per-process basis?

The locking must protect against TOCTOU races, not only with upload but 
also chunk insertion in the local cache, GC, renaming of bad chunks for 
failed verifies, ...

The initial implementation did not perform the locking but relied on 
atomic operations for local marker files only, causing quite some churn 
due to races and resulting in the current implementation [1,2].

[1] 
https://lore.proxmox.com/pbs-devel/20251114131901.441650-1-c.ebner@proxmox.com/
[2] 
https://lore.proxmox.com/pbs-devel/20251120100523.12147-1-c.ebner@proxmox.com/





^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH proxmox-backup] fix #7670: datastore: s3: allow for per-chunk file lock cleanup
  2026-06-04 14:09 [RFC PATCH proxmox-backup] fix #7670: datastore: s3: allow for per-chunk file lock cleanup Christian Ebner
  2026-06-05 12:11 ` Robert Obkircher
@ 2026-06-05 13:39 ` Fabian Grünbichler
  2026-06-05 15:06   ` Christian Ebner
  1 sibling, 1 reply; 8+ messages in thread
From: Fabian Grünbichler @ 2026-06-05 13:39 UTC (permalink / raw)
  To: Christian Ebner, pbs-devel

Quoting Christian Ebner (2026-06-04 16:09:19)
> Per-chunk file locks are located on a tmpfs but never cleaned up to
> avoid TOCTOU race conditions. Therefore lock files can accumulate
> over time, the memory required to store the inodes finally lead to
> OOM conditions if the system is not rebooted for a long time or a
> high number of different chunks is written to the s3 backed
> datastore.
> 
> To fix this, use the double stating strategy already implemented by
> pbs_datastore::backup_info::lock_helper(), but adapt it so that the
> lock file is cleaned up before unlocking. Since after file removal
> the lock can be acquired by a different thread/process, the file lock
> must also be dropped immediately without performing any other
> critical operation. To assure this, a ChunkLockGuard is implemented
> which removes the file and drops the file descriptor by implementing
> the Drop trait.
> 
> After each flock() call, which is performed as part of
> proxmox_sys::fs::open_file_locked(), stating the file and comparing
> locked files inode from the open file handle to the one currently
> present on the filesystem is performed. By this possible races are
> detected, resulting in missing or newly create lock files. In that
> case locking must be retried.
> 
> Note that this locking mechanism is not fair, the first caller might
> end up being the last to actually acquire the lock. This is however
> not problematic for the intended use case of per-chunk file locking,
> with limited lock contention.
> 
> Fixes: https://bugzilla.proxmox.com/show_bug.cgi?id=7670
> Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
> ---
> Sending this as RFC in case there are ideas for a different solution
> to the problem at hand which might be preferable.

to recap - when we originally introduced this, we checked for memory usage by
querying tmpfs file system usage. the 1kb per file overhead in slab was missed.

one solution to this would be to move the locks back to a regular filesystem,
and either

- clean them up when removing chunks (with corresponding retry logic
  similar to this patch, but less frequent?)

- clean them up by dropping the lock dir on reboot (e.g., via
  systemd-tmpfiles.d), which would be the equivalent of the current tmpfs
  approach but trading disk usage vs. memory usage, and probably slightly slower
  lock allocation

for ext4, 100k empty lock files take up 2.2M (roughly 2% of the tmpfs slab
usage), 500k take up 11.2M, for ZFS it's 56M for 100k (roughly half the slab
usage), though in both cases we of course would need a hierarchy of prefix dirs
again to keep access times okayish? for 23M chunk locks we'd still end up with
(estimated) 5G of usage on ext4..

another approach would be to switch to a different kind of locking mechanism
entirely (though with the cross-process and multi-threaded requirements we
have, this might not be easy either ;)) or to reduce the lock granularity.

given that a mistake in the handling of retries below can cause dataloss, doing
it for every lock/unlock pair sounds a bit dangerous. there's also the
additional overhead if the lock is actually contended to account for - we need
at least two loop iterations if it is, potentially a lot more?

or we go back to square one, and revisit the whole interaction here to see if
we can get rid of the per-chunk locks again in favor of something that scales
with the number of pending uploads.. the last time we tried this turned out to
be quite tricky though.

> Currently the only reason for this to fail as far as I can see is if
> external tooling prematurely removes the lock file. The retry
> mechanism could further be limited to a fixed number of retries to
> protect against infinite loops. This should also be forwards
> compatible with the previous implementation, the previous one might
> hover fail if a still ongoing job tries to lock the per-chunk file,
> looses however to the new implementation getting there first, which
> will then remove the file and the subsequent stat of the old
> implementation after getting the lock will fail.
> Such a race is however rather unlikely to happen and would not lead
> to an inconsistent state.
> 
>  pbs-datastore/src/chunk_store.rs | 69 +++++++++++++++++++++++++++++---
>  pbs-datastore/src/datastore.rs   |  6 +--
>  2 files changed, 66 insertions(+), 9 deletions(-)
> 
> diff --git a/pbs-datastore/src/chunk_store.rs b/pbs-datastore/src/chunk_store.rs
> index a936f5034..020d5ff2b 100644
> --- a/pbs-datastore/src/chunk_store.rs
> +++ b/pbs-datastore/src/chunk_store.rs
> @@ -9,7 +9,6 @@ use hex::FromHex;
>  use tracing::{info, warn};
>  
>  use pbs_api_types::{DatastoreFSyncLevel, GarbageCollectionStatus};
> -use pbs_config::BackupLockGuard;
>  use proxmox_io::ReadExt;
>  use proxmox_s3_client::S3Client;
>  use proxmox_sys::fs::{CreateOptions, create_dir, create_path, file_type_from_file_stat};
> @@ -43,6 +42,32 @@ pub struct ChunkStore {
>      locker: Option<Arc<Mutex<ProcessLocker>>>,
>  }
>  
> +pub(crate) struct ChunkLockGuard {
> +    file: std::fs::File,
> +    path: PathBuf,
> +}
> +
> +impl AsRawFd for ChunkLockGuard {
> +    fn as_raw_fd(&self) -> i32 {
> +        self.file.as_raw_fd()
> +    }
> +}
> +
> +impl Drop for ChunkLockGuard {
> +    fn drop(&mut self) {
> +        // After unlink lock can be acquired, local lock is invalid
> +        if let Err(err) = nix::unistd::unlink(&self.path) {
> +            tracing::error!(
> +                "Failed to unlink chunk lock guard {:?}: {err}",
> +                self.path,
> +            );
> +        }
> +        // implicit drop of lock guard file descriptor which unblocks flock() waiters for it,
> +        // they can however not use it and must retry from scratch (this is checked by stat after
> +        // lock).
> +    }
> +}
> +
>  // TODO: what about sysctl setting vm.vfs_cache_pressure (0 - 100) ?
>  
>  pub fn verify_chunk_size(size: usize) -> Result<(), Error> {
> @@ -963,12 +988,44 @@ impl ChunkStore {
>          &self,
>          digest: &[u8],
>          timeout: Duration,
> -    ) -> Result<BackupLockGuard, Error> {
> +    ) -> Result<ChunkLockGuard, Error> {
>          let lock_path = self.chunk_lock_path(digest);
> -        let guard = crate::backup_info::lock_helper(self.name(), &lock_path, |path| {
> -            pbs_config::open_backup_lockfile(path, Some(timeout), true)
> -        })?;
> -        Ok(guard)
> +
> +        let mut lock_dir = Path::new(DATASTORE_LOCKS_DIR).join(self.name.clone());
> +
> +        if let Some(parent) = lock_path.parent() {
> +            lock_dir = lock_dir.join(parent);
> +        };
> +
> +        std::fs::create_dir_all(&lock_dir)?;
> +
> +        let user = pbs_config::backup_user()?;
> +        let options = proxmox_sys::fs::CreateOptions::new()
> +            .perm(nix::sys::stat::Mode::from_bits_truncate(0o660))
> +            .owner(user.uid)
> +            .group(user.gid);
> +
> +        loop {
> +            let file = proxmox_sys::fs::open_file_locked(&lock_path, timeout, true, options)?;
> +
> +            // if stat fails here, file might have been tampered with from unrelated third party or
> +            // due to a bug, always fail with error.
> +            let inode = nix::sys::stat::fstat(file.as_raw_fd())?.st_ino;
> +
> +            match nix::sys::stat::stat(&lock_path) {
> +                Ok(stat) => if inode == stat.st_ino {
> +                    return Ok(ChunkLockGuard {
> +                        file,
> +                        path: lock_path.clone(),
> +                    });
> +                }
> +                Err(err) => if err != nix::errno::Errno::ENOENT {
> +                    bail!("failed to stat chunk lock file {lock_path:?}: {err}");
> +                }
> +            }
> +            // neither matching inode after lock, nor unrelated error on stat
> +            // spin retry
> +        }
>      }
>  
>      /// Generate the next bad chunk file path for given digest. Returns the path as well as the bad
> diff --git a/pbs-datastore/src/datastore.rs b/pbs-datastore/src/datastore.rs
> index e2d1ae67c..e930a3b95 100644
> --- a/pbs-datastore/src/datastore.rs
> +++ b/pbs-datastore/src/datastore.rs
> @@ -43,7 +43,7 @@ use proxmox_section_config::SectionConfigData;
>  use crate::backup_info::{
>      BackupDir, BackupGroup, BackupInfo, OLD_LOCKING, PROTECTED_MARKER_FILENAME,
>  };
> -use crate::chunk_store::ChunkStore;
> +use crate::chunk_store::{ChunkStore, ChunkLockGuard};
>  use crate::dynamic_index::{DynamicIndexReader, DynamicIndexWriter};
>  use crate::fixed_index::{FixedIndexReader, FixedIndexWriter};
>  use crate::hierarchy::{ListGroups, ListGroupsType, ListNamespaces, ListNamespacesRecursive};
> @@ -3497,7 +3497,7 @@ impl DataStore {
>  
>  /// Track S3 object keys to be deleted by garbage collection while holding their file lock.
>  struct S3DeleteList {
> -    list: Vec<(S3ObjectKey, BackupLockGuard)>,
> +    list: Vec<(S3ObjectKey, ChunkLockGuard)>,
>      first_entry_added: SystemTime,
>      age_threshold: Duration,
>      capacity_threshold: usize,
> @@ -3516,7 +3516,7 @@ impl S3DeleteList {
>  
>      /// Pushes the current key and backup lock guard to the list, updating the delete list age if
>      /// the list was empty before the insert.
> -    fn push(&mut self, key: S3ObjectKey, guard: BackupLockGuard) {
> +    fn push(&mut self, key: S3ObjectKey, guard: ChunkLockGuard) {
>          // set age based on first insertion
>          if self.list.is_empty() {
>              self.first_entry_added = SystemTime::now();
> -- 
> 2.47.3
> 
> 
> 
> 
>




^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH proxmox-backup] fix #7670: datastore: s3: allow for per-chunk file lock cleanup
  2026-06-05 13:39 ` Fabian Grünbichler
@ 2026-06-05 15:06   ` Christian Ebner
  2026-06-05 15:17     ` Christian Ebner
  0 siblings, 1 reply; 8+ messages in thread
From: Christian Ebner @ 2026-06-05 15:06 UTC (permalink / raw)
  To: Fabian Grünbichler, pbs-devel

On 6/5/26 3:39 PM, Fabian Grünbichler wrote:
> Quoting Christian Ebner (2026-06-04 16:09:19)
>> Per-chunk file locks are located on a tmpfs but never cleaned up to
>> avoid TOCTOU race conditions. Therefore lock files can accumulate
>> over time, the memory required to store the inodes finally lead to
>> OOM conditions if the system is not rebooted for a long time or a
>> high number of different chunks is written to the s3 backed
>> datastore.
>>
>> To fix this, use the double stating strategy already implemented by
>> pbs_datastore::backup_info::lock_helper(), but adapt it so that the
>> lock file is cleaned up before unlocking. Since after file removal
>> the lock can be acquired by a different thread/process, the file lock
>> must also be dropped immediately without performing any other
>> critical operation. To assure this, a ChunkLockGuard is implemented
>> which removes the file and drops the file descriptor by implementing
>> the Drop trait.
>>
>> After each flock() call, which is performed as part of
>> proxmox_sys::fs::open_file_locked(), stating the file and comparing
>> locked files inode from the open file handle to the one currently
>> present on the filesystem is performed. By this possible races are
>> detected, resulting in missing or newly create lock files. In that
>> case locking must be retried.
>>
>> Note that this locking mechanism is not fair, the first caller might
>> end up being the last to actually acquire the lock. This is however
>> not problematic for the intended use case of per-chunk file locking,
>> with limited lock contention.
>>
>> Fixes: https://bugzilla.proxmox.com/show_bug.cgi?id=7670
>> Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
>> ---
>> Sending this as RFC in case there are ideas for a different solution
>> to the problem at hand which might be preferable.
> 
> to recap - when we originally introduced this, we checked for memory usage by
> querying tmpfs file system usage. the 1kb per file overhead in slab was missed.
> 
> one solution to this would be to move the locks back to a regular filesystem,
> and either
> 
> - clean them up when removing chunks (with corresponding retry logic
>    similar to this patch, but less frequent?)

 From the two options for this solutions, this one sounds preferable. 
Ideally the locking would be performed on the chunk file itself... But 
that has issues with non-local filesystems AFAIR, so not an option 
unfortunately.

But probably we want to go with moving to a filesystem backed lock-files 
AND reducing the granularity as suggested below.

> - clean them up by dropping the lock dir on reboot (e.g., via
>    systemd-tmpfiles.d), which would be the equivalent of the current tmpfs
>    approach but trading disk usage vs. memory usage, and probably slightly slower
>    lock allocation
> 
> for ext4, 100k empty lock files take up 2.2M (roughly 2% of the tmpfs slab
> usage), 500k take up 11.2M, for ZFS it's 56M for 100k (roughly half the slab
> usage), though in both cases we of course would need a hierarchy of prefix dirs
> again to keep access times okayish? for 23M chunk locks we'd still end up with
> (estimated) 5G of usage on ext4..
> 
> another approach would be to switch to a different kind of locking mechanism
> entirely (though with the cross-process and multi-threaded requirements we
> have, this might not be easy either ;)) or to reduce the lock granularity.

Yeah, reducing the lock granularity and moving to something with 
deterministic count could be a way out, and given the usage estimates 
from above, using the 4 hex-digit prefix used also for chunk store 
directory hierarchy might be a viable candidate for defining such 
lock-files.
> given that a mistake in the handling of retries below can cause dataloss, doing
> it for every lock/unlock pair sounds a bit dangerous. there's also the
> additional overhead if the lock is actually contended to account for - we need
> at least two loop iterations if it is, potentially a lot more?

This is however not so frequently happening though? But yes, there is 
overhead by the then unavoidable retry/retries.

> or we go back to square one, and revisit the whole interaction here to see if
> we can get rid of the per-chunk locks again in favor of something that scales
> with the number of pending uploads.. the last time we tried this turned out to
> be quite tricky though.
The major downside for all solutions is unfortunately that moving 
forward can only happen with the current locking mechanism still in 
place, so a transitional step for cleanup unavoidable. But at least it 
is fine to remove the per-chunk lock files once they are being held in 
the process instance implementing the new locking logic.

Thanks for your inputs!




^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH proxmox-backup] fix #7670: datastore: s3: allow for per-chunk file lock cleanup
  2026-06-05 15:06   ` Christian Ebner
@ 2026-06-05 15:17     ` Christian Ebner
  0 siblings, 0 replies; 8+ messages in thread
From: Christian Ebner @ 2026-06-05 15:17 UTC (permalink / raw)
  To: Fabian Grünbichler, pbs-devel

On 6/5/26 5:06 PM, Christian Ebner wrote:
> On 6/5/26 3:39 PM, Fabian Grünbichler wrote:
>> Quoting Christian Ebner (2026-06-04 16:09:19)
>>> Per-chunk file locks are located on a tmpfs but never cleaned up to
>>> avoid TOCTOU race conditions. Therefore lock files can accumulate
>>> over time, the memory required to store the inodes finally lead to
>>> OOM conditions if the system is not rebooted for a long time or a
>>> high number of different chunks is written to the s3 backed
>>> datastore.
>>>
>>> To fix this, use the double stating strategy already implemented by
>>> pbs_datastore::backup_info::lock_helper(), but adapt it so that the
>>> lock file is cleaned up before unlocking. Since after file removal
>>> the lock can be acquired by a different thread/process, the file lock
>>> must also be dropped immediately without performing any other
>>> critical operation. To assure this, a ChunkLockGuard is implemented
>>> which removes the file and drops the file descriptor by implementing
>>> the Drop trait.
>>>
>>> After each flock() call, which is performed as part of
>>> proxmox_sys::fs::open_file_locked(), stating the file and comparing
>>> locked files inode from the open file handle to the one currently
>>> present on the filesystem is performed. By this possible races are
>>> detected, resulting in missing or newly create lock files. In that
>>> case locking must be retried.
>>>
>>> Note that this locking mechanism is not fair, the first caller might
>>> end up being the last to actually acquire the lock. This is however
>>> not problematic for the intended use case of per-chunk file locking,
>>> with limited lock contention.
>>>
>>> Fixes: https://bugzilla.proxmox.com/show_bug.cgi?id=7670
>>> Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
>>> ---
>>> Sending this as RFC in case there are ideas for a different solution
>>> to the problem at hand which might be preferable.
>>
>> to recap - when we originally introduced this, we checked for memory 
>> usage by
>> querying tmpfs file system usage. the 1kb per file overhead in slab 
>> was missed.
>>
>> one solution to this would be to move the locks back to a regular 
>> filesystem,
>> and either
>>
>> - clean them up when removing chunks (with corresponding retry logic
>>    similar to this patch, but less frequent?)
> 
>  From the two options for this solutions, this one sounds preferable. 
> Ideally the locking would be performed on the chunk file itself... But 
> that has issues with non-local filesystems AFAIR, so not an option 
> unfortunately.
> 
> But probably we want to go with moving to a filesystem backed lock-files 
> AND reducing the granularity as suggested below.
> 
>> - clean them up by dropping the lock dir on reboot (e.g., via
>>    systemd-tmpfiles.d), which would be the equivalent of the current 
>> tmpfs
>>    approach but trading disk usage vs. memory usage, and probably 
>> slightly slower
>>    lock allocation
>>
>> for ext4, 100k empty lock files take up 2.2M (roughly 2% of the tmpfs 
>> slab
>> usage), 500k take up 11.2M, for ZFS it's 56M for 100k (roughly half 
>> the slab
>> usage), though in both cases we of course would need a hierarchy of 
>> prefix dirs
>> again to keep access times okayish? for 23M chunk locks we'd still end 
>> up with
>> (estimated) 5G of usage on ext4..
>>
>> another approach would be to switch to a different kind of locking 
>> mechanism
>> entirely (though with the cross-process and multi-threaded 
>> requirements we
>> have, this might not be easy either ;)) or to reduce the lock 
>> granularity.
> 
> Yeah, reducing the lock granularity and moving to something with 
> deterministic count could be a way out, and given the usage estimates 
> from above, using the 4 hex-digit prefix used also for chunk store 
> directory hierarchy might be a viable candidate for defining such lock- 
> files.
>> given that a mistake in the handling of retries below can cause 
>> dataloss, doing
>> it for every lock/unlock pair sounds a bit dangerous. there's also the
>> additional overhead if the lock is actually contended to account for - 
>> we need
>> at least two loop iterations if it is, potentially a lot more?
> 
> This is however not so frequently happening though? But yes, there is 
> overhead by the then unavoidable retry/retries.
> 
>> or we go back to square one, and revisit the whole interaction here to 
>> see if
>> we can get rid of the per-chunk locks again in favor of something that 
>> scales
>> with the number of pending uploads.. the last time we tried this 
>> turned out to
>> be quite tricky though.
> The major downside for all solutions is unfortunately that moving 
> forward can only happen with the current locking mechanism still in 
> place, so a transitional step for cleanup unavoidable. But at least it 
> is fine to remove the per-chunk lock files once they are being held in 
> the process instance implementing the new locking logic.
> 
> Thanks for your inputs!

Oh, and another thing to consider as well: The backup snapshot 
lock-files are also not cleaned up AFAIKT? While way less problematic, 
they will also accumulate and add unneeded memory overhead on systems 
with high frequency backups which are not rebooted frequently.

There removing the lock-file on prune and retry for vanished lock-files 
after flock calls might be acceptable? Also performance wise.





^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2026-06-06  8:42 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-04 14:09 [RFC PATCH proxmox-backup] fix #7670: datastore: s3: allow for per-chunk file lock cleanup Christian Ebner
2026-06-05 12:11 ` Robert Obkircher
2026-06-05 12:31   ` Christian Ebner
2026-06-05 16:21     ` Robert Obkircher
2026-06-06  8:42       ` Christian Ebner
2026-06-05 13:39 ` Fabian Grünbichler
2026-06-05 15:06   ` Christian Ebner
2026-06-05 15:17     ` Christian Ebner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox

Service provided by Proxmox Server Solutions GmbH | Privacy | Legal