From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from firstgate.proxmox.com (firstgate.proxmox.com [IPv6:2a01:7e0:0:424::9]) by lore.proxmox.com (Postfix) with ESMTPS id 61FE01FF142 for ; Fri, 05 Jun 2026 14:32:17 +0200 (CEST) Received: from firstgate.proxmox.com (localhost [127.0.0.1]) by firstgate.proxmox.com (Proxmox) with ESMTP id ADC9B189D3; Fri, 5 Jun 2026 14:32:16 +0200 (CEST) Message-ID: <706c6feb-8f7c-4f07-af09-c0df91f95469@proxmox.com> Date: Fri, 5 Jun 2026 14:31:40 +0200 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [RFC PATCH proxmox-backup] fix #7670: datastore: s3: allow for per-chunk file lock cleanup To: Robert Obkircher , pbs-devel@lists.proxmox.com References: <20260604140919.97686-1-c.ebner@proxmox.com> Content-Language: en-US, de-DE From: Christian Ebner In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Bm-Milter-Handled: 55990f41-d878-4baa-be0a-ee34c49e34d2 X-Bm-Transport-Timestamp: 1780662661226 X-SPAM-LEVEL: Spam detection results: 0 AWL 0.069 Adjusted score from AWL reputation of From: address BAYES_00 -1.9 Bayes spam probability is 0 to 1% DMARC_MISSING 0.1 Missing DMARC policy KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment SPF_HELO_NONE 0.001 SPF: HELO does not publish an SPF Record SPF_PASS -0.001 SPF: sender matches SPF record Message-ID-Hash: KUDOKFIZVE45PN3HNANH2MEHYAKW5VTH X-Message-ID-Hash: KUDOKFIZVE45PN3HNANH2MEHYAKW5VTH X-MailFrom: c.ebner@proxmox.com X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; loop; banned-address; emergency; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header X-Mailman-Version: 3.3.10 Precedence: list List-Id: Proxmox Backup Server development discussion List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: On 6/5/26 2:11 PM, Robert Obkircher wrote: > > On 04.06.26 16:09, Christian Ebner wrote: >> Per-chunk file locks are located on a tmpfs but never cleaned up to >> avoid TOCTOU race conditions. Therefore lock files can accumulate >> over time, the memory required to store the inodes finally lead to >> OOM conditions if the system is not rebooted for a long time or a >> high number of different chunks is written to the s3 backed >> datastore. >> >> To fix this, use the double stating strategy already implemented by >> pbs_datastore::backup_info::lock_helper(), but adapt it so that the >> lock file is cleaned up before unlocking. Since after file removal >> the lock can be acquired by a different thread/process, the file lock >> must also be dropped immediately without performing any other >> critical operation. To assure this, a ChunkLockGuard is implemented >> which removes the file and drops the file descriptor by implementing >> the Drop trait. > One potential problem is that drop is not guaranteed to run if the > process aborts, so this could still slowly leak files over time. Good catch, maybe such files could be cleaned up by spawning a cleanup task triggered by chunk store instantiation. It could then iterate all the lock direntries and perform the same lock -> cleanup lock logic but without timeout, removing all lockfiles not currently being held at the cost of some IO for that, unfortunately. OTOH the number of lockfiles should be not to dramatic for that case, so IO should not be that dramatic and could be controlled by delays. Other ideas? >> >> After each flock() call, which is performed as part of >> proxmox_sys::fs::open_file_locked(), stating the file and comparing >> locked files inode from the open file handle to the one currently >> present on the filesystem is performed. By this possible races are >> detected, resulting in missing or newly create lock files. In that >> case locking must be retried. >> >> Note that this locking mechanism is not fair, the first caller might >> end up being the last to actually acquire the lock. This is however >> not problematic for the intended use case of per-chunk file locking, >> with limited lock contention. >> >> Fixes: https://bugzilla.proxmox.com/show_bug.cgi?id=7670 >> Signed-off-by: Christian Ebner >> --- >> Sending this as RFC in case there are ideas for a different solution >> to the problem at hand which might be preferable. >> >> Currently the only reason for this to fail as far as I can see is if >> external tooling prematurely removes the lock file. The retry >> mechanism could further be limited to a fixed number of retries to >> protect against infinite loops. This should also be forwards >> compatible with the previous implementation, the previous one might >> hover fail if a still ongoing job tries to lock the per-chunk file, >> looses however to the new implementation getting there first, which >> will then remove the file and the subsequent stat of the old >> implementation after getting the lock will fail. >> Such a race is however rather unlikely to happen and would not lead >> to an inconsistent state. >> >> pbs-datastore/src/chunk_store.rs | 69 +++++++++++++++++++++++++++++--- >> pbs-datastore/src/datastore.rs | 6 +-- >> 2 files changed, 66 insertions(+), 9 deletions(-) >> >> diff --git a/pbs-datastore/src/chunk_store.rs b/pbs-datastore/src/chunk_store.rs >> index a936f5034..020d5ff2b 100644 >> --- a/pbs-datastore/src/chunk_store.rs >> +++ b/pbs-datastore/src/chunk_store.rs >> @@ -9,7 +9,6 @@ use hex::FromHex; >> use tracing::{info, warn}; >> >> use pbs_api_types::{DatastoreFSyncLevel, GarbageCollectionStatus}; >> -use pbs_config::BackupLockGuard; >> use proxmox_io::ReadExt; >> use proxmox_s3_client::S3Client; >> use proxmox_sys::fs::{CreateOptions, create_dir, create_path, file_type_from_file_stat}; >> @@ -43,6 +42,32 @@ pub struct ChunkStore { >> locker: Option>>, >> } >> >> +pub(crate) struct ChunkLockGuard { >> + file: std::fs::File, >> + path: PathBuf, >> +} >> + >> +impl AsRawFd for ChunkLockGuard { >> + fn as_raw_fd(&self) -> i32 { >> + self.file.as_raw_fd() >> + } >> +} > Is there an actual reason to expose this? Actually no, leftover from initial prototyping which I forgot to drop, will remove for a new version. >> + >> +impl Drop for ChunkLockGuard { >> + fn drop(&mut self) { >> + // After unlink lock can be acquired, local lock is invalid >> + if let Err(err) = nix::unistd::unlink(&self.path) { >> + tracing::error!( >> + "Failed to unlink chunk lock guard {:?}: {err}", >> + self.path, >> + ); >> + } >> + // implicit drop of lock guard file descriptor which unblocks flock() waiters for it, >> + // they can however not use it and must retry from scratch (this is checked by stat after >> + // lock). >> + } >> +} >> + >> // TODO: what about sysctl setting vm.vfs_cache_pressure (0 - 100) ? >> >> pub fn verify_chunk_size(size: usize) -> Result<(), Error> { >> @@ -963,12 +988,44 @@ impl ChunkStore { >> &self, >> digest: &[u8], >> timeout: Duration, >> - ) -> Result { >> + ) -> Result { >> let lock_path = self.chunk_lock_path(digest); >> - let guard = crate::backup_info::lock_helper(self.name(), &lock_path, |path| { >> - pbs_config::open_backup_lockfile(path, Some(timeout), true) >> - })?; >> - Ok(guard) >> + >> + let mut lock_dir = Path::new(DATASTORE_LOCKS_DIR).join(self.name.clone()); > unnecessary clone >> + >> + if let Some(parent) = lock_path.parent() { >> + lock_dir = lock_dir.join(parent); > This only works because parent is absolute so lock_dir is ignored. If > it was relative we would potentially create the wrong directory. > > I think we could just create_dir_all(parent) here. Ack, can optimize this as well, thanks! >> + }; >> + >> + std::fs::create_dir_all(&lock_dir)?; >> + >> + let user = pbs_config::backup_user()?; >> + let options = proxmox_sys::fs::CreateOptions::new() >> + .perm(nix::sys::stat::Mode::from_bits_truncate(0o660)) >> + .owner(user.uid) >> + .group(user.gid); >> + >> + loop { >> + let file = proxmox_sys::fs::open_file_locked(&lock_path, timeout, true, options)?; > This needs something like timeout.saturating_sub(start.elapsed()) True, should honor the requested timeout independent from number of retries, will fix for a new version. >> + >> + // if stat fails here, file might have been tampered with from unrelated third party or >> + // due to a bug, always fail with error. >> + let inode = nix::sys::stat::fstat(file.as_raw_fd())?.st_ino; >> + >> + match nix::sys::stat::stat(&lock_path) { >> + Ok(stat) => if inode == stat.st_ino { > What if the inode number was reused? > > Maybe also compare st_dev and st_ctime? Good point! Given that, it might be worth to do the same for the pre-existing locking helper as well, as this is not done there as well. Will include a patch for that and adapt the checks here. >> + return Ok(ChunkLockGuard { >> + file, >> + path: lock_path.clone(), > unnecessary clone >> + }); >> + } >> + Err(err) => if err != nix::errno::Errno::ENOENT { >> + bail!("failed to stat chunk lock file {lock_path:?}: {err}"); >> + } >> + } >> + // neither matching inode after lock, nor unrelated error on stat >> + // spin retry >> + } >> } >> >> /// Generate the next bad chunk file path for given digest. Returns the path as well as the bad >> diff --git a/pbs-datastore/src/datastore.rs b/pbs-datastore/src/datastore.rs >> index e2d1ae67c..e930a3b95 100644 >> --- a/pbs-datastore/src/datastore.rs >> +++ b/pbs-datastore/src/datastore.rs >> @@ -43,7 +43,7 @@ use proxmox_section_config::SectionConfigData; >> use crate::backup_info::{ >> BackupDir, BackupGroup, BackupInfo, OLD_LOCKING, PROTECTED_MARKER_FILENAME, >> }; >> -use crate::chunk_store::ChunkStore; >> +use crate::chunk_store::{ChunkStore, ChunkLockGuard}; >> use crate::dynamic_index::{DynamicIndexReader, DynamicIndexWriter}; >> use crate::fixed_index::{FixedIndexReader, FixedIndexWriter}; >> use crate::hierarchy::{ListGroups, ListGroupsType, ListNamespaces, ListNamespacesRecursive}; >> @@ -3497,7 +3497,7 @@ impl DataStore { >> >> /// Track S3 object keys to be deleted by garbage collection while holding their file lock. >> struct S3DeleteList { >> - list: Vec<(S3ObjectKey, BackupLockGuard)>, >> + list: Vec<(S3ObjectKey, ChunkLockGuard)>, >> first_entry_added: SystemTime, >> age_threshold: Duration, >> capacity_threshold: usize, >> @@ -3516,7 +3516,7 @@ impl S3DeleteList { >> >> /// Pushes the current key and backup lock guard to the list, updating the delete list age if >> /// the list was empty before the insert. >> - fn push(&mut self, key: S3ObjectKey, guard: BackupLockGuard) { >> + fn push(&mut self, key: S3ObjectKey, guard: ChunkLockGuard) { >> // set age based on first insertion >> if self.list.is_empty() { >> self.first_entry_added = SystemTime::now();