From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from firstgate.proxmox.com (firstgate.proxmox.com [IPv6:2a01:7e0:0:424::9]) by lore.proxmox.com (Postfix) with ESMTPS id 07D461FF142 for ; Fri, 05 Jun 2026 14:11:58 +0200 (CEST) Received: from firstgate.proxmox.com (localhost [127.0.0.1]) by firstgate.proxmox.com (Proxmox) with ESMTP id 56FE018209; Fri, 5 Jun 2026 14:11:57 +0200 (CEST) Message-ID: Date: Fri, 5 Jun 2026 14:11:52 +0200 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [RFC PATCH proxmox-backup] fix #7670: datastore: s3: allow for per-chunk file lock cleanup To: Christian Ebner , pbs-devel@lists.proxmox.com References: <20260604140919.97686-1-c.ebner@proxmox.com> Content-Language: en-US, de-AT From: Robert Obkircher In-Reply-To: <20260604140919.97686-1-c.ebner@proxmox.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Bm-Milter-Handled: 55990f41-d878-4baa-be0a-ee34c49e34d2 X-Bm-Transport-Timestamp: 1780661472942 X-SPAM-LEVEL: Spam detection results: 0 AWL 0.050 Adjusted score from AWL reputation of From: address BAYES_00 -1.9 Bayes spam probability is 0 to 1% DMARC_MISSING 0.1 Missing DMARC policy KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment SPF_HELO_NONE 0.001 SPF: HELO does not publish an SPF Record SPF_PASS -0.001 SPF: sender matches SPF record URIBL_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to URIBL was blocked. See http://wiki.apache.org/spamassassin/DnsBlocklists#dnsbl-block for more information. [datastore.rs] Message-ID-Hash: DAZ4M2TBGIHONEOFCDBPUVPH337NDUT3 X-Message-ID-Hash: DAZ4M2TBGIHONEOFCDBPUVPH337NDUT3 X-MailFrom: r.obkircher@proxmox.com X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; loop; banned-address; emergency; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header X-Mailman-Version: 3.3.10 Precedence: list List-Id: Proxmox Backup Server development discussion List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: On 04.06.26 16:09, Christian Ebner wrote: > Per-chunk file locks are located on a tmpfs but never cleaned up to > avoid TOCTOU race conditions. Therefore lock files can accumulate > over time, the memory required to store the inodes finally lead to > OOM conditions if the system is not rebooted for a long time or a > high number of different chunks is written to the s3 backed > datastore. > > To fix this, use the double stating strategy already implemented by > pbs_datastore::backup_info::lock_helper(), but adapt it so that the > lock file is cleaned up before unlocking. Since after file removal > the lock can be acquired by a different thread/process, the file lock > must also be dropped immediately without performing any other > critical operation. To assure this, a ChunkLockGuard is implemented > which removes the file and drops the file descriptor by implementing > the Drop trait. One potential problem is that drop is not guaranteed to run if the process aborts, so this could still slowly leak files over time. > > After each flock() call, which is performed as part of > proxmox_sys::fs::open_file_locked(), stating the file and comparing > locked files inode from the open file handle to the one currently > present on the filesystem is performed. By this possible races are > detected, resulting in missing or newly create lock files. In that > case locking must be retried. > > Note that this locking mechanism is not fair, the first caller might > end up being the last to actually acquire the lock. This is however > not problematic for the intended use case of per-chunk file locking, > with limited lock contention. > > Fixes: https://bugzilla.proxmox.com/show_bug.cgi?id=7670 > Signed-off-by: Christian Ebner > --- > Sending this as RFC in case there are ideas for a different solution > to the problem at hand which might be preferable. > > Currently the only reason for this to fail as far as I can see is if > external tooling prematurely removes the lock file. The retry > mechanism could further be limited to a fixed number of retries to > protect against infinite loops. This should also be forwards > compatible with the previous implementation, the previous one might > hover fail if a still ongoing job tries to lock the per-chunk file, > looses however to the new implementation getting there first, which > will then remove the file and the subsequent stat of the old > implementation after getting the lock will fail. > Such a race is however rather unlikely to happen and would not lead > to an inconsistent state. > > pbs-datastore/src/chunk_store.rs | 69 +++++++++++++++++++++++++++++--- > pbs-datastore/src/datastore.rs | 6 +-- > 2 files changed, 66 insertions(+), 9 deletions(-) > > diff --git a/pbs-datastore/src/chunk_store.rs b/pbs-datastore/src/chunk_store.rs > index a936f5034..020d5ff2b 100644 > --- a/pbs-datastore/src/chunk_store.rs > +++ b/pbs-datastore/src/chunk_store.rs > @@ -9,7 +9,6 @@ use hex::FromHex; > use tracing::{info, warn}; > > use pbs_api_types::{DatastoreFSyncLevel, GarbageCollectionStatus}; > -use pbs_config::BackupLockGuard; > use proxmox_io::ReadExt; > use proxmox_s3_client::S3Client; > use proxmox_sys::fs::{CreateOptions, create_dir, create_path, file_type_from_file_stat}; > @@ -43,6 +42,32 @@ pub struct ChunkStore { > locker: Option>>, > } > > +pub(crate) struct ChunkLockGuard { > + file: std::fs::File, > + path: PathBuf, > +} > + > +impl AsRawFd for ChunkLockGuard { > + fn as_raw_fd(&self) -> i32 { > + self.file.as_raw_fd() > + } > +} Is there an actual reason to expose this? > + > +impl Drop for ChunkLockGuard { > + fn drop(&mut self) { > + // After unlink lock can be acquired, local lock is invalid > + if let Err(err) = nix::unistd::unlink(&self.path) { > + tracing::error!( > + "Failed to unlink chunk lock guard {:?}: {err}", > + self.path, > + ); > + } > + // implicit drop of lock guard file descriptor which unblocks flock() waiters for it, > + // they can however not use it and must retry from scratch (this is checked by stat after > + // lock). > + } > +} > + > // TODO: what about sysctl setting vm.vfs_cache_pressure (0 - 100) ? > > pub fn verify_chunk_size(size: usize) -> Result<(), Error> { > @@ -963,12 +988,44 @@ impl ChunkStore { > &self, > digest: &[u8], > timeout: Duration, > - ) -> Result { > + ) -> Result { > let lock_path = self.chunk_lock_path(digest); > - let guard = crate::backup_info::lock_helper(self.name(), &lock_path, |path| { > - pbs_config::open_backup_lockfile(path, Some(timeout), true) > - })?; > - Ok(guard) > + > + let mut lock_dir = Path::new(DATASTORE_LOCKS_DIR).join(self.name.clone()); unnecessary clone > + > + if let Some(parent) = lock_path.parent() { > + lock_dir = lock_dir.join(parent); This only works because parent is absolute so lock_dir is ignored. If it was relative we would potentially create the wrong directory. I think we could just create_dir_all(parent) here. > + }; > + > + std::fs::create_dir_all(&lock_dir)?; > + > + let user = pbs_config::backup_user()?; > + let options = proxmox_sys::fs::CreateOptions::new() > + .perm(nix::sys::stat::Mode::from_bits_truncate(0o660)) > + .owner(user.uid) > + .group(user.gid); > + > + loop { > + let file = proxmox_sys::fs::open_file_locked(&lock_path, timeout, true, options)?; This needs something like timeout.saturating_sub(start.elapsed()) > + > + // if stat fails here, file might have been tampered with from unrelated third party or > + // due to a bug, always fail with error. > + let inode = nix::sys::stat::fstat(file.as_raw_fd())?.st_ino; > + > + match nix::sys::stat::stat(&lock_path) { > + Ok(stat) => if inode == stat.st_ino { What if the inode number was reused? Maybe also compare st_dev and st_ctime? > + return Ok(ChunkLockGuard { > + file, > + path: lock_path.clone(), unnecessary clone > + }); > + } > + Err(err) => if err != nix::errno::Errno::ENOENT { > + bail!("failed to stat chunk lock file {lock_path:?}: {err}"); > + } > + } > + // neither matching inode after lock, nor unrelated error on stat > + // spin retry > + } > } > > /// Generate the next bad chunk file path for given digest. Returns the path as well as the bad > diff --git a/pbs-datastore/src/datastore.rs b/pbs-datastore/src/datastore.rs > index e2d1ae67c..e930a3b95 100644 > --- a/pbs-datastore/src/datastore.rs > +++ b/pbs-datastore/src/datastore.rs > @@ -43,7 +43,7 @@ use proxmox_section_config::SectionConfigData; > use crate::backup_info::{ > BackupDir, BackupGroup, BackupInfo, OLD_LOCKING, PROTECTED_MARKER_FILENAME, > }; > -use crate::chunk_store::ChunkStore; > +use crate::chunk_store::{ChunkStore, ChunkLockGuard}; > use crate::dynamic_index::{DynamicIndexReader, DynamicIndexWriter}; > use crate::fixed_index::{FixedIndexReader, FixedIndexWriter}; > use crate::hierarchy::{ListGroups, ListGroupsType, ListNamespaces, ListNamespacesRecursive}; > @@ -3497,7 +3497,7 @@ impl DataStore { > > /// Track S3 object keys to be deleted by garbage collection while holding their file lock. > struct S3DeleteList { > - list: Vec<(S3ObjectKey, BackupLockGuard)>, > + list: Vec<(S3ObjectKey, ChunkLockGuard)>, > first_entry_added: SystemTime, > age_threshold: Duration, > capacity_threshold: usize, > @@ -3516,7 +3516,7 @@ impl S3DeleteList { > > /// Pushes the current key and backup lock guard to the list, updating the delete list age if > /// the list was empty before the insert. > - fn push(&mut self, key: S3ObjectKey, guard: BackupLockGuard) { > + fn push(&mut self, key: S3ObjectKey, guard: ChunkLockGuard) { > // set age based on first insertion > if self.list.is_empty() { > self.first_entry_added = SystemTime::now();