From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68]) by lore.proxmox.com (Postfix) with ESMTPS id 847971FF142 for ; Fri, 05 Jun 2026 15:40:48 +0200 (CEST) Received: from firstgate.proxmox.com (localhost [127.0.0.1]) by firstgate.proxmox.com (Proxmox) with ESMTP id 92ADF1AC1E; Fri, 5 Jun 2026 15:40:47 +0200 (CEST) Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable In-Reply-To: <20260604140919.97686-1-c.ebner@proxmox.com> References: <20260604140919.97686-1-c.ebner@proxmox.com> Subject: Re: [RFC PATCH proxmox-backup] fix #7670: datastore: s3: allow for per-chunk file lock cleanup From: Fabian =?utf-8?q?Gr=C3=BCnbichler?= To: Christian Ebner , pbs-devel@lists.proxmox.com Date: Fri, 05 Jun 2026 15:39:58 +0200 Message-ID: <178066679899.993050.14588796821627975859@yuna.proxmox.com> User-Agent: alot/0.0.0 X-Bm-Milter-Handled: 55990f41-d878-4baa-be0a-ee34c49e34d2 X-Bm-Transport-Timestamp: 1780666771967 X-SPAM-LEVEL: Spam detection results: 0 AWL 0.054 Adjusted score from AWL reputation of From: address BAYES_00 -1.9 Bayes spam probability is 0 to 1% DMARC_MISSING 0.1 Missing DMARC policy KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment SPF_HELO_NONE 0.001 SPF: HELO does not publish an SPF Record SPF_PASS -0.001 SPF: sender matches SPF record Message-ID-Hash: QH6INJQXOCDRQC5LRM7MGINWPOVHETNS X-Message-ID-Hash: QH6INJQXOCDRQC5LRM7MGINWPOVHETNS X-MailFrom: f.gruenbichler@proxmox.com X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; loop; banned-address; emergency; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header X-Mailman-Version: 3.3.10 Precedence: list List-Id: Proxmox Backup Server development discussion List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: Quoting Christian Ebner (2026-06-04 16:09:19) > Per-chunk file locks are located on a tmpfs but never cleaned up to > avoid TOCTOU race conditions. Therefore lock files can accumulate > over time, the memory required to store the inodes finally lead to > OOM conditions if the system is not rebooted for a long time or a > high number of different chunks is written to the s3 backed > datastore. >=20 > To fix this, use the double stating strategy already implemented by > pbs_datastore::backup_info::lock_helper(), but adapt it so that the > lock file is cleaned up before unlocking. Since after file removal > the lock can be acquired by a different thread/process, the file lock > must also be dropped immediately without performing any other > critical operation. To assure this, a ChunkLockGuard is implemented > which removes the file and drops the file descriptor by implementing > the Drop trait. >=20 > After each flock() call, which is performed as part of > proxmox_sys::fs::open_file_locked(), stating the file and comparing > locked files inode from the open file handle to the one currently > present on the filesystem is performed. By this possible races are > detected, resulting in missing or newly create lock files. In that > case locking must be retried. >=20 > Note that this locking mechanism is not fair, the first caller might > end up being the last to actually acquire the lock. This is however > not problematic for the intended use case of per-chunk file locking, > with limited lock contention. >=20 > Fixes: https://bugzilla.proxmox.com/show_bug.cgi?id=3D7670 > Signed-off-by: Christian Ebner > --- > Sending this as RFC in case there are ideas for a different solution > to the problem at hand which might be preferable. to recap - when we originally introduced this, we checked for memory usage = by querying tmpfs file system usage. the 1kb per file overhead in slab was mis= sed. one solution to this would be to move the locks back to a regular filesyste= m, and either - clean them up when removing chunks (with corresponding retry logic similar to this patch, but less frequent?) - clean them up by dropping the lock dir on reboot (e.g., via systemd-tmpfiles.d), which would be the equivalent of the current tmpfs approach but trading disk usage vs. memory usage, and probably slightly s= lower lock allocation for ext4, 100k empty lock files take up 2.2M (roughly 2% of the tmpfs slab usage), 500k take up 11.2M, for ZFS it's 56M for 100k (roughly half the slab usage), though in both cases we of course would need a hierarchy of prefix = dirs again to keep access times okayish? for 23M chunk locks we'd still end up w= ith (estimated) 5G of usage on ext4.. another approach would be to switch to a different kind of locking mechanism entirely (though with the cross-process and multi-threaded requirements we have, this might not be easy either ;)) or to reduce the lock granularity. given that a mistake in the handling of retries below can cause dataloss, d= oing it for every lock/unlock pair sounds a bit dangerous. there's also the additional overhead if the lock is actually contended to account for - we n= eed at least two loop iterations if it is, potentially a lot more? or we go back to square one, and revisit the whole interaction here to see = if we can get rid of the per-chunk locks again in favor of something that scal= es with the number of pending uploads.. the last time we tried this turned out= to be quite tricky though. > Currently the only reason for this to fail as far as I can see is if > external tooling prematurely removes the lock file. The retry > mechanism could further be limited to a fixed number of retries to > protect against infinite loops. This should also be forwards > compatible with the previous implementation, the previous one might > hover fail if a still ongoing job tries to lock the per-chunk file, > looses however to the new implementation getting there first, which > will then remove the file and the subsequent stat of the old > implementation after getting the lock will fail. > Such a race is however rather unlikely to happen and would not lead > to an inconsistent state. >=20 > pbs-datastore/src/chunk_store.rs | 69 +++++++++++++++++++++++++++++--- > pbs-datastore/src/datastore.rs | 6 +-- > 2 files changed, 66 insertions(+), 9 deletions(-) >=20 > diff --git a/pbs-datastore/src/chunk_store.rs b/pbs-datastore/src/chunk_s= tore.rs > index a936f5034..020d5ff2b 100644 > --- a/pbs-datastore/src/chunk_store.rs > +++ b/pbs-datastore/src/chunk_store.rs > @@ -9,7 +9,6 @@ use hex::FromHex; > use tracing::{info, warn}; > =20 > use pbs_api_types::{DatastoreFSyncLevel, GarbageCollectionStatus}; > -use pbs_config::BackupLockGuard; > use proxmox_io::ReadExt; > use proxmox_s3_client::S3Client; > use proxmox_sys::fs::{CreateOptions, create_dir, create_path, file_type_= from_file_stat}; > @@ -43,6 +42,32 @@ pub struct ChunkStore { > locker: Option>>, > } > =20 > +pub(crate) struct ChunkLockGuard { > + file: std::fs::File, > + path: PathBuf, > +} > + > +impl AsRawFd for ChunkLockGuard { > + fn as_raw_fd(&self) -> i32 { > + self.file.as_raw_fd() > + } > +} > + > +impl Drop for ChunkLockGuard { > + fn drop(&mut self) { > + // After unlink lock can be acquired, local lock is invalid > + if let Err(err) =3D nix::unistd::unlink(&self.path) { > + tracing::error!( > + "Failed to unlink chunk lock guard {:?}: {err}", > + self.path, > + ); > + } > + // implicit drop of lock guard file descriptor which unblocks fl= ock() waiters for it, > + // they can however not use it and must retry from scratch (this= is checked by stat after > + // lock). > + } > +} > + > // TODO: what about sysctl setting vm.vfs_cache_pressure (0 - 100) ? > =20 > pub fn verify_chunk_size(size: usize) -> Result<(), Error> { > @@ -963,12 +988,44 @@ impl ChunkStore { > &self, > digest: &[u8], > timeout: Duration, > - ) -> Result { > + ) -> Result { > let lock_path =3D self.chunk_lock_path(digest); > - let guard =3D crate::backup_info::lock_helper(self.name(), &lock= _path, |path| { > - pbs_config::open_backup_lockfile(path, Some(timeout), true) > - })?; > - Ok(guard) > + > + let mut lock_dir =3D Path::new(DATASTORE_LOCKS_DIR).join(self.na= me.clone()); > + > + if let Some(parent) =3D lock_path.parent() { > + lock_dir =3D lock_dir.join(parent); > + }; > + > + std::fs::create_dir_all(&lock_dir)?; > + > + let user =3D pbs_config::backup_user()?; > + let options =3D proxmox_sys::fs::CreateOptions::new() > + .perm(nix::sys::stat::Mode::from_bits_truncate(0o660)) > + .owner(user.uid) > + .group(user.gid); > + > + loop { > + let file =3D proxmox_sys::fs::open_file_locked(&lock_path, t= imeout, true, options)?; > + > + // if stat fails here, file might have been tampered with fr= om unrelated third party or > + // due to a bug, always fail with error. > + let inode =3D nix::sys::stat::fstat(file.as_raw_fd())?.st_in= o; > + > + match nix::sys::stat::stat(&lock_path) { > + Ok(stat) =3D> if inode =3D=3D stat.st_ino { > + return Ok(ChunkLockGuard { > + file, > + path: lock_path.clone(), > + }); > + } > + Err(err) =3D> if err !=3D nix::errno::Errno::ENOENT { > + bail!("failed to stat chunk lock file {lock_path:?}:= {err}"); > + } > + } > + // neither matching inode after lock, nor unrelated error on= stat > + // spin retry > + } > } > =20 > /// Generate the next bad chunk file path for given digest. Returns = the path as well as the bad > diff --git a/pbs-datastore/src/datastore.rs b/pbs-datastore/src/datastore= .rs > index e2d1ae67c..e930a3b95 100644 > --- a/pbs-datastore/src/datastore.rs > +++ b/pbs-datastore/src/datastore.rs > @@ -43,7 +43,7 @@ use proxmox_section_config::SectionConfigData; > use crate::backup_info::{ > BackupDir, BackupGroup, BackupInfo, OLD_LOCKING, PROTECTED_MARKER_FI= LENAME, > }; > -use crate::chunk_store::ChunkStore; > +use crate::chunk_store::{ChunkStore, ChunkLockGuard}; > use crate::dynamic_index::{DynamicIndexReader, DynamicIndexWriter}; > use crate::fixed_index::{FixedIndexReader, FixedIndexWriter}; > use crate::hierarchy::{ListGroups, ListGroupsType, ListNamespaces, ListN= amespacesRecursive}; > @@ -3497,7 +3497,7 @@ impl DataStore { > =20 > /// Track S3 object keys to be deleted by garbage collection while holdi= ng their file lock. > struct S3DeleteList { > - list: Vec<(S3ObjectKey, BackupLockGuard)>, > + list: Vec<(S3ObjectKey, ChunkLockGuard)>, > first_entry_added: SystemTime, > age_threshold: Duration, > capacity_threshold: usize, > @@ -3516,7 +3516,7 @@ impl S3DeleteList { > =20 > /// Pushes the current key and backup lock guard to the list, updati= ng the delete list age if > /// the list was empty before the insert. > - fn push(&mut self, key: S3ObjectKey, guard: BackupLockGuard) { > + fn push(&mut self, key: S3ObjectKey, guard: ChunkLockGuard) { > // set age based on first insertion > if self.list.is_empty() { > self.first_entry_added =3D SystemTime::now(); > --=20 > 2.47.3 >=20 >=20 >=20 >=20 >