From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from firstgate.proxmox.com (firstgate.proxmox.com [IPv6:2a01:7e0:0:424::9]) by lore.proxmox.com (Postfix) with ESMTPS id 7A17C1FF16F for ; Tue, 14 Oct 2025 15:46:17 +0200 (CEST) Received: from firstgate.proxmox.com (localhost [127.0.0.1]) by firstgate.proxmox.com (Proxmox) with ESMTP id 99C01528E; Tue, 14 Oct 2025 15:46:35 +0200 (CEST) Date: Tue, 14 Oct 2025 15:46:25 +0200 From: Fabian =?iso-8859-1?q?Gr=FCnbichler?= To: Proxmox Backup Server development discussion References: <20251008152125.849216-1-c.ebner@proxmox.com> <20251008152125.849216-11-c.ebner@proxmox.com> In-Reply-To: <20251008152125.849216-11-c.ebner@proxmox.com> MIME-Version: 1.0 User-Agent: astroid/0.17.0 (https://github.com/astroidmail/astroid) Message-Id: <1760448465.h8v6n18v5i.astroid@yuna.none> X-Bm-Milter-Handled: 55990f41-d878-4baa-be0a-ee34c49e34d2 X-Bm-Transport-Timestamp: 1760449551893 X-SPAM-LEVEL: Spam detection results: 0 AWL 0.047 Adjusted score from AWL reputation of From: address BAYES_00 -1.9 Bayes spam probability is 0 to 1% DMARC_MISSING 0.1 Missing DMARC policy KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment RCVD_IN_VALIDITY_CERTIFIED_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to Validity was blocked. See https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more information. RCVD_IN_VALIDITY_RPBL_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to Validity was blocked. See https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more information. RCVD_IN_VALIDITY_SAFE_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to Validity was blocked. See https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more information. SPF_HELO_NONE 0.001 SPF: HELO does not publish an SPF Record SPF_PASS -0.001 SPF: sender matches SPF record URIBL_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to URIBL was blocked. See http://wiki.apache.org/spamassassin/DnsBlocklists#dnsbl-block for more information. [self.store, proxmox.com, datastore.rs] Subject: Re: [pbs-devel] [PATCH proxmox-backup v2 10/12] api: chunk upload: fix race between chunk backend upload and insert X-BeenThere: pbs-devel@lists.proxmox.com X-Mailman-Version: 2.1.29 Precedence: list List-Id: Proxmox Backup Server development discussion List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: Proxmox Backup Server development discussion Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: pbs-devel-bounces@lists.proxmox.com Sender: "pbs-devel" On October 8, 2025 5:21 pm, Christian Ebner wrote: > Chunk are first uploaded to the object store for S3 backed > datastores, only then inserted into the local datastore cache. > By this, it is assured that the chunk is only ever considered as > valid after successful upload. Garbage collection does however rely > on the local marker file to be present, only then considering the chunk > as in-use. While this marker is created if not present during phase 1 > of garbage collection, this only happens for chunks which are already > referenced by a complete index file. > > Therefore, there remains a race window between garbage collection > listing the chunks (upload completed) and lookup of the local marker > file being present (chunk cache insert after upload). This can lead to > chunks which just finished upload, but were not yet inserted into the > local cache store to be removed again. > > To close this race window, mark chunks which are currently being > uploaded to the backend by an additional marker file, checked by the > garbage collection as well if the regular marker is not found, and > only removed after successful chunk insert. > > Concurrent chunk upload is still possible, only creating the upload > marker file if not already present. There is still a race between the > concurrent uploads, however since they encode for the same data > (except compression level) both are valid, even when the winner of > the backup upload is not the winner of the chunk insert. However, if > one of the concurrent uploads fails, the other one must fail chunk > insertion and the backup job fail as well, as then there is no > guarantee that garbage collection could not have cleaned up the chunk > in the mean time, even if such a constellation is very unlikely. I still think this is brittle, and an approach that doesn't require having the marker file in sync with the upload state in all the edge cases would be better/easier. basically what we want to achieve is to have GC aware that a chunk existing on S3, but not locally is caused by an upload being in progress. this is almost 1:1 the same semantics we have for (regular) chunks being in the chunk store, but not yet referenced by an index (because a backup writer session is in progress). we can simply use the same mechanism with a slight adaptation: - protect marker file insertion/updates/removal by the chunk store mutex (just like we do for regular chunk files) - when we start a chunk upload, touch (or create) the marker file for that chunk - in GC, check the atime of the marker file - if it is after the GC cutoff, we know this chunk is valid despite missing locally, and we must not remove it. if the atime is before the cutoff, then by definition the marker file is cruft from a failed upload/backup writer, and we are allowed to remove the chunk and marker file. - in the "happy path", remove the marker file after insertion into the chunk store, to avoid accumulating marker files I'll give you an example below where the current approach fails.. > Avoid this overhead for regular datastores by passing and checking > the backend type for the chunk store. > > Signed-off-by: Christian Ebner > --- > pbs-datastore/src/chunk_store.rs | 136 +++++++++++++++++- > pbs-datastore/src/datastore.rs | 32 +++++ > .../src/local_datastore_lru_cache.rs | 3 +- > src/api2/backup/upload_chunk.rs | 19 ++- > src/api2/config/datastore.rs | 2 + > 5 files changed, 179 insertions(+), 13 deletions(-) > > diff --git a/pbs-datastore/src/chunk_store.rs b/pbs-datastore/src/chunk_store.rs > index 65d74d68e..5b1f397bd 100644 > --- a/pbs-datastore/src/chunk_store.rs > +++ b/pbs-datastore/src/chunk_store.rs > @@ -7,7 +7,7 @@ use std::time::Duration; > use anyhow::{bail, format_err, Context, Error}; > use tracing::{info, warn}; > > -use pbs_api_types::{DatastoreFSyncLevel, GarbageCollectionStatus}; > +use pbs_api_types::{DatastoreBackendType, DatastoreFSyncLevel, GarbageCollectionStatus}; > use proxmox_io::ReadExt; > use proxmox_s3_client::S3Client; > use proxmox_sys::fs::{create_dir, create_path, file_type_from_file_stat, CreateOptions}; > @@ -22,6 +22,10 @@ use crate::file_formats::{ > }; > use crate::DataBlob; > > +/// Filename extension for local datastore cache marker files indicating > +/// the chunk being uploaded to the backend > +const BACKEND_UPLOAD_MARKER_EXT: &str = "backend-upload"; > + > /// File system based chunk store > pub struct ChunkStore { > name: String, // used for error reporting > @@ -30,6 +34,7 @@ pub struct ChunkStore { > mutex: Mutex<()>, > locker: Option>>, > sync_level: DatastoreFSyncLevel, > + datastore_backend_type: DatastoreBackendType, > } > > // TODO: what about sysctl setting vm.vfs_cache_pressure (0 - 100) ? > @@ -77,6 +82,7 @@ impl ChunkStore { > mutex: Mutex::new(()), > locker: None, > sync_level: Default::default(), > + datastore_backend_type: DatastoreBackendType::default(), > } > } > > @@ -97,6 +103,7 @@ impl ChunkStore { > uid: nix::unistd::Uid, > gid: nix::unistd::Gid, > sync_level: DatastoreFSyncLevel, > + datastore_backend_type: DatastoreBackendType, > ) -> Result > where > P: Into, > @@ -151,7 +158,7 @@ impl ChunkStore { > } > } > > - Self::open(name, base, sync_level) > + Self::open(name, base, sync_level, datastore_backend_type) > } > > fn lockfile_path>(base: P) -> PathBuf { > @@ -185,6 +192,7 @@ impl ChunkStore { > name: &str, > base: P, > sync_level: DatastoreFSyncLevel, > + datastore_backend_type: DatastoreBackendType, > ) -> Result { > let base: PathBuf = base.into(); > > @@ -201,6 +209,7 @@ impl ChunkStore { > locker: Some(locker), > mutex: Mutex::new(()), > sync_level, > + datastore_backend_type, > }) > } > > @@ -557,10 +566,96 @@ impl ChunkStore { > Ok(()) > } > > + // Try to insert a new backend upload marker, signaling to garbage collection that there is an > + // in-progress upload for this chunk. > + // > + // Returns true if the marker was created or pre-existed (concurrent upload), returns false > + // if the chunk has been inserted since, the marker file not being created and the upload > + // must be avoided. > + pub(crate) fn insert_backend_upload_marker(&self, digest: &[u8; 32]) -> Result { > + if self.datastore_backend_type == DatastoreBackendType::Filesystem { > + bail!("cannot create backend upload marker, not a cache store"); > + } > + let (marker_path, digest_str) = self.chunk_backed_upload_marker_path(digest); > + > + let _lock = self.mutex.lock(); > + > + if self.cond_touch_chunk(digest, false)? { > + return Ok(false); > + } > + > + if let Err(err) = std::fs::File::options() > + .write(true) > + .create_new(true) > + .open(&marker_path) > + { > + if err.kind() != std::io::ErrorKind::AlreadyExists { > + return Err(err).with_context(|| { > + format!("failed to create backend upload marker for chunk {digest_str}") > + }); > + } > + } > + Ok(true) > + } > + > + pub(crate) fn cleanup_backend_upload_marker(&self, digest: &[u8; 32]) -> Result<(), Error> { > + if self.datastore_backend_type == DatastoreBackendType::Filesystem { > + bail!("cannot cleanup backend upload marker, not a cache store"); > + } > + let (marker_path, _digest_str) = self.chunk_backed_upload_marker_path(digest); > + > + let _lock = self.mutex.lock(); > + > + if let Err(err) = std::fs::remove_file(marker_path) { > + if err.kind() != std::io::ErrorKind::NotFound { > + bail!("failed to cleanup backend upload marker: {err}"); > + } > + } > + Ok(()) > + } > + > + pub(crate) fn backend_upload_marker_exists(&self, digest: &[u8; 32]) -> Result { > + let (marker_path, _digest_str) = self.chunk_backed_upload_marker_path(digest); > + if let Err(err) = std::fs::metadata(marker_path) { > + if err.kind() != std::io::ErrorKind::NotFound { > + bail!("failed to check backend upload marker: {err}"); > + } > + return Ok(false); > + } > + Ok(true) > + } > + > pub fn insert_chunk(&self, chunk: &DataBlob, digest: &[u8; 32]) -> Result<(bool, u64), Error> { > self.insert_chunk_impl(chunk, digest, |_, _| Ok(())) > } > > + pub fn insert_chunk_and_remove_upload_marker( > + &self, > + chunk: &DataBlob, > + digest: &[u8; 32], > + ) -> Result<(bool, u64), Error> { > + self.insert_chunk_impl(chunk, digest, |digest, pre_existing| { > + if self.datastore_backend_type != DatastoreBackendType::Filesystem { > + // Must fail as if the marker is no longer present but the chunk file is not > + // present, as this indicates that a concurrent upload failed, no guarantee that > + // garbage collection did not cleanup chunks in the mean time. > + let (chunk_path, digest_str) = self.chunk_path(digest); > + if let Err(err) = > + std::fs::remove_file(chunk_path.with_extension(BACKEND_UPLOAD_MARKER_EXT)) > + { > + if !(pre_existing && err.kind() == std::io::ErrorKind::NotFound) { > + bail!( > + "concurrent upload failed on store '{}' for {digest_str} - {err}", > + self.name, > + ) > + } > + } > + } > + > + Ok(()) > + }) > + } > + > fn insert_chunk_impl( > &self, > chunk: &DataBlob, > @@ -692,6 +787,15 @@ impl ChunkStore { > Ok(()) > } > > + /// Generate file path for backend upload marker of given chunk digest. > + fn chunk_backed_upload_marker_path(&self, digest: &[u8; 32]) -> (PathBuf, String) { > + let (chunk_path, digest_str) = self.chunk_path(digest); > + ( > + chunk_path.with_extension(BACKEND_UPLOAD_MARKER_EXT), > + digest_str, > + ) > + } > + > pub fn relative_path(&self, path: &Path) -> PathBuf { > // unwrap: only `None` in unit tests > assert!(self.locker.is_some()); > @@ -777,14 +881,26 @@ fn test_chunk_store1() { > > if let Err(_e) = std::fs::remove_dir_all(".testdir") { /* ignore */ } > > - let chunk_store = ChunkStore::open("test", &path, DatastoreFSyncLevel::None); > + let chunk_store = ChunkStore::open( > + "test", > + &path, > + DatastoreFSyncLevel::None, > + DatastoreBackendType::Filesystem, > + ); > assert!(chunk_store.is_err()); > > let user = nix::unistd::User::from_uid(nix::unistd::Uid::current()) > .unwrap() > .unwrap(); > - let chunk_store = > - ChunkStore::create("test", &path, user.uid, user.gid, DatastoreFSyncLevel::None).unwrap(); > + let chunk_store = ChunkStore::create( > + "test", > + &path, > + user.uid, > + user.gid, > + DatastoreFSyncLevel::None, > + DatastoreBackendType::Filesystem, > + ) > + .unwrap(); > > let (chunk, digest) = crate::data_blob::DataChunkBuilder::new(&[0u8, 1u8]) > .build() > @@ -796,8 +912,14 @@ fn test_chunk_store1() { > let (exists, _) = chunk_store.insert_chunk(&chunk, &digest).unwrap(); > assert!(exists); > > - let chunk_store = > - ChunkStore::create("test", &path, user.uid, user.gid, DatastoreFSyncLevel::None); > + let chunk_store = ChunkStore::create( > + "test", > + &path, > + user.uid, > + user.gid, > + DatastoreFSyncLevel::None, > + DatastoreBackendType::Filesystem, > + ); > assert!(chunk_store.is_err()); > > if let Err(_e) = std::fs::remove_dir_all(".testdir") { /* ignore */ } > diff --git a/pbs-datastore/src/datastore.rs b/pbs-datastore/src/datastore.rs > index a6b17e3c3..e40b6883b 100644 > --- a/pbs-datastore/src/datastore.rs > +++ b/pbs-datastore/src/datastore.rs > @@ -336,10 +336,14 @@ impl DataStore { > DatastoreTuning::API_SCHEMA > .parse_property_string(config.tuning.as_deref().unwrap_or(""))?, > )?; > + let backend_config: DatastoreBackendConfig = > + config.backend.as_deref().unwrap_or("").parse()?; > + let backend_type = backend_config.ty.unwrap_or_default(); > Arc::new(ChunkStore::open( > name, > config.absolute_path(), > tuning.sync_level.unwrap_or_default(), > + backend_type, > )?) > }; > > @@ -424,10 +428,16 @@ impl DataStore { > DatastoreTuning::API_SCHEMA > .parse_property_string(config.tuning.as_deref().unwrap_or(""))?, > )?; > + let backend_config: DatastoreBackendConfig = serde_json::from_value( > + DatastoreBackendConfig::API_SCHEMA > + .parse_property_string(config.backend.as_deref().unwrap_or(""))?, > + )?; > + let backend_type = backend_config.ty.unwrap_or_default(); > let chunk_store = ChunkStore::open( > &name, > config.absolute_path(), > tuning.sync_level.unwrap_or_default(), > + backend_type, > )?; > let inner = Arc::new(Self::with_store_and_config( > Arc::new(chunk_store), > @@ -1651,6 +1661,13 @@ impl DataStore { > let atime = match std::fs::metadata(&chunk_path) { > Ok(stat) => stat.accessed()?, > Err(err) if err.kind() == std::io::ErrorKind::NotFound => { > + if self > + .inner > + .chunk_store > + .backend_upload_marker_exists(&digest)? > + { > + continue; > + } > // File not found, delete by setting atime to unix epoch > info!("Not found, mark for deletion: {}", content.key); > SystemTime::UNIX_EPOCH > @@ -1857,6 +1874,21 @@ impl DataStore { > self.inner.chunk_store.insert_chunk(chunk, digest) > } > > + // Try to insert a new backend upload marker, signaling to garbage collection that there is an > + // in-progress upload for this chunk. > + // > + // Returns true if the marker was created or pre-existed (concurrent upload), returns false > + // if the chunk has been inserted since, the marker file not being created and the upload > + // must be avoided. > + pub fn insert_backend_upload_marker(&self, digest: &[u8; 32]) -> Result { > + self.inner.chunk_store.insert_backend_upload_marker(digest) > + } > + > + /// Remove the marker file signaling an in-progress upload to the backend > + pub fn cleanup_backend_upload_marker(&self, digest: &[u8; 32]) -> Result<(), Error> { > + self.inner.chunk_store.cleanup_backend_upload_marker(digest) > + } > + > pub fn stat_chunk(&self, digest: &[u8; 32]) -> Result { > let (chunk_path, _digest_str) = self.inner.chunk_store.chunk_path(digest); > std::fs::metadata(chunk_path).map_err(Error::from) > diff --git a/pbs-datastore/src/local_datastore_lru_cache.rs b/pbs-datastore/src/local_datastore_lru_cache.rs > index fe3b51a55..cdad77031 100644 > --- a/pbs-datastore/src/local_datastore_lru_cache.rs > +++ b/pbs-datastore/src/local_datastore_lru_cache.rs > @@ -34,7 +34,8 @@ impl LocalDatastoreLruCache { > /// > /// Fails if the chunk cannot be inserted successfully. > pub fn insert(&self, digest: &[u8; 32], chunk: &DataBlob) -> Result<(), Error> { > - self.store.insert_chunk(chunk, digest)?; > + self.store > + .insert_chunk_and_remove_upload_marker(chunk, digest)?; > self.cache > .insert(*digest, (), |digest| self.store.clear_chunk(&digest)) > } > diff --git a/src/api2/backup/upload_chunk.rs b/src/api2/backup/upload_chunk.rs > index 8dd7e4d52..7d1f863ed 100644 > --- a/src/api2/backup/upload_chunk.rs > +++ b/src/api2/backup/upload_chunk.rs > @@ -259,6 +259,7 @@ async fn upload_to_backend( > data.len() > ); > } > + let datastore = env.datastore.clone(); > > if env.no_cache { > let object_key = pbs_datastore::s3::object_key_from_digest(&digest)?; > @@ -272,11 +273,11 @@ async fn upload_to_backend( > // Avoid re-upload to S3 if the chunk is either present in the LRU cache or the chunk > // file exists on filesystem. The latter means that the chunk has been present in the > // past an was not cleaned up by garbage collection, so contained in the S3 object store. > - if env.datastore.cache_contains(&digest) { > + if datastore.cache_contains(&digest) { > tracing::info!("Skip upload of cached chunk {}", hex::encode(digest)); > return Ok((digest, size, encoded_size, true)); > } > - if let Ok(true) = env.datastore.cond_touch_chunk(&digest, false) { > + if let Ok(true) = datastore.cond_touch_chunk(&digest, false) { > tracing::info!( > "Skip upload of already encountered chunk {}", > hex::encode(digest) > @@ -286,16 +287,24 @@ async fn upload_to_backend( > > tracing::info!("Upload of new chunk {}", hex::encode(digest)); > let object_key = pbs_datastore::s3::object_key_from_digest(&digest)?; > - let is_duplicate = s3_client > + if !datastore.insert_backend_upload_marker(&digest)? { > + return Ok((digest, size, encoded_size, true)); > + } > + let is_duplicate = match s3_client > .upload_no_replace_with_retry(object_key, data.clone()) what are the semantics of two concurrent calls to this, if the chunk does not exist before on the S3 side? does the second call return it's a duplicate even if the first one is still in progress? do both happen and the result doesn't tell us which attempt "won"? does one of them error out? is this implementation/backend specific, or specified? > .await > - .map_err(|err| format_err!("failed to upload chunk to s3 backend - {err:#}"))?; > + { > + Ok(is_duplicate) => is_duplicate, > + Err(err) => { > + datastore.cleanup_backend_upload_marker(&digest)?; so if - for any reason - a concurrent upload fails, we will clean up the chunk marker here, and fail other backup sessions that happened to upload the same chunk digest in parallel (because they will encounter a missing pending upload marker when attempting to insert the chunk into the chunk store..) > + bail!("failed to upload chunk to s3 backend - {err:#}"); > + } > + }; > > // Only insert the chunk into the cache after it has been successufuly uploaded. > // Although less performant than doing this in parallel, it is required for consisency > // since chunks are considered as present on the backend if the file exists in the local > // cache store. > - let datastore = env.datastore.clone(); > tracing::info!("Caching of chunk {}", hex::encode(digest)); > let _ = tokio::task::spawn_blocking(move || { > let chunk = DataBlob::from_raw(data.to_vec())?; > diff --git a/src/api2/config/datastore.rs b/src/api2/config/datastore.rs > index 3b03c0466..541bd0a04 100644 > --- a/src/api2/config/datastore.rs > +++ b/src/api2/config/datastore.rs > @@ -173,6 +173,7 @@ pub(crate) fn do_create_datastore( > &datastore.name, > &path, > tuning.sync_level.unwrap_or_default(), > + backend_type, > ) > })? > } else { > @@ -204,6 +205,7 @@ pub(crate) fn do_create_datastore( > backup_user.uid, > backup_user.gid, > tuning.sync_level.unwrap_or_default(), > + backend_type, > )? > }; > > -- > 2.47.3 > > > > _______________________________________________ > pbs-devel mailing list > pbs-devel@lists.proxmox.com > https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel > > > _______________________________________________ pbs-devel mailing list pbs-devel@lists.proxmox.com https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel