[pbs-devel] [PATCH proxmox-backup 0/6] s3 store verify: fix concurrency issues and add missing in-memory cache eviction

public inbox for pbs-devel@lists.proxmox.com
 help / color / mirror / Atom feed

* [pbs-devel] [PATCH proxmox-backup 0/6] s3 store verify: fix concurrency issues and add missing in-memory cache eviction
@ 2025-10-16 13:18 Christian Ebner
  2025-10-16 13:18 ` [pbs-devel] [PATCH proxmox-backup 1/6] verify/datastore: make rename corrupt chunk a datastore helper method Christian Ebner
                   ` (5 more replies)
  0 siblings, 6 replies; 7+ messages in thread
From: Christian Ebner @ 2025-10-16 13:18 UTC (permalink / raw)
  To: pbs-devel

This patch series addresses issues with verification tasks on s3 backends.
Most importantly, it is assured that corrupt chunks are correctly evicted also
from the in-memory LRU cache, as otherwise new, non-corrupt chunk will not be
re-inserted into the chunk store if the digest is still cached as recently used.
Also, it refines the distinction between fetching error and chunk parsing error
handling, as the latter must flag the chunk as bad, the former not.

Further, avoid issues by concurrent chunk inserts and or garbage collection
while renaming corrupt chunks. The chunk store mutex lock was never acquired,
independent of the datastore backend, although being less problematic on
filesystem backends due to the atomic nature of the chunk file rename.

Finally, make sure that the mutex lock guarding the corrupt chunk list is not
held when entering the rename helper, as that performs async api calls to the
object store for datastores backed by S3, which could potentially lead to
deadlocks.

proxmox-backup:

Christian Ebner (6):
  verify/datastore: make rename corrupt chunk a datastore helper method
  datastore: refactor rename_corrupted_chunk error handling
  verify: never hold mutex lock in async scope on corrupt chunk rename
  datastore: acquire chunk store mutex lock when renaming corrupt chunk
  datastore: verify: evict corrupt chunks from in-memory LRU cache
  verify: distinguish s3 object fetching and chunk loading error

 pbs-datastore/src/datastore.rs |  63 ++++++++++++++++++
 src/backup/verify.rs           | 117 ++++++++-------------------------
 2 files changed, 92 insertions(+), 88 deletions(-)

Summary over all repositories:
  2 files changed, 92 insertions(+), 88 deletions(-)

-- 
Generated by git-murpp 0.8.1

_______________________________________________
pbs-devel mailing list
pbs-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [pbs-devel] [PATCH proxmox-backup 1/6] verify/datastore: make rename corrupt chunk a datastore helper method
  2025-10-16 13:18 [pbs-devel] [PATCH proxmox-backup 0/6] s3 store verify: fix concurrency issues and add missing in-memory cache eviction Christian Ebner
@ 2025-10-16 13:18 ` Christian Ebner
  2025-10-16 13:18 ` [pbs-devel] [PATCH proxmox-backup 2/6] datastore: refactor rename_corrupted_chunk error handling Christian Ebner
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: Christian Ebner @ 2025-10-16 13:18 UTC (permalink / raw)
  To: pbs-devel

By making this a helper of the datastore, within this method it will
become possible to access the inner chunk store for locking ecc.
That  will be required to correctly lock the store to avoid
concurrent chunk inserts and garbage collection operations during the
rename, to guarantee consistency on datastores with s3 backend.

Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
---
 pbs-datastore/src/datastore.rs | 70 +++++++++++++++++++++++++++++++
 src/backup/verify.rs           | 75 +---------------------------------
 2 files changed, 72 insertions(+), 73 deletions(-)

diff --git a/pbs-datastore/src/datastore.rs b/pbs-datastore/src/datastore.rs
index 038306166..802a39536 100644
--- a/pbs-datastore/src/datastore.rs
+++ b/pbs-datastore/src/datastore.rs
@@ -2418,4 +2418,74 @@ impl DataStore {
             .map_err(|err| format_err!("{err:#}"))?;
         Ok((backend_type, Some(s3_client)))
     }
+
+    pub fn rename_corrupted_chunk(&self, digest: &[u8; 32]) {
+        let (path, digest_str) = self.chunk_path(digest);
+
+        let mut counter = 0;
+        let mut new_path = path.clone();
+        loop {
+            new_path.set_file_name(format!("{}.{}.bad", digest_str, counter));
+            if new_path.exists() && counter < 9 {
+                counter += 1;
+            } else {
+                break;
+            }
+        }
+
+        let backend = match self.backend() {
+            Ok(backend) => backend,
+            Err(err) => {
+                info!(
+                    "failed to get backend while trying to rename bad chunk: {digest_str} - {err}"
+                );
+                return;
+            }
+        };
+
+        if let DatastoreBackend::S3(s3_client) = backend {
+            let suffix = format!(".{}.bad", counter);
+            let target_key = match crate::s3::object_key_from_digest_with_suffix(digest, &suffix) {
+                Ok(target_key) => target_key,
+                Err(err) => {
+                    info!("could not generate target key for corrupted chunk {path:?} - {err}");
+                    return;
+                }
+            };
+            let object_key = match crate::s3::object_key_from_digest(digest) {
+                Ok(object_key) => object_key,
+                Err(err) => {
+                    info!("could not generate object key for corrupted chunk {path:?} - {err}");
+                    return;
+                }
+            };
+            if proxmox_async::runtime::block_on(
+                s3_client.copy_object(object_key.clone(), target_key),
+            )
+            .is_ok()
+            {
+                if proxmox_async::runtime::block_on(s3_client.delete_object(object_key)).is_err() {
+                    info!("failed to delete corrupt chunk on s3 backend: {digest_str}");
+                }
+            } else {
+                info!("failed to copy corrupt chunk on s3 backend: {digest_str}");
+                // Early return to leave the potentially locally cached chunk in the same state as
+                // on the object store. Verification might have failed because of connection issue
+                // after all.
+                return;
+            }
+        }
+
+        match std::fs::rename(&path, &new_path) {
+            Ok(_) => {
+                info!("corrupted chunk renamed to {:?}", &new_path);
+            }
+            Err(err) => {
+                match err.kind() {
+                    std::io::ErrorKind::NotFound => { /* ignored */ }
+                    _ => info!("could not rename corrupted chunk {:?} - {err}", &path),
+                }
+            }
+        };
+    }
 }
diff --git a/src/backup/verify.rs b/src/backup/verify.rs
index bdbe3148b..92d3d9c49 100644
--- a/src/backup/verify.rs
+++ b/src/backup/verify.rs
@@ -76,77 +76,6 @@ impl VerifyWorker {
         }
     }
 
-    fn rename_corrupted_chunk(datastore: Arc<DataStore>, digest: &[u8; 32]) {
-        let (path, digest_str) = datastore.chunk_path(digest);
-
-        let mut counter = 0;
-        let mut new_path = path.clone();
-        loop {
-            new_path.set_file_name(format!("{}.{}.bad", digest_str, counter));
-            if new_path.exists() && counter < 9 {
-                counter += 1;
-            } else {
-                break;
-            }
-        }
-
-        let backend = match datastore.backend() {
-            Ok(backend) => backend,
-            Err(err) => {
-                info!(
-                    "failed to get backend while trying to rename bad chunk: {digest_str} - {err}"
-                );
-                return;
-            }
-        };
-
-        if let DatastoreBackend::S3(s3_client) = backend {
-            let suffix = format!(".{}.bad", counter);
-            let target_key =
-                match pbs_datastore::s3::object_key_from_digest_with_suffix(digest, &suffix) {
-                    Ok(target_key) => target_key,
-                    Err(err) => {
-                        info!("could not generate target key for corrupted chunk {path:?} - {err}");
-                        return;
-                    }
-                };
-            let object_key = match pbs_datastore::s3::object_key_from_digest(digest) {
-                Ok(object_key) => object_key,
-                Err(err) => {
-                    info!("could not generate object key for corrupted chunk {path:?} - {err}");
-                    return;
-                }
-            };
-            if proxmox_async::runtime::block_on(
-                s3_client.copy_object(object_key.clone(), target_key),
-            )
-            .is_ok()
-            {
-                if proxmox_async::runtime::block_on(s3_client.delete_object(object_key)).is_err() {
-                    info!("failed to delete corrupt chunk on s3 backend: {digest_str}");
-                }
-            } else {
-                info!("failed to copy corrupt chunk on s3 backend: {digest_str}");
-                // Early return to leave the potentially locally cached chunk in the same state as
-                // on the object store. Verification might have failed because of connection issue
-                // after all.
-                return;
-            }
-        }
-
-        match std::fs::rename(&path, &new_path) {
-            Ok(_) => {
-                info!("corrupted chunk renamed to {:?}", &new_path);
-            }
-            Err(err) => {
-                match err.kind() {
-                    std::io::ErrorKind::NotFound => { /* ignored */ }
-                    _ => info!("could not rename corrupted chunk {:?} - {err}", &path),
-                }
-            }
-        };
-    }
-
     fn verify_index_chunks(
         &self,
         index: Box<dyn IndexFile + Send>,
@@ -189,7 +118,7 @@ impl VerifyWorker {
                     corrupt_chunks2.lock().unwrap().insert(digest);
                     info!("{err}");
                     errors2.fetch_add(1, Ordering::SeqCst);
-                    Self::rename_corrupted_chunk(datastore2.clone(), &digest);
+                    datastore2.rename_corrupted_chunk(&digest);
                 } else {
                     verified_chunks2.lock().unwrap().insert(digest);
                 }
@@ -336,7 +265,7 @@ impl VerifyWorker {
         corrupt_chunks.insert(digest);
         error!(message);
         errors.fetch_add(1, Ordering::SeqCst);
-        Self::rename_corrupted_chunk(self.datastore.clone(), &digest);
+        self.datastore.rename_corrupted_chunk(&digest);
     }
 
     fn verify_fixed_index(&self, backup_dir: &BackupDir, info: &FileInfo) -> Result<(), Error> {
-- 
2.47.3



_______________________________________________
pbs-devel mailing list
pbs-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [pbs-devel] [PATCH proxmox-backup 2/6] datastore: refactor rename_corrupted_chunk error handling
  2025-10-16 13:18 [pbs-devel] [PATCH proxmox-backup 0/6] s3 store verify: fix concurrency issues and add missing in-memory cache eviction Christian Ebner
  2025-10-16 13:18 ` [pbs-devel] [PATCH proxmox-backup 1/6] verify/datastore: make rename corrupt chunk a datastore helper method Christian Ebner
@ 2025-10-16 13:18 ` Christian Ebner
  2025-10-16 13:18 ` [pbs-devel] [PATCH proxmox-backup 3/6] verify: never hold mutex lock in async scope on corrupt chunk rename Christian Ebner
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: Christian Ebner @ 2025-10-16 13:18 UTC (permalink / raw)
  To: pbs-devel

As part of the verification process, the helper was not intended to
return errors on failure but rather just log information and errors.

Refactoring the code so that the helper method returns errors and
an optional success message makes more concise and readable.

However, keep the logging as info at the callsite for both error and
success message logging to not interfere with the task log.

Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
---
 pbs-datastore/src/datastore.rs | 85 ++++++++++++++--------------------
 src/backup/verify.rs           | 12 ++++-
 2 files changed, 44 insertions(+), 53 deletions(-)

diff --git a/pbs-datastore/src/datastore.rs b/pbs-datastore/src/datastore.rs
index 802a39536..c280b82c7 100644
--- a/pbs-datastore/src/datastore.rs
+++ b/pbs-datastore/src/datastore.rs
@@ -2419,13 +2419,13 @@ impl DataStore {
         Ok((backend_type, Some(s3_client)))
     }
 
-    pub fn rename_corrupted_chunk(&self, digest: &[u8; 32]) {
+    pub fn rename_corrupted_chunk(&self, digest: &[u8; 32]) -> Result<Option<String>, Error> {
         let (path, digest_str) = self.chunk_path(digest);
 
         let mut counter = 0;
         let mut new_path = path.clone();
         loop {
-            new_path.set_file_name(format!("{}.{}.bad", digest_str, counter));
+            new_path.set_file_name(format!("{digest_str}.{counter}.bad"));
             if new_path.exists() && counter < 9 {
                 counter += 1;
             } else {
@@ -2433,59 +2433,42 @@ impl DataStore {
             }
         }
 
-        let backend = match self.backend() {
-            Ok(backend) => backend,
-            Err(err) => {
-                info!(
-                    "failed to get backend while trying to rename bad chunk: {digest_str} - {err}"
-                );
-                return;
-            }
-        };
+        let backend = self.backend().map_err(|err| {
+            format_err!(
+                "failed to get backend while trying to rename bad chunk: {digest_str} - {err}"
+            )
+        })?;
 
         if let DatastoreBackend::S3(s3_client) = backend {
-            let suffix = format!(".{}.bad", counter);
-            let target_key = match crate::s3::object_key_from_digest_with_suffix(digest, &suffix) {
-                Ok(target_key) => target_key,
-                Err(err) => {
-                    info!("could not generate target key for corrupted chunk {path:?} - {err}");
-                    return;
-                }
-            };
-            let object_key = match crate::s3::object_key_from_digest(digest) {
-                Ok(object_key) => object_key,
-                Err(err) => {
-                    info!("could not generate object key for corrupted chunk {path:?} - {err}");
-                    return;
-                }
-            };
-            if proxmox_async::runtime::block_on(
-                s3_client.copy_object(object_key.clone(), target_key),
-            )
-            .is_ok()
-            {
-                if proxmox_async::runtime::block_on(s3_client.delete_object(object_key)).is_err() {
-                    info!("failed to delete corrupt chunk on s3 backend: {digest_str}");
-                }
-            } else {
-                info!("failed to copy corrupt chunk on s3 backend: {digest_str}");
-                // Early return to leave the potentially locally cached chunk in the same state as
-                // on the object store. Verification might have failed because of connection issue
-                // after all.
-                return;
-            }
+            let suffix = format!(".{counter}.bad");
+            let target_key = crate::s3::object_key_from_digest_with_suffix(digest, &suffix)
+                .map_err(|err| {
+                    format_err!(
+                        "could not generate target key for corrupted chunk {path:?} - {err}"
+                    )
+                })?;
+            let object_key = crate::s3::object_key_from_digest(digest).map_err(|err| {
+                format_err!("could not generate object key for corrupted chunk {path:?} - {err}")
+            })?;
+
+            proxmox_async::runtime::block_on(s3_client.copy_object(object_key.clone(), target_key))
+                .map_err(|err| {
+                    format_err!("failed to copy corrupt chunk on s3 backend: {digest_str} - {err}")
+                })?;
+
+            proxmox_async::runtime::block_on(s3_client.delete_object(object_key)).map_err(
+                |err| {
+                    format_err!(
+                        "failed to delete corrupt chunk on s3 backend: {digest_str} - {err}"
+                    )
+                },
+            )?;
         }
 
         match std::fs::rename(&path, &new_path) {
-            Ok(_) => {
-                info!("corrupted chunk renamed to {:?}", &new_path);
-            }
-            Err(err) => {
-                match err.kind() {
-                    std::io::ErrorKind::NotFound => { /* ignored */ }
-                    _ => info!("could not rename corrupted chunk {:?} - {err}", &path),
-                }
-            }
-        };
+            Ok(_) => Ok(Some(format!("corrupted chunk renamed to {new_path:?}"))),
+            Err(err) if err.kind() == std::io::ErrorKind::NotFound => Ok(None),
+            Err(err) => bail!("could not rename corrupted chunk {path:?} - {err}"),
+        }
     }
 }
diff --git a/src/backup/verify.rs b/src/backup/verify.rs
index 92d3d9c49..39f36cd95 100644
--- a/src/backup/verify.rs
+++ b/src/backup/verify.rs
@@ -118,7 +118,11 @@ impl VerifyWorker {
                     corrupt_chunks2.lock().unwrap().insert(digest);
                     info!("{err}");
                     errors2.fetch_add(1, Ordering::SeqCst);
-                    datastore2.rename_corrupted_chunk(&digest);
+                    match datastore2.rename_corrupted_chunk(&digest) {
+                        Ok(Some(message)) => info!("{message}"),
+                        Err(err) => info!("{err}"),
+                        _ => (),
+                    }
                 } else {
                     verified_chunks2.lock().unwrap().insert(digest);
                 }
@@ -265,7 +269,11 @@ impl VerifyWorker {
         corrupt_chunks.insert(digest);
         error!(message);
         errors.fetch_add(1, Ordering::SeqCst);
-        self.datastore.rename_corrupted_chunk(&digest);
+        match self.datastore.rename_corrupted_chunk(&digest) {
+            Ok(Some(message)) => info!("{message}"),
+            Err(err) => info!("{err}"),
+            _ => (),
+        }
     }
 
     fn verify_fixed_index(&self, backup_dir: &BackupDir, info: &FileInfo) -> Result<(), Error> {
-- 
2.47.3



_______________________________________________
pbs-devel mailing list
pbs-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [pbs-devel] [PATCH proxmox-backup 3/6] verify: never hold mutex lock in async scope on corrupt chunk rename
  2025-10-16 13:18 [pbs-devel] [PATCH proxmox-backup 0/6] s3 store verify: fix concurrency issues and add missing in-memory cache eviction Christian Ebner
  2025-10-16 13:18 ` [pbs-devel] [PATCH proxmox-backup 1/6] verify/datastore: make rename corrupt chunk a datastore helper method Christian Ebner
  2025-10-16 13:18 ` [pbs-devel] [PATCH proxmox-backup 2/6] datastore: refactor rename_corrupted_chunk error handling Christian Ebner
@ 2025-10-16 13:18 ` Christian Ebner
  2025-10-16 13:18 ` [pbs-devel] [PATCH proxmox-backup 4/6] datastore: acquire chunk store mutex lock when renaming corrupt chunk Christian Ebner
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: Christian Ebner @ 2025-10-16 13:18 UTC (permalink / raw)
  To: pbs-devel

Holding a mutex lock across async await boundaries is prone to
deadlock [0]. Renaming a corrupt chunk requires however async API
calls in case of datastores backed by S3.

Fix this by simply not hold onto the mutex lock guarding the corrupt
chunk list during chunk verification tasks when calling the rename
method. If the chunk is already present in this list, there will be
no other verification task operating on that exact chunk anyways.

[0] https://docs.rs/tokio/latest/tokio/sync/struct.Mutex.html#which-kind-of-mutex-should-you-use

Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
---
 src/backup/verify.rs | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/src/backup/verify.rs b/src/backup/verify.rs
index 39f36cd95..b1066f6f5 100644
--- a/src/backup/verify.rs
+++ b/src/backup/verify.rs
@@ -265,8 +265,7 @@ impl VerifyWorker {
 
     fn add_corrupt_chunk(&self, digest: [u8; 32], errors: Arc<AtomicUsize>, message: &str) {
         // Panic on poisoned mutex
-        let mut corrupt_chunks = self.corrupt_chunks.lock().unwrap();
-        corrupt_chunks.insert(digest);
+        self.corrupt_chunks.lock().unwrap().insert(digest);
         error!(message);
         errors.fetch_add(1, Ordering::SeqCst);
         match self.datastore.rename_corrupted_chunk(&digest) {
-- 
2.47.3



_______________________________________________
pbs-devel mailing list
pbs-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [pbs-devel] [PATCH proxmox-backup 4/6] datastore: acquire chunk store mutex lock when renaming corrupt chunk
  2025-10-16 13:18 [pbs-devel] [PATCH proxmox-backup 0/6] s3 store verify: fix concurrency issues and add missing in-memory cache eviction Christian Ebner
                   ` (2 preceding siblings ...)
  2025-10-16 13:18 ` [pbs-devel] [PATCH proxmox-backup 3/6] verify: never hold mutex lock in async scope on corrupt chunk rename Christian Ebner
@ 2025-10-16 13:18 ` Christian Ebner
  2025-10-16 13:18 ` [pbs-devel] [PATCH proxmox-backup 5/6] datastore: verify: evict corrupt chunks from in-memory LRU cache Christian Ebner
  2025-10-16 13:18 ` [pbs-devel] [PATCH proxmox-backup 6/6] verify: distinguish s3 object fetching and chunk loading error Christian Ebner
  5 siblings, 0 replies; 7+ messages in thread
From: Christian Ebner @ 2025-10-16 13:18 UTC (permalink / raw)
  To: pbs-devel

While the rename itself is an atomic operation, it must be assured
that no other task such as garbage collection or backup chunk insert
are expecting to hold an exclusive access to the chunk store.

Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
---
 pbs-datastore/src/datastore.rs | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/pbs-datastore/src/datastore.rs b/pbs-datastore/src/datastore.rs
index c280b82c7..a7ea8fd96 100644
--- a/pbs-datastore/src/datastore.rs
+++ b/pbs-datastore/src/datastore.rs
@@ -2465,6 +2465,8 @@ impl DataStore {
             )?;
         }
 
+        let _lock = self.inner.chunk_store.mutex().lock().unwrap();
+
         match std::fs::rename(&path, &new_path) {
             Ok(_) => Ok(Some(format!("corrupted chunk renamed to {new_path:?}"))),
             Err(err) if err.kind() == std::io::ErrorKind::NotFound => Ok(None),
-- 
2.47.3



_______________________________________________
pbs-devel mailing list
pbs-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [pbs-devel] [PATCH proxmox-backup 5/6] datastore: verify: evict corrupt chunks from in-memory LRU cache
  2025-10-16 13:18 [pbs-devel] [PATCH proxmox-backup 0/6] s3 store verify: fix concurrency issues and add missing in-memory cache eviction Christian Ebner
                   ` (3 preceding siblings ...)
  2025-10-16 13:18 ` [pbs-devel] [PATCH proxmox-backup 4/6] datastore: acquire chunk store mutex lock when renaming corrupt chunk Christian Ebner
@ 2025-10-16 13:18 ` Christian Ebner
  2025-10-16 13:18 ` [pbs-devel] [PATCH proxmox-backup 6/6] verify: distinguish s3 object fetching and chunk loading error Christian Ebner
  5 siblings, 0 replies; 7+ messages in thread
From: Christian Ebner @ 2025-10-16 13:18 UTC (permalink / raw)
  To: pbs-devel

Chunks detected as corrupt have been renamed on both, the S3 backend
and the local datastore cache, but not evicted from the in-memory
cache containing the LRU chunk digests. This can lead to the chunks
being considered as already present if their digest is still cached,
and therefore not being re-inserted in the local store cache and S3
backend on backup upload.

Fix this by not only renaming the local datastore's chunk marker
file, but also removing it from the in-memory cache while holding the
chunk store mutex lock to exclude interference from concurrent chunk
inserts.

Reported-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
---
 pbs-datastore/src/datastore.rs | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/pbs-datastore/src/datastore.rs b/pbs-datastore/src/datastore.rs
index a7ea8fd96..c551679e3 100644
--- a/pbs-datastore/src/datastore.rs
+++ b/pbs-datastore/src/datastore.rs
@@ -2467,10 +2467,18 @@ impl DataStore {
 
         let _lock = self.inner.chunk_store.mutex().lock().unwrap();
 
-        match std::fs::rename(&path, &new_path) {
+        let result = match std::fs::rename(&path, &new_path) {
             Ok(_) => Ok(Some(format!("corrupted chunk renamed to {new_path:?}"))),
             Err(err) if err.kind() == std::io::ErrorKind::NotFound => Ok(None),
             Err(err) => bail!("could not rename corrupted chunk {path:?} - {err}"),
+        };
+
+        if let Some(cache) = self.cache() {
+            // Requiremets for calling the unsafe method are met, since snapshots referencing the
+            // corrupt chunk are to be considered corrupt. Ignore the error due to the missing file.
+            let _ = unsafe { cache.remove(digest) };
         }
+
+        result
     }
 }
-- 
2.47.3



_______________________________________________
pbs-devel mailing list
pbs-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [pbs-devel] [PATCH proxmox-backup 6/6] verify: distinguish s3 object fetching and chunk loading error
  2025-10-16 13:18 [pbs-devel] [PATCH proxmox-backup 0/6] s3 store verify: fix concurrency issues and add missing in-memory cache eviction Christian Ebner
                   ` (4 preceding siblings ...)
  2025-10-16 13:18 ` [pbs-devel] [PATCH proxmox-backup 5/6] datastore: verify: evict corrupt chunks from in-memory LRU cache Christian Ebner
@ 2025-10-16 13:18 ` Christian Ebner
  5 siblings, 0 replies; 7+ messages in thread
From: Christian Ebner @ 2025-10-16 13:18 UTC (permalink / raw)
  To: pbs-devel

Errors while loading chunks from the object store might be cause by
transient issues, and must therefore handled so they do not
incorrectly mark chunks as corrupt.

On creating the chunk from the response data, which includes the
chunk header and validity checks, errors must however lead to the
chunk being flagged as bad. Adapt the code so these errors are
correctly distinguished.

Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
---
 src/backup/verify.rs | 31 ++++++++++++++++++-------------
 1 file changed, 18 insertions(+), 13 deletions(-)

diff --git a/src/backup/verify.rs b/src/backup/verify.rs
index b1066f6f5..4cfd81350 100644
--- a/src/backup/verify.rs
+++ b/src/backup/verify.rs
@@ -225,19 +225,24 @@ impl VerifyWorker {
                 let object_key = pbs_datastore::s3::object_key_from_digest(&info.digest)?;
                 match proxmox_async::runtime::block_on(s3_client.get_object(object_key)) {
                     Ok(Some(response)) => {
-                        let chunk_result = proxmox_lang::try_block!({
-                            let bytes =
-                                proxmox_async::runtime::block_on(response.content.collect())?
-                                    .to_bytes();
-                            DataBlob::from_raw(bytes.to_vec())
-                        });
-
-                        match chunk_result {
-                            Ok(chunk) => {
-                                let size = info.size();
-                                *read_bytes += chunk.raw_size();
-                                decoder_pool.send((chunk, info.digest, size))?;
-                                *decoded_bytes += size;
+                        match proxmox_async::runtime::block_on(response.content.collect()) {
+                            Ok(raw_chunk) => {
+                                match DataBlob::from_raw(raw_chunk.to_bytes().to_vec()) {
+                                    Ok(chunk) => {
+                                        let size = info.size();
+                                        *read_bytes += chunk.raw_size();
+                                        decoder_pool.send((chunk, info.digest, size))?;
+                                        *decoded_bytes += size;
+                                    }
+                                    Err(err) => self.add_corrupt_chunk(
+                                        info.digest,
+                                        errors,
+                                        &format!(
+                                            "can't verify chunk with digest {} - {err}",
+                                            hex::encode(info.digest)
+                                        ),
+                                    ),
+                                }
                             }
                             Err(err) => {
                                 errors.fetch_add(1, Ordering::SeqCst);
-- 
2.47.3



_______________________________________________
pbs-devel mailing list
pbs-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2025-10-16 13:18 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-10-16 13:18 [pbs-devel] [PATCH proxmox-backup 0/6] s3 store verify: fix concurrency issues and add missing in-memory cache eviction Christian Ebner
2025-10-16 13:18 ` [pbs-devel] [PATCH proxmox-backup 1/6] verify/datastore: make rename corrupt chunk a datastore helper method Christian Ebner
2025-10-16 13:18 ` [pbs-devel] [PATCH proxmox-backup 2/6] datastore: refactor rename_corrupted_chunk error handling Christian Ebner
2025-10-16 13:18 ` [pbs-devel] [PATCH proxmox-backup 3/6] verify: never hold mutex lock in async scope on corrupt chunk rename Christian Ebner
2025-10-16 13:18 ` [pbs-devel] [PATCH proxmox-backup 4/6] datastore: acquire chunk store mutex lock when renaming corrupt chunk Christian Ebner
2025-10-16 13:18 ` [pbs-devel] [PATCH proxmox-backup 5/6] datastore: verify: evict corrupt chunks from in-memory LRU cache Christian Ebner
2025-10-16 13:18 ` [pbs-devel] [PATCH proxmox-backup 6/6] verify: distinguish s3 object fetching and chunk loading error Christian Ebner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox

Service provided by Proxmox Server Solutions GmbH | Privacy | Legal