[pbs-devel] [PATCH proxmox-backup 0/6] s3 store verify: fix concurrency issues and add missing in-memory cache eviction

all lists on lists.proxmox.com
 help / color / mirror / Atom feed

* [pbs-devel] [PATCH proxmox-backup 0/6] s3 store verify: fix concurrency issues and add missing in-memory cache eviction
@ 2025-10-16 13:18 Christian Ebner
  2025-10-16 13:18 ` [pbs-devel] [PATCH proxmox-backup 1/6] verify/datastore: make rename corrupt chunk a datastore helper method Christian Ebner
                   ` (5 more replies)
  0 siblings, 6 replies; 16+ messages in thread
From: Christian Ebner @ 2025-10-16 13:18 UTC (permalink / raw)
  To: pbs-devel

This patch series addresses issues with verification tasks on s3 backends.
Most importantly, it is assured that corrupt chunks are correctly evicted also
from the in-memory LRU cache, as otherwise new, non-corrupt chunk will not be
re-inserted into the chunk store if the digest is still cached as recently used.
Also, it refines the distinction between fetching error and chunk parsing error
handling, as the latter must flag the chunk as bad, the former not.

Further, avoid issues by concurrent chunk inserts and or garbage collection
while renaming corrupt chunks. The chunk store mutex lock was never acquired,
independent of the datastore backend, although being less problematic on
filesystem backends due to the atomic nature of the chunk file rename.

Finally, make sure that the mutex lock guarding the corrupt chunk list is not
held when entering the rename helper, as that performs async api calls to the
object store for datastores backed by S3, which could potentially lead to
deadlocks.

proxmox-backup:

Christian Ebner (6):
  verify/datastore: make rename corrupt chunk a datastore helper method
  datastore: refactor rename_corrupted_chunk error handling
  verify: never hold mutex lock in async scope on corrupt chunk rename
  datastore: acquire chunk store mutex lock when renaming corrupt chunk
  datastore: verify: evict corrupt chunks from in-memory LRU cache
  verify: distinguish s3 object fetching and chunk loading error

 pbs-datastore/src/datastore.rs |  63 ++++++++++++++++++
 src/backup/verify.rs           | 117 ++++++++-------------------------
 2 files changed, 92 insertions(+), 88 deletions(-)

Summary over all repositories:
  2 files changed, 92 insertions(+), 88 deletions(-)

-- 
Generated by git-murpp 0.8.1

_______________________________________________
pbs-devel mailing list
pbs-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [pbs-devel] [PATCH proxmox-backup 1/6] verify/datastore: make rename corrupt chunk a datastore helper method
  2025-10-16 13:18 [pbs-devel] [PATCH proxmox-backup 0/6] s3 store verify: fix concurrency issues and add missing in-memory cache eviction Christian Ebner
@ 2025-10-16 13:18 ` Christian Ebner
  2025-10-16 13:18 ` [pbs-devel] [PATCH proxmox-backup 2/6] datastore: refactor rename_corrupted_chunk error handling Christian Ebner
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 16+ messages in thread
From: Christian Ebner @ 2025-10-16 13:18 UTC (permalink / raw)
  To: pbs-devel

By making this a helper of the datastore, within this method it will
become possible to access the inner chunk store for locking ecc.
That  will be required to correctly lock the store to avoid
concurrent chunk inserts and garbage collection operations during the
rename, to guarantee consistency on datastores with s3 backend.

Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
---
 pbs-datastore/src/datastore.rs | 70 +++++++++++++++++++++++++++++++
 src/backup/verify.rs           | 75 +---------------------------------
 2 files changed, 72 insertions(+), 73 deletions(-)

diff --git a/pbs-datastore/src/datastore.rs b/pbs-datastore/src/datastore.rs
index 038306166..802a39536 100644
--- a/pbs-datastore/src/datastore.rs
+++ b/pbs-datastore/src/datastore.rs
@@ -2418,4 +2418,74 @@ impl DataStore {
             .map_err(|err| format_err!("{err:#}"))?;
         Ok((backend_type, Some(s3_client)))
     }
+
+    pub fn rename_corrupted_chunk(&self, digest: &[u8; 32]) {
+        let (path, digest_str) = self.chunk_path(digest);
+
+        let mut counter = 0;
+        let mut new_path = path.clone();
+        loop {
+            new_path.set_file_name(format!("{}.{}.bad", digest_str, counter));
+            if new_path.exists() && counter < 9 {
+                counter += 1;
+            } else {
+                break;
+            }
+        }
+
+        let backend = match self.backend() {
+            Ok(backend) => backend,
+            Err(err) => {
+                info!(
+                    "failed to get backend while trying to rename bad chunk: {digest_str} - {err}"
+                );
+                return;
+            }
+        };
+
+        if let DatastoreBackend::S3(s3_client) = backend {
+            let suffix = format!(".{}.bad", counter);
+            let target_key = match crate::s3::object_key_from_digest_with_suffix(digest, &suffix) {
+                Ok(target_key) => target_key,
+                Err(err) => {
+                    info!("could not generate target key for corrupted chunk {path:?} - {err}");
+                    return;
+                }
+            };
+            let object_key = match crate::s3::object_key_from_digest(digest) {
+                Ok(object_key) => object_key,
+                Err(err) => {
+                    info!("could not generate object key for corrupted chunk {path:?} - {err}");
+                    return;
+                }
+            };
+            if proxmox_async::runtime::block_on(
+                s3_client.copy_object(object_key.clone(), target_key),
+            )
+            .is_ok()
+            {
+                if proxmox_async::runtime::block_on(s3_client.delete_object(object_key)).is_err() {
+                    info!("failed to delete corrupt chunk on s3 backend: {digest_str}");
+                }
+            } else {
+                info!("failed to copy corrupt chunk on s3 backend: {digest_str}");
+                // Early return to leave the potentially locally cached chunk in the same state as
+                // on the object store. Verification might have failed because of connection issue
+                // after all.
+                return;
+            }
+        }
+
+        match std::fs::rename(&path, &new_path) {
+            Ok(_) => {
+                info!("corrupted chunk renamed to {:?}", &new_path);
+            }
+            Err(err) => {
+                match err.kind() {
+                    std::io::ErrorKind::NotFound => { /* ignored */ }
+                    _ => info!("could not rename corrupted chunk {:?} - {err}", &path),
+                }
+            }
+        };
+    }
 }
diff --git a/src/backup/verify.rs b/src/backup/verify.rs
index bdbe3148b..92d3d9c49 100644
--- a/src/backup/verify.rs
+++ b/src/backup/verify.rs
@@ -76,77 +76,6 @@ impl VerifyWorker {
         }
     }
 
-    fn rename_corrupted_chunk(datastore: Arc<DataStore>, digest: &[u8; 32]) {
-        let (path, digest_str) = datastore.chunk_path(digest);
-
-        let mut counter = 0;
-        let mut new_path = path.clone();
-        loop {
-            new_path.set_file_name(format!("{}.{}.bad", digest_str, counter));
-            if new_path.exists() && counter < 9 {
-                counter += 1;
-            } else {
-                break;
-            }
-        }
-
-        let backend = match datastore.backend() {
-            Ok(backend) => backend,
-            Err(err) => {
-                info!(
-                    "failed to get backend while trying to rename bad chunk: {digest_str} - {err}"
-                );
-                return;
-            }
-        };
-
-        if let DatastoreBackend::S3(s3_client) = backend {
-            let suffix = format!(".{}.bad", counter);
-            let target_key =
-                match pbs_datastore::s3::object_key_from_digest_with_suffix(digest, &suffix) {
-                    Ok(target_key) => target_key,
-                    Err(err) => {
-                        info!("could not generate target key for corrupted chunk {path:?} - {err}");
-                        return;
-                    }
-                };
-            let object_key = match pbs_datastore::s3::object_key_from_digest(digest) {
-                Ok(object_key) => object_key,
-                Err(err) => {
-                    info!("could not generate object key for corrupted chunk {path:?} - {err}");
-                    return;
-                }
-            };
-            if proxmox_async::runtime::block_on(
-                s3_client.copy_object(object_key.clone(), target_key),
-            )
-            .is_ok()
-            {
-                if proxmox_async::runtime::block_on(s3_client.delete_object(object_key)).is_err() {
-                    info!("failed to delete corrupt chunk on s3 backend: {digest_str}");
-                }
-            } else {
-                info!("failed to copy corrupt chunk on s3 backend: {digest_str}");
-                // Early return to leave the potentially locally cached chunk in the same state as
-                // on the object store. Verification might have failed because of connection issue
-                // after all.
-                return;
-            }
-        }
-
-        match std::fs::rename(&path, &new_path) {
-            Ok(_) => {
-                info!("corrupted chunk renamed to {:?}", &new_path);
-            }
-            Err(err) => {
-                match err.kind() {
-                    std::io::ErrorKind::NotFound => { /* ignored */ }
-                    _ => info!("could not rename corrupted chunk {:?} - {err}", &path),
-                }
-            }
-        };
-    }
-
     fn verify_index_chunks(
         &self,
         index: Box<dyn IndexFile + Send>,
@@ -189,7 +118,7 @@ impl VerifyWorker {
                     corrupt_chunks2.lock().unwrap().insert(digest);
                     info!("{err}");
                     errors2.fetch_add(1, Ordering::SeqCst);
-                    Self::rename_corrupted_chunk(datastore2.clone(), &digest);
+                    datastore2.rename_corrupted_chunk(&digest);
                 } else {
                     verified_chunks2.lock().unwrap().insert(digest);
                 }
@@ -336,7 +265,7 @@ impl VerifyWorker {
         corrupt_chunks.insert(digest);
         error!(message);
         errors.fetch_add(1, Ordering::SeqCst);
-        Self::rename_corrupted_chunk(self.datastore.clone(), &digest);
+        self.datastore.rename_corrupted_chunk(&digest);
     }
 
     fn verify_fixed_index(&self, backup_dir: &BackupDir, info: &FileInfo) -> Result<(), Error> {
-- 
2.47.3



_______________________________________________
pbs-devel mailing list
pbs-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [pbs-devel] [PATCH proxmox-backup 2/6] datastore: refactor rename_corrupted_chunk error handling
  2025-10-16 13:18 [pbs-devel] [PATCH proxmox-backup 0/6] s3 store verify: fix concurrency issues and add missing in-memory cache eviction Christian Ebner
  2025-10-16 13:18 ` [pbs-devel] [PATCH proxmox-backup 1/6] verify/datastore: make rename corrupt chunk a datastore helper method Christian Ebner
@ 2025-10-16 13:18 ` Christian Ebner
  2025-10-27 10:59   ` Fabian Grünbichler
  2025-10-16 13:18 ` [pbs-devel] [PATCH proxmox-backup 3/6] verify: never hold mutex lock in async scope on corrupt chunk rename Christian Ebner
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 16+ messages in thread
From: Christian Ebner @ 2025-10-16 13:18 UTC (permalink / raw)
  To: pbs-devel

As part of the verification process, the helper was not intended to
return errors on failure but rather just log information and errors.

Refactoring the code so that the helper method returns errors and
an optional success message makes more concise and readable.

However, keep the logging as info at the callsite for both error and
success message logging to not interfere with the task log.

Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
---
 pbs-datastore/src/datastore.rs | 85 ++++++++++++++--------------------
 src/backup/verify.rs           | 12 ++++-
 2 files changed, 44 insertions(+), 53 deletions(-)

diff --git a/pbs-datastore/src/datastore.rs b/pbs-datastore/src/datastore.rs
index 802a39536..c280b82c7 100644
--- a/pbs-datastore/src/datastore.rs
+++ b/pbs-datastore/src/datastore.rs
@@ -2419,13 +2419,13 @@ impl DataStore {
         Ok((backend_type, Some(s3_client)))
     }
 
-    pub fn rename_corrupted_chunk(&self, digest: &[u8; 32]) {
+    pub fn rename_corrupted_chunk(&self, digest: &[u8; 32]) -> Result<Option<String>, Error> {
         let (path, digest_str) = self.chunk_path(digest);
 
         let mut counter = 0;
         let mut new_path = path.clone();
         loop {
-            new_path.set_file_name(format!("{}.{}.bad", digest_str, counter));
+            new_path.set_file_name(format!("{digest_str}.{counter}.bad"));
             if new_path.exists() && counter < 9 {
                 counter += 1;
             } else {
@@ -2433,59 +2433,42 @@ impl DataStore {
             }
         }
 
-        let backend = match self.backend() {
-            Ok(backend) => backend,
-            Err(err) => {
-                info!(
-                    "failed to get backend while trying to rename bad chunk: {digest_str} - {err}"
-                );
-                return;
-            }
-        };
+        let backend = self.backend().map_err(|err| {
+            format_err!(
+                "failed to get backend while trying to rename bad chunk: {digest_str} - {err}"
+            )
+        })?;
 
         if let DatastoreBackend::S3(s3_client) = backend {
-            let suffix = format!(".{}.bad", counter);
-            let target_key = match crate::s3::object_key_from_digest_with_suffix(digest, &suffix) {
-                Ok(target_key) => target_key,
-                Err(err) => {
-                    info!("could not generate target key for corrupted chunk {path:?} - {err}");
-                    return;
-                }
-            };
-            let object_key = match crate::s3::object_key_from_digest(digest) {
-                Ok(object_key) => object_key,
-                Err(err) => {
-                    info!("could not generate object key for corrupted chunk {path:?} - {err}");
-                    return;
-                }
-            };
-            if proxmox_async::runtime::block_on(
-                s3_client.copy_object(object_key.clone(), target_key),
-            )
-            .is_ok()
-            {
-                if proxmox_async::runtime::block_on(s3_client.delete_object(object_key)).is_err() {
-                    info!("failed to delete corrupt chunk on s3 backend: {digest_str}");
-                }
-            } else {
-                info!("failed to copy corrupt chunk on s3 backend: {digest_str}");
-                // Early return to leave the potentially locally cached chunk in the same state as
-                // on the object store. Verification might have failed because of connection issue
-                // after all.
-                return;
-            }
+            let suffix = format!(".{counter}.bad");
+            let target_key = crate::s3::object_key_from_digest_with_suffix(digest, &suffix)
+                .map_err(|err| {
+                    format_err!(
+                        "could not generate target key for corrupted chunk {path:?} - {err}"
+                    )
+                })?;
+            let object_key = crate::s3::object_key_from_digest(digest).map_err(|err| {
+                format_err!("could not generate object key for corrupted chunk {path:?} - {err}")
+            })?;
+
+            proxmox_async::runtime::block_on(s3_client.copy_object(object_key.clone(), target_key))
+                .map_err(|err| {
+                    format_err!("failed to copy corrupt chunk on s3 backend: {digest_str} - {err}")
+                })?;
+
+            proxmox_async::runtime::block_on(s3_client.delete_object(object_key)).map_err(
+                |err| {
+                    format_err!(
+                        "failed to delete corrupt chunk on s3 backend: {digest_str} - {err}"
+                    )
+                },
+            )?;
         }
 
         match std::fs::rename(&path, &new_path) {
-            Ok(_) => {
-                info!("corrupted chunk renamed to {:?}", &new_path);
-            }
-            Err(err) => {
-                match err.kind() {
-                    std::io::ErrorKind::NotFound => { /* ignored */ }
-                    _ => info!("could not rename corrupted chunk {:?} - {err}", &path),
-                }
-            }
-        };
+            Ok(_) => Ok(Some(format!("corrupted chunk renamed to {new_path:?}"))),
+            Err(err) if err.kind() == std::io::ErrorKind::NotFound => Ok(None),
+            Err(err) => bail!("could not rename corrupted chunk {path:?} - {err}"),
+        }
     }
 }
diff --git a/src/backup/verify.rs b/src/backup/verify.rs
index 92d3d9c49..39f36cd95 100644
--- a/src/backup/verify.rs
+++ b/src/backup/verify.rs
@@ -118,7 +118,11 @@ impl VerifyWorker {
                     corrupt_chunks2.lock().unwrap().insert(digest);
                     info!("{err}");
                     errors2.fetch_add(1, Ordering::SeqCst);
-                    datastore2.rename_corrupted_chunk(&digest);
+                    match datastore2.rename_corrupted_chunk(&digest) {
+                        Ok(Some(message)) => info!("{message}"),
+                        Err(err) => info!("{err}"),
+                        _ => (),
+                    }
                 } else {
                     verified_chunks2.lock().unwrap().insert(digest);
                 }
@@ -265,7 +269,11 @@ impl VerifyWorker {
         corrupt_chunks.insert(digest);
         error!(message);
         errors.fetch_add(1, Ordering::SeqCst);
-        self.datastore.rename_corrupted_chunk(&digest);
+        match self.datastore.rename_corrupted_chunk(&digest) {
+            Ok(Some(message)) => info!("{message}"),
+            Err(err) => info!("{err}"),
+            _ => (),
+        }
     }
 
     fn verify_fixed_index(&self, backup_dir: &BackupDir, info: &FileInfo) -> Result<(), Error> {
-- 
2.47.3



_______________________________________________
pbs-devel mailing list
pbs-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [pbs-devel] [PATCH proxmox-backup 2/6] datastore: refactor rename_corrupted_chunk error handling
  2025-10-16 13:18 ` [pbs-devel] [PATCH proxmox-backup 2/6] datastore: refactor rename_corrupted_chunk error handling Christian Ebner
@ 2025-10-27 10:59   ` Fabian Grünbichler
  2025-10-27 11:36     ` Christian Ebner
  0 siblings, 1 reply; 16+ messages in thread
From: Fabian Grünbichler @ 2025-10-27 10:59 UTC (permalink / raw)
  To: Proxmox Backup Server development discussion

On October 16, 2025 3:18 pm, Christian Ebner wrote:
> As part of the verification process, the helper was not intended to
> return errors on failure but rather just log information and errors.
> 
> Refactoring the code so that the helper method returns errors and
> an optional success message makes more concise and readable.
> However, keep the logging as info at the callsite for both error and
> success message logging to not interfere with the task log.

following this logic, I think we should not return an info-level message
as string in this datastore interface, but regular data with meaning,
see below for some suggestions..

> 
> Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
> ---
>  pbs-datastore/src/datastore.rs | 85 ++++++++++++++--------------------
>  src/backup/verify.rs           | 12 ++++-
>  2 files changed, 44 insertions(+), 53 deletions(-)
> 
> diff --git a/pbs-datastore/src/datastore.rs b/pbs-datastore/src/datastore.rs
> index 802a39536..c280b82c7 100644
> --- a/pbs-datastore/src/datastore.rs
> +++ b/pbs-datastore/src/datastore.rs
> @@ -2419,13 +2419,13 @@ impl DataStore {
>          Ok((backend_type, Some(s3_client)))
>      }
>  
> -    pub fn rename_corrupted_chunk(&self, digest: &[u8; 32]) {
> +    pub fn rename_corrupted_chunk(&self, digest: &[u8; 32]) -> Result<Option<String>, Error> {
>          let (path, digest_str) = self.chunk_path(digest);
>  
>          let mut counter = 0;
>          let mut new_path = path.clone();
>          loop {
> -            new_path.set_file_name(format!("{}.{}.bad", digest_str, counter));
> +            new_path.set_file_name(format!("{digest_str}.{counter}.bad"));
>              if new_path.exists() && counter < 9 {
>                  counter += 1;
>              } else {
> @@ -2433,59 +2433,42 @@ impl DataStore {
>              }
>          }
>  
> -        let backend = match self.backend() {
> -            Ok(backend) => backend,
> -            Err(err) => {
> -                info!(
> -                    "failed to get backend while trying to rename bad chunk: {digest_str} - {err}"
> -                );
> -                return;
> -            }
> -        };
> +        let backend = self.backend().map_err(|err| {
> +            format_err!(
> +                "failed to get backend while trying to rename bad chunk: {digest_str} - {err}"
> +            )
> +        })?;
>  
>          if let DatastoreBackend::S3(s3_client) = backend {
> -            let suffix = format!(".{}.bad", counter);
> -            let target_key = match crate::s3::object_key_from_digest_with_suffix(digest, &suffix) {
> -                Ok(target_key) => target_key,
> -                Err(err) => {
> -                    info!("could not generate target key for corrupted chunk {path:?} - {err}");
> -                    return;
> -                }
> -            };
> -            let object_key = match crate::s3::object_key_from_digest(digest) {
> -                Ok(object_key) => object_key,
> -                Err(err) => {
> -                    info!("could not generate object key for corrupted chunk {path:?} - {err}");
> -                    return;
> -                }
> -            };
> -            if proxmox_async::runtime::block_on(
> -                s3_client.copy_object(object_key.clone(), target_key),
> -            )
> -            .is_ok()
> -            {
> -                if proxmox_async::runtime::block_on(s3_client.delete_object(object_key)).is_err() {
> -                    info!("failed to delete corrupt chunk on s3 backend: {digest_str}");
> -                }
> -            } else {
> -                info!("failed to copy corrupt chunk on s3 backend: {digest_str}");
> -                // Early return to leave the potentially locally cached chunk in the same state as
> -                // on the object store. Verification might have failed because of connection issue
> -                // after all.
> -                return;
> -            }
> +            let suffix = format!(".{counter}.bad");
> +            let target_key = crate::s3::object_key_from_digest_with_suffix(digest, &suffix)
> +                .map_err(|err| {
> +                    format_err!(
> +                        "could not generate target key for corrupted chunk {path:?} - {err}"

nit: while we're at it, could we please get rid of the "corrupted" here
in favor of "corrupt", for consistency's sake? :)

> +                    )
> +                })?;
> +            let object_key = crate::s3::object_key_from_digest(digest).map_err(|err| {
> +                format_err!("could not generate object key for corrupted chunk {path:?} - {err}")

same here

> +            })?;
> +
> +            proxmox_async::runtime::block_on(s3_client.copy_object(object_key.clone(), target_key))
> +                .map_err(|err| {
> +                    format_err!("failed to copy corrupt chunk on s3 backend: {digest_str} - {err}")
> +                })?;
> +
> +            proxmox_async::runtime::block_on(s3_client.delete_object(object_key)).map_err(
> +                |err| {
> +                    format_err!(
> +                        "failed to delete corrupt chunk on s3 backend: {digest_str} - {err}"
> +                    )
> +                },
> +            )?;
>          }
>  
>          match std::fs::rename(&path, &new_path) {
> -            Ok(_) => {
> -                info!("corrupted chunk renamed to {:?}", &new_path);
> -            }
> -            Err(err) => {
> -                match err.kind() {
> -                    std::io::ErrorKind::NotFound => { /* ignored */ }
> -                    _ => info!("could not rename corrupted chunk {:?} - {err}", &path),
> -                }
> -            }
> -        };
> +            Ok(_) => Ok(Some(format!("corrupted chunk renamed to {new_path:?}"))),

this should return one of the following:
- (true, new_path): renamed, here's the path if you need it
- (true, Some(new_path)): renamed, here's the path if you need it
- Some(new_path): new path, encoding that it got renamed by virtue of it
  being Some

> +            Err(err) if err.kind() == std::io::ErrorKind::NotFound => Ok(None),

correspondingly, this should return one of the following:
(false, new_path) or (false, None) or None

> +            Err(err) => bail!("could not rename corrupted chunk {path:?} - {err}"),
> +        }
>      }
>  }
> diff --git a/src/backup/verify.rs b/src/backup/verify.rs
> index 92d3d9c49..39f36cd95 100644
> --- a/src/backup/verify.rs
> +++ b/src/backup/verify.rs
> @@ -118,7 +118,11 @@ impl VerifyWorker {
>                      corrupt_chunks2.lock().unwrap().insert(digest);
>                      info!("{err}");
>                      errors2.fetch_add(1, Ordering::SeqCst);
> -                    datastore2.rename_corrupted_chunk(&digest);
> +                    match datastore2.rename_corrupted_chunk(&digest) {
> +                        Ok(Some(message)) => info!("{message}"),
> +                        Err(err) => info!("{err}"),
> +                        _ => (),
> +                    }
>                  } else {
>                      verified_chunks2.lock().unwrap().insert(digest);
>                  }
> @@ -265,7 +269,11 @@ impl VerifyWorker {
>          corrupt_chunks.insert(digest);
>          error!(message);
>          errors.fetch_add(1, Ordering::SeqCst);
> -        self.datastore.rename_corrupted_chunk(&digest);
> +        match self.datastore.rename_corrupted_chunk(&digest) {
> +            Ok(Some(message)) => info!("{message}"),
> +            Err(err) => info!("{err}"),
> +            _ => (),
> +        }
>      }
>  
>      fn verify_fixed_index(&self, backup_dir: &BackupDir, info: &FileInfo) -> Result<(), Error> {
> -- 
> 2.47.3
> 
> 
> 
> _______________________________________________
> pbs-devel mailing list
> pbs-devel@lists.proxmox.com
> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel
> 
> 
> 


_______________________________________________
pbs-devel mailing list
pbs-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [pbs-devel] [PATCH proxmox-backup 2/6] datastore: refactor rename_corrupted_chunk error handling
  2025-10-27 10:59   ` Fabian Grünbichler
@ 2025-10-27 11:36     ` Christian Ebner
  0 siblings, 0 replies; 16+ messages in thread
From: Christian Ebner @ 2025-10-27 11:36 UTC (permalink / raw)
  To: Proxmox Backup Server development discussion, Fabian Grünbichler

On 10/27/25 11:59 AM, Fabian Grünbichler wrote:
> On October 16, 2025 3:18 pm, Christian Ebner wrote:
>> As part of the verification process, the helper was not intended to
>> return errors on failure but rather just log information and errors.
>>
>> Refactoring the code so that the helper method returns errors and
>> an optional success message makes more concise and readable.
>> However, keep the logging as info at the callsite for both error and
>> success message logging to not interfere with the task log.
> 
> following this logic, I think we should not return an info-level message
> as string in this datastore interface, but regular data with meaning,
> see below for some suggestions..

Okay, will refactor this based on your suggestions for the v2 then.


_______________________________________________
pbs-devel mailing list
pbs-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [pbs-devel] [PATCH proxmox-backup 3/6] verify: never hold mutex lock in async scope on corrupt chunk rename
  2025-10-16 13:18 [pbs-devel] [PATCH proxmox-backup 0/6] s3 store verify: fix concurrency issues and add missing in-memory cache eviction Christian Ebner
  2025-10-16 13:18 ` [pbs-devel] [PATCH proxmox-backup 1/6] verify/datastore: make rename corrupt chunk a datastore helper method Christian Ebner
  2025-10-16 13:18 ` [pbs-devel] [PATCH proxmox-backup 2/6] datastore: refactor rename_corrupted_chunk error handling Christian Ebner
@ 2025-10-16 13:18 ` Christian Ebner
  2025-10-27 10:59   ` Fabian Grünbichler
  2025-10-16 13:18 ` [pbs-devel] [PATCH proxmox-backup 4/6] datastore: acquire chunk store mutex lock when renaming corrupt chunk Christian Ebner
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 16+ messages in thread
From: Christian Ebner @ 2025-10-16 13:18 UTC (permalink / raw)
  To: pbs-devel

Holding a mutex lock across async await boundaries is prone to
deadlock [0]. Renaming a corrupt chunk requires however async API
calls in case of datastores backed by S3.

Fix this by simply not hold onto the mutex lock guarding the corrupt
chunk list during chunk verification tasks when calling the rename
method. If the chunk is already present in this list, there will be
no other verification task operating on that exact chunk anyways.

[0] https://docs.rs/tokio/latest/tokio/sync/struct.Mutex.html#which-kind-of-mutex-should-you-use

Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
---
 src/backup/verify.rs | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/src/backup/verify.rs b/src/backup/verify.rs
index 39f36cd95..b1066f6f5 100644
--- a/src/backup/verify.rs
+++ b/src/backup/verify.rs
@@ -265,8 +265,7 @@ impl VerifyWorker {
 
     fn add_corrupt_chunk(&self, digest: [u8; 32], errors: Arc<AtomicUsize>, message: &str) {
         // Panic on poisoned mutex
-        let mut corrupt_chunks = self.corrupt_chunks.lock().unwrap();
-        corrupt_chunks.insert(digest);
+        self.corrupt_chunks.lock().unwrap().insert(digest);
         error!(message);
         errors.fetch_add(1, Ordering::SeqCst);
         match self.datastore.rename_corrupted_chunk(&digest) {
-- 
2.47.3



_______________________________________________
pbs-devel mailing list
pbs-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [pbs-devel] [PATCH proxmox-backup 3/6] verify: never hold mutex lock in async scope on corrupt chunk rename
  2025-10-16 13:18 ` [pbs-devel] [PATCH proxmox-backup 3/6] verify: never hold mutex lock in async scope on corrupt chunk rename Christian Ebner
@ 2025-10-27 10:59   ` Fabian Grünbichler
  2025-10-27 11:37     ` Christian Ebner
  0 siblings, 1 reply; 16+ messages in thread
From: Fabian Grünbichler @ 2025-10-27 10:59 UTC (permalink / raw)
  To: Proxmox Backup Server development discussion

On October 16, 2025 3:18 pm, Christian Ebner wrote:
> Holding a mutex lock across async await boundaries is prone to
> deadlock [0]. Renaming a corrupt chunk requires however async API
> calls in case of datastores backed by S3.
> 
> Fix this by simply not hold onto the mutex lock guarding the corrupt
> chunk list during chunk verification tasks when calling the rename
> method. If the chunk is already present in this list, there will be
> no other verification task operating on that exact chunk anyways.
> 
> [0] https://docs.rs/tokio/latest/tokio/sync/struct.Mutex.html#which-kind-of-mutex-should-you-use
> 
> Signed-off-by: Christian Ebner <c.ebner@proxmox.com>

please send this one as patch #1, it fixes a bug and is independent from
the rest of the cleanup, AFAICT?

> ---
>  src/backup/verify.rs | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/src/backup/verify.rs b/src/backup/verify.rs
> index 39f36cd95..b1066f6f5 100644
> --- a/src/backup/verify.rs
> +++ b/src/backup/verify.rs
> @@ -265,8 +265,7 @@ impl VerifyWorker {
>  
>      fn add_corrupt_chunk(&self, digest: [u8; 32], errors: Arc<AtomicUsize>, message: &str) {
>          // Panic on poisoned mutex
> -        let mut corrupt_chunks = self.corrupt_chunks.lock().unwrap();
> -        corrupt_chunks.insert(digest);
> +        self.corrupt_chunks.lock().unwrap().insert(digest);
>          error!(message);
>          errors.fetch_add(1, Ordering::SeqCst);
>          match self.datastore.rename_corrupted_chunk(&digest) {
> -- 
> 2.47.3
> 
> 
> 
> _______________________________________________
> pbs-devel mailing list
> pbs-devel@lists.proxmox.com
> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel
> 
> 
> 


_______________________________________________
pbs-devel mailing list
pbs-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [pbs-devel] [PATCH proxmox-backup 3/6] verify: never hold mutex lock in async scope on corrupt chunk rename
  2025-10-27 10:59   ` Fabian Grünbichler
@ 2025-10-27 11:37     ` Christian Ebner
  0 siblings, 0 replies; 16+ messages in thread
From: Christian Ebner @ 2025-10-27 11:37 UTC (permalink / raw)
  To: Proxmox Backup Server development discussion, Fabian Grünbichler

On 10/27/25 12:00 PM, Fabian Grünbichler wrote:
> On October 16, 2025 3:18 pm, Christian Ebner wrote:
>> Holding a mutex lock across async await boundaries is prone to
>> deadlock [0]. Renaming a corrupt chunk requires however async API
>> calls in case of datastores backed by S3.
>>
>> Fix this by simply not hold onto the mutex lock guarding the corrupt
>> chunk list during chunk verification tasks when calling the rename
>> method. If the chunk is already present in this list, there will be
>> no other verification task operating on that exact chunk anyways.
>>
>> [0] https://docs.rs/tokio/latest/tokio/sync/struct.Mutex.html#which-kind-of-mutex-should-you-use
>>
>> Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
> 
> please send this one as patch #1, it fixes a bug and is independent from
> the rest of the cleanup, AFAICT?

Yes, will be up-front in the v2.


_______________________________________________
pbs-devel mailing list
pbs-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [pbs-devel] [PATCH proxmox-backup 4/6] datastore: acquire chunk store mutex lock when renaming corrupt chunk
  2025-10-16 13:18 [pbs-devel] [PATCH proxmox-backup 0/6] s3 store verify: fix concurrency issues and add missing in-memory cache eviction Christian Ebner
                   ` (2 preceding siblings ...)
  2025-10-16 13:18 ` [pbs-devel] [PATCH proxmox-backup 3/6] verify: never hold mutex lock in async scope on corrupt chunk rename Christian Ebner
@ 2025-10-16 13:18 ` Christian Ebner
  2025-10-27 10:59   ` Fabian Grünbichler
  2025-10-16 13:18 ` [pbs-devel] [PATCH proxmox-backup 5/6] datastore: verify: evict corrupt chunks from in-memory LRU cache Christian Ebner
  2025-10-16 13:18 ` [pbs-devel] [PATCH proxmox-backup 6/6] verify: distinguish s3 object fetching and chunk loading error Christian Ebner
  5 siblings, 1 reply; 16+ messages in thread
From: Christian Ebner @ 2025-10-16 13:18 UTC (permalink / raw)
  To: pbs-devel

While the rename itself is an atomic operation, it must be assured
that no other task such as garbage collection or backup chunk insert
are expecting to hold an exclusive access to the chunk store.

Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
---
 pbs-datastore/src/datastore.rs | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/pbs-datastore/src/datastore.rs b/pbs-datastore/src/datastore.rs
index c280b82c7..a7ea8fd96 100644
--- a/pbs-datastore/src/datastore.rs
+++ b/pbs-datastore/src/datastore.rs
@@ -2465,6 +2465,8 @@ impl DataStore {
             )?;
         }
 
+        let _lock = self.inner.chunk_store.mutex().lock().unwrap();
+
         match std::fs::rename(&path, &new_path) {
             Ok(_) => Ok(Some(format!("corrupted chunk renamed to {new_path:?}"))),
             Err(err) if err.kind() == std::io::ErrorKind::NotFound => Ok(None),
-- 
2.47.3



_______________________________________________
pbs-devel mailing list
pbs-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [pbs-devel] [PATCH proxmox-backup 4/6] datastore: acquire chunk store mutex lock when renaming corrupt chunk
  2025-10-16 13:18 ` [pbs-devel] [PATCH proxmox-backup 4/6] datastore: acquire chunk store mutex lock when renaming corrupt chunk Christian Ebner
@ 2025-10-27 10:59   ` Fabian Grünbichler
  0 siblings, 0 replies; 16+ messages in thread
From: Fabian Grünbichler @ 2025-10-27 10:59 UTC (permalink / raw)
  To: Proxmox Backup Server development discussion

On October 16, 2025 3:18 pm, Christian Ebner wrote:
> While the rename itself is an atomic operation, it must be assured
> that no other task such as garbage collection or backup chunk insert
> are expecting to hold an exclusive access to the chunk store.
> 
> Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
> ---
>  pbs-datastore/src/datastore.rs | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/pbs-datastore/src/datastore.rs b/pbs-datastore/src/datastore.rs
> index c280b82c7..a7ea8fd96 100644
> --- a/pbs-datastore/src/datastore.rs
> +++ b/pbs-datastore/src/datastore.rs
> @@ -2465,6 +2465,8 @@ impl DataStore {
>              )?;
>          }
>  
> +        let _lock = self.inner.chunk_store.mutex().lock().unwrap();
> +

the counter/new_path loop should move here to also be protected by the
lock - it doesn't buy as much, but it's better than nothing..

but it's also used for S3, without anything ensuring that the counters
are actually in sync between those two?

>          match std::fs::rename(&path, &new_path) {
>              Ok(_) => Ok(Some(format!("corrupted chunk renamed to {new_path:?}"))),
>              Err(err) if err.kind() == std::io::ErrorKind::NotFound => Ok(None),
> -- 
> 2.47.3
> 
> 
> 
> _______________________________________________
> pbs-devel mailing list
> pbs-devel@lists.proxmox.com
> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel
> 
> 
> 


_______________________________________________
pbs-devel mailing list
pbs-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [pbs-devel] [PATCH proxmox-backup 5/6] datastore: verify: evict corrupt chunks from in-memory LRU cache
  2025-10-16 13:18 [pbs-devel] [PATCH proxmox-backup 0/6] s3 store verify: fix concurrency issues and add missing in-memory cache eviction Christian Ebner
                   ` (3 preceding siblings ...)
  2025-10-16 13:18 ` [pbs-devel] [PATCH proxmox-backup 4/6] datastore: acquire chunk store mutex lock when renaming corrupt chunk Christian Ebner
@ 2025-10-16 13:18 ` Christian Ebner
  2025-10-27 10:59   ` Fabian Grünbichler
  2025-10-16 13:18 ` [pbs-devel] [PATCH proxmox-backup 6/6] verify: distinguish s3 object fetching and chunk loading error Christian Ebner
  5 siblings, 1 reply; 16+ messages in thread
From: Christian Ebner @ 2025-10-16 13:18 UTC (permalink / raw)
  To: pbs-devel

Chunks detected as corrupt have been renamed on both, the S3 backend
and the local datastore cache, but not evicted from the in-memory
cache containing the LRU chunk digests. This can lead to the chunks
being considered as already present if their digest is still cached,
and therefore not being re-inserted in the local store cache and S3
backend on backup upload.

Fix this by not only renaming the local datastore's chunk marker
file, but also removing it from the in-memory cache while holding the
chunk store mutex lock to exclude interference from concurrent chunk
inserts.

Reported-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
---
 pbs-datastore/src/datastore.rs | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/pbs-datastore/src/datastore.rs b/pbs-datastore/src/datastore.rs
index a7ea8fd96..c551679e3 100644
--- a/pbs-datastore/src/datastore.rs
+++ b/pbs-datastore/src/datastore.rs
@@ -2467,10 +2467,18 @@ impl DataStore {
 
         let _lock = self.inner.chunk_store.mutex().lock().unwrap();
 
-        match std::fs::rename(&path, &new_path) {
+        let result = match std::fs::rename(&path, &new_path) {
             Ok(_) => Ok(Some(format!("corrupted chunk renamed to {new_path:?}"))),
             Err(err) if err.kind() == std::io::ErrorKind::NotFound => Ok(None),
             Err(err) => bail!("could not rename corrupted chunk {path:?} - {err}"),
+        };
+
+        if let Some(cache) = self.cache() {
+            // Requiremets for calling the unsafe method are met, since snapshots referencing the
+            // corrupt chunk are to be considered corrupt. Ignore the error due to the missing file.
+            let _ = unsafe { cache.remove(digest) };
         }
+
+        result
     }
 }
-- 
2.47.3



_______________________________________________
pbs-devel mailing list
pbs-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [pbs-devel] [PATCH proxmox-backup 5/6] datastore: verify: evict corrupt chunks from in-memory LRU cache
  2025-10-16 13:18 ` [pbs-devel] [PATCH proxmox-backup 5/6] datastore: verify: evict corrupt chunks from in-memory LRU cache Christian Ebner
@ 2025-10-27 10:59   ` Fabian Grünbichler
  2025-10-27 11:53     ` Christian Ebner
  0 siblings, 1 reply; 16+ messages in thread
From: Fabian Grünbichler @ 2025-10-27 10:59 UTC (permalink / raw)
  To: Proxmox Backup Server development discussion

On October 16, 2025 3:18 pm, Christian Ebner wrote:
> Chunks detected as corrupt have been renamed on both, the S3 backend
> and the local datastore cache, but not evicted from the in-memory
> cache containing the LRU chunk digests. This can lead to the chunks
> being considered as already present if their digest is still cached,
> and therefore not being re-inserted in the local store cache and S3
> backend on backup upload.
> 
> Fix this by not only renaming the local datastore's chunk marker
> file, but also removing it from the in-memory cache while holding the
> chunk store mutex lock to exclude interference from concurrent chunk
> inserts.
> 
> Reported-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
> Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
> ---
>  pbs-datastore/src/datastore.rs | 10 +++++++++-
>  1 file changed, 9 insertions(+), 1 deletion(-)
> 
> diff --git a/pbs-datastore/src/datastore.rs b/pbs-datastore/src/datastore.rs
> index a7ea8fd96..c551679e3 100644
> --- a/pbs-datastore/src/datastore.rs
> +++ b/pbs-datastore/src/datastore.rs
> @@ -2467,10 +2467,18 @@ impl DataStore {
>  
>          let _lock = self.inner.chunk_store.mutex().lock().unwrap();
>  
> -        match std::fs::rename(&path, &new_path) {
> +        let result = match std::fs::rename(&path, &new_path) {
>              Ok(_) => Ok(Some(format!("corrupted chunk renamed to {new_path:?}"))),
>              Err(err) if err.kind() == std::io::ErrorKind::NotFound => Ok(None),
>              Err(err) => bail!("could not rename corrupted chunk {path:?} - {err}"),
> +        };
> +
> +        if let Some(cache) = self.cache() {
> +            // Requiremets for calling the unsafe method are met, since snapshots referencing the
> +            // corrupt chunk are to be considered corrupt. Ignore the error due to the missing file.
> +            let _ = unsafe { cache.remove(digest) };
>          }

not exactly related to this patch, but something I noticed while
reviewing it:

how do we protect against concurrent actions on the same chunk between
- verification detects chunk as corrupt
- rename on S3 level
- lock + rename and remove on chunk store/cache level

let's say to verification tasks and a backup writer race, all involving
the same chunk:

- verification A and B detect chunk as corrupt
- verification A renames it on the S3 level
- verification A renames it on the chunk store level and removes it from
  the cache
- backup writer uploads it to S3
- verification B renames it on the S3 level

and now either:
- backup writer inserts it into the cache
- verification B renames it on the chunk store level and removes it from
  the cache

which means the backup writer creates a corrupt, but not detected as
such backup

or
- verification B attempts to rename it on the chunk store level
  (ENOTFOUND) and removes it from the cache
- backup writer inserts it into the cache

which means we just broke cache coherency - the chunk exists in the
chunk store and cache, but not on S3 (anymore)


or another racey variant:
- verification A and B detect chunk as corrupt
- verification A renames it on the S3 level
- verification B renames it on the S3 level but fails because it doesn't
  exist there??

I think we might need another synchronization mechanism for verification
involving S3..

> +
> +        result
>      }
>  }
> -- 
> 2.47.3
> 
> 
> 
> _______________________________________________
> pbs-devel mailing list
> pbs-devel@lists.proxmox.com
> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel
> 


_______________________________________________
pbs-devel mailing list
pbs-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [pbs-devel] [PATCH proxmox-backup 5/6] datastore: verify: evict corrupt chunks from in-memory LRU cache
  2025-10-27 10:59   ` Fabian Grünbichler
@ 2025-10-27 11:53     ` Christian Ebner
  0 siblings, 0 replies; 16+ messages in thread
From: Christian Ebner @ 2025-10-27 11:53 UTC (permalink / raw)
  To: Proxmox Backup Server development discussion, Fabian Grünbichler

On 10/27/25 11:59 AM, Fabian Grünbichler wrote:
> On October 16, 2025 3:18 pm, Christian Ebner wrote:
>> Chunks detected as corrupt have been renamed on both, the S3 backend
>> and the local datastore cache, but not evicted from the in-memory
>> cache containing the LRU chunk digests. This can lead to the chunks
>> being considered as already present if their digest is still cached,
>> and therefore not being re-inserted in the local store cache and S3
>> backend on backup upload.
>>
>> Fix this by not only renaming the local datastore's chunk marker
>> file, but also removing it from the in-memory cache while holding the
>> chunk store mutex lock to exclude interference from concurrent chunk
>> inserts.
>>
>> Reported-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
>> Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
>> ---
>>   pbs-datastore/src/datastore.rs | 10 +++++++++-
>>   1 file changed, 9 insertions(+), 1 deletion(-)
>>
>> diff --git a/pbs-datastore/src/datastore.rs b/pbs-datastore/src/datastore.rs
>> index a7ea8fd96..c551679e3 100644
>> --- a/pbs-datastore/src/datastore.rs
>> +++ b/pbs-datastore/src/datastore.rs
>> @@ -2467,10 +2467,18 @@ impl DataStore {
>>   
>>           let _lock = self.inner.chunk_store.mutex().lock().unwrap();
>>   
>> -        match std::fs::rename(&path, &new_path) {
>> +        let result = match std::fs::rename(&path, &new_path) {
>>               Ok(_) => Ok(Some(format!("corrupted chunk renamed to {new_path:?}"))),
>>               Err(err) if err.kind() == std::io::ErrorKind::NotFound => Ok(None),
>>               Err(err) => bail!("could not rename corrupted chunk {path:?} - {err}"),
>> +        };
>> +
>> +        if let Some(cache) = self.cache() {
>> +            // Requiremets for calling the unsafe method are met, since snapshots referencing the
>> +            // corrupt chunk are to be considered corrupt. Ignore the error due to the missing file.
>> +            let _ = unsafe { cache.remove(digest) };
>>           }
> 
> not exactly related to this patch, but something I noticed while
> reviewing it:
> 
> how do we protect against concurrent actions on the same chunk between
> - verification detects chunk as corrupt
> - rename on S3 level
> - lock + rename and remove on chunk store/cache level
> 
> let's say to verification tasks and a backup writer race, all involving
> the same chunk:
> 
> - verification A and B detect chunk as corrupt
> - verification A renames it on the S3 level
> - verification A renames it on the chunk store level and removes it from
>    the cache
> - backup writer uploads it to S3
> - verification B renames it on the S3 level
> 
> and now either:
> - backup writer inserts it into the cache
> - verification B renames it on the chunk store level and removes it from
>    the cache
> 
> which means the backup writer creates a corrupt, but not detected as
> such backup
> 
> or
> - verification B attempts to rename it on the chunk store level
>    (ENOTFOUND) and removes it from the cache
> - backup writer inserts it into the cache
> 
> which means we just broke cache coherency - the chunk exists in the
> chunk store and cache, but not on S3 (anymore)
> 
> 
> or another racey variant:
> - verification A and B detect chunk as corrupt
> - verification A renames it on the S3 level
> - verification B renames it on the S3 level but fails because it doesn't
>    exist there??
> 
> I think we might need another synchronization mechanism for verification
> involving S3..

only option I see here is if the rename (copyObject) can only happen 
conditionally, e.g. via the x-amz-copy-source-if-unmodified-since header

Which might however not be available for all providers.


_______________________________________________
pbs-devel mailing list
pbs-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [pbs-devel] [PATCH proxmox-backup 6/6] verify: distinguish s3 object fetching and chunk loading error
  2025-10-16 13:18 [pbs-devel] [PATCH proxmox-backup 0/6] s3 store verify: fix concurrency issues and add missing in-memory cache eviction Christian Ebner
                   ` (4 preceding siblings ...)
  2025-10-16 13:18 ` [pbs-devel] [PATCH proxmox-backup 5/6] datastore: verify: evict corrupt chunks from in-memory LRU cache Christian Ebner
@ 2025-10-16 13:18 ` Christian Ebner
  2025-10-27 10:59   ` Fabian Grünbichler
  5 siblings, 1 reply; 16+ messages in thread
From: Christian Ebner @ 2025-10-16 13:18 UTC (permalink / raw)
  To: pbs-devel

Errors while loading chunks from the object store might be cause by
transient issues, and must therefore handled so they do not
incorrectly mark chunks as corrupt.

On creating the chunk from the response data, which includes the
chunk header and validity checks, errors must however lead to the
chunk being flagged as bad. Adapt the code so these errors are
correctly distinguished.

Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
---
 src/backup/verify.rs | 31 ++++++++++++++++++-------------
 1 file changed, 18 insertions(+), 13 deletions(-)

diff --git a/src/backup/verify.rs b/src/backup/verify.rs
index b1066f6f5..4cfd81350 100644
--- a/src/backup/verify.rs
+++ b/src/backup/verify.rs
@@ -225,19 +225,24 @@ impl VerifyWorker {
                 let object_key = pbs_datastore::s3::object_key_from_digest(&info.digest)?;
                 match proxmox_async::runtime::block_on(s3_client.get_object(object_key)) {
                     Ok(Some(response)) => {
-                        let chunk_result = proxmox_lang::try_block!({
-                            let bytes =
-                                proxmox_async::runtime::block_on(response.content.collect())?
-                                    .to_bytes();
-                            DataBlob::from_raw(bytes.to_vec())
-                        });
-
-                        match chunk_result {
-                            Ok(chunk) => {
-                                let size = info.size();
-                                *read_bytes += chunk.raw_size();
-                                decoder_pool.send((chunk, info.digest, size))?;
-                                *decoded_bytes += size;
+                        match proxmox_async::runtime::block_on(response.content.collect()) {
+                            Ok(raw_chunk) => {
+                                match DataBlob::from_raw(raw_chunk.to_bytes().to_vec()) {
+                                    Ok(chunk) => {
+                                        let size = info.size();
+                                        *read_bytes += chunk.raw_size();
+                                        decoder_pool.send((chunk, info.digest, size))?;
+                                        *decoded_bytes += size;
+                                    }
+                                    Err(err) => self.add_corrupt_chunk(
+                                        info.digest,
+                                        errors,
+                                        &format!(
+                                            "can't verify chunk with digest {} - {err}",
+                                            hex::encode(info.digest)
+                                        ),
+                                    ),
+                                }
                             }
                             Err(err) => {
                                 errors.fetch_add(1, Ordering::SeqCst);
-- 
2.47.3



_______________________________________________
pbs-devel mailing list
pbs-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [pbs-devel] [PATCH proxmox-backup 6/6] verify: distinguish s3 object fetching and chunk loading error
  2025-10-16 13:18 ` [pbs-devel] [PATCH proxmox-backup 6/6] verify: distinguish s3 object fetching and chunk loading error Christian Ebner
@ 2025-10-27 10:59   ` Fabian Grünbichler
  2025-10-27 11:54     ` Christian Ebner
  0 siblings, 1 reply; 16+ messages in thread
From: Fabian Grünbichler @ 2025-10-27 10:59 UTC (permalink / raw)
  To: Proxmox Backup Server development discussion

On October 16, 2025 3:18 pm, Christian Ebner wrote:
> Errors while loading chunks from the object store might be cause by
> transient issues, and must therefore handled so they do not
> incorrectly mark chunks as corrupt.
> 
> On creating the chunk from the response data, which includes the
> chunk header and validity checks, errors must however lead to the
> chunk being flagged as bad. Adapt the code so these errors are
> correctly distinguished.
> 
> Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
> ---

this patch is also independent from the rest and fixes a known bug,
right? if so, please order it up front and/or say so, then it can be
applied independently from the bigger refactoring..

>  src/backup/verify.rs | 31 ++++++++++++++++++-------------
>  1 file changed, 18 insertions(+), 13 deletions(-)
> 
> diff --git a/src/backup/verify.rs b/src/backup/verify.rs
> index b1066f6f5..4cfd81350 100644
> --- a/src/backup/verify.rs
> +++ b/src/backup/verify.rs
> @@ -225,19 +225,24 @@ impl VerifyWorker {
>                  let object_key = pbs_datastore::s3::object_key_from_digest(&info.digest)?;
>                  match proxmox_async::runtime::block_on(s3_client.get_object(object_key)) {
>                      Ok(Some(response)) => {
> -                        let chunk_result = proxmox_lang::try_block!({
> -                            let bytes =
> -                                proxmox_async::runtime::block_on(response.content.collect())?
> -                                    .to_bytes();
> -                            DataBlob::from_raw(bytes.to_vec())
> -                        });
> -
> -                        match chunk_result {
> -                            Ok(chunk) => {
> -                                let size = info.size();
> -                                *read_bytes += chunk.raw_size();
> -                                decoder_pool.send((chunk, info.digest, size))?;
> -                                *decoded_bytes += size;
> +                        match proxmox_async::runtime::block_on(response.content.collect()) {
> +                            Ok(raw_chunk) => {
> +                                match DataBlob::from_raw(raw_chunk.to_bytes().to_vec()) {
> +                                    Ok(chunk) => {
> +                                        let size = info.size();
> +                                        *read_bytes += chunk.raw_size();
> +                                        decoder_pool.send((chunk, info.digest, size))?;
> +                                        *decoded_bytes += size;
> +                                    }
> +                                    Err(err) => self.add_corrupt_chunk(
> +                                        info.digest,
> +                                        errors,
> +                                        &format!(
> +                                            "can't verify chunk with digest {} - {err}",
> +                                            hex::encode(info.digest)
> +                                        ),
> +                                    ),
> +                                }
>                              }
>                              Err(err) => {
>                                  errors.fetch_add(1, Ordering::SeqCst);
> -- 
> 2.47.3
> 
> 
> 
> _______________________________________________
> pbs-devel mailing list
> pbs-devel@lists.proxmox.com
> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel
> 
> 
> 


_______________________________________________
pbs-devel mailing list
pbs-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [pbs-devel] [PATCH proxmox-backup 6/6] verify: distinguish s3 object fetching and chunk loading error
  2025-10-27 10:59   ` Fabian Grünbichler
@ 2025-10-27 11:54     ` Christian Ebner
  0 siblings, 0 replies; 16+ messages in thread
From: Christian Ebner @ 2025-10-27 11:54 UTC (permalink / raw)
  To: Proxmox Backup Server development discussion, Fabian Grünbichler

On 10/27/25 11:59 AM, Fabian Grünbichler wrote:
> On October 16, 2025 3:18 pm, Christian Ebner wrote:
>> Errors while loading chunks from the object store might be cause by
>> transient issues, and must therefore handled so they do not
>> incorrectly mark chunks as corrupt.
>>
>> On creating the chunk from the response data, which includes the
>> chunk header and validity checks, errors must however lead to the
>> chunk being flagged as bad. Adapt the code so these errors are
>> correctly distinguished.
>>
>> Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
>> ---
> 
> this patch is also independent from the rest and fixes a known bug,
> right? if so, please order it up front and/or say so, then it can be
> applied independently from the bigger refactoring..

okay, will bring this to the front as well.


_______________________________________________
pbs-devel mailing list
pbs-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2025-10-27 11:53 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-10-16 13:18 [pbs-devel] [PATCH proxmox-backup 0/6] s3 store verify: fix concurrency issues and add missing in-memory cache eviction Christian Ebner
2025-10-16 13:18 ` [pbs-devel] [PATCH proxmox-backup 1/6] verify/datastore: make rename corrupt chunk a datastore helper method Christian Ebner
2025-10-16 13:18 ` [pbs-devel] [PATCH proxmox-backup 2/6] datastore: refactor rename_corrupted_chunk error handling Christian Ebner
2025-10-27 10:59   ` Fabian Grünbichler
2025-10-27 11:36     ` Christian Ebner
2025-10-16 13:18 ` [pbs-devel] [PATCH proxmox-backup 3/6] verify: never hold mutex lock in async scope on corrupt chunk rename Christian Ebner
2025-10-27 10:59   ` Fabian Grünbichler
2025-10-27 11:37     ` Christian Ebner
2025-10-16 13:18 ` [pbs-devel] [PATCH proxmox-backup 4/6] datastore: acquire chunk store mutex lock when renaming corrupt chunk Christian Ebner
2025-10-27 10:59   ` Fabian Grünbichler
2025-10-16 13:18 ` [pbs-devel] [PATCH proxmox-backup 5/6] datastore: verify: evict corrupt chunks from in-memory LRU cache Christian Ebner
2025-10-27 10:59   ` Fabian Grünbichler
2025-10-27 11:53     ` Christian Ebner
2025-10-16 13:18 ` [pbs-devel] [PATCH proxmox-backup 6/6] verify: distinguish s3 object fetching and chunk loading error Christian Ebner
2025-10-27 10:59   ` Fabian Grünbichler
2025-10-27 11:54     ` Christian Ebner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.

Service provided by Proxmox Server Solutions GmbH | Privacy | Legal