public inbox for pbs-devel@lists.proxmox.com
 help / color / mirror / Atom feed
* [pbs-devel] [PATCH proxmox-backup 0/2] fix 2 issues with s3 store verifies
@ 2025-10-29 11:06 Christian Ebner
  2025-10-29 11:06 ` [pbs-devel] [PATCH proxmox-backup 1/2] verify: never hold mutex lock in async scope on corrupt chunk rename Christian Ebner
                   ` (2 more replies)
  0 siblings, 3 replies; 4+ messages in thread
From: Christian Ebner @ 2025-10-29 11:06 UTC (permalink / raw)
  To: pbs-devel

This patches were pulled out from the original patch series [0] since
they are independent from the bigger series attempting to fix the
possible race between corrupt chunk renaming and chunk insert/upload
and better reviewed/tested independently.

Patch 1 makes sure the mutex guard to sync up access to the corrupt
chunk list is dropped before attempting to rename a corrupt chunk,
which will call into async context on s3 stores. Otherwise deadlock
can arise.

Patch 2 is a followup to the bugfix for issue #6665, which however
did not correctly distinguish between transient fetching errors and
the possible chunk DataBlob decoding error from the response body in
case of a successful response.

[0] https://lore.proxmox.com/pbs-devel/20251016131819.349049-6-c.ebner@proxmox.com/T/

Christian Ebner (2):
  verify: never hold mutex lock in async scope on corrupt chunk rename
  verify: distinguish s3 object fetching and chunk loading error

 src/backup/verify.rs | 34 +++++++++++++++++++---------------
 1 file changed, 19 insertions(+), 15 deletions(-)

-- 
2.47.3



_______________________________________________
pbs-devel mailing list
pbs-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel


^ permalink raw reply	[flat|nested] 4+ messages in thread

* [pbs-devel] [PATCH proxmox-backup 1/2] verify: never hold mutex lock in async scope on corrupt chunk rename
  2025-10-29 11:06 [pbs-devel] [PATCH proxmox-backup 0/2] fix 2 issues with s3 store verifies Christian Ebner
@ 2025-10-29 11:06 ` Christian Ebner
  2025-10-29 11:06 ` [pbs-devel] [PATCH proxmox-backup 2/2] verify: distinguish s3 object fetching and chunk loading error Christian Ebner
  2025-11-11 13:57 ` [pbs-devel] applied-series: [PATCH proxmox-backup 0/2] fix 2 issues with s3 store verifies Fabian Grünbichler
  2 siblings, 0 replies; 4+ messages in thread
From: Christian Ebner @ 2025-10-29 11:06 UTC (permalink / raw)
  To: pbs-devel

Holding a mutex lock across async await boundaries is prone to
deadlock [0]. Renaming a corrupt chunk requires however async API
calls in case of datastores backed by S3.

Fix this by simply not hold onto the mutex lock guarding the corrupt
chunk list during chunk verification tasks when calling the rename
method. If the chunk is already present in this list, there will be
no other verification task operating on that exact chunk anyways.

[0] https://docs.rs/tokio/latest/tokio/sync/struct.Mutex.html#which-kind-of-mutex-should-you-use

Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
---
Originally on list as https://lore.proxmox.com/pbs-devel/20251016131819.349049-4-c.ebner@proxmox.com/

No changes to that patch.

 src/backup/verify.rs | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/src/backup/verify.rs b/src/backup/verify.rs
index bdbe3148b..7172e81e1 100644
--- a/src/backup/verify.rs
+++ b/src/backup/verify.rs
@@ -332,8 +332,7 @@ impl VerifyWorker {
 
     fn add_corrupt_chunk(&self, digest: [u8; 32], errors: Arc<AtomicUsize>, message: &str) {
         // Panic on poisoned mutex
-        let mut corrupt_chunks = self.corrupt_chunks.lock().unwrap();
-        corrupt_chunks.insert(digest);
+        self.corrupt_chunks.lock().unwrap().insert(digest);
         error!(message);
         errors.fetch_add(1, Ordering::SeqCst);
         Self::rename_corrupted_chunk(self.datastore.clone(), &digest);
-- 
2.47.3



_______________________________________________
pbs-devel mailing list
pbs-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel


^ permalink raw reply	[flat|nested] 4+ messages in thread

* [pbs-devel] [PATCH proxmox-backup 2/2] verify: distinguish s3 object fetching and chunk loading error
  2025-10-29 11:06 [pbs-devel] [PATCH proxmox-backup 0/2] fix 2 issues with s3 store verifies Christian Ebner
  2025-10-29 11:06 ` [pbs-devel] [PATCH proxmox-backup 1/2] verify: never hold mutex lock in async scope on corrupt chunk rename Christian Ebner
@ 2025-10-29 11:06 ` Christian Ebner
  2025-11-11 13:57 ` [pbs-devel] applied-series: [PATCH proxmox-backup 0/2] fix 2 issues with s3 store verifies Fabian Grünbichler
  2 siblings, 0 replies; 4+ messages in thread
From: Christian Ebner @ 2025-10-29 11:06 UTC (permalink / raw)
  To: pbs-devel

Errors while loading chunks from the object store might be cause by
transient issues, and must therefore handled so they do not
incorrectly mark chunks as corrupt.

On creating the chunk from the response data, which includes the
chunk header and validity checks, errors must however lead to the
chunk being flagged as bad. Adapt the code so these errors are
correctly distinguished.

This is a followup to commit 3c350f35 ("fix #6665: never rename
chunks on s3 client fetch errors") which did not take that into
account.

Fixes: 3c350f35 ("fix #6665: never rename chunks on s3 client fetch errors")
Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
---
Originally on list as https://lore.proxmox.com/pbs-devel/20251016131819.349049-7-c.ebner@proxmox.com/

Extended the commit message to include the commit ref since the
original version of the patch.

 src/backup/verify.rs | 31 ++++++++++++++++++-------------
 1 file changed, 18 insertions(+), 13 deletions(-)

diff --git a/src/backup/verify.rs b/src/backup/verify.rs
index 7172e81e1..f01345b04 100644
--- a/src/backup/verify.rs
+++ b/src/backup/verify.rs
@@ -292,19 +292,24 @@ impl VerifyWorker {
                 let object_key = pbs_datastore::s3::object_key_from_digest(&info.digest)?;
                 match proxmox_async::runtime::block_on(s3_client.get_object(object_key)) {
                     Ok(Some(response)) => {
-                        let chunk_result = proxmox_lang::try_block!({
-                            let bytes =
-                                proxmox_async::runtime::block_on(response.content.collect())?
-                                    .to_bytes();
-                            DataBlob::from_raw(bytes.to_vec())
-                        });
-
-                        match chunk_result {
-                            Ok(chunk) => {
-                                let size = info.size();
-                                *read_bytes += chunk.raw_size();
-                                decoder_pool.send((chunk, info.digest, size))?;
-                                *decoded_bytes += size;
+                        match proxmox_async::runtime::block_on(response.content.collect()) {
+                            Ok(raw_chunk) => {
+                                match DataBlob::from_raw(raw_chunk.to_bytes().to_vec()) {
+                                    Ok(chunk) => {
+                                        let size = info.size();
+                                        *read_bytes += chunk.raw_size();
+                                        decoder_pool.send((chunk, info.digest, size))?;
+                                        *decoded_bytes += size;
+                                    }
+                                    Err(err) => self.add_corrupt_chunk(
+                                        info.digest,
+                                        errors,
+                                        &format!(
+                                            "can't verify chunk with digest {} - {err}",
+                                            hex::encode(info.digest)
+                                        ),
+                                    ),
+                                }
                             }
                             Err(err) => {
                                 errors.fetch_add(1, Ordering::SeqCst);
-- 
2.47.3



_______________________________________________
pbs-devel mailing list
pbs-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel


^ permalink raw reply	[flat|nested] 4+ messages in thread

* [pbs-devel] applied-series: [PATCH proxmox-backup 0/2] fix 2 issues with s3 store verifies
  2025-10-29 11:06 [pbs-devel] [PATCH proxmox-backup 0/2] fix 2 issues with s3 store verifies Christian Ebner
  2025-10-29 11:06 ` [pbs-devel] [PATCH proxmox-backup 1/2] verify: never hold mutex lock in async scope on corrupt chunk rename Christian Ebner
  2025-10-29 11:06 ` [pbs-devel] [PATCH proxmox-backup 2/2] verify: distinguish s3 object fetching and chunk loading error Christian Ebner
@ 2025-11-11 13:57 ` Fabian Grünbichler
  2 siblings, 0 replies; 4+ messages in thread
From: Fabian Grünbichler @ 2025-11-11 13:57 UTC (permalink / raw)
  To: Proxmox Backup Server development discussion

although that nested match there would benefit from some refactoring,
and I am also not 100% sure we shouldn't add some additional checks to
the flow there (e.g., heal the S3 chunk if the local one is correct,
verify the local one is correct in addition to the S3 one, ensure
renaming happens even if either side is missing the "corrupt" chunk?)

On October 29, 2025 12:06 pm, Christian Ebner wrote:
> This patches were pulled out from the original patch series [0] since
> they are independent from the bigger series attempting to fix the
> possible race between corrupt chunk renaming and chunk insert/upload
> and better reviewed/tested independently.
> 
> Patch 1 makes sure the mutex guard to sync up access to the corrupt
> chunk list is dropped before attempting to rename a corrupt chunk,
> which will call into async context on s3 stores. Otherwise deadlock
> can arise.
> 
> Patch 2 is a followup to the bugfix for issue #6665, which however
> did not correctly distinguish between transient fetching errors and
> the possible chunk DataBlob decoding error from the response body in
> case of a successful response.
> 
> [0] https://lore.proxmox.com/pbs-devel/20251016131819.349049-6-c.ebner@proxmox.com/T/
> 
> Christian Ebner (2):
>   verify: never hold mutex lock in async scope on corrupt chunk rename
>   verify: distinguish s3 object fetching and chunk loading error
> 
>  src/backup/verify.rs | 34 +++++++++++++++++++---------------
>  1 file changed, 19 insertions(+), 15 deletions(-)
> 
> -- 
> 2.47.3
> 
> 
> 
> _______________________________________________
> pbs-devel mailing list
> pbs-devel@lists.proxmox.com
> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel
> 
> 
> 


_______________________________________________
pbs-devel mailing list
pbs-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel


^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2025-11-11 13:56 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-10-29 11:06 [pbs-devel] [PATCH proxmox-backup 0/2] fix 2 issues with s3 store verifies Christian Ebner
2025-10-29 11:06 ` [pbs-devel] [PATCH proxmox-backup 1/2] verify: never hold mutex lock in async scope on corrupt chunk rename Christian Ebner
2025-10-29 11:06 ` [pbs-devel] [PATCH proxmox-backup 2/2] verify: distinguish s3 object fetching and chunk loading error Christian Ebner
2025-11-11 13:57 ` [pbs-devel] applied-series: [PATCH proxmox-backup 0/2] fix 2 issues with s3 store verifies Fabian Grünbichler

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox
Service provided by Proxmox Server Solutions GmbH | Privacy | Legal