public inbox for pbs-devel@lists.proxmox.com
 help / color / mirror / Atom feed
* [pbs-devel] [RFC proxmox-backup 0/2] fix #6750: fix possible deadlock for s3 backed datastore backups
@ 2025-09-24 14:56 Christian Ebner
  2025-09-24 14:56 ` [pbs-devel] [PATCH proxmox-backup 1/2] fix #6750: api: avoid possible deadlock on datastores with s3 backend Christian Ebner
  2025-09-24 14:56 ` [pbs-devel] [PATCH proxmox-backup 2/2] api: backup: never hold mutex guard when doing manifest update Christian Ebner
  0 siblings, 2 replies; 3+ messages in thread
From: Christian Ebner @ 2025-09-24 14:56 UTC (permalink / raw)
  To: pbs-devel

These patches aim to fix a deadlock which can occur during backup
jobs to datastores backed by S3 backend. The deadlock most likely is
caused by the mutex guard for the backup shared state being held
while entering the tokio::task::block_in_place context and executing
async code, which however can lead to deadlocks as described in [0].

Therefore, these patches avoid holding the mutex guard for the shared
backup state while performing the s3 backend operations, by
prematurely dropping it. To avoid inconsistencies, account and check
for active backend operations when closing fixed/dynamic index files
and on finish api calls.

Sending this as an RFC with the intend to provide the patches for a
potential test build to verify it resolves the issue for affected
users.

[0] https://docs.rs/tokio/latest/tokio/sync/struct.Mutex.html#which-kind-of-mutex-should-you-use

Link to the bugtracker issue:
https://bugzilla.proxmox.com/show_bug.cgi?id=6750

Another report in the community forum:
https://forum.proxmox.com/threads/171422/

proxmox-backup:

Christian Ebner (2):
  fix #6750: api: avoid possible deadlock on datastores with s3 backend
  api: backup: never hold mutex guard when doing manifest update

 src/api2/backup/environment.rs | 89 +++++++++++++++++++++++-----------
 1 file changed, 61 insertions(+), 28 deletions(-)


Summary over all repositories:
  1 files changed, 61 insertions(+), 28 deletions(-)

-- 
Generated by git-murpp 0.8.1


_______________________________________________
pbs-devel mailing list
pbs-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel


^ permalink raw reply	[flat|nested] 3+ messages in thread

* [pbs-devel] [PATCH proxmox-backup 1/2] fix #6750: api: avoid possible deadlock on datastores with s3 backend
  2025-09-24 14:56 [pbs-devel] [RFC proxmox-backup 0/2] fix #6750: fix possible deadlock for s3 backed datastore backups Christian Ebner
@ 2025-09-24 14:56 ` Christian Ebner
  2025-09-24 14:56 ` [pbs-devel] [PATCH proxmox-backup 2/2] api: backup: never hold mutex guard when doing manifest update Christian Ebner
  1 sibling, 0 replies; 3+ messages in thread
From: Christian Ebner @ 2025-09-24 14:56 UTC (permalink / raw)
  To: pbs-devel

Closing of the fixed or dynamic index files with s3 backend will call
async code, which must be avoided because of possible deadlocks [0].
Therefore, perform all changes on the shared backup state and drop the
guard before uploading the fixed index file to the s3 backend.

Account for active backend operations and check consistency, since it
must be assured that all active backend operations are finished before
the finish call can succeed.

[0] https://docs.rs/tokio/latest/tokio/sync/struct.Mutex.html#which-kind-of-mutex-should-you-use

Fixes: https://bugzilla.proxmox.com/show_bug.cgi?id=6750
Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
---
 src/api2/backup/environment.rs | 77 +++++++++++++++++++++++-----------
 1 file changed, 53 insertions(+), 24 deletions(-)

diff --git a/src/api2/backup/environment.rs b/src/api2/backup/environment.rs
index d5e6869cd..e535891a4 100644
--- a/src/api2/backup/environment.rs
+++ b/src/api2/backup/environment.rs
@@ -82,6 +82,7 @@ struct SharedBackupState {
     finished: bool,
     uid_counter: usize,
     file_counter: usize, // successfully uploaded files
+    active_backend_operations: usize,
     dynamic_writers: HashMap<usize, DynamicWriterState>,
     fixed_writers: HashMap<usize, FixedWriterState>,
     known_chunks: KnownChunksMap,
@@ -135,6 +136,7 @@ impl BackupEnvironment {
             finished: false,
             uid_counter: 0,
             file_counter: 0,
+            active_backend_operations: 0,
             dynamic_writers: HashMap::new(),
             fixed_writers: HashMap::new(),
             known_chunks: HashMap::new(),
@@ -483,15 +485,10 @@ impl BackupEnvironment {
             );
         }
 
-        // For S3 backends, upload the index file to the object store after closing
-        if let DatastoreBackend::S3(s3_client) = &self.backend {
-            self.s3_upload_index(s3_client, &data.name)
-                .context("failed to upload dynamic index to s3 backend")?;
-            self.log(format!(
-                "Uploaded dynamic index file to s3 backend: {}",
-                data.name
-            ))
-        }
+        state.file_counter += 1;
+        state.backup_size += size;
+        state.backup_stat = state.backup_stat + data.upload_stat;
+        state.active_backend_operations += 1;
 
         self.log_upload_stat(
             &data.name,
@@ -502,9 +499,21 @@ impl BackupEnvironment {
             &data.upload_stat,
         );
 
-        state.file_counter += 1;
-        state.backup_size += size;
-        state.backup_stat = state.backup_stat + data.upload_stat;
+        // never hold mutex guard during s3 upload due to possible deadlocks
+        drop(state);
+
+        // For S3 backends, upload the index file to the object store after closing
+        if let DatastoreBackend::S3(s3_client) = &self.backend {
+            self.s3_upload_index(s3_client, &data.name)
+                .context("failed to upload dynamic index to s3 backend")?;
+            self.log(format!(
+                "Uploaded dynamic index file to s3 backend: {}",
+                data.name
+            ))
+        }
+
+        let mut state = self.state.lock().unwrap();
+        state.active_backend_operations -= 1;
 
         Ok(())
     }
@@ -567,15 +576,10 @@ impl BackupEnvironment {
             );
         }
 
-        // For S3 backends, upload the index file to the object store after closing
-        if let DatastoreBackend::S3(s3_client) = &self.backend {
-            self.s3_upload_index(s3_client, &data.name)
-                .context("failed to upload fixed index to s3 backend")?;
-            self.log(format!(
-                "Uploaded fixed index file to object store: {}",
-                data.name
-            ))
-        }
+        state.file_counter += 1;
+        state.backup_size += size;
+        state.backup_stat = state.backup_stat + data.upload_stat;
+        state.active_backend_operations += 1;
 
         self.log_upload_stat(
             &data.name,
@@ -586,9 +590,21 @@ impl BackupEnvironment {
             &data.upload_stat,
         );
 
-        state.file_counter += 1;
-        state.backup_size += size;
-        state.backup_stat = state.backup_stat + data.upload_stat;
+        // never hold mutex guard during s3 upload due to possible deadlocks
+        drop(state);
+
+        // For S3 backends, upload the index file to the object store after closing
+        if let DatastoreBackend::S3(s3_client) = &self.backend {
+            self.s3_upload_index(s3_client, &data.name)
+                .context("failed to upload fixed index to s3 backend")?;
+            self.log(format!(
+                "Uploaded fixed index file to object store: {}",
+                data.name
+            ))
+        }
+
+        let mut state = self.state.lock().unwrap();
+        state.active_backend_operations -= 1;
 
         Ok(())
     }
@@ -645,6 +661,13 @@ impl BackupEnvironment {
             bail!("found open index writer - unable to finish backup");
         }
 
+        if state.active_backend_operations != 0 {
+            bail!(
+                "backup task still has {} active operations.",
+                state.active_backend_operations,
+            );
+        }
+
         if state.file_counter == 0 {
             bail!("backup does not contain valid files (file count == 0)");
         }
@@ -753,6 +776,12 @@ impl BackupEnvironment {
         if !state.finished {
             bail!("backup ended but finished flag is not set.");
         }
+        if state.active_backend_operations != 0 {
+            bail!(
+                "backup ended but {} active backend operations.",
+                state.active_backend_operations,
+            );
+        }
         Ok(())
     }
 
-- 
2.47.3



_______________________________________________
pbs-devel mailing list
pbs-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel


^ permalink raw reply	[flat|nested] 3+ messages in thread

* [pbs-devel] [PATCH proxmox-backup 2/2] api: backup: never hold mutex guard when doing manifest update
  2025-09-24 14:56 [pbs-devel] [RFC proxmox-backup 0/2] fix #6750: fix possible deadlock for s3 backed datastore backups Christian Ebner
  2025-09-24 14:56 ` [pbs-devel] [PATCH proxmox-backup 1/2] fix #6750: api: avoid possible deadlock on datastores with s3 backend Christian Ebner
@ 2025-09-24 14:56 ` Christian Ebner
  1 sibling, 0 replies; 3+ messages in thread
From: Christian Ebner @ 2025-09-24 14:56 UTC (permalink / raw)
  To: pbs-devel

An manifest update with s3 backend will call async code, which must
be avoided because of possible deadlocks [0]. Therefore, perform all
changes on the shared backup state and drop the guard before
updating the manifest, which performs the backend specific update.
Dropping the guard prematurely is fine, as the state has already been
set to be finished, so no other api calls belonging to the same
backup task cannot perform further changes anyways.

[0] https://docs.rs/tokio/latest/tokio/sync/struct.Mutex.html#which-kind-of-mutex-should-you-use

Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
---
 src/api2/backup/environment.rs | 12 ++++++++----
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/src/api2/backup/environment.rs b/src/api2/backup/environment.rs
index e535891a4..073027c51 100644
--- a/src/api2/backup/environment.rs
+++ b/src/api2/backup/environment.rs
@@ -682,8 +682,15 @@ impl BackupEnvironment {
             }
         }
 
-        // check for valid manifest and store stats
         let stats = serde_json::to_value(state.backup_stat)?;
+
+        // marks the backup state as finished, so no other api calls can modify its state anymore
+        state.finished = true;
+
+        // never hold mutex guard during s3 upload due to possible deadlocks
+        drop(state);
+
+        // check for valid manifest and store stats
         self.backup_dir
             .update_manifest(&self.backend, |manifest| {
                 manifest.unprotected["chunk_upload_stats"] = stats;
@@ -692,9 +699,6 @@ impl BackupEnvironment {
 
         self.datastore.try_ensure_sync_level()?;
 
-        // marks the backup as successful
-        state.finished = true;
-
         Ok(())
     }
 
-- 
2.47.3



_______________________________________________
pbs-devel mailing list
pbs-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel


^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2025-09-24 14:56 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-09-24 14:56 [pbs-devel] [RFC proxmox-backup 0/2] fix #6750: fix possible deadlock for s3 backed datastore backups Christian Ebner
2025-09-24 14:56 ` [pbs-devel] [PATCH proxmox-backup 1/2] fix #6750: api: avoid possible deadlock on datastores with s3 backend Christian Ebner
2025-09-24 14:56 ` [pbs-devel] [PATCH proxmox-backup 2/2] api: backup: never hold mutex guard when doing manifest update Christian Ebner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox
Service provided by Proxmox Server Solutions GmbH | Privacy | Legal