public inbox for pbs-devel@lists.proxmox.com
 help / color / mirror / Atom feed
* [pbs-devel] [RFC proxmox-backup] fix #5710: api: backup: stat known chunks during backup
@ 2024-09-13 13:59 Christian Ebner
  2024-09-13 15:23 ` Dietmar Maurer
                   ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: Christian Ebner @ 2024-09-13 13:59 UTC (permalink / raw)
  To: pbs-devel

Known chunks are expected to be present on the datastore a-priori,
allowing clients to only re-index these chunks without uploading the
raw chunk data. The list of reusable known chunks is send to the
client by the server, deduced from the indexed chunks of the previous
backup snapshot of the group.

If however such a known chunk disappeared (the previous backup
snapshot having been verified before that or not verified just yet),
the backup will finish just fine, leading to a seemingly successful
backup. Only a subsequent verification job will detect the backup
snapshot as being corrupt.

In order to reduce the impact, stat the list of known chunks twice
during backup:
- during registration of a known chunk
- when finishing the backup

The first stat is to early on detect missing chunks, the latter
assures that all (known and newly registered) chunks are present on
the datastore after the backup run.

If a missing chunk is detected, the backup run itself will fail and
the previous backup snapshots verify state is set to failed.
This prevents the same snapshot from being reused by another,
subsequent backup job.

link to issue in bugtracker:
https://bugzilla.proxmox.com/show_bug.cgi?id=5710

Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
---
Sending this as RFC because of the possible performance impact,
especially for users with datastores located on storages with limited
I/0.

Test on my side did not show a significant performance difference as
compared to an unpatched server.
I did perform vzdump backups of a VM with a 32G disk attached and a
LXC container with a Debian install and rootfs of ca. 400M (both off,
no changes in data in-between backup runs). PBS datastore located on
ZFS backed by an NVME SSD, 5 runs each after an initial run to assure
full chunk presence on server and valid previous snapshot.

Here some ballpark figures:

-----------------------------------------------------------
patched                    | unpatched
-----------------------------------------------------------
VM           | LXC         | VM           | LXC
-----------------------------------------------------------
15.3s ± 1.4s | 1.73 ± 0.02 | 14.7s ± 1.8s | 1.77s ± 0.05s
-----------------------------------------------------------

 src/api2/backup/environment.rs | 12 ++++++++++
 src/api2/backup/mod.rs         | 44 ++++++++++++++++++++++++++++++++++
 2 files changed, 56 insertions(+)

diff --git a/src/api2/backup/environment.rs b/src/api2/backup/environment.rs
index 99d885e2e..90fd91d63 100644
--- a/src/api2/backup/environment.rs
+++ b/src/api2/backup/environment.rs
@@ -259,6 +259,18 @@ impl BackupEnvironment {
         state.known_chunks.get(digest).copied()
     }
 
+    /// Stat all currently registered chunks
+    ///
+    pub fn stat_registered_chunks(&self) -> Result<(), Error> {
+        let state = self.state.lock().unwrap();
+        for digest in state.known_chunks.keys() {
+            self.datastore
+                .stat_chunk(&digest)
+                .map_err(|err| format_err!("stat failed on {} - {err}", hex::encode(digest)))?;
+        }
+        Ok(())
+    }
+
     /// Store the writer with an unique ID
     pub fn register_dynamic_writer(
         &self,
diff --git a/src/api2/backup/mod.rs b/src/api2/backup/mod.rs
index ea0d0292e..c308066b6 100644
--- a/src/api2/backup/mod.rs
+++ b/src/api2/backup/mod.rs
@@ -579,6 +579,8 @@ fn dynamic_append(
             .lookup_chunk(&digest)
             .ok_or_else(|| format_err!("no such chunk {}", digest_str))?;
 
+        stat_known_chunk(env, wid, &digest)?;
+
         env.dynamic_writer_append_chunk(wid, offset, size, &digest)?;
 
         env.debug(format!(
@@ -653,6 +655,8 @@ fn fixed_append(
             .lookup_chunk(&digest)
             .ok_or_else(|| format_err!("no such chunk {}", digest_str))?;
 
+        stat_known_chunk(env, wid, &digest)?;
+
         env.fixed_writer_append_chunk(wid, offset, size, &digest)?;
 
         env.debug(format!(
@@ -785,6 +789,12 @@ fn finish_backup(
 ) -> Result<Value, Error> {
     let env: &BackupEnvironment = rpcenv.as_ref();
 
+    if let Err(err) = env.stat_registered_chunks() {
+        env.debug(format!("stat registered chunks failed - {err}"));
+        update_previous_manifest_to_verify_failed(env)?;
+        bail!("stat registered chunks failed - {err}");
+    }
+
     env.finish_backup()?;
     env.log("successfully finished backup");
 
@@ -872,3 +882,37 @@ fn download_previous(
     }
     .boxed()
 }
+
+// Check if known chunk is still on disk
+//
+// If not, fail and mark previous backup snapshot as invalid so the next backup run will not use it.
+fn stat_known_chunk(env: &BackupEnvironment, wid: usize, digest: &[u8; 32]) -> Result<(), Error> {
+    if let Err(err) = env.datastore.stat_chunk(&digest) {
+        let digest_str = hex::encode(digest);
+        env.debug(format!(
+            "known chunk stat failed on {digest_str} indexed by {wid} - {err}",
+        ));
+        update_previous_manifest_to_verify_failed(env)?;
+        bail!("known chunk stat failed on {digest_str} - {err}");
+    }
+    Ok(())
+}
+
+// Update the manifest of the previous backup snapshot to be in `VerifyState::Failed`, using the
+// current backup writer tasks UPID.
+fn update_previous_manifest_to_verify_failed(env: &BackupEnvironment) -> Result<(), Error> {
+    if let Some(last) = env.last_backup.as_ref() {
+        // No need to acquire snapshot lock, already locked when starting the backup
+        let verify_state = SnapshotVerifyState {
+            state: VerifyState::Failed,
+            upid: env.worker.upid().clone(), // backp writer UPID
+        };
+        let verify_state = serde_json::to_value(verify_state)?;
+        last.backup_dir
+            .update_manifest(|manifest| {
+                manifest.unprotected["verify_state"] = verify_state;
+            })
+            .map_err(|err| format_err!("manifest update failed - {err}"))?;
+    }
+    Ok(())
+}
-- 
2.39.2



_______________________________________________
pbs-devel mailing list
pbs-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2024-09-20 13:42 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-09-13 13:59 [pbs-devel] [RFC proxmox-backup] fix #5710: api: backup: stat known chunks during backup Christian Ebner
2024-09-13 15:23 ` Dietmar Maurer
2024-09-17  7:19   ` Christian Ebner
2024-09-17 11:40     ` Christian Ebner
2024-09-18  8:54       ` Dietmar Maurer
2024-09-16 10:17 ` Gabriel Goller
2024-09-17  7:23   ` Christian Ebner
2024-09-20 13:42 ` Christian Ebner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox
Service provided by Proxmox Server Solutions GmbH | Privacy | Legal