[pbs-devel] [PATCH proxmox-backup 0/3] fix #3786: resync corrupt chunks in sync-job

public inbox for pbs-devel@lists.proxmox.com
 help / color / mirror / Atom feed

* [pbs-devel] [PATCH proxmox-backup 0/3] fix #3786: resync corrupt chunks in sync-job
@ 2024-10-15 13:18 Gabriel Goller
  2024-10-15 13:18 ` [pbs-devel] [PATCH proxmox-backup 1/3] fix #3786: api: add resync-corrupt option to sync jobs Gabriel Goller
                   ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: Gabriel Goller @ 2024-10-15 13:18 UTC (permalink / raw)
  To: pbs-devel

Add an option `resync-corrupt` that resyncs corrupt snapshots when running
sync-job. This option checks if the local snapshot failed the last
verification and if it did, overwrites the local snapshot with the
remote one.

This is quite useful, as we currently don't have an option to "fix" 
broken chunks/snapshots in any way, even if a healthy version is on 
another (e.g. offsite) instance.

Important things to note are also: this has a slight performance 
penalty, as all the manifests have to be looked through, and a 
verification job has to be run beforehand, otherwise we do not know 
if the snapshot is healthy.

Note: This series was originally written by Shannon! I just picked it 
up, rebased, and fixed the obvious comments on the last series.

Changelog since RFC (Shannon's work):
 - rename option from deep-sync to resync-corrupt
 - rebase on latest master (and change implementation details, as a 
     lot has changed around sync-jobs)

proxmox-backup:

Gabriel Goller (3):
  fix #3786: api: add resync-corrupt option to sync jobs
  fix #3786: ui/cli: add resync-corrupt option on sync-jobs
  fix #3786: docs: add resync-corrupt option to sync-job

 docs/managing-remotes.rst         |  6 +++
 pbs-api-types/src/jobs.rs         | 11 +++++
 pbs-datastore/src/backup_info.rs  | 13 +++++-
 src/api2/config/sync.rs           |  4 ++
 src/api2/pull.rs                  |  9 +++-
 src/bin/proxmox-backup-manager.rs | 13 +++++-
 src/server/pull.rs                | 68 +++++++++++++++++++++----------
 www/window/SyncJobEdit.js         | 11 +++++
 8 files changed, 110 insertions(+), 25 deletions(-)

Summary over all repositories:
  8 files changed, 110 insertions(+), 25 deletions(-)

-- 
Generated by git-murpp 0.7.1

_______________________________________________
pbs-devel mailing list
pbs-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [pbs-devel] [PATCH proxmox-backup 1/3] fix #3786: api: add resync-corrupt option to sync jobs
  2024-10-15 13:18 [pbs-devel] [PATCH proxmox-backup 0/3] fix #3786: resync corrupt chunks in sync-job Gabriel Goller
@ 2024-10-15 13:18 ` Gabriel Goller
  2024-10-17  6:15   ` Thomas Lamprecht
  2024-10-15 13:18 ` [pbs-devel] [PATCH proxmox-backup 2/3] fix #3786: ui/cli: add resync-corrupt option on sync-jobs Gabriel Goller
  2024-10-15 13:18 ` [pbs-devel] [PATCH proxmox-backup 3/3] fix #3786: docs: add resync-corrupt option to sync-job Gabriel Goller
  2 siblings, 1 reply; 9+ messages in thread
From: Gabriel Goller @ 2024-10-15 13:18 UTC (permalink / raw)
  To: pbs-devel

This option allows us to "fix" corrupt snapshots (and/or their chunks)
by pulling them from another remote. When traversing the remote
snapshots, we check if it exists locally, and if it is, we check if the
last verification of it failed. If the local snapshot is broken and the
`resync-corrupt` option is turned on, we pull in the remote snapshot,
overwriting the local one.

This is very useful and has been requested a lot, as there is currently
no way to "fix" corrupt chunks/snapshots even if the user has a healthy
version of it on their offsite instance.

Signed-off-by: Gabriel Goller <g.goller@proxmox.com>
Originally-by: Shannon Sterz <s.sterz@proxmox.com>
---
 pbs-api-types/src/jobs.rs         | 11 +++++
 pbs-datastore/src/backup_info.rs  | 13 +++++-
 src/api2/config/sync.rs           |  4 ++
 src/api2/pull.rs                  |  9 +++-
 src/bin/proxmox-backup-manager.rs |  4 +-
 src/server/pull.rs                | 68 +++++++++++++++++++++----------
 6 files changed, 84 insertions(+), 25 deletions(-)

diff --git a/pbs-api-types/src/jobs.rs b/pbs-api-types/src/jobs.rs
index 868702bc059e..a3cd68bf5afd 100644
--- a/pbs-api-types/src/jobs.rs
+++ b/pbs-api-types/src/jobs.rs
@@ -498,6 +498,11 @@ pub const TRANSFER_LAST_SCHEMA: Schema =
         .minimum(1)
         .schema();
 
+pub const RESYNC_CORRUPT_SCHEMA: Schema = BooleanSchema::new(
+    "If the verification failed for a local snapshot, try to pull its chunks again.",
+)
+.schema();
+
 #[api(
     properties: {
         id: {
@@ -552,6 +557,10 @@ pub const TRANSFER_LAST_SCHEMA: Schema =
             schema: TRANSFER_LAST_SCHEMA,
             optional: true,
         },
+        "resync-corrupt": {
+            schema: RESYNC_CORRUPT_SCHEMA,
+            optional: true,
+        }
     }
 )]
 #[derive(Serialize, Deserialize, Clone, Updater, PartialEq)]
@@ -585,6 +594,8 @@ pub struct SyncJobConfig {
     pub limit: RateLimitConfig,
     #[serde(skip_serializing_if = "Option::is_none")]
     pub transfer_last: Option<usize>,
+    #[serde(skip_serializing_if = "Option::is_none")]
+    pub resync_corrupt: Option<bool>,
 }
 
 impl SyncJobConfig {
diff --git a/pbs-datastore/src/backup_info.rs b/pbs-datastore/src/backup_info.rs
index 414ec878d01a..2d283dbe5cd9 100644
--- a/pbs-datastore/src/backup_info.rs
+++ b/pbs-datastore/src/backup_info.rs
@@ -8,7 +8,8 @@ use anyhow::{bail, format_err, Error};
 use proxmox_sys::fs::{lock_dir_noblock, replace_file, CreateOptions};
 
 use pbs_api_types::{
-    Authid, BackupNamespace, BackupType, GroupFilter, BACKUP_DATE_REGEX, BACKUP_FILE_REGEX,
+    Authid, BackupNamespace, BackupType, GroupFilter, SnapshotVerifyState, VerifyState,
+    BACKUP_DATE_REGEX, BACKUP_FILE_REGEX,
 };
 use pbs_config::{open_backup_lockfile, BackupLockGuard};
 
@@ -583,6 +584,16 @@ impl BackupDir {
 
         Ok(())
     }
+
+    /// Load the verify state from the manifest.
+    pub fn verify_state(&self) -> Option<VerifyState> {
+        self.load_manifest().ok().and_then(|(m, _)| {
+            let verify = m.unprotected["verify_state"].clone();
+            serde_json::from_value::<SnapshotVerifyState>(verify)
+                .ok()
+                .map(|svs| svs.state)
+        })
+    }
 }
 
 impl AsRef<pbs_api_types::BackupNamespace> for BackupDir {
diff --git a/src/api2/config/sync.rs b/src/api2/config/sync.rs
index 6fdc69a9e645..fa9db92f3d11 100644
--- a/src/api2/config/sync.rs
+++ b/src/api2/config/sync.rs
@@ -368,6 +368,9 @@ pub fn update_sync_job(
     if let Some(transfer_last) = update.transfer_last {
         data.transfer_last = Some(transfer_last);
     }
+    if let Some(resync_corrupt) = update.resync_corrupt {
+        data.resync_corrupt = Some(resync_corrupt);
+    }
 
     if update.limit.rate_in.is_some() {
         data.limit.rate_in = update.limit.rate_in;
@@ -527,6 +530,7 @@ acl:1:/remote/remote1/remotestore1:write@pbs:RemoteSyncOperator
         ns: None,
         owner: Some(write_auth_id.clone()),
         comment: None,
+        resync_corrupt: None,
         remove_vanished: None,
         max_depth: None,
         group_filter: None,
diff --git a/src/api2/pull.rs b/src/api2/pull.rs
index e733c9839e3a..0d4be0e2d228 100644
--- a/src/api2/pull.rs
+++ b/src/api2/pull.rs
@@ -10,7 +10,7 @@ use pbs_api_types::{
     Authid, BackupNamespace, GroupFilter, RateLimitConfig, SyncJobConfig, DATASTORE_SCHEMA,
     GROUP_FILTER_LIST_SCHEMA, NS_MAX_DEPTH_REDUCED_SCHEMA, PRIV_DATASTORE_BACKUP,
     PRIV_DATASTORE_PRUNE, PRIV_REMOTE_READ, REMOTE_ID_SCHEMA, REMOVE_VANISHED_BACKUPS_SCHEMA,
-    TRANSFER_LAST_SCHEMA,
+    RESYNC_CORRUPT_SCHEMA, TRANSFER_LAST_SCHEMA,
 };
 use pbs_config::CachedUserInfo;
 use proxmox_human_byte::HumanByte;
@@ -89,6 +89,7 @@ impl TryFrom<&SyncJobConfig> for PullParameters {
             sync_job.group_filter.clone(),
             sync_job.limit.clone(),
             sync_job.transfer_last,
+            sync_job.resync_corrupt,
         )
     }
 }
@@ -240,6 +241,10 @@ pub fn do_sync_job(
                 schema: TRANSFER_LAST_SCHEMA,
                 optional: true,
             },
+            "resync-corrupt": {
+                schema: RESYNC_CORRUPT_SCHEMA,
+                optional: true,
+            },
         },
     },
     access: {
@@ -264,6 +269,7 @@ async fn pull(
     group_filter: Option<Vec<GroupFilter>>,
     limit: RateLimitConfig,
     transfer_last: Option<usize>,
+    resync_corrupt: Option<bool>,
     rpcenv: &mut dyn RpcEnvironment,
 ) -> Result<String, Error> {
     let auth_id: Authid = rpcenv.get_auth_id().unwrap().parse()?;
@@ -301,6 +307,7 @@ async fn pull(
         group_filter,
         limit,
         transfer_last,
+        resync_corrupt,
     )?;
 
     // fixme: set to_stdout to false?
diff --git a/src/bin/proxmox-backup-manager.rs b/src/bin/proxmox-backup-manager.rs
index 420e96665662..38a1cf0f5881 100644
--- a/src/bin/proxmox-backup-manager.rs
+++ b/src/bin/proxmox-backup-manager.rs
@@ -14,8 +14,8 @@ use pbs_api_types::percent_encoding::percent_encode_component;
 use pbs_api_types::{
     BackupNamespace, GroupFilter, RateLimitConfig, SyncJobConfig, DATASTORE_SCHEMA,
     GROUP_FILTER_LIST_SCHEMA, IGNORE_VERIFIED_BACKUPS_SCHEMA, NS_MAX_DEPTH_SCHEMA,
-    REMOTE_ID_SCHEMA, REMOVE_VANISHED_BACKUPS_SCHEMA, TRANSFER_LAST_SCHEMA, UPID_SCHEMA,
-    VERIFICATION_OUTDATED_AFTER_SCHEMA,
+    REMOTE_ID_SCHEMA, REMOVE_VANISHED_BACKUPS_SCHEMA, RESYNC_CORRUPT_SCHEMA, TRANSFER_LAST_SCHEMA,
+    UPID_SCHEMA, VERIFICATION_OUTDATED_AFTER_SCHEMA,
 };
 use pbs_client::{display_task_log, view_task_result};
 use pbs_config::sync;
diff --git a/src/server/pull.rs b/src/server/pull.rs
index 3117f7d2c960..67881f83b5e3 100644
--- a/src/server/pull.rs
+++ b/src/server/pull.rs
@@ -12,7 +12,8 @@ use tracing::info;
 
 use pbs_api_types::{
     print_store_and_ns, Authid, BackupDir, BackupGroup, BackupNamespace, GroupFilter, Operation,
-    RateLimitConfig, Remote, MAX_NAMESPACE_DEPTH, PRIV_DATASTORE_AUDIT, PRIV_DATASTORE_BACKUP,
+    RateLimitConfig, Remote, VerifyState, MAX_NAMESPACE_DEPTH, PRIV_DATASTORE_AUDIT,
+    PRIV_DATASTORE_BACKUP,
 };
 use pbs_client::BackupRepository;
 use pbs_config::CachedUserInfo;
@@ -55,6 +56,8 @@ pub(crate) struct PullParameters {
     group_filter: Vec<GroupFilter>,
     /// How many snapshots should be transferred at most (taking the newest N snapshots)
     transfer_last: Option<usize>,
+    /// Whether to re-sync corrupted snapshots
+    resync_corrupt: bool,
 }
 
 impl PullParameters {
@@ -72,12 +75,14 @@ impl PullParameters {
         group_filter: Option<Vec<GroupFilter>>,
         limit: RateLimitConfig,
         transfer_last: Option<usize>,
+        resync_corrupt: Option<bool>,
     ) -> Result<Self, Error> {
         if let Some(max_depth) = max_depth {
             ns.check_max_depth(max_depth)?;
             remote_ns.check_max_depth(max_depth)?;
         };
         let remove_vanished = remove_vanished.unwrap_or(false);
+        let resync_corrupt = resync_corrupt.unwrap_or(false);
 
         let source: Arc<dyn SyncSource> = if let Some(remote) = remote {
             let (remote_config, _digest) = pbs_config::remote::config()?;
@@ -116,6 +121,7 @@ impl PullParameters {
             max_depth,
             group_filter,
             transfer_last,
+            resync_corrupt,
         })
     }
 }
@@ -175,9 +181,10 @@ async fn pull_index_chunks<I: IndexFile>(
                     target.cond_touch_chunk(&info.digest, false)
                 })?;
                 if chunk_exists {
-                    //info!("chunk {} exists {}", pos, hex::encode(digest));
+                    //info!("chunk exists {}", hex::encode(info.digest));
                     return Ok::<_, Error>(());
                 }
+
                 //info!("sync {} chunk {}", pos, hex::encode(digest));
                 let chunk = chunk_reader.read_raw_chunk(&info.digest).await?;
                 let raw_size = chunk.raw_size() as usize;
@@ -332,6 +339,7 @@ async fn pull_snapshot<'a>(
     reader: Arc<dyn SyncSourceReader + 'a>,
     snapshot: &'a pbs_datastore::BackupDir,
     downloaded_chunks: Arc<Mutex<HashSet<[u8; 32]>>>,
+    resync_corrupt: bool,
 ) -> Result<SyncStats, Error> {
     let mut sync_stats = SyncStats::default();
     let mut manifest_name = snapshot.full_path();
@@ -352,6 +360,8 @@ async fn pull_snapshot<'a>(
         return Ok(sync_stats);
     }
 
+    let snapshot_is_corrupt = snapshot.verify_state() == Some(VerifyState::Failed);
+
     if manifest_name.exists() {
         let manifest_blob = proxmox_lang::try_block!({
             let mut manifest_file = std::fs::File::open(&manifest_name).map_err(|err| {
@@ -365,7 +375,9 @@ async fn pull_snapshot<'a>(
             format_err!("unable to read local manifest {manifest_name:?} - {err}")
         })?;
 
-        if manifest_blob.raw_data() == tmp_manifest_blob.raw_data() {
+        if (manifest_blob.raw_data() == tmp_manifest_blob.raw_data())
+            && (!resync_corrupt && !snapshot_is_corrupt)
+        {
             if !client_log_name.exists() {
                 reader.try_download_client_log(&client_log_name).await?;
             };
@@ -381,7 +393,7 @@ async fn pull_snapshot<'a>(
         let mut path = snapshot.full_path();
         path.push(&item.filename);
 
-        if path.exists() {
+        if !(resync_corrupt && snapshot_is_corrupt) && path.exists() {
             match ArchiveType::from_path(&item.filename)? {
                 ArchiveType::DynamicIndex => {
                     let index = DynamicIndexReader::open(&path)?;
@@ -416,6 +428,10 @@ async fn pull_snapshot<'a>(
             }
         }
 
+        if resync_corrupt && snapshot_is_corrupt {
+            info!("pulling snapshot {} again", snapshot.dir());
+        }
+
         let stats =
             pull_single_archive(reader.clone(), snapshot, item, downloaded_chunks.clone()).await?;
         sync_stats.add(stats);
@@ -443,6 +459,7 @@ async fn pull_snapshot_from<'a>(
     reader: Arc<dyn SyncSourceReader + 'a>,
     snapshot: &'a pbs_datastore::BackupDir,
     downloaded_chunks: Arc<Mutex<HashSet<[u8; 32]>>>,
+    resync_corrupt: bool,
 ) -> Result<SyncStats, Error> {
     let (_path, is_new, _snap_lock) = snapshot
         .datastore()
@@ -451,7 +468,7 @@ async fn pull_snapshot_from<'a>(
     let sync_stats = if is_new {
         info!("sync snapshot {}", snapshot.dir());
 
-        match pull_snapshot(reader, snapshot, downloaded_chunks).await {
+        match pull_snapshot(reader, snapshot, downloaded_chunks, resync_corrupt).await {
             Err(err) => {
                 if let Err(cleanup_err) = snapshot.datastore().remove_backup_dir(
                     snapshot.backup_ns(),
@@ -469,7 +486,7 @@ async fn pull_snapshot_from<'a>(
         }
     } else {
         info!("re-sync snapshot {}", snapshot.dir());
-        pull_snapshot(reader, snapshot, downloaded_chunks).await?
+        pull_snapshot(reader, snapshot, downloaded_chunks, resync_corrupt).await?
     };
 
     Ok(sync_stats)
@@ -528,21 +545,24 @@ async fn pull_group(
         .enumerate()
         .filter(|&(pos, ref dir)| {
             source_snapshots.insert(dir.time);
-            if last_sync_time > dir.time {
-                already_synced_skip_info.update(dir.time);
-                return false;
-            } else if already_synced_skip_info.count > 0 {
-                info!("{already_synced_skip_info}");
-                already_synced_skip_info.reset();
-                return true;
-            }
+            // If resync_corrupt is set, we go through all the remote snapshots
+            if !params.resync_corrupt {
+                if last_sync_time > dir.time {
+                    already_synced_skip_info.update(dir.time);
+                    return false;
+                } else if already_synced_skip_info.count > 0 {
+                    info!("{already_synced_skip_info}");
+                    already_synced_skip_info.reset();
+                    return true;
+                }
 
-            if pos < cutoff && last_sync_time != dir.time {
-                transfer_last_skip_info.update(dir.time);
-                return false;
-            } else if transfer_last_skip_info.count > 0 {
-                info!("{transfer_last_skip_info}");
-                transfer_last_skip_info.reset();
+                if pos < cutoff && last_sync_time != dir.time {
+                    transfer_last_skip_info.update(dir.time);
+                    return false;
+                } else if transfer_last_skip_info.count > 0 {
+                    info!("{transfer_last_skip_info}");
+                    transfer_last_skip_info.reset();
+                }
             }
             true
         })
@@ -566,7 +586,13 @@ async fn pull_group(
             .source
             .reader(source_namespace, &from_snapshot)
             .await?;
-        let result = pull_snapshot_from(reader, &to_snapshot, downloaded_chunks.clone()).await;
+        let result = pull_snapshot_from(
+            reader,
+            &to_snapshot,
+            downloaded_chunks.clone(),
+            params.resync_corrupt,
+        )
+        .await;
 
         progress.done_snapshots = pos as u64 + 1;
         info!("percentage done: {progress}");
-- 
2.39.5



_______________________________________________
pbs-devel mailing list
pbs-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [pbs-devel] [PATCH proxmox-backup 2/3] fix #3786: ui/cli: add resync-corrupt option on sync-jobs
  2024-10-15 13:18 [pbs-devel] [PATCH proxmox-backup 0/3] fix #3786: resync corrupt chunks in sync-job Gabriel Goller
  2024-10-15 13:18 ` [pbs-devel] [PATCH proxmox-backup 1/3] fix #3786: api: add resync-corrupt option to sync jobs Gabriel Goller
@ 2024-10-15 13:18 ` Gabriel Goller
  2024-10-15 13:18 ` [pbs-devel] [PATCH proxmox-backup 3/3] fix #3786: docs: add resync-corrupt option to sync-job Gabriel Goller
  2 siblings, 0 replies; 9+ messages in thread
From: Gabriel Goller @ 2024-10-15 13:18 UTC (permalink / raw)
  To: pbs-devel

Add the `resync-corrupt` option to the ui and the
`proxmox-backup-manager` cli. It is listed in the `Advanced` section,
because it slows the sync-job down and is useless if no verification
job was run beforehand.

Signed-off-by: Gabriel Goller <g.goller@proxmox.com>
Originally-by: Shannon Sterz <s.sterz@proxmox.com>
---
 src/bin/proxmox-backup-manager.rs |  9 +++++++++
 www/window/SyncJobEdit.js         | 11 +++++++++++
 2 files changed, 20 insertions(+)

diff --git a/src/bin/proxmox-backup-manager.rs b/src/bin/proxmox-backup-manager.rs
index 38a1cf0f5881..08728e9d7250 100644
--- a/src/bin/proxmox-backup-manager.rs
+++ b/src/bin/proxmox-backup-manager.rs
@@ -339,6 +339,10 @@ fn task_mgmt_cli() -> CommandLineInterface {
                 schema: TRANSFER_LAST_SCHEMA,
                 optional: true,
             },
+            "resync-corrupt": {
+                schema: RESYNC_CORRUPT_SCHEMA,
+                optional: true,
+            },
         }
    }
 )]
@@ -355,6 +359,7 @@ async fn pull_datastore(
     group_filter: Option<Vec<GroupFilter>>,
     limit: RateLimitConfig,
     transfer_last: Option<usize>,
+    resync_corrupt: Option<bool>,
     param: Value,
 ) -> Result<Value, Error> {
     let output_format = get_output_format(&param);
@@ -391,6 +396,10 @@ async fn pull_datastore(
         args["transfer-last"] = json!(transfer_last)
     }
 
+    if let Some(resync_corrupt) = resync_corrupt {
+        args["resync-corrupt"] = Value::from(resync_corrupt);
+    }
+
     let mut limit_json = json!(limit);
     let limit_map = limit_json
         .as_object_mut()
diff --git a/www/window/SyncJobEdit.js b/www/window/SyncJobEdit.js
index 6543995e8800..a3c497fc2185 100644
--- a/www/window/SyncJobEdit.js
+++ b/www/window/SyncJobEdit.js
@@ -321,6 +321,17 @@ Ext.define('PBS.window.SyncJobEdit', {
 			    deleteEmpty: '{!isCreate}',
 			},
 		    },
+		    {
+			fieldLabel: gettext('Resync corrupt snapshots'),
+			xtype: 'proxmoxcheckbox',
+			name: 'resync-corrupt',
+			autoEl: {
+			    tag: 'div',
+			    'data-qtip': gettext('Re-sync snapshots, whose verfification failed.'),
+			},
+			uncheckedValue: false,
+			value: false,
+		    },
 		],
 	    },
 	    {
-- 
2.39.5



_______________________________________________
pbs-devel mailing list
pbs-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [pbs-devel] [PATCH proxmox-backup 3/3] fix #3786: docs: add resync-corrupt option to sync-job
  2024-10-15 13:18 [pbs-devel] [PATCH proxmox-backup 0/3] fix #3786: resync corrupt chunks in sync-job Gabriel Goller
  2024-10-15 13:18 ` [pbs-devel] [PATCH proxmox-backup 1/3] fix #3786: api: add resync-corrupt option to sync jobs Gabriel Goller
  2024-10-15 13:18 ` [pbs-devel] [PATCH proxmox-backup 2/3] fix #3786: ui/cli: add resync-corrupt option on sync-jobs Gabriel Goller
@ 2024-10-15 13:18 ` Gabriel Goller
  2 siblings, 0 replies; 9+ messages in thread
From: Gabriel Goller @ 2024-10-15 13:18 UTC (permalink / raw)
  To: pbs-devel

Add short section explaining the `resync-corrupt` option on the
sync-job.

Signed-off-by: Gabriel Goller <g.goller@proxmox.com>
Originally-by: Shannon Sterz <s.sterz@proxmox.com>
---
 docs/managing-remotes.rst | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/docs/managing-remotes.rst b/docs/managing-remotes.rst
index 38ccd5af25d5..b441daa9b0bd 100644
--- a/docs/managing-remotes.rst
+++ b/docs/managing-remotes.rst
@@ -132,6 +132,12 @@ For mixing include and exclude filter, following rules apply:
 
 .. note:: The ``protected`` flag of remote backup snapshots will not be synced.
 
+Enabling the advanced option 'resync-corrupt' will re-sync all snapshots that have 
+failed to verify during the last :ref:`maintenance_verification`. Hence, a verification
+job needs to be run before a sync job with 'resync-corrupt' can be carried out. Be aware
+that a 'resync-corrupt'-job needs to check the manifests of all snapshots in a datastore
+and might take much longer than regular sync jobs.
+
 Namespace Support
 ^^^^^^^^^^^^^^^^^
 
-- 
2.39.5



_______________________________________________
pbs-devel mailing list
pbs-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [pbs-devel] [PATCH proxmox-backup 1/3] fix #3786: api: add resync-corrupt option to sync jobs
  2024-10-15 13:18 ` [pbs-devel] [PATCH proxmox-backup 1/3] fix #3786: api: add resync-corrupt option to sync jobs Gabriel Goller
@ 2024-10-17  6:15   ` Thomas Lamprecht
  2024-10-17 13:20     ` Gabriel Goller
  0 siblings, 1 reply; 9+ messages in thread
From: Thomas Lamprecht @ 2024-10-17  6:15 UTC (permalink / raw)
  To: Proxmox Backup Server development discussion, Gabriel Goller

Am 15/10/2024 um 15:18 schrieb Gabriel Goller:
> This option allows us to "fix" corrupt snapshots (and/or their chunks)
> by pulling them from another remote. When traversing the remote
> snapshots, we check if it exists locally, and if it is, we check if the
> last verification of it failed. If the local snapshot is broken and the
> `resync-corrupt` option is turned on, we pull in the remote snapshot,
> overwriting the local one.
> 
> This is very useful and has been requested a lot, as there is currently
> no way to "fix" corrupt chunks/snapshots even if the user has a healthy
> version of it on their offsite instance.
> 
> Signed-off-by: Gabriel Goller <g.goller@proxmox.com>
> Originally-by: Shannon Sterz <s.sterz@proxmox.com>

those two trailers probably should be switched to uphold causal ordering.


In general, it would be _really_ nice to get some regression testing for
sync jobs covering the selection with different matching and options.
The lack of that doesn't have to block your/shannon's patch, but IMO it
should be added rather sooner than later. Sync is becoming more and more
complex, and the selection of snapshots is a very important and fundamental
part of it.

> ---
>  pbs-api-types/src/jobs.rs         | 11 +++++
>  pbs-datastore/src/backup_info.rs  | 13 +++++-
>  src/api2/config/sync.rs           |  4 ++
>  src/api2/pull.rs                  |  9 +++-
>  src/bin/proxmox-backup-manager.rs |  4 +-
>  src/server/pull.rs                | 68 +++++++++++++++++++++----------
>  6 files changed, 84 insertions(+), 25 deletions(-)
> 
> diff --git a/pbs-api-types/src/jobs.rs b/pbs-api-types/src/jobs.rs
> index 868702bc059e..a3cd68bf5afd 100644
> --- a/pbs-api-types/src/jobs.rs
> +++ b/pbs-api-types/src/jobs.rs
> @@ -498,6 +498,11 @@ pub const TRANSFER_LAST_SCHEMA: Schema =
>          .minimum(1)
>          .schema();
>  
> +pub const RESYNC_CORRUPT_SCHEMA: Schema = BooleanSchema::new(
> +    "If the verification failed for a local snapshot, try to pull its chunks again.",

not only chunks, also its indexes (which might be also a source of a bad
verification result I think)

> +)
> +.schema();
> +
>  #[api(
>      properties: {
>          id: {
> @@ -552,6 +557,10 @@ pub const TRANSFER_LAST_SCHEMA: Schema =
>              schema: TRANSFER_LAST_SCHEMA,
>              optional: true,
>          },
> +        "resync-corrupt": {
> +            schema: RESYNC_CORRUPT_SCHEMA,
> +            optional: true,
> +        }
>      }
>  )]
>  #[derive(Serialize, Deserialize, Clone, Updater, PartialEq)]
> @@ -585,6 +594,8 @@ pub struct SyncJobConfig {
>      pub limit: RateLimitConfig,
>      #[serde(skip_serializing_if = "Option::is_none")]
>      pub transfer_last: Option<usize>,
> +    #[serde(skip_serializing_if = "Option::is_none")]
> +    pub resync_corrupt: Option<bool>,
>  }
>  
>  impl SyncJobConfig {
> diff --git a/pbs-datastore/src/backup_info.rs b/pbs-datastore/src/backup_info.rs
> index 414ec878d01a..2d283dbe5cd9 100644
> --- a/pbs-datastore/src/backup_info.rs
> +++ b/pbs-datastore/src/backup_info.rs
> @@ -8,7 +8,8 @@ use anyhow::{bail, format_err, Error};
>  use proxmox_sys::fs::{lock_dir_noblock, replace_file, CreateOptions};
>  
>  use pbs_api_types::{
> -    Authid, BackupNamespace, BackupType, GroupFilter, BACKUP_DATE_REGEX, BACKUP_FILE_REGEX,
> +    Authid, BackupNamespace, BackupType, GroupFilter, SnapshotVerifyState, VerifyState,
> +    BACKUP_DATE_REGEX, BACKUP_FILE_REGEX,
>  };
>  use pbs_config::{open_backup_lockfile, BackupLockGuard};
>  
> @@ -583,6 +584,16 @@ impl BackupDir {
>  
>          Ok(())
>      }
> +
> +    /// Load the verify state from the manifest.
> +    pub fn verify_state(&self) -> Option<VerifyState> {
> +        self.load_manifest().ok().and_then(|(m, _)| {

hmm, just ignoring errors here seems a bit odd, that might be a case where one
might want to re-sync. So I'd prefer wrapping this a result, even if it's a bit
tedious.

> +            let verify = m.unprotected["verify_state"].clone();
> +            serde_json::from_value::<SnapshotVerifyState>(verify)
> +                .ok()
> +                .map(|svs| svs.state)
> +        })
> +    }
>  }
>  
>  impl AsRef<pbs_api_types::BackupNamespace> for BackupDir {
> diff --git a/src/api2/config/sync.rs b/src/api2/config/sync.rs
> index 6fdc69a9e645..fa9db92f3d11 100644
> --- a/src/api2/config/sync.rs
> +++ b/src/api2/config/sync.rs
> @@ -368,6 +368,9 @@ pub fn update_sync_job(
>      if let Some(transfer_last) = update.transfer_last {
>          data.transfer_last = Some(transfer_last);
>      }
> +    if let Some(resync_corrupt) = update.resync_corrupt {
> +        data.resync_corrupt = Some(resync_corrupt);
> +    }
>  
>      if update.limit.rate_in.is_some() {
>          data.limit.rate_in = update.limit.rate_in;
> @@ -527,6 +530,7 @@ acl:1:/remote/remote1/remotestore1:write@pbs:RemoteSyncOperator
>          ns: None,
>          owner: Some(write_auth_id.clone()),
>          comment: None,
> +        resync_corrupt: None,
>          remove_vanished: None,
>          max_depth: None,
>          group_filter: None,
> diff --git a/src/api2/pull.rs b/src/api2/pull.rs
> index e733c9839e3a..0d4be0e2d228 100644
> --- a/src/api2/pull.rs
> +++ b/src/api2/pull.rs
> @@ -10,7 +10,7 @@ use pbs_api_types::{
>      Authid, BackupNamespace, GroupFilter, RateLimitConfig, SyncJobConfig, DATASTORE_SCHEMA,
>      GROUP_FILTER_LIST_SCHEMA, NS_MAX_DEPTH_REDUCED_SCHEMA, PRIV_DATASTORE_BACKUP,
>      PRIV_DATASTORE_PRUNE, PRIV_REMOTE_READ, REMOTE_ID_SCHEMA, REMOVE_VANISHED_BACKUPS_SCHEMA,
> -    TRANSFER_LAST_SCHEMA,
> +    RESYNC_CORRUPT_SCHEMA, TRANSFER_LAST_SCHEMA,
>  };
>  use pbs_config::CachedUserInfo;
>  use proxmox_human_byte::HumanByte;
> @@ -89,6 +89,7 @@ impl TryFrom<&SyncJobConfig> for PullParameters {
>              sync_job.group_filter.clone(),
>              sync_job.limit.clone(),
>              sync_job.transfer_last,
> +            sync_job.resync_corrupt,
>          )
>      }
>  }
> @@ -240,6 +241,10 @@ pub fn do_sync_job(
>                  schema: TRANSFER_LAST_SCHEMA,
>                  optional: true,
>              },
> +            "resync-corrupt": {
> +                schema: RESYNC_CORRUPT_SCHEMA,
> +                optional: true,
> +            },
>          },
>      },
>      access: {
> @@ -264,6 +269,7 @@ async fn pull(
>      group_filter: Option<Vec<GroupFilter>>,
>      limit: RateLimitConfig,
>      transfer_last: Option<usize>,
> +    resync_corrupt: Option<bool>,
>      rpcenv: &mut dyn RpcEnvironment,
>  ) -> Result<String, Error> {
>      let auth_id: Authid = rpcenv.get_auth_id().unwrap().parse()?;
> @@ -301,6 +307,7 @@ async fn pull(
>          group_filter,
>          limit,
>          transfer_last,
> +        resync_corrupt,
>      )?;
>  
>      // fixme: set to_stdout to false?
> diff --git a/src/bin/proxmox-backup-manager.rs b/src/bin/proxmox-backup-manager.rs
> index 420e96665662..38a1cf0f5881 100644
> --- a/src/bin/proxmox-backup-manager.rs
> +++ b/src/bin/proxmox-backup-manager.rs
> @@ -14,8 +14,8 @@ use pbs_api_types::percent_encoding::percent_encode_component;
>  use pbs_api_types::{
>      BackupNamespace, GroupFilter, RateLimitConfig, SyncJobConfig, DATASTORE_SCHEMA,
>      GROUP_FILTER_LIST_SCHEMA, IGNORE_VERIFIED_BACKUPS_SCHEMA, NS_MAX_DEPTH_SCHEMA,
> -    REMOTE_ID_SCHEMA, REMOVE_VANISHED_BACKUPS_SCHEMA, TRANSFER_LAST_SCHEMA, UPID_SCHEMA,
> -    VERIFICATION_OUTDATED_AFTER_SCHEMA,
> +    REMOTE_ID_SCHEMA, REMOVE_VANISHED_BACKUPS_SCHEMA, RESYNC_CORRUPT_SCHEMA, TRANSFER_LAST_SCHEMA,
> +    UPID_SCHEMA, VERIFICATION_OUTDATED_AFTER_SCHEMA,
>  };
>  use pbs_client::{display_task_log, view_task_result};
>  use pbs_config::sync;
> diff --git a/src/server/pull.rs b/src/server/pull.rs
> index 3117f7d2c960..67881f83b5e3 100644
> --- a/src/server/pull.rs
> +++ b/src/server/pull.rs
> @@ -12,7 +12,8 @@ use tracing::info;
>  
>  use pbs_api_types::{
>      print_store_and_ns, Authid, BackupDir, BackupGroup, BackupNamespace, GroupFilter, Operation,
> -    RateLimitConfig, Remote, MAX_NAMESPACE_DEPTH, PRIV_DATASTORE_AUDIT, PRIV_DATASTORE_BACKUP,
> +    RateLimitConfig, Remote, VerifyState, MAX_NAMESPACE_DEPTH, PRIV_DATASTORE_AUDIT,
> +    PRIV_DATASTORE_BACKUP,
>  };
>  use pbs_client::BackupRepository;
>  use pbs_config::CachedUserInfo;
> @@ -55,6 +56,8 @@ pub(crate) struct PullParameters {
>      group_filter: Vec<GroupFilter>,
>      /// How many snapshots should be transferred at most (taking the newest N snapshots)
>      transfer_last: Option<usize>,
> +    /// Whether to re-sync corrupted snapshots
> +    resync_corrupt: bool,
>  }
>  
>  impl PullParameters {
> @@ -72,12 +75,14 @@ impl PullParameters {
>          group_filter: Option<Vec<GroupFilter>>,
>          limit: RateLimitConfig,
>          transfer_last: Option<usize>,
> +        resync_corrupt: Option<bool>,
>      ) -> Result<Self, Error> {
>          if let Some(max_depth) = max_depth {
>              ns.check_max_depth(max_depth)?;
>              remote_ns.check_max_depth(max_depth)?;
>          };
>          let remove_vanished = remove_vanished.unwrap_or(false);
> +        let resync_corrupt = resync_corrupt.unwrap_or(false);
>  
>          let source: Arc<dyn SyncSource> = if let Some(remote) = remote {
>              let (remote_config, _digest) = pbs_config::remote::config()?;
> @@ -116,6 +121,7 @@ impl PullParameters {
>              max_depth,
>              group_filter,
>              transfer_last,
> +            resync_corrupt,
>          })
>      }
>  }
> @@ -175,9 +181,10 @@ async fn pull_index_chunks<I: IndexFile>(
>                      target.cond_touch_chunk(&info.digest, false)
>                  })?;
>                  if chunk_exists {
> -                    //info!("chunk {} exists {}", pos, hex::encode(digest));
> +                    //info!("chunk exists {}", hex::encode(info.digest));
>                      return Ok::<_, Error>(());
>                  }
> +
>                  //info!("sync {} chunk {}", pos, hex::encode(digest));
>                  let chunk = chunk_reader.read_raw_chunk(&info.digest).await?;
>                  let raw_size = chunk.raw_size() as usize;
> @@ -332,6 +339,7 @@ async fn pull_snapshot<'a>(
>      reader: Arc<dyn SyncSourceReader + 'a>,
>      snapshot: &'a pbs_datastore::BackupDir,
>      downloaded_chunks: Arc<Mutex<HashSet<[u8; 32]>>>,
> +    resync_corrupt: bool,

would be good to update the doc comment of this method to mention the new param
in the description of steps

>  ) -> Result<SyncStats, Error> {
>      let mut sync_stats = SyncStats::default();
>      let mut manifest_name = snapshot.full_path();
> @@ -352,6 +360,8 @@ async fn pull_snapshot<'a>(
>          return Ok(sync_stats);
>      }
>  
> +    let snapshot_is_corrupt = snapshot.verify_state() == Some(VerifyState::Failed);

Hmm, but currently you'd only need to get the state if resync_corrupt is true?
Or at least if the local manifest already exists, if we want to allow re-syncing
on some other conditions (but that can be adapted to that once required).

e.g.:

let must_resync_exisiting = resync_corrupt && snapshot.verify_state() == Some(VerifyState::Failed);

That would make the checks below a bit easier to read.

And would it make sense to check the downloaded manifest's verify state, no point
in re-pulling if the other side is also corrupt, or is that possibility already
excluded somewhere else in this code path?

> +
>      if manifest_name.exists() {
>          let manifest_blob = proxmox_lang::try_block!({
>              let mut manifest_file = std::fs::File::open(&manifest_name).map_err(|err| {
> @@ -365,7 +375,9 @@ async fn pull_snapshot<'a>(
>              format_err!("unable to read local manifest {manifest_name:?} - {err}")
>          })?;
>  
> -        if manifest_blob.raw_data() == tmp_manifest_blob.raw_data() {
> +        if (manifest_blob.raw_data() == tmp_manifest_blob.raw_data())

unnecessary addition of parenthesis a == b && (!c && !d) would be fine too and less
noise to read, so to say.

> +            && (!resync_corrupt && !snapshot_is_corrupt)

This is equal to `.. && !(resync_corrupt || snapshot_is_corrupt)`, which makes it
slightly easier to read for me, and I'm slightly confused as why it's not 
`.. && !(resync_corrupt && snapshot_is_corrupt)`, i.e., with the new variable above:

if manifest_blob.raw_data() == tmp_manifest_blob.raw_data() && !must_resync_exisiting {
    // ...
}

Or why would the expression need to be different here?

> +        {
>              if !client_log_name.exists() {
>                  reader.try_download_client_log(&client_log_name).await?;
>              };
> @@ -381,7 +393,7 @@ async fn pull_snapshot<'a>(
>          let mut path = snapshot.full_path();
>          path.push(&item.filename);
>  
> -        if path.exists() {
> +        if !(resync_corrupt && snapshot_is_corrupt) && path.exists() {
>              match ArchiveType::from_path(&item.filename)? {
>                  ArchiveType::DynamicIndex => {
>                      let index = DynamicIndexReader::open(&path)?;
> @@ -416,6 +428,10 @@ async fn pull_snapshot<'a>(
>              }
>          }
>  
> +        if resync_corrupt && snapshot_is_corrupt {
> +            info!("pulling snapshot {} again", snapshot.dir());

maybe include the reason for re-pulling this here, e.g. something like:

"re-syncing snapshot {} due to bad verification result"

And wouldn't that get logged multiple times for backups with multiple archives, like
e.g. guests with multiple disks? Makes probably more sense to log that before the for
loop


> +        }
> +
>          let stats =
>              pull_single_archive(reader.clone(), snapshot, item, downloaded_chunks.clone()).await?;
>          sync_stats.add(stats);
> @@ -443,6 +459,7 @@ async fn pull_snapshot_from<'a>(
>      reader: Arc<dyn SyncSourceReader + 'a>,
>      snapshot: &'a pbs_datastore::BackupDir,
>      downloaded_chunks: Arc<Mutex<HashSet<[u8; 32]>>>,
> +    resync_corrupt: bool,
>  ) -> Result<SyncStats, Error> {
>      let (_path, is_new, _snap_lock) = snapshot
>          .datastore()
> @@ -451,7 +468,7 @@ async fn pull_snapshot_from<'a>(
>      let sync_stats = if is_new {
>          info!("sync snapshot {}", snapshot.dir());
>  
> -        match pull_snapshot(reader, snapshot, downloaded_chunks).await {
> +        match pull_snapshot(reader, snapshot, downloaded_chunks, resync_corrupt).await {
>              Err(err) => {
>                  if let Err(cleanup_err) = snapshot.datastore().remove_backup_dir(
>                      snapshot.backup_ns(),
> @@ -469,7 +486,7 @@ async fn pull_snapshot_from<'a>(
>          }
>      } else {
>          info!("re-sync snapshot {}", snapshot.dir());
> -        pull_snapshot(reader, snapshot, downloaded_chunks).await?
> +        pull_snapshot(reader, snapshot, downloaded_chunks, resync_corrupt).await?
>      };
>  
>      Ok(sync_stats)
> @@ -528,21 +545,24 @@ async fn pull_group(
>          .enumerate()
>          .filter(|&(pos, ref dir)| {
>              source_snapshots.insert(dir.time);
> -            if last_sync_time > dir.time {
> -                already_synced_skip_info.update(dir.time);
> -                return false;
> -            } else if already_synced_skip_info.count > 0 {
> -                info!("{already_synced_skip_info}");
> -                already_synced_skip_info.reset();
> -                return true;
> -            }
> +            // If resync_corrupt is set, we go through all the remote snapshots
> +            if !params.resync_corrupt {

maybe reverse that and use an explicit return? i.e.:

if params.resync_corrupt {
    return true; // always check all remote snapshots when resyncing
}

// rest of the filter

would avoid an indentation level for the non-bypass path and be slightly easier
to read to me

> +                if last_sync_time > dir.time {
> +                    already_synced_skip_info.update(dir.time);
> +                    return false;
> +                } else if already_synced_skip_info.count > 0 {
> +                    info!("{already_synced_skip_info}");
> +                    already_synced_skip_info.reset();
> +                    return true;
> +                }
>  
> -            if pos < cutoff && last_sync_time != dir.time {
> -                transfer_last_skip_info.update(dir.time);
> -                return false;
> -            } else if transfer_last_skip_info.count > 0 {
> -                info!("{transfer_last_skip_info}");
> -                transfer_last_skip_info.reset();
> +                if pos < cutoff && last_sync_time != dir.time {
> +                    transfer_last_skip_info.update(dir.time);
> +                    return false;
> +                } else if transfer_last_skip_info.count > 0 {
> +                    info!("{transfer_last_skip_info}");
> +                    transfer_last_skip_info.reset();
> +                }
>              }
>              true
>          })
> @@ -566,7 +586,13 @@ async fn pull_group(
>              .source
>              .reader(source_namespace, &from_snapshot)
>              .await?;
> -        let result = pull_snapshot_from(reader, &to_snapshot, downloaded_chunks.clone()).await;
> +        let result = pull_snapshot_from(
> +            reader,
> +            &to_snapshot,
> +            downloaded_chunks.clone(),
> +            params.resync_corrupt,
> +        )
> +        .await;
>  
>          progress.done_snapshots = pos as u64 + 1;
>          info!("percentage done: {progress}");



_______________________________________________
pbs-devel mailing list
pbs-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [pbs-devel] [PATCH proxmox-backup 1/3] fix #3786: api: add resync-corrupt option to sync jobs
  2024-10-17  6:15   ` Thomas Lamprecht
@ 2024-10-17 13:20     ` Gabriel Goller
  2024-10-17 14:09       ` Thomas Lamprecht
  2024-10-21  8:16       ` Lukas Wagner
  0 siblings, 2 replies; 9+ messages in thread
From: Gabriel Goller @ 2024-10-17 13:20 UTC (permalink / raw)
  To: Thomas Lamprecht; +Cc: Proxmox Backup Server development discussion

On 17.10.2024 08:15, Thomas Lamprecht wrote:
>Am 15/10/2024 um 15:18 schrieb Gabriel Goller:
>> This option allows us to "fix" corrupt snapshots (and/or their chunks)
>> by pulling them from another remote. When traversing the remote
>> snapshots, we check if it exists locally, and if it is, we check if the
>> last verification of it failed. If the local snapshot is broken and the
>> `resync-corrupt` option is turned on, we pull in the remote snapshot,
>> overwriting the local one.
>>
>> This is very useful and has been requested a lot, as there is currently
>> no way to "fix" corrupt chunks/snapshots even if the user has a healthy
>> version of it on their offsite instance.
>>
>> Signed-off-by: Gabriel Goller <g.goller@proxmox.com>
>> Originally-by: Shannon Sterz <s.sterz@proxmox.com>
>
>those two trailers probably should be switched to uphold causal ordering.

Makes sense.

>In general, it would be _really_ nice to get some regression testing for
>sync jobs covering the selection with different matching and options.
>The lack of that doesn't have to block your/shannon's patch, but IMO it
>should be added rather sooner than later. Sync is becoming more and more
>complex, and the selection of snapshots is a very important and fundamental
>part of it.

Do you think we could mock ourselves out of this and unit-test it
somehow? For full regression tests we will need Lukas's CI work.

>> ---
>>  pbs-api-types/src/jobs.rs         | 11 +++++
>>  pbs-datastore/src/backup_info.rs  | 13 +++++-
>>  src/api2/config/sync.rs           |  4 ++
>>  src/api2/pull.rs                  |  9 +++-
>>  src/bin/proxmox-backup-manager.rs |  4 +-
>>  src/server/pull.rs                | 68 +++++++++++++++++++++----------
>>  6 files changed, 84 insertions(+), 25 deletions(-)
>>
>> diff --git a/pbs-api-types/src/jobs.rs b/pbs-api-types/src/jobs.rs
>> index 868702bc059e..a3cd68bf5afd 100644
>> --- a/pbs-api-types/src/jobs.rs
>> +++ b/pbs-api-types/src/jobs.rs
>> @@ -498,6 +498,11 @@ pub const TRANSFER_LAST_SCHEMA: Schema =
>>          .minimum(1)
>>          .schema();
>>
>> +pub const RESYNC_CORRUPT_SCHEMA: Schema = BooleanSchema::new(
>> +    "If the verification failed for a local snapshot, try to pull its chunks again.",
>
>not only chunks, also its indexes (which might be also a source of a bad
>verification result I think)

Yes, correct, I adjusted the description above.

>> +)
>> +.schema();
>> +
>>  #[api(
>>      properties: {
>>          id: {
>> @@ -552,6 +557,10 @@ pub const TRANSFER_LAST_SCHEMA: Schema =
>>              schema: TRANSFER_LAST_SCHEMA,
>>              optional: true,
>>          },
>> +        "resync-corrupt": {
>> +            schema: RESYNC_CORRUPT_SCHEMA,
>> +            optional: true,
>> +        }
>>      }
>>  )]
>>  #[derive(Serialize, Deserialize, Clone, Updater, PartialEq)]
>> @@ -585,6 +594,8 @@ pub struct SyncJobConfig {
>>      pub limit: RateLimitConfig,
>>      #[serde(skip_serializing_if = "Option::is_none")]
>>      pub transfer_last: Option<usize>,
>> +    #[serde(skip_serializing_if = "Option::is_none")]
>> +    pub resync_corrupt: Option<bool>,
>>  }
>>
>>  impl SyncJobConfig {
>> diff --git a/pbs-datastore/src/backup_info.rs b/pbs-datastore/src/backup_info.rs
>> index 414ec878d01a..2d283dbe5cd9 100644
>> --- a/pbs-datastore/src/backup_info.rs
>> +++ b/pbs-datastore/src/backup_info.rs
>> @@ -8,7 +8,8 @@ use anyhow::{bail, format_err, Error};
>>  use proxmox_sys::fs::{lock_dir_noblock, replace_file, CreateOptions};
>>
>>  use pbs_api_types::{
>> -    Authid, BackupNamespace, BackupType, GroupFilter, BACKUP_DATE_REGEX, BACKUP_FILE_REGEX,
>> +    Authid, BackupNamespace, BackupType, GroupFilter, SnapshotVerifyState, VerifyState,
>> +    BACKUP_DATE_REGEX, BACKUP_FILE_REGEX,
>>  };
>>  use pbs_config::{open_backup_lockfile, BackupLockGuard};
>>
>> @@ -583,6 +584,16 @@ impl BackupDir {
>>
>>          Ok(())
>>      }
>> +
>> +    /// Load the verify state from the manifest.
>> +    pub fn verify_state(&self) -> Option<VerifyState> {
>> +        self.load_manifest().ok().and_then(|(m, _)| {
>
>hmm, just ignoring errors here seems a bit odd, that might be a case where one
>might want to re-sync. So I'd prefer wrapping this a result, even if it's a bit
>tedious.

Converted to (and returned) a `Result<VerifyState, anyhow::Error>` for now, 
but I'd be wary about treating a failed `load_manifest` the same as a
`VerifyState::Failed`. Especially because currently an unverified
snapshot simply returns `Err(null...)`. So we would first need to extend
`VerifyState` to have a `Unknown` variant.

>> @@ -332,6 +339,7 @@ async fn pull_snapshot<'a>(
>>      reader: Arc<dyn SyncSourceReader + 'a>,
>>      snapshot: &'a pbs_datastore::BackupDir,
>>      downloaded_chunks: Arc<Mutex<HashSet<[u8; 32]>>>,
>> +    resync_corrupt: bool,
>
>would be good to update the doc comment of this method to mention the new param
>in the description of steps

done.

>>  ) -> Result<SyncStats, Error> {
>>      let mut sync_stats = SyncStats::default();
>>      let mut manifest_name = snapshot.full_path();
>> @@ -352,6 +360,8 @@ async fn pull_snapshot<'a>(
>>          return Ok(sync_stats);
>>      }
>>
>> +    let snapshot_is_corrupt = snapshot.verify_state() == Some(VerifyState::Failed);
>
>Hmm, but currently you'd only need to get the state if resync_corrupt is true?
>Or at least if the local manifest already exists, if we want to allow re-syncing
>on some other conditions (but that can be adapted to that once required).
>
>e.g.:
>
>let must_resync_exisiting = resync_corrupt && snapshot.verify_state() == Some(VerifyState::Failed);
>
>That would make the checks below a bit easier to read.

Yep, done as well, this makes all the checks easier to read.

>And would it make sense to check the downloaded manifest's verify state, no point
>in re-pulling if the other side is also corrupt, or is that possibility already
>excluded somewhere else in this code path?

This should be covered, because:
  - if the index is broken, we can't even open it
  - if an unencrypted chunk is corrupted, the crc is checked when reading
    it (pbs-client/src/remote_chunk_reader.rs:55)
  - if an encrypted chunk is corrupted, the crc is checked as well 
    (pbs-datastore/src/data_blob.rs:81)

>> +
>>      if manifest_name.exists() {
>>          let manifest_blob = proxmox_lang::try_block!({
>>              let mut manifest_file = std::fs::File::open(&manifest_name).map_err(|err| {
>> @@ -365,7 +375,9 @@ async fn pull_snapshot<'a>(
>>              format_err!("unable to read local manifest {manifest_name:?} - {err}")
>>          })?;
>>
>> -        if manifest_blob.raw_data() == tmp_manifest_blob.raw_data() {
>> +        if (manifest_blob.raw_data() == tmp_manifest_blob.raw_data())
>
>unnecessary addition of parenthesis a == b && (!c && !d) would be fine too and less
>noise to read, so to say.

Changed.

>> +            && (!resync_corrupt && !snapshot_is_corrupt)
>
>This is equal to `.. && !(resync_corrupt || snapshot_is_corrupt)`, which makes it
>slightly easier to read for me, and I'm slightly confused as why it's not
>`.. && !(resync_corrupt && snapshot_is_corrupt)`, i.e., with the new variable above:
>
>if manifest_blob.raw_data() == tmp_manifest_blob.raw_data() && !must_resync_exisiting {
>    // ...
>}
>
>Or why would the expression need to be different here?

Changed this as well after combining the flags.

>> +        {
>>              if !client_log_name.exists() {
>>                  reader.try_download_client_log(&client_log_name).await?;
>>              };
>> @@ -381,7 +393,7 @@ async fn pull_snapshot<'a>(
>>          let mut path = snapshot.full_path();
>>          path.push(&item.filename);
>>
>> -        if path.exists() {
>> +        if !(resync_corrupt && snapshot_is_corrupt) && path.exists() {
>>              match ArchiveType::from_path(&item.filename)? {
>>                  ArchiveType::DynamicIndex => {
>>                      let index = DynamicIndexReader::open(&path)?;
>> @@ -416,6 +428,10 @@ async fn pull_snapshot<'a>(
>>              }
>>          }
>>
>> +        if resync_corrupt && snapshot_is_corrupt {
>> +            info!("pulling snapshot {} again", snapshot.dir());
>
>maybe include the reason for re-pulling this here, e.g. something like:
>
>"re-syncing snapshot {} due to bad verification result"
>
>And wouldn't that get logged multiple times for backups with multiple archives, like
>e.g. guests with multiple disks? Makes probably more sense to log that before the for
>loop

Yep, you're right, moved it to the beginning of the loop!

>> +        }
>> +
>>          let stats =
>>              pull_single_archive(reader.clone(), snapshot, item, downloaded_chunks.clone()).await?;
>>          sync_stats.add(stats);
>> @@ -443,6 +459,7 @@ async fn pull_snapshot_from<'a>(
>>      reader: Arc<dyn SyncSourceReader + 'a>,
>>      snapshot: &'a pbs_datastore::BackupDir,
>>      downloaded_chunks: Arc<Mutex<HashSet<[u8; 32]>>>,
>> +    resync_corrupt: bool,
>>  ) -> Result<SyncStats, Error> {
>>      let (_path, is_new, _snap_lock) = snapshot
>>          .datastore()
>> @@ -451,7 +468,7 @@ async fn pull_snapshot_from<'a>(
>>      let sync_stats = if is_new {
>>          info!("sync snapshot {}", snapshot.dir());
>>
>> -        match pull_snapshot(reader, snapshot, downloaded_chunks).await {
>> +        match pull_snapshot(reader, snapshot, downloaded_chunks, resync_corrupt).await {
>>              Err(err) => {
>>                  if let Err(cleanup_err) = snapshot.datastore().remove_backup_dir(
>>                      snapshot.backup_ns(),
>> @@ -469,7 +486,7 @@ async fn pull_snapshot_from<'a>(
>>          }
>>      } else {
>>          info!("re-sync snapshot {}", snapshot.dir());
>> -        pull_snapshot(reader, snapshot, downloaded_chunks).await?
>> +        pull_snapshot(reader, snapshot, downloaded_chunks, resync_corrupt).await?
>>      };
>>
>>      Ok(sync_stats)
>> @@ -528,21 +545,24 @@ async fn pull_group(
>>          .enumerate()
>>          .filter(|&(pos, ref dir)| {
>>              source_snapshots.insert(dir.time);
>> -            if last_sync_time > dir.time {
>> -                already_synced_skip_info.update(dir.time);
>> -                return false;
>> -            } else if already_synced_skip_info.count > 0 {
>> -                info!("{already_synced_skip_info}");
>> -                already_synced_skip_info.reset();
>> -                return true;
>> -            }
>> +            // If resync_corrupt is set, we go through all the remote snapshots
>> +            if !params.resync_corrupt {
>
>maybe reverse that and use an explicit return? i.e.:
>
>if params.resync_corrupt {
>    return true; // always check all remote snapshots when resyncing
>}
>
>// rest of the filter
>
>would avoid an indentation level for the non-bypass path and be slightly easier
>to read to me

True, fixed this as well.


Thanks for the review!


_______________________________________________
pbs-devel mailing list
pbs-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [pbs-devel] [PATCH proxmox-backup 1/3] fix #3786: api: add resync-corrupt option to sync jobs
  2024-10-17 13:20     ` Gabriel Goller
@ 2024-10-17 14:09       ` Thomas Lamprecht
  2024-10-18  9:08         ` Gabriel Goller
  2024-10-21  8:16       ` Lukas Wagner
  1 sibling, 1 reply; 9+ messages in thread
From: Thomas Lamprecht @ 2024-10-17 14:09 UTC (permalink / raw)
  To: Gabriel Goller; +Cc: Proxmox Backup Server development discussion

Am 17/10/2024 um 15:20 schrieb Gabriel Goller:
> On 17.10.2024 08:15, Thomas Lamprecht wrote:
>> In general, it would be _really_ nice to get some regression testing for
>> sync jobs covering the selection with different matching and options.
>> The lack of that doesn't have to block your/shannon's patch, but IMO it
>> should be added rather sooner than later. Sync is becoming more and more
>> complex, and the selection of snapshots is a very important and fundamental
>> part of it.
> 
> Do you think we could mock ourselves out of this and unit-test it
> somehow? For full regression tests we will need Lukas's CI work.

I hope so, as that is mostly pure application logic. Albeit I'm I currently
can't give you tips for what the best/simplest way for mocking this would be.
Obvious candidates would probably be to pull out the parts for getting local and
remote backup snapshot lists and info into a trait or alternatively
refactoring more logic out that gets the info passed as parameter and call
that in the tests with the test data.

I have no strong preference here and there might be even better options for
this specific cases, but it'd be great if the existing "real" code does not
need to bend backwards to support the mocking and that defining or expanding
the tests isn't too tedious.
FWIW, one option might even be to have the list of snapshots defined and
generating two directory trees during build that mimic the actual datastores
more closely and might require less mocking and thus have bigger code coverage.

>>> +    /// Load the verify state from the manifest.
>>> +    pub fn verify_state(&self) -> Option<VerifyState> {
>>> +        self.load_manifest().ok().and_then(|(m, _)| {
>>
>> hmm, just ignoring errors here seems a bit odd, that might be a case where one
>> might want to re-sync. So I'd prefer wrapping this a result, even if it's a bit
>> tedious.
> 
> Converted to (and returned) a `Result<VerifyState, anyhow::Error>` for now, 
> but I'd be wary about treating a failed `load_manifest` the same as a
> `VerifyState::Failed`. Especially because currently an unverified
> snapshot simply returns `Err(null...)`. So we would first need to extend
> `VerifyState` to have a `Unknown` variant.

Yeah, I agree with you not treating a failure to load/parse the manifest
the same as a failed verification state, but the caller can log the error
and thus at least make the user aware of something unexpected happening.

>> And would it make sense to check the downloaded manifest's verify state, no point
>> in re-pulling if the other side is also corrupt, or is that possibility already
>> excluded somewhere else in this code path?
> 
> This should be covered, because:
>   - if the index is broken, we can't even open it
>   - if an unencrypted chunk is corrupted, the crc is checked when reading
>     it (pbs-client/src/remote_chunk_reader.rs:55)
>   - if an encrypted chunk is corrupted, the crc is checked as well 
>     (pbs-datastore/src/data_blob.rs:81)

ok, thanks for confirming this.

_______________________________________________
pbs-devel mailing list
pbs-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [pbs-devel] [PATCH proxmox-backup 1/3] fix #3786: api: add resync-corrupt option to sync jobs
  2024-10-17 14:09       ` Thomas Lamprecht
@ 2024-10-18  9:08         ` Gabriel Goller
  0 siblings, 0 replies; 9+ messages in thread
From: Gabriel Goller @ 2024-10-18  9:08 UTC (permalink / raw)
  To: Thomas Lamprecht; +Cc: Proxmox Backup Server development discussion

On 17.10.2024 16:09, Thomas Lamprecht wrote:
>Am 17/10/2024 um 15:20 schrieb Gabriel Goller:
>> On 17.10.2024 08:15, Thomas Lamprecht wrote:
>>> In general, it would be _really_ nice to get some regression testing for
>>> sync jobs covering the selection with different matching and options.
>>> The lack of that doesn't have to block your/shannon's patch, but IMO it
>>> should be added rather sooner than later. Sync is becoming more and more
>>> complex, and the selection of snapshots is a very important and fundamental
>>> part of it.
>>
>> Do you think we could mock ourselves out of this and unit-test it
>> somehow? For full regression tests we will need Lukas's CI work.
>
>I hope so, as that is mostly pure application logic. Albeit I'm I currently
>can't give you tips for what the best/simplest way for mocking this would be.
>Obvious candidates would probably be to pull out the parts for getting local and
>remote backup snapshot lists and info into a trait or alternatively
>refactoring more logic out that gets the info passed as parameter and call
>that in the tests with the test data.
>
>I have no strong preference here and there might be even better options for
>this specific cases, but it'd be great if the existing "real" code does not
>need to bend backwards to support the mocking and that defining or expanding
>the tests isn't too tedious.
>FWIW, one option might even be to have the list of snapshots defined and
>generating two directory trees during build that mimic the actual datastores
>more closely and might require less mocking and thus have bigger code coverage.

Interesting, I'll have a look at it when I'm done with my other stuff
:)

>>>> +    /// Load the verify state from the manifest.
>>>> +    pub fn verify_state(&self) -> Option<VerifyState> {
>>>> +        self.load_manifest().ok().and_then(|(m, _)| {
>>>
>>> hmm, just ignoring errors here seems a bit odd, that might be a case where one
>>> might want to re-sync. So I'd prefer wrapping this a result, even if it's a bit
>>> tedious.
>>
>> Converted to (and returned) a `Result<VerifyState, anyhow::Error>` for now,
>> but I'd be wary about treating a failed `load_manifest` the same as a
>> `VerifyState::Failed`. Especially because currently an unverified
>> snapshot simply returns `Err(null...)`. So we would first need to extend
>> `VerifyState` to have a `Unknown` variant.
>
>Yeah, I agree with you not treating a failure to load/parse the manifest
>the same as a failed verification state, but the caller can log the error
>and thus at least make the user aware of something unexpected happening.

Will do!

>>> And would it make sense to check the downloaded manifest's verify state, no point
>>> in re-pulling if the other side is also corrupt, or is that possibility already
>>> excluded somewhere else in this code path?
>>
>> This should be covered, because:
>>   - if the index is broken, we can't even open it
>>   - if an unencrypted chunk is corrupted, the crc is checked when reading
>>     it (pbs-client/src/remote_chunk_reader.rs:55)
>>   - if an encrypted chunk is corrupted, the crc is checked as well
>>     (pbs-datastore/src/data_blob.rs:81)
>
>ok, thanks for confirming this.

Thanks for the reply!
Posted a v2.


_______________________________________________
pbs-devel mailing list
pbs-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [pbs-devel] [PATCH proxmox-backup 1/3] fix #3786: api: add resync-corrupt option to sync jobs
  2024-10-17 13:20     ` Gabriel Goller
  2024-10-17 14:09       ` Thomas Lamprecht
@ 2024-10-21  8:16       ` Lukas Wagner
  1 sibling, 0 replies; 9+ messages in thread
From: Lukas Wagner @ 2024-10-21  8:16 UTC (permalink / raw)
  To: Proxmox Backup Server development discussion, Gabriel Goller,
	Thomas Lamprecht

On  2024-10-17 15:20, Gabriel Goller wrote:
> On 17.10.2024 08:15, Thomas Lamprecht wrote:
>> Am 15/10/2024 um 15:18 schrieb Gabriel Goller:
>>> This option allows us to "fix" corrupt snapshots (and/or their chunks)
>>> by pulling them from another remote. When traversing the remote
>>> snapshots, we check if it exists locally, and if it is, we check if the
>>> last verification of it failed. If the local snapshot is broken and the
>>> `resync-corrupt` option is turned on, we pull in the remote snapshot,
>>> overwriting the local one.
>>>
>>> This is very useful and has been requested a lot, as there is currently
>>> no way to "fix" corrupt chunks/snapshots even if the user has a healthy
>>> version of it on their offsite instance.
>>>
>>> Signed-off-by: Gabriel Goller <g.goller@proxmox.com>
>>> Originally-by: Shannon Sterz <s.sterz@proxmox.com>
>>
>> those two trailers probably should be switched to uphold causal ordering.
> 
> Makes sense.
> 
>> In general, it would be _really_ nice to get some regression testing for
>> sync jobs covering the selection with different matching and options.
>> The lack of that doesn't have to block your/shannon's patch, but IMO it
>> should be added rather sooner than later. Sync is becoming more and more
>> complex, and the selection of snapshots is a very important and fundamental
>> part of it.
> 
> Do you think we could mock ourselves out of this and unit-test it
> somehow? For full regression tests we will need Lukas's CI work.
> 

To chime in here, while it will be certainly possible to test such stuff once the
end-to-end testing tooling is finished (it's getting there), in most cases it is
preferable to test as much as possible in unit tests as they are
  - much faster (<<1s, compared to end-to-end tests which could take seconds or even minutes, depending on what they do)
  - easier to write and maintain (you are much closer to the code under test)
  - also serve as documentation for the code (e.g. what contracts a unit of code must uphold)
  - far shorter feedback loop / better DX (cargo test vs. having to set up/run test instance VMs or trigger some CI system)

Especially for something like the snapshot selection in sync jobs, a good test suite might cover
many different permutations and corner cases, which would be *a lot of* of work to write
and also take *a long time* to execute in a full bells and whistles end-to-end test where you 
perform an actual sync job between two real PBS installations.

That being said, it will be of course a challenge to factor out the sync logic to make it testable.
Without me being familiar with the code, it could involve abstracting the interactions with the rest of
the system with some trait, moving the sync logic into a separate structs / functions and use
dependency injection to give the sync module/fn a concrete implementation to use (in tests
you'd then use a fake implementation). For existing code these refactoring steps can be
quite intimidating, which is the reason why I am a big fan of writing unit tests *while*
writing the actual implementation (similar to, but not quite TDD; there you'd write the tests *first*),
as this ensures that the code I'm writing is actually testable.

-- 
- Lukas


_______________________________________________
pbs-devel mailing list
pbs-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel


^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2024-10-21  8:16 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-10-15 13:18 [pbs-devel] [PATCH proxmox-backup 0/3] fix #3786: resync corrupt chunks in sync-job Gabriel Goller
2024-10-15 13:18 ` [pbs-devel] [PATCH proxmox-backup 1/3] fix #3786: api: add resync-corrupt option to sync jobs Gabriel Goller
2024-10-17  6:15   ` Thomas Lamprecht
2024-10-17 13:20     ` Gabriel Goller
2024-10-17 14:09       ` Thomas Lamprecht
2024-10-18  9:08         ` Gabriel Goller
2024-10-21  8:16       ` Lukas Wagner
2024-10-15 13:18 ` [pbs-devel] [PATCH proxmox-backup 2/3] fix #3786: ui/cli: add resync-corrupt option on sync-jobs Gabriel Goller
2024-10-15 13:18 ` [pbs-devel] [PATCH proxmox-backup 3/3] fix #3786: docs: add resync-corrupt option to sync-job Gabriel Goller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox

Service provided by Proxmox Server Solutions GmbH | Privacy | Legal