[pbs-devel] [PATCH proxmox-backup v2 0/3] fix #3786: resync corrupt chunks in sync-job

public inbox for pbs-devel@lists.proxmox.com
 help / color / mirror / Atom feed

* [pbs-devel] [PATCH proxmox-backup v2 0/3] fix #3786: resync corrupt chunks in sync-job
@ 2024-10-18  9:09 Gabriel Goller
  2024-10-18  9:09 ` [pbs-devel] [PATCH proxmox-backup v2 1/3] fix #3786: api: add resync-corrupt option to sync jobs Gabriel Goller
                   ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: Gabriel Goller @ 2024-10-18  9:09 UTC (permalink / raw)
  To: pbs-devel

Add an option `resync-corrupt` that resyncs corrupt snapshots when running
sync-job. This option checks if the local snapshot failed the last
verification and if it did, overwrites the local snapshot with the
remote one.

This is quite useful, as we currently don't have an option to "fix" 
broken chunks/snapshots in any way, even if a healthy version is on 
another (e.g. offsite) instance.

Important things to note are also: this has a slight performance 
penalty, as all the manifests have to be looked through, and a 
verification job has to be run beforehand, otherwise we do not know 
if the snapshot is healthy.

Note: This series was originally written by Shannon! I just picked it 
up, rebased, and fixed the obvious comments on the last series.

Changelog v2 (thanks @Thomas):
 - order git trailers
 - adjusted schema description to include broken indexes
 - change verify_state to return a Result<_,_>
 - print error if verify_state is not able to read the state
 - update docs on pull_snapshot function
 - simplify logic by combining flags
 - move log line out of loop to only print once that we resync the snapshot

Changelog since RFC (Shannon's work):
 - rename option from deep-sync to resync-corrupt
 - rebase on latest master (and change implementation details, as a 
     lot has changed around sync-jobs)

proxmox-backup:

Gabriel Goller (3):
  fix #3786: api: add resync-corrupt option to sync jobs
  fix #3786: ui/cli: add resync-corrupt option on sync-jobs
  fix #3786: docs: add resync-corrupt option to sync-job

 docs/managing-remotes.rst         |  6 ++++
 pbs-api-types/src/jobs.rs         | 10 ++++++
 pbs-datastore/src/backup_info.rs  | 13 +++++++-
 src/api2/config/sync.rs           |  4 +++
 src/api2/pull.rs                  |  9 +++++-
 src/bin/proxmox-backup-manager.rs | 13 ++++++--
 src/server/pull.rs                | 52 ++++++++++++++++++++++++++-----
 www/window/SyncJobEdit.js         | 11 +++++++
 8 files changed, 106 insertions(+), 12 deletions(-)


Summary over all repositories:
  8 files changed, 106 insertions(+), 12 deletions(-)

-- 
Generated by git-murpp 0.7.1


_______________________________________________
pbs-devel mailing list
pbs-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [pbs-devel] [PATCH proxmox-backup v2 1/3] fix #3786: api: add resync-corrupt option to sync jobs
  2024-10-18  9:09 [pbs-devel] [PATCH proxmox-backup v2 0/3] fix #3786: resync corrupt chunks in sync-job Gabriel Goller
@ 2024-10-18  9:09 ` Gabriel Goller
  2024-11-04 11:51   ` Fabian Grünbichler
  2024-10-18  9:09 ` [pbs-devel] [PATCH proxmox-backup v2 2/3] fix #3786: ui/cli: add resync-corrupt option on sync-jobs Gabriel Goller
  2024-10-18  9:09 ` [pbs-devel] [PATCH proxmox-backup v2 3/3] fix #3786: docs: add resync-corrupt option to sync-job Gabriel Goller
  2 siblings, 1 reply; 8+ messages in thread
From: Gabriel Goller @ 2024-10-18  9:09 UTC (permalink / raw)
  To: pbs-devel

This option allows us to "fix" corrupt snapshots (and/or their chunks)
by pulling them from another remote. When traversing the remote
snapshots, we check if it exists locally, and if it is, we check if the
last verification of it failed. If the local snapshot is broken and the
`resync-corrupt` option is turned on, we pull in the remote snapshot,
overwriting the local one.

This is very useful and has been requested a lot, as there is currently
no way to "fix" corrupt chunks/snapshots even if the user has a healthy
version of it on their offsite instance.

Originally-by: Shannon Sterz <s.sterz@proxmox.com>
Signed-off-by: Gabriel Goller <g.goller@proxmox.com>
---
 pbs-api-types/src/jobs.rs         | 10 ++++++
 pbs-datastore/src/backup_info.rs  | 13 +++++++-
 src/api2/config/sync.rs           |  4 +++
 src/api2/pull.rs                  |  9 +++++-
 src/bin/proxmox-backup-manager.rs |  4 +--
 src/server/pull.rs                | 52 ++++++++++++++++++++++++++-----
 6 files changed, 80 insertions(+), 12 deletions(-)

diff --git a/pbs-api-types/src/jobs.rs b/pbs-api-types/src/jobs.rs
index 868702bc059e..58f739ad00b5 100644
--- a/pbs-api-types/src/jobs.rs
+++ b/pbs-api-types/src/jobs.rs
@@ -498,6 +498,10 @@ pub const TRANSFER_LAST_SCHEMA: Schema =
         .minimum(1)
         .schema();
 
+pub const RESYNC_CORRUPT_SCHEMA: Schema =
+    BooleanSchema::new("If the verification failed for a local snapshot, try to pull it again.")
+        .schema();
+
 #[api(
     properties: {
         id: {
@@ -552,6 +556,10 @@ pub const TRANSFER_LAST_SCHEMA: Schema =
             schema: TRANSFER_LAST_SCHEMA,
             optional: true,
         },
+        "resync-corrupt": {
+            schema: RESYNC_CORRUPT_SCHEMA,
+            optional: true,
+        }
     }
 )]
 #[derive(Serialize, Deserialize, Clone, Updater, PartialEq)]
@@ -585,6 +593,8 @@ pub struct SyncJobConfig {
     pub limit: RateLimitConfig,
     #[serde(skip_serializing_if = "Option::is_none")]
     pub transfer_last: Option<usize>,
+    #[serde(skip_serializing_if = "Option::is_none")]
+    pub resync_corrupt: Option<bool>,
 }
 
 impl SyncJobConfig {
diff --git a/pbs-datastore/src/backup_info.rs b/pbs-datastore/src/backup_info.rs
index 414ec878d01a..c86fbb7568ab 100644
--- a/pbs-datastore/src/backup_info.rs
+++ b/pbs-datastore/src/backup_info.rs
@@ -8,7 +8,8 @@ use anyhow::{bail, format_err, Error};
 use proxmox_sys::fs::{lock_dir_noblock, replace_file, CreateOptions};
 
 use pbs_api_types::{
-    Authid, BackupNamespace, BackupType, GroupFilter, BACKUP_DATE_REGEX, BACKUP_FILE_REGEX,
+    Authid, BackupNamespace, BackupType, GroupFilter, SnapshotVerifyState, VerifyState,
+    BACKUP_DATE_REGEX, BACKUP_FILE_REGEX,
 };
 use pbs_config::{open_backup_lockfile, BackupLockGuard};
 
@@ -583,6 +584,16 @@ impl BackupDir {
 
         Ok(())
     }
+
+    /// Load the verify state from the manifest.
+    pub fn verify_state(&self) -> Result<VerifyState, anyhow::Error> {
+        self.load_manifest().and_then(|(m, _)| {
+            let verify = m.unprotected["verify_state"].clone();
+            serde_json::from_value::<SnapshotVerifyState>(verify)
+                .map(|svs| svs.state)
+                .map_err(Into::into)
+        })
+    }
 }
 
 impl AsRef<pbs_api_types::BackupNamespace> for BackupDir {
diff --git a/src/api2/config/sync.rs b/src/api2/config/sync.rs
index 6fdc69a9e645..fa9db92f3d11 100644
--- a/src/api2/config/sync.rs
+++ b/src/api2/config/sync.rs
@@ -368,6 +368,9 @@ pub fn update_sync_job(
     if let Some(transfer_last) = update.transfer_last {
         data.transfer_last = Some(transfer_last);
     }
+    if let Some(resync_corrupt) = update.resync_corrupt {
+        data.resync_corrupt = Some(resync_corrupt);
+    }
 
     if update.limit.rate_in.is_some() {
         data.limit.rate_in = update.limit.rate_in;
@@ -527,6 +530,7 @@ acl:1:/remote/remote1/remotestore1:write@pbs:RemoteSyncOperator
         ns: None,
         owner: Some(write_auth_id.clone()),
         comment: None,
+        resync_corrupt: None,
         remove_vanished: None,
         max_depth: None,
         group_filter: None,
diff --git a/src/api2/pull.rs b/src/api2/pull.rs
index e733c9839e3a..0d4be0e2d228 100644
--- a/src/api2/pull.rs
+++ b/src/api2/pull.rs
@@ -10,7 +10,7 @@ use pbs_api_types::{
     Authid, BackupNamespace, GroupFilter, RateLimitConfig, SyncJobConfig, DATASTORE_SCHEMA,
     GROUP_FILTER_LIST_SCHEMA, NS_MAX_DEPTH_REDUCED_SCHEMA, PRIV_DATASTORE_BACKUP,
     PRIV_DATASTORE_PRUNE, PRIV_REMOTE_READ, REMOTE_ID_SCHEMA, REMOVE_VANISHED_BACKUPS_SCHEMA,
-    TRANSFER_LAST_SCHEMA,
+    RESYNC_CORRUPT_SCHEMA, TRANSFER_LAST_SCHEMA,
 };
 use pbs_config::CachedUserInfo;
 use proxmox_human_byte::HumanByte;
@@ -89,6 +89,7 @@ impl TryFrom<&SyncJobConfig> for PullParameters {
             sync_job.group_filter.clone(),
             sync_job.limit.clone(),
             sync_job.transfer_last,
+            sync_job.resync_corrupt,
         )
     }
 }
@@ -240,6 +241,10 @@ pub fn do_sync_job(
                 schema: TRANSFER_LAST_SCHEMA,
                 optional: true,
             },
+            "resync-corrupt": {
+                schema: RESYNC_CORRUPT_SCHEMA,
+                optional: true,
+            },
         },
     },
     access: {
@@ -264,6 +269,7 @@ async fn pull(
     group_filter: Option<Vec<GroupFilter>>,
     limit: RateLimitConfig,
     transfer_last: Option<usize>,
+    resync_corrupt: Option<bool>,
     rpcenv: &mut dyn RpcEnvironment,
 ) -> Result<String, Error> {
     let auth_id: Authid = rpcenv.get_auth_id().unwrap().parse()?;
@@ -301,6 +307,7 @@ async fn pull(
         group_filter,
         limit,
         transfer_last,
+        resync_corrupt,
     )?;
 
     // fixme: set to_stdout to false?
diff --git a/src/bin/proxmox-backup-manager.rs b/src/bin/proxmox-backup-manager.rs
index 420e96665662..38a1cf0f5881 100644
--- a/src/bin/proxmox-backup-manager.rs
+++ b/src/bin/proxmox-backup-manager.rs
@@ -14,8 +14,8 @@ use pbs_api_types::percent_encoding::percent_encode_component;
 use pbs_api_types::{
     BackupNamespace, GroupFilter, RateLimitConfig, SyncJobConfig, DATASTORE_SCHEMA,
     GROUP_FILTER_LIST_SCHEMA, IGNORE_VERIFIED_BACKUPS_SCHEMA, NS_MAX_DEPTH_SCHEMA,
-    REMOTE_ID_SCHEMA, REMOVE_VANISHED_BACKUPS_SCHEMA, TRANSFER_LAST_SCHEMA, UPID_SCHEMA,
-    VERIFICATION_OUTDATED_AFTER_SCHEMA,
+    REMOTE_ID_SCHEMA, REMOVE_VANISHED_BACKUPS_SCHEMA, RESYNC_CORRUPT_SCHEMA, TRANSFER_LAST_SCHEMA,
+    UPID_SCHEMA, VERIFICATION_OUTDATED_AFTER_SCHEMA,
 };
 use pbs_client::{display_task_log, view_task_result};
 use pbs_config::sync;
diff --git a/src/server/pull.rs b/src/server/pull.rs
index 3117f7d2c960..b2dd15d9d6db 100644
--- a/src/server/pull.rs
+++ b/src/server/pull.rs
@@ -7,12 +7,14 @@ use std::sync::{Arc, Mutex};
 use std::time::SystemTime;
 
 use anyhow::{bail, format_err, Error};
+use nom::combinator::verify;
 use proxmox_human_byte::HumanByte;
 use tracing::info;
 
 use pbs_api_types::{
     print_store_and_ns, Authid, BackupDir, BackupGroup, BackupNamespace, GroupFilter, Operation,
-    RateLimitConfig, Remote, MAX_NAMESPACE_DEPTH, PRIV_DATASTORE_AUDIT, PRIV_DATASTORE_BACKUP,
+    RateLimitConfig, Remote, VerifyState, MAX_NAMESPACE_DEPTH, PRIV_DATASTORE_AUDIT,
+    PRIV_DATASTORE_BACKUP,
 };
 use pbs_client::BackupRepository;
 use pbs_config::CachedUserInfo;
@@ -55,6 +57,8 @@ pub(crate) struct PullParameters {
     group_filter: Vec<GroupFilter>,
     /// How many snapshots should be transferred at most (taking the newest N snapshots)
     transfer_last: Option<usize>,
+    /// Whether to re-sync corrupted snapshots
+    resync_corrupt: bool,
 }
 
 impl PullParameters {
@@ -72,12 +76,14 @@ impl PullParameters {
         group_filter: Option<Vec<GroupFilter>>,
         limit: RateLimitConfig,
         transfer_last: Option<usize>,
+        resync_corrupt: Option<bool>,
     ) -> Result<Self, Error> {
         if let Some(max_depth) = max_depth {
             ns.check_max_depth(max_depth)?;
             remote_ns.check_max_depth(max_depth)?;
         };
         let remove_vanished = remove_vanished.unwrap_or(false);
+        let resync_corrupt = resync_corrupt.unwrap_or(false);
 
         let source: Arc<dyn SyncSource> = if let Some(remote) = remote {
             let (remote_config, _digest) = pbs_config::remote::config()?;
@@ -116,6 +122,7 @@ impl PullParameters {
             max_depth,
             group_filter,
             transfer_last,
+            resync_corrupt,
         })
     }
 }
@@ -175,9 +182,10 @@ async fn pull_index_chunks<I: IndexFile>(
                     target.cond_touch_chunk(&info.digest, false)
                 })?;
                 if chunk_exists {
-                    //info!("chunk {} exists {}", pos, hex::encode(digest));
+                    //info!("chunk exists {}", hex::encode(info.digest));
                     return Ok::<_, Error>(());
                 }
+
                 //info!("sync {} chunk {}", pos, hex::encode(digest));
                 let chunk = chunk_reader.read_raw_chunk(&info.digest).await?;
                 let raw_size = chunk.raw_size() as usize;
@@ -325,13 +333,15 @@ async fn pull_single_archive<'a>(
 /// - (Re)download the manifest
 /// -- if it matches, only download log and treat snapshot as already synced
 /// - Iterate over referenced files
-/// -- if file already exists, verify contents
+/// -- if file already exists, verify contents or pull again if last
+///     verification failed and `resync_corrupt` is true
 /// -- if not, pull it from the remote
 /// - Download log if not already existing
 async fn pull_snapshot<'a>(
     reader: Arc<dyn SyncSourceReader + 'a>,
     snapshot: &'a pbs_datastore::BackupDir,
     downloaded_chunks: Arc<Mutex<HashSet<[u8; 32]>>>,
+    resync_corrupt: bool,
 ) -> Result<SyncStats, Error> {
     let mut sync_stats = SyncStats::default();
     let mut manifest_name = snapshot.full_path();
@@ -352,6 +362,14 @@ async fn pull_snapshot<'a>(
         return Ok(sync_stats);
     }
 
+    let must_resync_existing = resync_corrupt
+        && snapshot
+            .verify_state()
+            .inspect_err(|err| {
+                tracing::error!("Failed to check verification state of snapshot: {err:?}")
+            })
+            .is_ok_and(|state| state == VerifyState::Failed);
+
     if manifest_name.exists() {
         let manifest_blob = proxmox_lang::try_block!({
             let mut manifest_file = std::fs::File::open(&manifest_name).map_err(|err| {
@@ -365,7 +383,7 @@ async fn pull_snapshot<'a>(
             format_err!("unable to read local manifest {manifest_name:?} - {err}")
         })?;
 
-        if manifest_blob.raw_data() == tmp_manifest_blob.raw_data() {
+        if manifest_blob.raw_data() == tmp_manifest_blob.raw_data() && !must_resync_existing {
             if !client_log_name.exists() {
                 reader.try_download_client_log(&client_log_name).await?;
             };
@@ -377,11 +395,18 @@ async fn pull_snapshot<'a>(
 
     let manifest = BackupManifest::try_from(tmp_manifest_blob)?;
 
+    if must_resync_existing {
+        info!(
+            "re-syncing snapshot {} due to bad verification result",
+            snapshot.dir()
+        );
+    }
+
     for item in manifest.files() {
         let mut path = snapshot.full_path();
         path.push(&item.filename);
 
-        if path.exists() {
+        if !must_resync_existing && path.exists() {
             match ArchiveType::from_path(&item.filename)? {
                 ArchiveType::DynamicIndex => {
                     let index = DynamicIndexReader::open(&path)?;
@@ -443,6 +468,7 @@ async fn pull_snapshot_from<'a>(
     reader: Arc<dyn SyncSourceReader + 'a>,
     snapshot: &'a pbs_datastore::BackupDir,
     downloaded_chunks: Arc<Mutex<HashSet<[u8; 32]>>>,
+    resync_corrupt: bool,
 ) -> Result<SyncStats, Error> {
     let (_path, is_new, _snap_lock) = snapshot
         .datastore()
@@ -451,7 +477,7 @@ async fn pull_snapshot_from<'a>(
     let sync_stats = if is_new {
         info!("sync snapshot {}", snapshot.dir());
 
-        match pull_snapshot(reader, snapshot, downloaded_chunks).await {
+        match pull_snapshot(reader, snapshot, downloaded_chunks, resync_corrupt).await {
             Err(err) => {
                 if let Err(cleanup_err) = snapshot.datastore().remove_backup_dir(
                     snapshot.backup_ns(),
@@ -469,7 +495,7 @@ async fn pull_snapshot_from<'a>(
         }
     } else {
         info!("re-sync snapshot {}", snapshot.dir());
-        pull_snapshot(reader, snapshot, downloaded_chunks).await?
+        pull_snapshot(reader, snapshot, downloaded_chunks, resync_corrupt).await?
     };
 
     Ok(sync_stats)
@@ -528,6 +554,10 @@ async fn pull_group(
         .enumerate()
         .filter(|&(pos, ref dir)| {
             source_snapshots.insert(dir.time);
+            // If resync_corrupt is set, we go through all the remote snapshots
+            if params.resync_corrupt {
+                return true;
+            }
             if last_sync_time > dir.time {
                 already_synced_skip_info.update(dir.time);
                 return false;
@@ -566,7 +596,13 @@ async fn pull_group(
             .source
             .reader(source_namespace, &from_snapshot)
             .await?;
-        let result = pull_snapshot_from(reader, &to_snapshot, downloaded_chunks.clone()).await;
+        let result = pull_snapshot_from(
+            reader,
+            &to_snapshot,
+            downloaded_chunks.clone(),
+            params.resync_corrupt,
+        )
+        .await;
 
         progress.done_snapshots = pos as u64 + 1;
         info!("percentage done: {progress}");
-- 
2.39.5



_______________________________________________
pbs-devel mailing list
pbs-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [pbs-devel] [PATCH proxmox-backup v2 2/3] fix #3786: ui/cli: add resync-corrupt option on sync-jobs
  2024-10-18  9:09 [pbs-devel] [PATCH proxmox-backup v2 0/3] fix #3786: resync corrupt chunks in sync-job Gabriel Goller
  2024-10-18  9:09 ` [pbs-devel] [PATCH proxmox-backup v2 1/3] fix #3786: api: add resync-corrupt option to sync jobs Gabriel Goller
@ 2024-10-18  9:09 ` Gabriel Goller
  2024-10-18  9:09 ` [pbs-devel] [PATCH proxmox-backup v2 3/3] fix #3786: docs: add resync-corrupt option to sync-job Gabriel Goller
  2 siblings, 0 replies; 8+ messages in thread
From: Gabriel Goller @ 2024-10-18  9:09 UTC (permalink / raw)
  To: pbs-devel

Add the `resync-corrupt` option to the ui and the
`proxmox-backup-manager` cli. It is listed in the `Advanced` section,
because it slows the sync-job down and is useless if no verification
job was run beforehand.

Originally-by: Shannon Sterz <s.sterz@proxmox.com>
Signed-off-by: Gabriel Goller <g.goller@proxmox.com>
---
 src/bin/proxmox-backup-manager.rs |  9 +++++++++
 www/window/SyncJobEdit.js         | 11 +++++++++++
 2 files changed, 20 insertions(+)

diff --git a/src/bin/proxmox-backup-manager.rs b/src/bin/proxmox-backup-manager.rs
index 38a1cf0f5881..08728e9d7250 100644
--- a/src/bin/proxmox-backup-manager.rs
+++ b/src/bin/proxmox-backup-manager.rs
@@ -339,6 +339,10 @@ fn task_mgmt_cli() -> CommandLineInterface {
                 schema: TRANSFER_LAST_SCHEMA,
                 optional: true,
             },
+            "resync-corrupt": {
+                schema: RESYNC_CORRUPT_SCHEMA,
+                optional: true,
+            },
         }
    }
 )]
@@ -355,6 +359,7 @@ async fn pull_datastore(
     group_filter: Option<Vec<GroupFilter>>,
     limit: RateLimitConfig,
     transfer_last: Option<usize>,
+    resync_corrupt: Option<bool>,
     param: Value,
 ) -> Result<Value, Error> {
     let output_format = get_output_format(&param);
@@ -391,6 +396,10 @@ async fn pull_datastore(
         args["transfer-last"] = json!(transfer_last)
     }
 
+    if let Some(resync_corrupt) = resync_corrupt {
+        args["resync-corrupt"] = Value::from(resync_corrupt);
+    }
+
     let mut limit_json = json!(limit);
     let limit_map = limit_json
         .as_object_mut()
diff --git a/www/window/SyncJobEdit.js b/www/window/SyncJobEdit.js
index 6543995e8800..a3c497fc2185 100644
--- a/www/window/SyncJobEdit.js
+++ b/www/window/SyncJobEdit.js
@@ -321,6 +321,17 @@ Ext.define('PBS.window.SyncJobEdit', {
 			    deleteEmpty: '{!isCreate}',
 			},
 		    },
+		    {
+			fieldLabel: gettext('Resync corrupt snapshots'),
+			xtype: 'proxmoxcheckbox',
+			name: 'resync-corrupt',
+			autoEl: {
+			    tag: 'div',
+			    'data-qtip': gettext('Re-sync snapshots, whose verfification failed.'),
+			},
+			uncheckedValue: false,
+			value: false,
+		    },
 		],
 	    },
 	    {
-- 
2.39.5



_______________________________________________
pbs-devel mailing list
pbs-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [pbs-devel] [PATCH proxmox-backup v2 3/3] fix #3786: docs: add resync-corrupt option to sync-job
  2024-10-18  9:09 [pbs-devel] [PATCH proxmox-backup v2 0/3] fix #3786: resync corrupt chunks in sync-job Gabriel Goller
  2024-10-18  9:09 ` [pbs-devel] [PATCH proxmox-backup v2 1/3] fix #3786: api: add resync-corrupt option to sync jobs Gabriel Goller
  2024-10-18  9:09 ` [pbs-devel] [PATCH proxmox-backup v2 2/3] fix #3786: ui/cli: add resync-corrupt option on sync-jobs Gabriel Goller
@ 2024-10-18  9:09 ` Gabriel Goller
  2 siblings, 0 replies; 8+ messages in thread
From: Gabriel Goller @ 2024-10-18  9:09 UTC (permalink / raw)
  To: pbs-devel

Add short section explaining the `resync-corrupt` option on the
sync-job.

Originally-by: Shannon Sterz <s.sterz@proxmox.com>
Signed-off-by: Gabriel Goller <g.goller@proxmox.com>
---
 docs/managing-remotes.rst | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/docs/managing-remotes.rst b/docs/managing-remotes.rst
index 38ccd5af25d5..b441daa9b0bd 100644
--- a/docs/managing-remotes.rst
+++ b/docs/managing-remotes.rst
@@ -132,6 +132,12 @@ For mixing include and exclude filter, following rules apply:
 
 .. note:: The ``protected`` flag of remote backup snapshots will not be synced.
 
+Enabling the advanced option 'resync-corrupt' will re-sync all snapshots that have 
+failed to verify during the last :ref:`maintenance_verification`. Hence, a verification
+job needs to be run before a sync job with 'resync-corrupt' can be carried out. Be aware
+that a 'resync-corrupt'-job needs to check the manifests of all snapshots in a datastore
+and might take much longer than regular sync jobs.
+
 Namespace Support
 ^^^^^^^^^^^^^^^^^
 
-- 
2.39.5



_______________________________________________
pbs-devel mailing list
pbs-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [pbs-devel] [PATCH proxmox-backup v2 1/3] fix #3786: api: add resync-corrupt option to sync jobs
  2024-10-18  9:09 ` [pbs-devel] [PATCH proxmox-backup v2 1/3] fix #3786: api: add resync-corrupt option to sync jobs Gabriel Goller
@ 2024-11-04 11:51   ` Fabian Grünbichler
  2024-11-04 16:02     ` Gabriel Goller
  0 siblings, 1 reply; 8+ messages in thread
From: Fabian Grünbichler @ 2024-11-04 11:51 UTC (permalink / raw)
  To: Proxmox Backup Server development discussion

this doesn't really do what it says on the tin, see below.

On October 18, 2024 11:09 am, Gabriel Goller wrote:
> This option allows us to "fix" corrupt snapshots (and/or their chunks)
> by pulling them from another remote. When traversing the remote
> snapshots, we check if it exists locally, and if it is, we check if the
> last verification of it failed. If the local snapshot is broken and the
> `resync-corrupt` option is turned on, we pull in the remote snapshot,
> overwriting the local one.
> 
> This is very useful and has been requested a lot, as there is currently
> no way to "fix" corrupt chunks/snapshots even if the user has a healthy
> version of it on their offsite instance.
> 
> Originally-by: Shannon Sterz <s.sterz@proxmox.com>
> Signed-off-by: Gabriel Goller <g.goller@proxmox.com>
> ---
>  pbs-api-types/src/jobs.rs         | 10 ++++++
>  pbs-datastore/src/backup_info.rs  | 13 +++++++-
>  src/api2/config/sync.rs           |  4 +++
>  src/api2/pull.rs                  |  9 +++++-
>  src/bin/proxmox-backup-manager.rs |  4 +--
>  src/server/pull.rs                | 52 ++++++++++++++++++++++++++-----
>  6 files changed, 80 insertions(+), 12 deletions(-)
> 
> diff --git a/pbs-api-types/src/jobs.rs b/pbs-api-types/src/jobs.rs
> index 868702bc059e..58f739ad00b5 100644
> --- a/pbs-api-types/src/jobs.rs
> +++ b/pbs-api-types/src/jobs.rs
> @@ -498,6 +498,10 @@ pub const TRANSFER_LAST_SCHEMA: Schema =
>          .minimum(1)
>          .schema();
>  
> +pub const RESYNC_CORRUPT_SCHEMA: Schema =
> +    BooleanSchema::new("If the verification failed for a local snapshot, try to pull it again.")
> +        .schema();
> +
>  #[api(
>      properties: {
>          id: {
> @@ -552,6 +556,10 @@ pub const TRANSFER_LAST_SCHEMA: Schema =
>              schema: TRANSFER_LAST_SCHEMA,
>              optional: true,
>          },
> +        "resync-corrupt": {
> +            schema: RESYNC_CORRUPT_SCHEMA,
> +            optional: true,
> +        }
>      }
>  )]
>  #[derive(Serialize, Deserialize, Clone, Updater, PartialEq)]
> @@ -585,6 +593,8 @@ pub struct SyncJobConfig {
>      pub limit: RateLimitConfig,
>      #[serde(skip_serializing_if = "Option::is_none")]
>      pub transfer_last: Option<usize>,
> +    #[serde(skip_serializing_if = "Option::is_none")]
> +    pub resync_corrupt: Option<bool>,
>  }
>  
>  impl SyncJobConfig {
> diff --git a/pbs-datastore/src/backup_info.rs b/pbs-datastore/src/backup_info.rs
> index 414ec878d01a..c86fbb7568ab 100644
> --- a/pbs-datastore/src/backup_info.rs
> +++ b/pbs-datastore/src/backup_info.rs
> @@ -8,7 +8,8 @@ use anyhow::{bail, format_err, Error};
>  use proxmox_sys::fs::{lock_dir_noblock, replace_file, CreateOptions};
>  
>  use pbs_api_types::{
> -    Authid, BackupNamespace, BackupType, GroupFilter, BACKUP_DATE_REGEX, BACKUP_FILE_REGEX,
> +    Authid, BackupNamespace, BackupType, GroupFilter, SnapshotVerifyState, VerifyState,
> +    BACKUP_DATE_REGEX, BACKUP_FILE_REGEX,
>  };
>  use pbs_config::{open_backup_lockfile, BackupLockGuard};
>  
> @@ -583,6 +584,16 @@ impl BackupDir {
>  
>          Ok(())
>      }
> +
> +    /// Load the verify state from the manifest.
> +    pub fn verify_state(&self) -> Result<VerifyState, anyhow::Error> {
> +        self.load_manifest().and_then(|(m, _)| {
> +            let verify = m.unprotected["verify_state"].clone();
> +            serde_json::from_value::<SnapshotVerifyState>(verify)
> +                .map(|svs| svs.state)
> +                .map_err(Into::into)
> +        })
> +    }

wouldn't it make more sense to have this as a getter for an optional
SnapshotVerifyState on the BackupManifest?

then it could go into its own commit, other call sites that load the
verify state from a manifest could be adapted to it, and then this
commit can also start using it?

also see the comment further below about how the current implementation
is very noisy if snapshots are newly synced as opposed to resynced..

>  }
>  
>  impl AsRef<pbs_api_types::BackupNamespace> for BackupDir {
> diff --git a/src/api2/config/sync.rs b/src/api2/config/sync.rs
> index 6fdc69a9e645..fa9db92f3d11 100644
> --- a/src/api2/config/sync.rs
> +++ b/src/api2/config/sync.rs
> @@ -368,6 +368,9 @@ pub fn update_sync_job(
>      if let Some(transfer_last) = update.transfer_last {
>          data.transfer_last = Some(transfer_last);
>      }
> +    if let Some(resync_corrupt) = update.resync_corrupt {
> +        data.resync_corrupt = Some(resync_corrupt);
> +    }
>  
>      if update.limit.rate_in.is_some() {
>          data.limit.rate_in = update.limit.rate_in;
> @@ -527,6 +530,7 @@ acl:1:/remote/remote1/remotestore1:write@pbs:RemoteSyncOperator
>          ns: None,
>          owner: Some(write_auth_id.clone()),
>          comment: None,
> +        resync_corrupt: None,
>          remove_vanished: None,
>          max_depth: None,
>          group_filter: None,
> diff --git a/src/api2/pull.rs b/src/api2/pull.rs
> index e733c9839e3a..0d4be0e2d228 100644
> --- a/src/api2/pull.rs
> +++ b/src/api2/pull.rs
> @@ -10,7 +10,7 @@ use pbs_api_types::{
>      Authid, BackupNamespace, GroupFilter, RateLimitConfig, SyncJobConfig, DATASTORE_SCHEMA,
>      GROUP_FILTER_LIST_SCHEMA, NS_MAX_DEPTH_REDUCED_SCHEMA, PRIV_DATASTORE_BACKUP,
>      PRIV_DATASTORE_PRUNE, PRIV_REMOTE_READ, REMOTE_ID_SCHEMA, REMOVE_VANISHED_BACKUPS_SCHEMA,
> -    TRANSFER_LAST_SCHEMA,
> +    RESYNC_CORRUPT_SCHEMA, TRANSFER_LAST_SCHEMA,
>  };
>  use pbs_config::CachedUserInfo;
>  use proxmox_human_byte::HumanByte;
> @@ -89,6 +89,7 @@ impl TryFrom<&SyncJobConfig> for PullParameters {
>              sync_job.group_filter.clone(),
>              sync_job.limit.clone(),
>              sync_job.transfer_last,
> +            sync_job.resync_corrupt,
>          )
>      }
>  }
> @@ -240,6 +241,10 @@ pub fn do_sync_job(
>                  schema: TRANSFER_LAST_SCHEMA,
>                  optional: true,
>              },
> +            "resync-corrupt": {
> +                schema: RESYNC_CORRUPT_SCHEMA,
> +                optional: true,
> +            },
>          },
>      },
>      access: {
> @@ -264,6 +269,7 @@ async fn pull(
>      group_filter: Option<Vec<GroupFilter>>,
>      limit: RateLimitConfig,
>      transfer_last: Option<usize>,
> +    resync_corrupt: Option<bool>,
>      rpcenv: &mut dyn RpcEnvironment,
>  ) -> Result<String, Error> {
>      let auth_id: Authid = rpcenv.get_auth_id().unwrap().parse()?;
> @@ -301,6 +307,7 @@ async fn pull(
>          group_filter,
>          limit,
>          transfer_last,
> +        resync_corrupt,
>      )?;
>  
>      // fixme: set to_stdout to false?
> diff --git a/src/bin/proxmox-backup-manager.rs b/src/bin/proxmox-backup-manager.rs
> index 420e96665662..38a1cf0f5881 100644
> --- a/src/bin/proxmox-backup-manager.rs
> +++ b/src/bin/proxmox-backup-manager.rs
> @@ -14,8 +14,8 @@ use pbs_api_types::percent_encoding::percent_encode_component;
>  use pbs_api_types::{
>      BackupNamespace, GroupFilter, RateLimitConfig, SyncJobConfig, DATASTORE_SCHEMA,
>      GROUP_FILTER_LIST_SCHEMA, IGNORE_VERIFIED_BACKUPS_SCHEMA, NS_MAX_DEPTH_SCHEMA,
> -    REMOTE_ID_SCHEMA, REMOVE_VANISHED_BACKUPS_SCHEMA, TRANSFER_LAST_SCHEMA, UPID_SCHEMA,
> -    VERIFICATION_OUTDATED_AFTER_SCHEMA,
> +    REMOTE_ID_SCHEMA, REMOVE_VANISHED_BACKUPS_SCHEMA, RESYNC_CORRUPT_SCHEMA, TRANSFER_LAST_SCHEMA,
> +    UPID_SCHEMA, VERIFICATION_OUTDATED_AFTER_SCHEMA,
>  };
>  use pbs_client::{display_task_log, view_task_result};
>  use pbs_config::sync;
> diff --git a/src/server/pull.rs b/src/server/pull.rs
> index 3117f7d2c960..b2dd15d9d6db 100644
> --- a/src/server/pull.rs
> +++ b/src/server/pull.rs
> @@ -7,12 +7,14 @@ use std::sync::{Arc, Mutex};
>  use std::time::SystemTime;
>  
>  use anyhow::{bail, format_err, Error};
> +use nom::combinator::verify;

I think this snuck in ;)

>  use proxmox_human_byte::HumanByte;
>  use tracing::info;
>  
>  use pbs_api_types::{
>      print_store_and_ns, Authid, BackupDir, BackupGroup, BackupNamespace, GroupFilter, Operation,
> -    RateLimitConfig, Remote, MAX_NAMESPACE_DEPTH, PRIV_DATASTORE_AUDIT, PRIV_DATASTORE_BACKUP,
> +    RateLimitConfig, Remote, VerifyState, MAX_NAMESPACE_DEPTH, PRIV_DATASTORE_AUDIT,
> +    PRIV_DATASTORE_BACKUP,
>  };
>  use pbs_client::BackupRepository;
>  use pbs_config::CachedUserInfo;
> @@ -55,6 +57,8 @@ pub(crate) struct PullParameters {
>      group_filter: Vec<GroupFilter>,
>      /// How many snapshots should be transferred at most (taking the newest N snapshots)
>      transfer_last: Option<usize>,
> +    /// Whether to re-sync corrupted snapshots
> +    resync_corrupt: bool,
>  }
>  
>  impl PullParameters {
> @@ -72,12 +76,14 @@ impl PullParameters {
>          group_filter: Option<Vec<GroupFilter>>,
>          limit: RateLimitConfig,
>          transfer_last: Option<usize>,
> +        resync_corrupt: Option<bool>,
>      ) -> Result<Self, Error> {
>          if let Some(max_depth) = max_depth {
>              ns.check_max_depth(max_depth)?;
>              remote_ns.check_max_depth(max_depth)?;
>          };
>          let remove_vanished = remove_vanished.unwrap_or(false);
> +        let resync_corrupt = resync_corrupt.unwrap_or(false);
>  
>          let source: Arc<dyn SyncSource> = if let Some(remote) = remote {
>              let (remote_config, _digest) = pbs_config::remote::config()?;
> @@ -116,6 +122,7 @@ impl PullParameters {
>              max_depth,
>              group_filter,
>              transfer_last,
> +            resync_corrupt,
>          })
>      }
>  }
> @@ -175,9 +182,10 @@ async fn pull_index_chunks<I: IndexFile>(
>                      target.cond_touch_chunk(&info.digest, false)
>                  })?;
>                  if chunk_exists {
> -                    //info!("chunk {} exists {}", pos, hex::encode(digest));
> +                    //info!("chunk exists {}", hex::encode(info.digest));

this

>                      return Ok::<_, Error>(());
>                  }
> +

and this as well?

>                  //info!("sync {} chunk {}", pos, hex::encode(digest));
>                  let chunk = chunk_reader.read_raw_chunk(&info.digest).await?;
>                  let raw_size = chunk.raw_size() as usize;
> @@ -325,13 +333,15 @@ async fn pull_single_archive<'a>(
>  /// - (Re)download the manifest
>  /// -- if it matches, only download log and treat snapshot as already synced
>  /// - Iterate over referenced files
> -/// -- if file already exists, verify contents
> +/// -- if file already exists, verify contents or pull again if last
> +///     verification failed and `resync_corrupt` is true
>  /// -- if not, pull it from the remote
>  /// - Download log if not already existing
>  async fn pull_snapshot<'a>(
>      reader: Arc<dyn SyncSourceReader + 'a>,
>      snapshot: &'a pbs_datastore::BackupDir,
>      downloaded_chunks: Arc<Mutex<HashSet<[u8; 32]>>>,
> +    resync_corrupt: bool,
>  ) -> Result<SyncStats, Error> {
>      let mut sync_stats = SyncStats::default();
>      let mut manifest_name = snapshot.full_path();
> @@ -352,6 +362,14 @@ async fn pull_snapshot<'a>(
>          return Ok(sync_stats);
>      }
> 

I think this part here is somewhat wrong ordering wise, or at least,
unnecessarily expensive..

if resync_corrupt is enabled, we want to (in this order!)
- check the local snapshot for corruption, if it exists
- if it is corrupt, we proceed with resyncing
- if not, we only proceed with resyncing if it is the last snapshot in
  this group, and return early otherwise

that way, we avoid redownloading all the manifests.. but see further
below for another issue with the current implementation..

> +    let must_resync_existing = resync_corrupt
> +        && snapshot
> +            .verify_state()
> +            .inspect_err(|err| {
> +                tracing::error!("Failed to check verification state of snapshot: {err:?}")

2024-11-04T12:34:57+01:00: Failed to check verification state of snapshot: unable to load blob '"/tank/pbs/ns/foobar/ns/test/ns/another_test/vm/900/2023-04-06T14:36:00Z/index.json.blob"' - No such file or directory (os error 2)

this seems to be very noisy for newly synced snapshots, because the
helper is implemented on BackupInfo instead of on BackupManifest..

> +            })
> +            .is_ok_and(|state| state == VerifyState::Failed);
> +
>      if manifest_name.exists() {
>          let manifest_blob = proxmox_lang::try_block!({
>              let mut manifest_file = std::fs::File::open(&manifest_name).map_err(|err| {
> @@ -365,7 +383,7 @@ async fn pull_snapshot<'a>(
>              format_err!("unable to read local manifest {manifest_name:?} - {err}")
>          })?;
>  
> -        if manifest_blob.raw_data() == tmp_manifest_blob.raw_data() {
> +        if manifest_blob.raw_data() == tmp_manifest_blob.raw_data() && !must_resync_existing {
>              if !client_log_name.exists() {
>                  reader.try_download_client_log(&client_log_name).await?;
>              };
> @@ -377,11 +395,18 @@ async fn pull_snapshot<'a>(
>  
>      let manifest = BackupManifest::try_from(tmp_manifest_blob)?;
>  
> +    if must_resync_existing {
> +        info!(
> +            "re-syncing snapshot {} due to bad verification result",
> +            snapshot.dir()
> +        );
> +    }
> +
>      for item in manifest.files() {
>          let mut path = snapshot.full_path();
>          path.push(&item.filename);
>  
> -        if path.exists() {
> +        if !must_resync_existing && path.exists() {
>              match ArchiveType::from_path(&item.filename)? {
>                  ArchiveType::DynamicIndex => {
>                      let index = DynamicIndexReader::open(&path)?;
> @@ -443,6 +468,7 @@ async fn pull_snapshot_from<'a>(
>      reader: Arc<dyn SyncSourceReader + 'a>,
>      snapshot: &'a pbs_datastore::BackupDir,
>      downloaded_chunks: Arc<Mutex<HashSet<[u8; 32]>>>,
> +    resync_corrupt: bool,
>  ) -> Result<SyncStats, Error> {
>      let (_path, is_new, _snap_lock) = snapshot
>          .datastore()
> @@ -451,7 +477,7 @@ async fn pull_snapshot_from<'a>(
>      let sync_stats = if is_new {
>          info!("sync snapshot {}", snapshot.dir());
>  
> -        match pull_snapshot(reader, snapshot, downloaded_chunks).await {
> +        match pull_snapshot(reader, snapshot, downloaded_chunks, resync_corrupt).await {
>              Err(err) => {
>                  if let Err(cleanup_err) = snapshot.datastore().remove_backup_dir(
>                      snapshot.backup_ns(),
> @@ -469,7 +495,7 @@ async fn pull_snapshot_from<'a>(
>          }
>      } else {
>          info!("re-sync snapshot {}", snapshot.dir());
> -        pull_snapshot(reader, snapshot, downloaded_chunks).await?
> +        pull_snapshot(reader, snapshot, downloaded_chunks, resync_corrupt).await?
>      };
>  
>      Ok(sync_stats)
> @@ -528,6 +554,10 @@ async fn pull_group(
>          .enumerate()
>          .filter(|&(pos, ref dir)| {
>              source_snapshots.insert(dir.time);
> +            // If resync_corrupt is set, we go through all the remote snapshots
> +            if params.resync_corrupt {
> +                return true;
> +            }

alternatively, we could check the local manifest here, and only include
existing snapshots with a failed verification state, the last one and
new ones? that way, we'd get more meaningful progress stats as well..

because right now, this will not only resync existing corrupt snapshots,
but also ones that have been pruned locally, but not on the source
(i.e., the other proposed "fixing" sync mode that syncs "missing"
old snapshots, not just corrupt ones).

>              if last_sync_time > dir.time {
>                  already_synced_skip_info.update(dir.time);
>                  return false;
> @@ -566,7 +596,13 @@ async fn pull_group(
>              .source
>              .reader(source_namespace, &from_snapshot)
>              .await?;
> -        let result = pull_snapshot_from(reader, &to_snapshot, downloaded_chunks.clone()).await;
> +        let result = pull_snapshot_from(
> +            reader,
> +            &to_snapshot,
> +            downloaded_chunks.clone(),
> +            params.resync_corrupt,
> +        )
> +        .await;
>  
>          progress.done_snapshots = pos as u64 + 1;
>          info!("percentage done: {progress}");
> -- 
> 2.39.5
> 
> 
> 
> _______________________________________________
> pbs-devel mailing list
> pbs-devel@lists.proxmox.com
> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel
> 
> 
> 


_______________________________________________
pbs-devel mailing list
pbs-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [pbs-devel] [PATCH proxmox-backup v2 1/3] fix #3786: api: add resync-corrupt option to sync jobs
  2024-11-04 11:51   ` Fabian Grünbichler
@ 2024-11-04 16:02     ` Gabriel Goller
  2024-11-05  7:20       ` Fabian Grünbichler
  0 siblings, 1 reply; 8+ messages in thread
From: Gabriel Goller @ 2024-11-04 16:02 UTC (permalink / raw)
  To: Fabian Grünbichler; +Cc: Proxmox Backup Server development discussion

On 04.11.2024 12:51, Fabian Grünbichler wrote:
>this doesn't really do what it says on the tin, see below.
>
>On October 18, 2024 11:09 am, Gabriel Goller wrote:
>> [snip]
>> diff --git a/pbs-datastore/src/backup_info.rs b/pbs-datastore/src/backup_info.rs
>> index 414ec878d01a..c86fbb7568ab 100644
>> --- a/pbs-datastore/src/backup_info.rs
>> +++ b/pbs-datastore/src/backup_info.rs
>> @@ -8,7 +8,8 @@ use anyhow::{bail, format_err, Error};
>>  use proxmox_sys::fs::{lock_dir_noblock, replace_file, CreateOptions};
>>
>>  use pbs_api_types::{
>> -    Authid, BackupNamespace, BackupType, GroupFilter, BACKUP_DATE_REGEX, BACKUP_FILE_REGEX,
>> +    Authid, BackupNamespace, BackupType, GroupFilter, SnapshotVerifyState, VerifyState,
>> +    BACKUP_DATE_REGEX, BACKUP_FILE_REGEX,
>>  };
>>  use pbs_config::{open_backup_lockfile, BackupLockGuard};
>>
>> @@ -583,6 +584,16 @@ impl BackupDir {
>>
>>          Ok(())
>>      }
>> +
>> +    /// Load the verify state from the manifest.
>> +    pub fn verify_state(&self) -> Result<VerifyState, anyhow::Error> {
>> +        self.load_manifest().and_then(|(m, _)| {
>> +            let verify = m.unprotected["verify_state"].clone();
>> +            serde_json::from_value::<SnapshotVerifyState>(verify)
>> +                .map(|svs| svs.state)
>> +                .map_err(Into::into)
>> +        })
>> +    }
>
>wouldn't it make more sense to have this as a getter for an optional
>SnapshotVerifyState on the BackupManifest?
>
>then it could go into its own commit, other call sites that load the
>verify state from a manifest could be adapted to it, and then this
>commit can also start using it?

You're right the BackupManifest is loaded here twice actually. So I
could move this function to a getter in BackupManifest and then move the
construction of it (src/server/pull.rs:396):

     let manifest = BackupManifest::try_from(tmp_manifest_blob)?;

up to (src/server/pull.rs:365). And instead of having the try_block to
try read from the manifest just call BackupManifest::try_from.

I'll see if I can make this work...

>also see the comment further below about how the current implementation
>is very noisy if snapshots are newly synced as opposed to resynced..
>
>>  }
>>
>>  impl AsRef<pbs_api_types::BackupNamespace> for BackupDir {
>> diff --git a/src/api2/config/sync.rs b/src/api2/config/sync.rs
>> index 6fdc69a9e645..fa9db92f3d11 100644
>> --- a/src/api2/config/sync.rs
>> +++ b/src/api2/config/sync.rs
>> [snip]
>> diff --git a/src/server/pull.rs b/src/server/pull.rs
>> index 3117f7d2c960..b2dd15d9d6db 100644
>> --- a/src/server/pull.rs
>> +++ b/src/server/pull.rs
>> @@ -7,12 +7,14 @@ use std::sync::{Arc, Mutex};
>>  use std::time::SystemTime;
>>
>>  use anyhow::{bail, format_err, Error};
>> +use nom::combinator::verify;
>
>I think this snuck in ;)

Oops, my fault :)

>> [snip]
>> @@ -175,9 +182,10 @@ async fn pull_index_chunks<I: IndexFile>(
>>                      target.cond_touch_chunk(&info.digest, false)
>>                  })?;
>>                  if chunk_exists {
>> -                    //info!("chunk {} exists {}", pos, hex::encode(digest));
>> +                    //info!("chunk exists {}", hex::encode(info.digest));
>
>this
>
>>                      return Ok::<_, Error>(());
>>                  }
>> +
>
>and this as well?

Yep, removed both, thanks for the heads-up!

>> [snip]
>> @@ -325,13 +333,15 @@ async fn pull_single_archive<'a>(
>>  /// - (Re)download the manifest
>>  /// -- if it matches, only download log and treat snapshot as already synced
>>  /// - Iterate over referenced files
>> -/// -- if file already exists, verify contents
>> +/// -- if file already exists, verify contents or pull again if last
>> +///     verification failed and `resync_corrupt` is true
>>  /// -- if not, pull it from the remote
>>  /// - Download log if not already existing
>>  async fn pull_snapshot<'a>(
>>      reader: Arc<dyn SyncSourceReader + 'a>,
>>      snapshot: &'a pbs_datastore::BackupDir,
>>      downloaded_chunks: Arc<Mutex<HashSet<[u8; 32]>>>,
>> +    resync_corrupt: bool,
>>  ) -> Result<SyncStats, Error> {
>>      let mut sync_stats = SyncStats::default();
>>      let mut manifest_name = snapshot.full_path();
>> @@ -352,6 +362,14 @@ async fn pull_snapshot<'a>(
>>          return Ok(sync_stats);
>>      }
>>
>
>I think this part here is somewhat wrong ordering wise, or at least,
>unnecessarily expensive..
>
>if resync_corrupt is enabled, we want to (in this order!)
>- check the local snapshot for corruption, if it exists
>- if it is corrupt, we proceed with resyncing
>- if not, we only proceed with resyncing if it is the last snapshot in
>  this group, and return early otherwise
>
>that way, we avoid redownloading all the manifests.. but see further
>below for another issue with the current implementation..
>
>> +    let must_resync_existing = resync_corrupt
>> +        && snapshot
>> +            .verify_state()
>> +            .inspect_err(|err| {
>> +                tracing::error!("Failed to check verification state of snapshot: {err:?}")
>
>2024-11-04T12:34:57+01:00: Failed to check verification state of snapshot: unable to load blob '"/tank/pbs/ns/foobar/ns/test/ns/another_test/vm/900/2023-04-06T14:36:00Z/index.json.blob"' - No such file or directory (os error 2)
>
>this seems to be very noisy for newly synced snapshots, because the
>helper is implemented on BackupInfo instead of on BackupManifest..

True, but I think this is mostly because new backups don't have a
verify_state yet (and that doesn't necessarily mean == bad). Removed the
log line as it's silly anyway :)

>> +            })
>> +            .is_ok_and(|state| state == VerifyState::Failed);
>> +
>>      if manifest_name.exists() {
>>          let manifest_blob = proxmox_lang::try_block!({
>>              let mut manifest_file = std::fs::File::open(&manifest_name).map_err(|err| {
>> [snip]
>> @@ -528,6 +554,10 @@ async fn pull_group(
>>          .enumerate()
>>          .filter(|&(pos, ref dir)| {
>>              source_snapshots.insert(dir.time);
>> +            // If resync_corrupt is set, we go through all the remote snapshots
>> +            if params.resync_corrupt {
>> +                return true;
>> +            }
>
>alternatively, we could check the local manifest here, and only include
>existing snapshots with a failed verification state, the last one and
>new ones? that way, we'd get more meaningful progress stats as well..

That's true, that would be cleaner. The downside is that we would have
open/parse the BackupManifest twice.

I could write something like:

     if params.resync_corrupt {
        let local_dir = params.target.store.backup_dir(target_ns.clone(), dir.clone());
        if let Ok(dir) = local_dir {
           let verify_state = dir.verify_state(); 
           if verify_state == Some(VerifyState::Failed) {
               return true;
           }
        }
     }

>because right now, this will not only resync existing corrupt snapshots,
>but also ones that have been pruned locally, but not on the source
>(i.e., the other proposed "fixing" sync mode that syncs "missing"
>old snapshots, not just corrupt ones).

I'm too stupid to find the mail where this was mentioned/discussed, I'm
quite sure we said to just pull both, and then maybe separate them in a
future iteration/feature. But now that I think about the flag is named
`resync_corrupt` so I'd expect it to only pull in the corrupt snapshots.

I actually agree with this change, it probably is also more performant
(reading backup_manifest twice is probably faster than pulling lots of
unneeded manifests from the remote).

>>              if last_sync_time > dir.time {
>>                  already_synced_skip_info.update(dir.time);
>>                  return false;
>> @@ -566,7 +596,13 @@ async fn pull_group(
>>              .source
>>              .reader(source_namespace, &from_snapshot)
>>              .await?;
>> -        let result = pull_snapshot_from(reader, &to_snapshot, downloaded_chunks.clone()).await;
>> +        let result = pull_snapshot_from(
>> +            reader,
>> +            &to_snapshot,
>> +            downloaded_chunks.clone(),
>> +            params.resync_corrupt,
>> +        )
>> +        .await;
>>
>>          progress.done_snapshots = pos as u64 + 1;
>>          info!("percentage done: {progress}");


_______________________________________________
pbs-devel mailing list
pbs-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [pbs-devel] [PATCH proxmox-backup v2 1/3] fix #3786: api: add resync-corrupt option to sync jobs
  2024-11-04 16:02     ` Gabriel Goller
@ 2024-11-05  7:20       ` Fabian Grünbichler
  2024-11-05 10:39         ` Gabriel Goller
  0 siblings, 1 reply; 8+ messages in thread
From: Fabian Grünbichler @ 2024-11-05  7:20 UTC (permalink / raw)
  To: Gabriel Goller; +Cc: Proxmox Backup Server development discussion


> Gabriel Goller <g.goller@proxmox.com> hat am 04.11.2024 17:02 CET geschrieben:
> 
>  
> On 04.11.2024 12:51, Fabian Grünbichler wrote:
> >this doesn't really do what it says on the tin, see below.
> >
> >On October 18, 2024 11:09 am, Gabriel Goller wrote:
> >> @@ -528,6 +554,10 @@ async fn pull_group(
> >>          .enumerate()
> >>          .filter(|&(pos, ref dir)| {
> >>              source_snapshots.insert(dir.time);
> >> +            // If resync_corrupt is set, we go through all the remote snapshots
> >> +            if params.resync_corrupt {
> >> +                return true;
> >> +            }
> >
> >alternatively, we could check the local manifest here, and only include
> >existing snapshots with a failed verification state, the last one and
> >new ones? that way, we'd get more meaningful progress stats as well..
> 
> That's true, that would be cleaner. The downside is that we would have
> open/parse the BackupManifest twice.
> 
> I could write something like:
> 
>      if params.resync_corrupt {
>         let local_dir = params.target.store.backup_dir(target_ns.clone(), dir.clone());
>         if let Ok(dir) = local_dir {
>            let verify_state = dir.verify_state(); 
>            if verify_state == Some(VerifyState::Failed) {
>                return true;
>            }
>         }
>      }
> 
> >because right now, this will not only resync existing corrupt snapshots,
> >but also ones that have been pruned locally, but not on the source
> >(i.e., the other proposed "fixing" sync mode that syncs "missing"
> >old snapshots, not just corrupt ones).
> 
> I'm too stupid to find the mail where this was mentioned/discussed, I'm
> quite sure we said to just pull both, and then maybe separate them in a
> future iteration/feature. But now that I think about the flag is named
> `resync_corrupt` so I'd expect it to only pull in the corrupt snapshots.
> 
> I actually agree with this change, it probably is also more performant
> (reading backup_manifest twice is probably faster than pulling lots of
> unneeded manifests from the remote).

I think we do want both of these, but they are not a single feature, since the "sync missing snapshots" option would effectively undo any pruning you do on the target side. Their implementation is of course somewhat intertwined, as they both affect the snapshot selection logic. They might even be both enabled and combined with remove_vanished to have a sort of 1:1 repairing replication going on (with all the pros and cons/danger that comes with it).

For syncing missing snapshots we might also want to require additional privileges to prevent lesser privileged users from stuffing backup groups (i.e, at least require Prune privs?).


_______________________________________________
pbs-devel mailing list
pbs-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [pbs-devel] [PATCH proxmox-backup v2 1/3] fix #3786: api: add resync-corrupt option to sync jobs
  2024-11-05  7:20       ` Fabian Grünbichler
@ 2024-11-05 10:39         ` Gabriel Goller
  0 siblings, 0 replies; 8+ messages in thread
From: Gabriel Goller @ 2024-11-05 10:39 UTC (permalink / raw)
  To: Fabian Grünbichler; +Cc: Proxmox Backup Server development discussion

I Agree!
Sent a v3!


_______________________________________________
pbs-devel mailing list
pbs-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2024-11-05 10:39 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-10-18  9:09 [pbs-devel] [PATCH proxmox-backup v2 0/3] fix #3786: resync corrupt chunks in sync-job Gabriel Goller
2024-10-18  9:09 ` [pbs-devel] [PATCH proxmox-backup v2 1/3] fix #3786: api: add resync-corrupt option to sync jobs Gabriel Goller
2024-11-04 11:51   ` Fabian Grünbichler
2024-11-04 16:02     ` Gabriel Goller
2024-11-05  7:20       ` Fabian Grünbichler
2024-11-05 10:39         ` Gabriel Goller
2024-10-18  9:09 ` [pbs-devel] [PATCH proxmox-backup v2 2/3] fix #3786: ui/cli: add resync-corrupt option on sync-jobs Gabriel Goller
2024-10-18  9:09 ` [pbs-devel] [PATCH proxmox-backup v2 3/3] fix #3786: docs: add resync-corrupt option to sync-job Gabriel Goller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox

Service provided by Proxmox Server Solutions GmbH | Privacy | Legal