public inbox for pbs-devel@lists.proxmox.com
 help / color / mirror / Atom feed
* [pbs-devel] [PATCH proxmox-backup v5 0/4] fix #3786: resync corrupt chunks in sync-job
@ 2024-11-22  9:39 Gabriel Goller
  2024-11-22  9:39 ` [pbs-devel] [PATCH proxmox-backup v5 1/4] snapshot: add helper function to retrieve verify_state Gabriel Goller
                   ` (5 more replies)
  0 siblings, 6 replies; 7+ messages in thread
From: Gabriel Goller @ 2024-11-22  9:39 UTC (permalink / raw)
  To: pbs-devel

Add an option `resync-corrupt` that resyncs corrupt snapshots when running
sync-job. This option checks if the local snapshot failed the last
verification and if it did, overwrites the local snapshot with the
remote one.

This is quite useful, as we currently don't have an option to "fix" 
broken chunks/snapshots in any way, even if a healthy version is on 
another (e.g. offsite) instance.

Important things to note are also: this has a slight performance 
penalty, as all the manifests have to be looked through, and a 
verification job has to be run beforehand, otherwise we do not know 
if the snapshot is healthy.

Note: This series was originally written by Shannon! I just picked it 
up, rebased, and fixed the obvious comments on the last series.

Changelog v5 (thanks @Fabian):
 - rebase
 - don't remove parsing error in verify_state helper
 - add error logs on failures

Changelog v4 (thanks @Fabian):
 - make verify_state bubble up errors
 - call verify_state helper everywhere we need the verify_state
 - resync broken manifests (so resync when load_manifest fails)

Changelog v3 (thanks @Fabian):
 - filter out snapshots earlier in the pull_group function
 - move verify_state to BackupManifest and fixed invocations
 - reverted verify_state Option -> Result state (It doesn't matter if we get an
   error, we get that quite often f.e. in new backups)
 - removed some unnecessary log lines
 - removed some unnecessary imports and modifications
 - rebase to current master

Changelog v2 (thanks @Thomas):
 - order git trailers
 - adjusted schema description to include broken indexes
 - change verify_state to return a Result<_,_>
 - print error if verify_state is not able to read the state
 - update docs on pull_snapshot function
 - simplify logic by combining flags
 - move log line out of loop to only print once that we resync the snapshot

Changelog since RFC (Shannon's work):
 - rename option from deep-sync to resync-corrupt
 - rebase on latest master (and change implementation details, as a 
     lot has changed around sync-jobs)

proxmox-backup:

Gabriel Goller (4):
  snapshot: add helper function to retrieve verify_state
  fix #3786: api: add resync-corrupt option to sync jobs
  fix #3786: ui/cli: add resync-corrupt option on sync-jobs
  fix #3786: docs: add resync-corrupt option to sync-job

 docs/managing-remotes.rst         |  6 +++
 pbs-api-types/src/jobs.rs         | 10 +++++
 pbs-datastore/src/backup_info.rs  |  9 +++-
 pbs-datastore/src/manifest.rs     | 14 +++++-
 src/api2/admin/datastore.rs       | 16 +++----
 src/api2/backup/mod.rs            | 18 +++++---
 src/api2/config/sync.rs           |  4 ++
 src/api2/pull.rs                  |  9 +++-
 src/backup/verify.rs              | 13 +++---
 src/bin/proxmox-backup-manager.rs | 16 ++++++-
 src/server/pull.rs                | 72 ++++++++++++++++++++++++-------
 www/window/SyncJobEdit.js         | 11 +++++
 12 files changed, 155 insertions(+), 43 deletions(-)


Summary over all repositories:
  12 files changed, 155 insertions(+), 43 deletions(-)

-- 
Generated by git-murpp 0.7.1


_______________________________________________
pbs-devel mailing list
pbs-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [pbs-devel] [PATCH proxmox-backup v5 1/4] snapshot: add helper function to retrieve verify_state
  2024-11-22  9:39 [pbs-devel] [PATCH proxmox-backup v5 0/4] fix #3786: resync corrupt chunks in sync-job Gabriel Goller
@ 2024-11-22  9:39 ` Gabriel Goller
  2024-11-22  9:39 ` [pbs-devel] [PATCH proxmox-backup v5 2/4] fix #3786: api: add resync-corrupt option to sync jobs Gabriel Goller
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: Gabriel Goller @ 2024-11-22  9:39 UTC (permalink / raw)
  To: pbs-devel

Add helper functions to retrieve the verify_state from the manifest of a
snapshot. Replaced all the manual "verify_state" parsing with the helper
function.

Suggested-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
Signed-off-by: Gabriel Goller <g.goller@proxmox.com>
---
 pbs-datastore/src/backup_info.rs |  9 +++++++--
 pbs-datastore/src/manifest.rs    | 14 +++++++++++++-
 src/api2/admin/datastore.rs      | 16 +++++++---------
 src/api2/backup/mod.rs           | 18 +++++++++++-------
 src/backup/verify.rs             | 13 ++++++++-----
 5 files changed, 46 insertions(+), 24 deletions(-)

diff --git a/pbs-datastore/src/backup_info.rs b/pbs-datastore/src/backup_info.rs
index 62d12b1183df..a581d75757b4 100644
--- a/pbs-datastore/src/backup_info.rs
+++ b/pbs-datastore/src/backup_info.rs
@@ -8,8 +8,8 @@ use anyhow::{bail, format_err, Error};
 use proxmox_sys::fs::{lock_dir_noblock, replace_file, CreateOptions};
 
 use pbs_api_types::{
-    Authid, BackupGroupDeleteStats, BackupNamespace, BackupType, GroupFilter, BACKUP_DATE_REGEX,
-    BACKUP_FILE_REGEX,
+    Authid, BackupGroupDeleteStats, BackupNamespace, BackupType, GroupFilter, VerifyState,
+    BACKUP_DATE_REGEX, BACKUP_FILE_REGEX,
 };
 use pbs_config::{open_backup_lockfile, BackupLockGuard};
 
@@ -555,6 +555,11 @@ impl BackupDir {
 
         Ok(())
     }
+
+    /// Load the verify state from the manifest.
+    pub fn verify_state(&self) -> Result<Option<VerifyState>, anyhow::Error> {
+        Ok(self.load_manifest()?.0.verify_state()?.map(|svs| svs.state))
+    }
 }
 
 impl AsRef<pbs_api_types::BackupNamespace> for BackupDir {
diff --git a/pbs-datastore/src/manifest.rs b/pbs-datastore/src/manifest.rs
index c3df014272a0..3013fab97221 100644
--- a/pbs-datastore/src/manifest.rs
+++ b/pbs-datastore/src/manifest.rs
@@ -5,7 +5,7 @@ use anyhow::{bail, format_err, Error};
 use serde::{Deserialize, Serialize};
 use serde_json::{json, Value};
 
-use pbs_api_types::{BackupType, CryptMode, Fingerprint};
+use pbs_api_types::{BackupType, CryptMode, Fingerprint, SnapshotVerifyState};
 use pbs_tools::crypt_config::CryptConfig;
 
 pub const MANIFEST_BLOB_NAME: &str = "index.json.blob";
@@ -242,6 +242,18 @@ impl BackupManifest {
         let manifest: BackupManifest = serde_json::from_value(json)?;
         Ok(manifest)
     }
+
+    /// Get the verify state of the snapshot
+    ///
+    /// Note: New snapshots, which have not been verified yet, do not have a status and this
+    /// function will return `Ok(None)`.
+    pub fn verify_state(&self) -> Result<Option<SnapshotVerifyState>, anyhow::Error> {
+        let verify = self.unprotected["verify_state"].clone();
+        if verify.is_null() {
+            return Ok(None);
+        }
+        Ok(Some(serde_json::from_value::<SnapshotVerifyState>(verify)?))
+    }
 }
 
 impl TryFrom<super::DataBlob> for BackupManifest {
diff --git a/src/api2/admin/datastore.rs b/src/api2/admin/datastore.rs
index 99b579f02c50..3624dba41199 100644
--- a/src/api2/admin/datastore.rs
+++ b/src/api2/admin/datastore.rs
@@ -537,15 +537,13 @@ unsafe fn list_snapshots_blocking(
                     }
                 };
 
-                let verification = manifest.unprotected["verify_state"].clone();
-                let verification: Option<SnapshotVerifyState> =
-                    match serde_json::from_value(verification) {
-                        Ok(verify) => verify,
-                        Err(err) => {
-                            eprintln!("error parsing verification state : '{}'", err);
-                            None
-                        }
-                    };
+                let verification: Option<SnapshotVerifyState> = match manifest.verify_state() {
+                    Ok(verify) => verify,
+                    Err(err) => {
+                        eprintln!("error parsing verification state : '{}'", err);
+                        None
+                    }
+                };
 
                 let size = Some(files.iter().map(|x| x.size.unwrap_or(0)).sum());
 
diff --git a/src/api2/backup/mod.rs b/src/api2/backup/mod.rs
index ea0d0292ec58..a735768b0f83 100644
--- a/src/api2/backup/mod.rs
+++ b/src/api2/backup/mod.rs
@@ -8,6 +8,7 @@ use hyper::http::request::Parts;
 use hyper::{Body, Request, Response, StatusCode};
 use serde::Deserialize;
 use serde_json::{json, Value};
+use tracing::warn;
 
 use proxmox_rest_server::{H2Service, WorkerTask};
 use proxmox_router::{http_err, list_subdirs_api_method};
@@ -19,9 +20,9 @@ use proxmox_sortable_macro::sortable;
 use proxmox_sys::fs::lock_dir_noblock_shared;
 
 use pbs_api_types::{
-    Authid, BackupNamespace, BackupType, Operation, SnapshotVerifyState, VerifyState,
-    BACKUP_ARCHIVE_NAME_SCHEMA, BACKUP_ID_SCHEMA, BACKUP_NAMESPACE_SCHEMA, BACKUP_TIME_SCHEMA,
-    BACKUP_TYPE_SCHEMA, CHUNK_DIGEST_SCHEMA, DATASTORE_SCHEMA, PRIV_DATASTORE_BACKUP,
+    Authid, BackupNamespace, BackupType, Operation, VerifyState, BACKUP_ARCHIVE_NAME_SCHEMA,
+    BACKUP_ID_SCHEMA, BACKUP_NAMESPACE_SCHEMA, BACKUP_TIME_SCHEMA, BACKUP_TYPE_SCHEMA,
+    CHUNK_DIGEST_SCHEMA, DATASTORE_SCHEMA, PRIV_DATASTORE_BACKUP,
 };
 use pbs_config::CachedUserInfo;
 use pbs_datastore::index::IndexFile;
@@ -159,15 +160,18 @@ fn upgrade_to_backup_protocol(
             let info = backup_group.last_backup(true).unwrap_or(None);
             if let Some(info) = info {
                 let (manifest, _) = info.backup_dir.load_manifest()?;
-                let verify = manifest.unprotected["verify_state"].clone();
-                match serde_json::from_value::<SnapshotVerifyState>(verify) {
-                    Ok(verify) => match verify.state {
+                match manifest.verify_state() {
+                    Ok(Some(verify)) => match verify.state {
                         VerifyState::Ok => Some(info),
                         VerifyState::Failed => None,
                     },
-                    Err(_) => {
+                    Ok(None) => {
                         // no verify state found, treat as valid
                         Some(info)
+                    },
+                    Err(err) => {
+                        warn!("error parsing the snapshot manifest: {err:#}");
+                        Some(info)
                     }
                 }
             } else {
diff --git a/src/backup/verify.rs b/src/backup/verify.rs
index 6ef7e8eb3ebb..c1abe69a4fde 100644
--- a/src/backup/verify.rs
+++ b/src/backup/verify.rs
@@ -5,7 +5,7 @@ use std::time::Instant;
 
 use anyhow::{bail, format_err, Error};
 use nix::dir::Dir;
-use tracing::{error, info};
+use tracing::{error, info, warn};
 
 use proxmox_sys::fs::lock_dir_noblock_shared;
 use proxmox_worker_task::WorkerTaskContext;
@@ -553,10 +553,13 @@ pub fn verify_filter(
         return true;
     }
 
-    let raw_verify_state = manifest.unprotected["verify_state"].clone();
-    match serde_json::from_value::<SnapshotVerifyState>(raw_verify_state) {
-        Err(_) => true, // no last verification, always include
-        Ok(last_verify) => {
+    match manifest.verify_state() {
+        Err(err) => {
+            warn!("error reading manifest: {err:#}");
+            true
+        }
+        Ok(None) => true, // no last verification, always include
+        Ok(Some(last_verify)) => {
             match outdated_after {
                 None => false, // never re-verify if ignored and no max age
                 Some(max_age) => {
-- 
2.39.5



_______________________________________________
pbs-devel mailing list
pbs-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [pbs-devel] [PATCH proxmox-backup v5 2/4] fix #3786: api: add resync-corrupt option to sync jobs
  2024-11-22  9:39 [pbs-devel] [PATCH proxmox-backup v5 0/4] fix #3786: resync corrupt chunks in sync-job Gabriel Goller
  2024-11-22  9:39 ` [pbs-devel] [PATCH proxmox-backup v5 1/4] snapshot: add helper function to retrieve verify_state Gabriel Goller
@ 2024-11-22  9:39 ` Gabriel Goller
  2024-11-22  9:39 ` [pbs-devel] [PATCH proxmox-backup v5 3/4] fix #3786: ui/cli: add resync-corrupt option on sync-jobs Gabriel Goller
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: Gabriel Goller @ 2024-11-22  9:39 UTC (permalink / raw)
  To: pbs-devel

This option allows us to "fix" corrupt snapshots (and/or their chunks)
by pulling them from another remote. When traversing the remote
snapshots, we check if it exists locally, and if it is, we check if the
last verification of it failed. If the local snapshot is broken and the
`resync-corrupt` option is turned on, we pull in the remote snapshot,
overwriting the local one.

This is very useful and has been requested a lot, as there is currently
no way to "fix" corrupt chunks/snapshots even if the user has a healthy
version of it on their offsite instance.

Originally-by: Shannon Sterz <s.sterz@proxmox.com>
Signed-off-by: Gabriel Goller <g.goller@proxmox.com>
Reviewed-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
---
 pbs-api-types/src/jobs.rs | 10 ++++++
 src/api2/config/sync.rs   |  4 +++
 src/api2/pull.rs          |  9 ++++-
 src/server/pull.rs        | 72 ++++++++++++++++++++++++++++++---------
 4 files changed, 78 insertions(+), 17 deletions(-)

diff --git a/pbs-api-types/src/jobs.rs b/pbs-api-types/src/jobs.rs
index e8056beb00cb..52520811b560 100644
--- a/pbs-api-types/src/jobs.rs
+++ b/pbs-api-types/src/jobs.rs
@@ -536,6 +536,10 @@ impl SyncDirection {
     }
 }
 
+pub const RESYNC_CORRUPT_SCHEMA: Schema =
+    BooleanSchema::new("If the verification failed for a local snapshot, try to pull it again.")
+        .schema();
+
 #[api(
     properties: {
         id: {
@@ -590,6 +594,10 @@ impl SyncDirection {
             schema: TRANSFER_LAST_SCHEMA,
             optional: true,
         },
+        "resync-corrupt": {
+            schema: RESYNC_CORRUPT_SCHEMA,
+            optional: true,
+        }
     }
 )]
 #[derive(Serialize, Deserialize, Clone, Updater, PartialEq)]
@@ -623,6 +631,8 @@ pub struct SyncJobConfig {
     pub limit: RateLimitConfig,
     #[serde(skip_serializing_if = "Option::is_none")]
     pub transfer_last: Option<usize>,
+    #[serde(skip_serializing_if = "Option::is_none")]
+    pub resync_corrupt: Option<bool>,
 }
 
 impl SyncJobConfig {
diff --git a/src/api2/config/sync.rs b/src/api2/config/sync.rs
index 78eb7320566b..7ff6cae029d1 100644
--- a/src/api2/config/sync.rs
+++ b/src/api2/config/sync.rs
@@ -471,6 +471,9 @@ pub fn update_sync_job(
     if let Some(transfer_last) = update.transfer_last {
         data.transfer_last = Some(transfer_last);
     }
+    if let Some(resync_corrupt) = update.resync_corrupt {
+        data.resync_corrupt = Some(resync_corrupt);
+    }
 
     if update.limit.rate_in.is_some() {
         data.limit.rate_in = update.limit.rate_in;
@@ -629,6 +632,7 @@ acl:1:/remote/remote1/remotestore1:write@pbs:RemoteSyncOperator
         ns: None,
         owner: Some(write_auth_id.clone()),
         comment: None,
+        resync_corrupt: None,
         remove_vanished: None,
         max_depth: None,
         group_filter: None,
diff --git a/src/api2/pull.rs b/src/api2/pull.rs
index d039dab59c65..d8ed1a7347b5 100644
--- a/src/api2/pull.rs
+++ b/src/api2/pull.rs
@@ -10,7 +10,7 @@ use pbs_api_types::{
     Authid, BackupNamespace, GroupFilter, RateLimitConfig, SyncJobConfig, DATASTORE_SCHEMA,
     GROUP_FILTER_LIST_SCHEMA, NS_MAX_DEPTH_REDUCED_SCHEMA, PRIV_DATASTORE_BACKUP,
     PRIV_DATASTORE_PRUNE, PRIV_REMOTE_READ, REMOTE_ID_SCHEMA, REMOVE_VANISHED_BACKUPS_SCHEMA,
-    TRANSFER_LAST_SCHEMA,
+    RESYNC_CORRUPT_SCHEMA, TRANSFER_LAST_SCHEMA,
 };
 use pbs_config::CachedUserInfo;
 use proxmox_rest_server::WorkerTask;
@@ -87,6 +87,7 @@ impl TryFrom<&SyncJobConfig> for PullParameters {
             sync_job.group_filter.clone(),
             sync_job.limit.clone(),
             sync_job.transfer_last,
+            sync_job.resync_corrupt,
         )
     }
 }
@@ -132,6 +133,10 @@ impl TryFrom<&SyncJobConfig> for PullParameters {
                 schema: TRANSFER_LAST_SCHEMA,
                 optional: true,
             },
+            "resync-corrupt": {
+                schema: RESYNC_CORRUPT_SCHEMA,
+                optional: true,
+            },
         },
     },
     access: {
@@ -156,6 +161,7 @@ async fn pull(
     group_filter: Option<Vec<GroupFilter>>,
     limit: RateLimitConfig,
     transfer_last: Option<usize>,
+    resync_corrupt: Option<bool>,
     rpcenv: &mut dyn RpcEnvironment,
 ) -> Result<String, Error> {
     let auth_id: Authid = rpcenv.get_auth_id().unwrap().parse()?;
@@ -193,6 +199,7 @@ async fn pull(
         group_filter,
         limit,
         transfer_last,
+        resync_corrupt,
     )?;
 
     // fixme: set to_stdout to false?
diff --git a/src/server/pull.rs b/src/server/pull.rs
index 08b55956ce52..40d872d2487c 100644
--- a/src/server/pull.rs
+++ b/src/server/pull.rs
@@ -12,7 +12,8 @@ use tracing::info;
 
 use pbs_api_types::{
     print_store_and_ns, Authid, BackupDir, BackupGroup, BackupNamespace, GroupFilter, Operation,
-    RateLimitConfig, Remote, MAX_NAMESPACE_DEPTH, PRIV_DATASTORE_AUDIT, PRIV_DATASTORE_BACKUP,
+    RateLimitConfig, Remote, VerifyState, MAX_NAMESPACE_DEPTH, PRIV_DATASTORE_AUDIT,
+    PRIV_DATASTORE_BACKUP,
 };
 use pbs_client::BackupRepository;
 use pbs_config::CachedUserInfo;
@@ -55,6 +56,8 @@ pub(crate) struct PullParameters {
     group_filter: Vec<GroupFilter>,
     /// How many snapshots should be transferred at most (taking the newest N snapshots)
     transfer_last: Option<usize>,
+    /// Whether to re-sync corrupted snapshots
+    resync_corrupt: bool,
 }
 
 impl PullParameters {
@@ -72,12 +75,14 @@ impl PullParameters {
         group_filter: Option<Vec<GroupFilter>>,
         limit: RateLimitConfig,
         transfer_last: Option<usize>,
+        resync_corrupt: Option<bool>,
     ) -> Result<Self, Error> {
         if let Some(max_depth) = max_depth {
             ns.check_max_depth(max_depth)?;
             remote_ns.check_max_depth(max_depth)?;
         };
         let remove_vanished = remove_vanished.unwrap_or(false);
+        let resync_corrupt = resync_corrupt.unwrap_or(false);
 
         let source: Arc<dyn SyncSource> = if let Some(remote) = remote {
             let (remote_config, _digest) = pbs_config::remote::config()?;
@@ -116,6 +121,7 @@ impl PullParameters {
             max_depth,
             group_filter,
             transfer_last,
+            resync_corrupt,
         })
     }
 }
@@ -323,7 +329,7 @@ async fn pull_single_archive<'a>(
 ///
 /// Pulling a snapshot consists of the following steps:
 /// - (Re)download the manifest
-/// -- if it matches, only download log and treat snapshot as already synced
+/// -- if it matches and is not corrupt, only download log and treat snapshot as already synced
 /// - Iterate over referenced files
 /// -- if file already exists, verify contents
 /// -- if not, pull it from the remote
@@ -332,6 +338,7 @@ async fn pull_snapshot<'a>(
     reader: Arc<dyn SyncSourceReader + 'a>,
     snapshot: &'a pbs_datastore::BackupDir,
     downloaded_chunks: Arc<Mutex<HashSet<[u8; 32]>>>,
+    corrupt: bool,
 ) -> Result<SyncStats, Error> {
     let mut sync_stats = SyncStats::default();
     let mut manifest_name = snapshot.full_path();
@@ -352,7 +359,7 @@ async fn pull_snapshot<'a>(
         return Ok(sync_stats);
     }
 
-    if manifest_name.exists() {
+    if manifest_name.exists() && !corrupt {
         let manifest_blob = proxmox_lang::try_block!({
             let mut manifest_file = std::fs::File::open(&manifest_name).map_err(|err| {
                 format_err!("unable to open local manifest {manifest_name:?} - {err}")
@@ -381,7 +388,7 @@ async fn pull_snapshot<'a>(
         let mut path = snapshot.full_path();
         path.push(&item.filename);
 
-        if path.exists() {
+        if !corrupt && path.exists() {
             match ArchiveType::from_path(&item.filename)? {
                 ArchiveType::DynamicIndex => {
                     let index = DynamicIndexReader::open(&path)?;
@@ -443,6 +450,7 @@ async fn pull_snapshot_from<'a>(
     reader: Arc<dyn SyncSourceReader + 'a>,
     snapshot: &'a pbs_datastore::BackupDir,
     downloaded_chunks: Arc<Mutex<HashSet<[u8; 32]>>>,
+    corrupt: bool,
 ) -> Result<SyncStats, Error> {
     let (_path, is_new, _snap_lock) = snapshot
         .datastore()
@@ -451,7 +459,8 @@ async fn pull_snapshot_from<'a>(
     let sync_stats = if is_new {
         info!("sync snapshot {}", snapshot.dir());
 
-        match pull_snapshot(reader, snapshot, downloaded_chunks).await {
+        // this snapshot is new, so it can never be corrupt
+        match pull_snapshot(reader, snapshot, downloaded_chunks, false).await {
             Err(err) => {
                 if let Err(cleanup_err) = snapshot.datastore().remove_backup_dir(
                     snapshot.backup_ns(),
@@ -468,8 +477,12 @@ async fn pull_snapshot_from<'a>(
             }
         }
     } else {
-        info!("re-sync snapshot {}", snapshot.dir());
-        pull_snapshot(reader, snapshot, downloaded_chunks).await?
+        if corrupt {
+            info!("re-sync snapshot {} due to corruption", snapshot.dir());
+        } else {
+            info!("re-sync snapshot {}", snapshot.dir());
+        }
+        pull_snapshot(reader, snapshot, downloaded_chunks, corrupt).await?
     };
 
     Ok(sync_stats)
@@ -523,26 +536,52 @@ async fn pull_group(
         .last_successful_backup(&target_ns, group)?
         .unwrap_or(i64::MIN);
 
-    let list: Vec<BackupDir> = raw_list
+    // Filter remote BackupDirs to include in pull
+    // Also stores if the snapshot is corrupt (verification job failed)
+    let list: Vec<(BackupDir, bool)> = raw_list
         .into_iter()
         .enumerate()
-        .filter(|&(pos, ref dir)| {
+        .filter_map(|(pos, dir)| {
             source_snapshots.insert(dir.time);
+            // If resync_corrupt is set, check if the corresponding local snapshot failed to
+            // verification
+            if params.resync_corrupt {
+                let local_dir = params
+                    .target
+                    .store
+                    .backup_dir(target_ns.clone(), dir.clone());
+                if let Ok(local_dir) = local_dir {
+                    match local_dir.verify_state() {
+                        Ok(Some(state)) => {
+                            if state == VerifyState::Failed {
+                                return Some((dir, true));
+                            }
+                        }
+                        Ok(None) => {
+                            // The verify_state item was not found in the manifest, this means the
+                            // snapshot is new.
+                        }
+                        Err(_) => {
+                            // There was an error loading the manifest, probably better if we
+                            // resync.
+                            return Some((dir, true));
+                        }
+                    }
+                }
+            }
             // Note: the snapshot represented by `last_sync_time` might be missing its backup log
             // or post-backup verification state if those were not yet available during the last
             // sync run, always resync it
             if last_sync_time > dir.time {
                 already_synced_skip_info.update(dir.time);
-                return false;
+                return None;
             }
-
             if pos < cutoff && last_sync_time != dir.time {
                 transfer_last_skip_info.update(dir.time);
-                return false;
+                return None;
             }
-            true
+            Some((dir, false))
         })
-        .map(|(_, dir)| dir)
         .collect();
 
     if already_synced_skip_info.count > 0 {
@@ -561,7 +600,7 @@ async fn pull_group(
 
     let mut sync_stats = SyncStats::default();
 
-    for (pos, from_snapshot) in list.into_iter().enumerate() {
+    for (pos, (from_snapshot, corrupt)) in list.into_iter().enumerate() {
         let to_snapshot = params
             .target
             .store
@@ -571,7 +610,8 @@ async fn pull_group(
             .source
             .reader(source_namespace, &from_snapshot)
             .await?;
-        let result = pull_snapshot_from(reader, &to_snapshot, downloaded_chunks.clone()).await;
+        let result =
+            pull_snapshot_from(reader, &to_snapshot, downloaded_chunks.clone(), corrupt).await;
 
         progress.done_snapshots = pos as u64 + 1;
         info!("percentage done: {progress}");
-- 
2.39.5



_______________________________________________
pbs-devel mailing list
pbs-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [pbs-devel] [PATCH proxmox-backup v5 3/4] fix #3786: ui/cli: add resync-corrupt option on sync-jobs
  2024-11-22  9:39 [pbs-devel] [PATCH proxmox-backup v5 0/4] fix #3786: resync corrupt chunks in sync-job Gabriel Goller
  2024-11-22  9:39 ` [pbs-devel] [PATCH proxmox-backup v5 1/4] snapshot: add helper function to retrieve verify_state Gabriel Goller
  2024-11-22  9:39 ` [pbs-devel] [PATCH proxmox-backup v5 2/4] fix #3786: api: add resync-corrupt option to sync jobs Gabriel Goller
@ 2024-11-22  9:39 ` Gabriel Goller
  2024-11-22  9:39 ` [pbs-devel] [PATCH proxmox-backup v5 4/4] fix #3786: docs: add resync-corrupt option to sync-job Gabriel Goller
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: Gabriel Goller @ 2024-11-22  9:39 UTC (permalink / raw)
  To: pbs-devel

Add the `resync-corrupt` option to the ui and the
`proxmox-backup-manager` cli. It is listed in the `Advanced` section,
because it slows the sync-job down and is useless if no verification
job was run beforehand.

Originally-by: Shannon Sterz <s.sterz@proxmox.com>
Signed-off-by: Gabriel Goller <g.goller@proxmox.com>
Reviewed-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
---
 src/bin/proxmox-backup-manager.rs | 16 ++++++++++++++--
 www/window/SyncJobEdit.js         | 11 +++++++++++
 2 files changed, 25 insertions(+), 2 deletions(-)

diff --git a/src/bin/proxmox-backup-manager.rs b/src/bin/proxmox-backup-manager.rs
index d887dc1d50a1..02ca0d028225 100644
--- a/src/bin/proxmox-backup-manager.rs
+++ b/src/bin/proxmox-backup-manager.rs
@@ -14,8 +14,8 @@ use pbs_api_types::percent_encoding::percent_encode_component;
 use pbs_api_types::{
     BackupNamespace, GroupFilter, RateLimitConfig, SyncDirection, SyncJobConfig, DATASTORE_SCHEMA,
     GROUP_FILTER_LIST_SCHEMA, IGNORE_VERIFIED_BACKUPS_SCHEMA, NS_MAX_DEPTH_SCHEMA,
-    REMOTE_ID_SCHEMA, REMOVE_VANISHED_BACKUPS_SCHEMA, TRANSFER_LAST_SCHEMA, UPID_SCHEMA,
-    VERIFICATION_OUTDATED_AFTER_SCHEMA,
+    REMOTE_ID_SCHEMA, REMOVE_VANISHED_BACKUPS_SCHEMA, RESYNC_CORRUPT_SCHEMA, TRANSFER_LAST_SCHEMA,
+    UPID_SCHEMA, VERIFICATION_OUTDATED_AFTER_SCHEMA,
 };
 use pbs_client::{display_task_log, view_task_result};
 use pbs_config::sync;
@@ -307,6 +307,7 @@ async fn sync_datastore(
     group_filter: Option<Vec<GroupFilter>>,
     limit: RateLimitConfig,
     transfer_last: Option<usize>,
+    resync_corrupt: Option<bool>,
     param: Value,
     sync_direction: SyncDirection,
 ) -> Result<Value, Error> {
@@ -343,6 +344,10 @@ async fn sync_datastore(
         args["transfer-last"] = json!(transfer_last)
     }
 
+    if let Some(resync) = resync_corrupt {
+        args["resync-corrupt"] = Value::from(resync);
+    }
+
     let mut limit_json = json!(limit);
     let limit_map = limit_json
         .as_object_mut()
@@ -405,6 +410,10 @@ async fn sync_datastore(
                 schema: TRANSFER_LAST_SCHEMA,
                 optional: true,
             },
+            "resync-corrupt": {
+                schema: RESYNC_CORRUPT_SCHEMA,
+                optional: true,
+            },
         }
    }
 )]
@@ -421,6 +430,7 @@ async fn pull_datastore(
     group_filter: Option<Vec<GroupFilter>>,
     limit: RateLimitConfig,
     transfer_last: Option<usize>,
+    resync_corrupt: Option<bool>,
     param: Value,
 ) -> Result<Value, Error> {
     sync_datastore(
@@ -434,6 +444,7 @@ async fn pull_datastore(
         group_filter,
         limit,
         transfer_last,
+        resync_corrupt,
         param,
         SyncDirection::Pull,
     )
@@ -513,6 +524,7 @@ async fn push_datastore(
         group_filter,
         limit,
         transfer_last,
+        None,
         param,
         SyncDirection::Push,
     )
diff --git a/www/window/SyncJobEdit.js b/www/window/SyncJobEdit.js
index 0e648e7b3e50..03f61bee6494 100644
--- a/www/window/SyncJobEdit.js
+++ b/www/window/SyncJobEdit.js
@@ -358,6 +358,17 @@ Ext.define('PBS.window.SyncJobEdit', {
 			    deleteEmpty: '{!isCreate}',
 			},
 		    },
+		    {
+			fieldLabel: gettext('Resync corrupt snapshots'),
+			xtype: 'proxmoxcheckbox',
+			name: 'resync-corrupt',
+			autoEl: {
+			    tag: 'div',
+			    'data-qtip': gettext('Re-sync snapshots, whose verfification failed.'),
+			},
+			uncheckedValue: false,
+			value: false,
+		    },
 		],
 	    },
 	    {
-- 
2.39.5



_______________________________________________
pbs-devel mailing list
pbs-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [pbs-devel] [PATCH proxmox-backup v5 4/4] fix #3786: docs: add resync-corrupt option to sync-job
  2024-11-22  9:39 [pbs-devel] [PATCH proxmox-backup v5 0/4] fix #3786: resync corrupt chunks in sync-job Gabriel Goller
                   ` (2 preceding siblings ...)
  2024-11-22  9:39 ` [pbs-devel] [PATCH proxmox-backup v5 3/4] fix #3786: ui/cli: add resync-corrupt option on sync-jobs Gabriel Goller
@ 2024-11-22  9:39 ` Gabriel Goller
  2024-11-22 10:37 ` [pbs-devel] [PATCH proxmox-backup v5 0/4] fix #3786: resync corrupt chunks in sync-job Fabian Grünbichler
  2024-11-22 12:15 ` Gabriel Goller
  5 siblings, 0 replies; 7+ messages in thread
From: Gabriel Goller @ 2024-11-22  9:39 UTC (permalink / raw)
  To: pbs-devel

Add short section explaining the `resync-corrupt` option on the
sync-job.

Originally-by: Shannon Sterz <s.sterz@proxmox.com>
Signed-off-by: Gabriel Goller <g.goller@proxmox.com>
Reviewed-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
---
 docs/managing-remotes.rst | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/docs/managing-remotes.rst b/docs/managing-remotes.rst
index a7fd5143d236..4a78a9310fa5 100644
--- a/docs/managing-remotes.rst
+++ b/docs/managing-remotes.rst
@@ -135,6 +135,12 @@ For mixing include and exclude filter, following rules apply:
 
 .. note:: The ``protected`` flag of remote backup snapshots will not be synced.
 
+Enabling the advanced option 'resync-corrupt' will re-sync all snapshots that have 
+failed to verify during the last :ref:`maintenance_verification`. Hence, a verification
+job needs to be run before a sync job with 'resync-corrupt' can be carried out. Be aware
+that a 'resync-corrupt'-job needs to check the manifests of all snapshots in a datastore
+and might take much longer than regular sync jobs.
+
 Namespace Support
 ^^^^^^^^^^^^^^^^^
 
-- 
2.39.5



_______________________________________________
pbs-devel mailing list
pbs-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [pbs-devel] [PATCH proxmox-backup v5 0/4] fix #3786: resync corrupt chunks in sync-job
  2024-11-22  9:39 [pbs-devel] [PATCH proxmox-backup v5 0/4] fix #3786: resync corrupt chunks in sync-job Gabriel Goller
                   ` (3 preceding siblings ...)
  2024-11-22  9:39 ` [pbs-devel] [PATCH proxmox-backup v5 4/4] fix #3786: docs: add resync-corrupt option to sync-job Gabriel Goller
@ 2024-11-22 10:37 ` Fabian Grünbichler
  2024-11-22 12:15 ` Gabriel Goller
  5 siblings, 0 replies; 7+ messages in thread
From: Fabian Grünbichler @ 2024-11-22 10:37 UTC (permalink / raw)
  To: Gabriel Goller, pbs-devel

w.r.t. the off-list discussion - I think resync-corrupt is okay as standalone
option, it matches with the others like remove_vanished.

but I noticed another thing that requires some more changes - we need to only
allow resync-corrupt for pull syncs, not for push ones (for now - it's not
impossible to implement it for push as well, but it requires some backend
changes and thoughts about the priv implications).

Quoting Gabriel Goller (2024-11-22 10:39:15)
> Add an option `resync-corrupt` that resyncs corrupt snapshots when running
> sync-job. This option checks if the local snapshot failed the last
> verification and if it did, overwrites the local snapshot with the
> remote one.
> 
> This is quite useful, as we currently don't have an option to "fix" 
> broken chunks/snapshots in any way, even if a healthy version is on 
> another (e.g. offsite) instance.
> 
> Important things to note are also: this has a slight performance 
> penalty, as all the manifests have to be looked through, and a 
> verification job has to be run beforehand, otherwise we do not know 
> if the snapshot is healthy.
> 
> Note: This series was originally written by Shannon! I just picked it 
> up, rebased, and fixed the obvious comments on the last series.
> 
> Changelog v5 (thanks @Fabian):
>  - rebase
>  - don't remove parsing error in verify_state helper
>  - add error logs on failures
> 
> Changelog v4 (thanks @Fabian):
>  - make verify_state bubble up errors
>  - call verify_state helper everywhere we need the verify_state
>  - resync broken manifests (so resync when load_manifest fails)
> 
> Changelog v3 (thanks @Fabian):
>  - filter out snapshots earlier in the pull_group function
>  - move verify_state to BackupManifest and fixed invocations
>  - reverted verify_state Option -> Result state (It doesn't matter if we get an
>    error, we get that quite often f.e. in new backups)
>  - removed some unnecessary log lines
>  - removed some unnecessary imports and modifications
>  - rebase to current master
> 
> Changelog v2 (thanks @Thomas):
>  - order git trailers
>  - adjusted schema description to include broken indexes
>  - change verify_state to return a Result<_,_>
>  - print error if verify_state is not able to read the state
>  - update docs on pull_snapshot function
>  - simplify logic by combining flags
>  - move log line out of loop to only print once that we resync the snapshot
> 
> Changelog since RFC (Shannon's work):
>  - rename option from deep-sync to resync-corrupt
>  - rebase on latest master (and change implementation details, as a 
>      lot has changed around sync-jobs)
> 
> proxmox-backup:
> 
> Gabriel Goller (4):
>   snapshot: add helper function to retrieve verify_state
>   fix #3786: api: add resync-corrupt option to sync jobs
>   fix #3786: ui/cli: add resync-corrupt option on sync-jobs
>   fix #3786: docs: add resync-corrupt option to sync-job
> 
>  docs/managing-remotes.rst         |  6 +++
>  pbs-api-types/src/jobs.rs         | 10 +++++
>  pbs-datastore/src/backup_info.rs  |  9 +++-
>  pbs-datastore/src/manifest.rs     | 14 +++++-
>  src/api2/admin/datastore.rs       | 16 +++----
>  src/api2/backup/mod.rs            | 18 +++++---
>  src/api2/config/sync.rs           |  4 ++
>  src/api2/pull.rs                  |  9 +++-
>  src/backup/verify.rs              | 13 +++---
>  src/bin/proxmox-backup-manager.rs | 16 ++++++-
>  src/server/pull.rs                | 72 ++++++++++++++++++++++++-------
>  www/window/SyncJobEdit.js         | 11 +++++
>  12 files changed, 155 insertions(+), 43 deletions(-)
> 
> 
> Summary over all repositories:
>   12 files changed, 155 insertions(+), 43 deletions(-)
> 
> -- 
> Generated by git-murpp 0.7.1
> 
> 
> _______________________________________________
> pbs-devel mailing list
> pbs-devel@lists.proxmox.com
> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel
> 
>


_______________________________________________
pbs-devel mailing list
pbs-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [pbs-devel] [PATCH proxmox-backup v5 0/4] fix #3786: resync corrupt chunks in sync-job
  2024-11-22  9:39 [pbs-devel] [PATCH proxmox-backup v5 0/4] fix #3786: resync corrupt chunks in sync-job Gabriel Goller
                   ` (4 preceding siblings ...)
  2024-11-22 10:37 ` [pbs-devel] [PATCH proxmox-backup v5 0/4] fix #3786: resync corrupt chunks in sync-job Fabian Grünbichler
@ 2024-11-22 12:15 ` Gabriel Goller
  5 siblings, 0 replies; 7+ messages in thread
From: Gabriel Goller @ 2024-11-22 12:15 UTC (permalink / raw)
  To: pbs-devel

Sent a new version!


_______________________________________________
pbs-devel mailing list
pbs-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2024-11-22 12:15 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-11-22  9:39 [pbs-devel] [PATCH proxmox-backup v5 0/4] fix #3786: resync corrupt chunks in sync-job Gabriel Goller
2024-11-22  9:39 ` [pbs-devel] [PATCH proxmox-backup v5 1/4] snapshot: add helper function to retrieve verify_state Gabriel Goller
2024-11-22  9:39 ` [pbs-devel] [PATCH proxmox-backup v5 2/4] fix #3786: api: add resync-corrupt option to sync jobs Gabriel Goller
2024-11-22  9:39 ` [pbs-devel] [PATCH proxmox-backup v5 3/4] fix #3786: ui/cli: add resync-corrupt option on sync-jobs Gabriel Goller
2024-11-22  9:39 ` [pbs-devel] [PATCH proxmox-backup v5 4/4] fix #3786: docs: add resync-corrupt option to sync-job Gabriel Goller
2024-11-22 10:37 ` [pbs-devel] [PATCH proxmox-backup v5 0/4] fix #3786: resync corrupt chunks in sync-job Fabian Grünbichler
2024-11-22 12:15 ` Gabriel Goller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox
Service provided by Proxmox Server Solutions GmbH | Privacy | Legal