* [pbs-devel] [PATCH proxmox-backup 0/3] fix #3786: resync corrupt chunks in sync-job @ 2024-10-15 13:18 Gabriel Goller 2024-10-15 13:18 ` [pbs-devel] [PATCH proxmox-backup 1/3] fix #3786: api: add resync-corrupt option to sync jobs Gabriel Goller ` (2 more replies) 0 siblings, 3 replies; 9+ messages in thread From: Gabriel Goller @ 2024-10-15 13:18 UTC (permalink / raw) To: pbs-devel Add an option `resync-corrupt` that resyncs corrupt snapshots when running sync-job. This option checks if the local snapshot failed the last verification and if it did, overwrites the local snapshot with the remote one. This is quite useful, as we currently don't have an option to "fix" broken chunks/snapshots in any way, even if a healthy version is on another (e.g. offsite) instance. Important things to note are also: this has a slight performance penalty, as all the manifests have to be looked through, and a verification job has to be run beforehand, otherwise we do not know if the snapshot is healthy. Note: This series was originally written by Shannon! I just picked it up, rebased, and fixed the obvious comments on the last series. Changelog since RFC (Shannon's work): - rename option from deep-sync to resync-corrupt - rebase on latest master (and change implementation details, as a lot has changed around sync-jobs) proxmox-backup: Gabriel Goller (3): fix #3786: api: add resync-corrupt option to sync jobs fix #3786: ui/cli: add resync-corrupt option on sync-jobs fix #3786: docs: add resync-corrupt option to sync-job docs/managing-remotes.rst | 6 +++ pbs-api-types/src/jobs.rs | 11 +++++ pbs-datastore/src/backup_info.rs | 13 +++++- src/api2/config/sync.rs | 4 ++ src/api2/pull.rs | 9 +++- src/bin/proxmox-backup-manager.rs | 13 +++++- src/server/pull.rs | 68 +++++++++++++++++++++---------- www/window/SyncJobEdit.js | 11 +++++ 8 files changed, 110 insertions(+), 25 deletions(-) Summary over all repositories: 8 files changed, 110 insertions(+), 25 deletions(-) -- Generated by git-murpp 0.7.1 _______________________________________________ pbs-devel mailing list pbs-devel@lists.proxmox.com https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel ^ permalink raw reply [flat|nested] 9+ messages in thread
* [pbs-devel] [PATCH proxmox-backup 1/3] fix #3786: api: add resync-corrupt option to sync jobs 2024-10-15 13:18 [pbs-devel] [PATCH proxmox-backup 0/3] fix #3786: resync corrupt chunks in sync-job Gabriel Goller @ 2024-10-15 13:18 ` Gabriel Goller 2024-10-17 6:15 ` Thomas Lamprecht 2024-10-15 13:18 ` [pbs-devel] [PATCH proxmox-backup 2/3] fix #3786: ui/cli: add resync-corrupt option on sync-jobs Gabriel Goller 2024-10-15 13:18 ` [pbs-devel] [PATCH proxmox-backup 3/3] fix #3786: docs: add resync-corrupt option to sync-job Gabriel Goller 2 siblings, 1 reply; 9+ messages in thread From: Gabriel Goller @ 2024-10-15 13:18 UTC (permalink / raw) To: pbs-devel This option allows us to "fix" corrupt snapshots (and/or their chunks) by pulling them from another remote. When traversing the remote snapshots, we check if it exists locally, and if it is, we check if the last verification of it failed. If the local snapshot is broken and the `resync-corrupt` option is turned on, we pull in the remote snapshot, overwriting the local one. This is very useful and has been requested a lot, as there is currently no way to "fix" corrupt chunks/snapshots even if the user has a healthy version of it on their offsite instance. Signed-off-by: Gabriel Goller <g.goller@proxmox.com> Originally-by: Shannon Sterz <s.sterz@proxmox.com> --- pbs-api-types/src/jobs.rs | 11 +++++ pbs-datastore/src/backup_info.rs | 13 +++++- src/api2/config/sync.rs | 4 ++ src/api2/pull.rs | 9 +++- src/bin/proxmox-backup-manager.rs | 4 +- src/server/pull.rs | 68 +++++++++++++++++++++---------- 6 files changed, 84 insertions(+), 25 deletions(-) diff --git a/pbs-api-types/src/jobs.rs b/pbs-api-types/src/jobs.rs index 868702bc059e..a3cd68bf5afd 100644 --- a/pbs-api-types/src/jobs.rs +++ b/pbs-api-types/src/jobs.rs @@ -498,6 +498,11 @@ pub const TRANSFER_LAST_SCHEMA: Schema = .minimum(1) .schema(); +pub const RESYNC_CORRUPT_SCHEMA: Schema = BooleanSchema::new( + "If the verification failed for a local snapshot, try to pull its chunks again.", +) +.schema(); + #[api( properties: { id: { @@ -552,6 +557,10 @@ pub const TRANSFER_LAST_SCHEMA: Schema = schema: TRANSFER_LAST_SCHEMA, optional: true, }, + "resync-corrupt": { + schema: RESYNC_CORRUPT_SCHEMA, + optional: true, + } } )] #[derive(Serialize, Deserialize, Clone, Updater, PartialEq)] @@ -585,6 +594,8 @@ pub struct SyncJobConfig { pub limit: RateLimitConfig, #[serde(skip_serializing_if = "Option::is_none")] pub transfer_last: Option<usize>, + #[serde(skip_serializing_if = "Option::is_none")] + pub resync_corrupt: Option<bool>, } impl SyncJobConfig { diff --git a/pbs-datastore/src/backup_info.rs b/pbs-datastore/src/backup_info.rs index 414ec878d01a..2d283dbe5cd9 100644 --- a/pbs-datastore/src/backup_info.rs +++ b/pbs-datastore/src/backup_info.rs @@ -8,7 +8,8 @@ use anyhow::{bail, format_err, Error}; use proxmox_sys::fs::{lock_dir_noblock, replace_file, CreateOptions}; use pbs_api_types::{ - Authid, BackupNamespace, BackupType, GroupFilter, BACKUP_DATE_REGEX, BACKUP_FILE_REGEX, + Authid, BackupNamespace, BackupType, GroupFilter, SnapshotVerifyState, VerifyState, + BACKUP_DATE_REGEX, BACKUP_FILE_REGEX, }; use pbs_config::{open_backup_lockfile, BackupLockGuard}; @@ -583,6 +584,16 @@ impl BackupDir { Ok(()) } + + /// Load the verify state from the manifest. + pub fn verify_state(&self) -> Option<VerifyState> { + self.load_manifest().ok().and_then(|(m, _)| { + let verify = m.unprotected["verify_state"].clone(); + serde_json::from_value::<SnapshotVerifyState>(verify) + .ok() + .map(|svs| svs.state) + }) + } } impl AsRef<pbs_api_types::BackupNamespace> for BackupDir { diff --git a/src/api2/config/sync.rs b/src/api2/config/sync.rs index 6fdc69a9e645..fa9db92f3d11 100644 --- a/src/api2/config/sync.rs +++ b/src/api2/config/sync.rs @@ -368,6 +368,9 @@ pub fn update_sync_job( if let Some(transfer_last) = update.transfer_last { data.transfer_last = Some(transfer_last); } + if let Some(resync_corrupt) = update.resync_corrupt { + data.resync_corrupt = Some(resync_corrupt); + } if update.limit.rate_in.is_some() { data.limit.rate_in = update.limit.rate_in; @@ -527,6 +530,7 @@ acl:1:/remote/remote1/remotestore1:write@pbs:RemoteSyncOperator ns: None, owner: Some(write_auth_id.clone()), comment: None, + resync_corrupt: None, remove_vanished: None, max_depth: None, group_filter: None, diff --git a/src/api2/pull.rs b/src/api2/pull.rs index e733c9839e3a..0d4be0e2d228 100644 --- a/src/api2/pull.rs +++ b/src/api2/pull.rs @@ -10,7 +10,7 @@ use pbs_api_types::{ Authid, BackupNamespace, GroupFilter, RateLimitConfig, SyncJobConfig, DATASTORE_SCHEMA, GROUP_FILTER_LIST_SCHEMA, NS_MAX_DEPTH_REDUCED_SCHEMA, PRIV_DATASTORE_BACKUP, PRIV_DATASTORE_PRUNE, PRIV_REMOTE_READ, REMOTE_ID_SCHEMA, REMOVE_VANISHED_BACKUPS_SCHEMA, - TRANSFER_LAST_SCHEMA, + RESYNC_CORRUPT_SCHEMA, TRANSFER_LAST_SCHEMA, }; use pbs_config::CachedUserInfo; use proxmox_human_byte::HumanByte; @@ -89,6 +89,7 @@ impl TryFrom<&SyncJobConfig> for PullParameters { sync_job.group_filter.clone(), sync_job.limit.clone(), sync_job.transfer_last, + sync_job.resync_corrupt, ) } } @@ -240,6 +241,10 @@ pub fn do_sync_job( schema: TRANSFER_LAST_SCHEMA, optional: true, }, + "resync-corrupt": { + schema: RESYNC_CORRUPT_SCHEMA, + optional: true, + }, }, }, access: { @@ -264,6 +269,7 @@ async fn pull( group_filter: Option<Vec<GroupFilter>>, limit: RateLimitConfig, transfer_last: Option<usize>, + resync_corrupt: Option<bool>, rpcenv: &mut dyn RpcEnvironment, ) -> Result<String, Error> { let auth_id: Authid = rpcenv.get_auth_id().unwrap().parse()?; @@ -301,6 +307,7 @@ async fn pull( group_filter, limit, transfer_last, + resync_corrupt, )?; // fixme: set to_stdout to false? diff --git a/src/bin/proxmox-backup-manager.rs b/src/bin/proxmox-backup-manager.rs index 420e96665662..38a1cf0f5881 100644 --- a/src/bin/proxmox-backup-manager.rs +++ b/src/bin/proxmox-backup-manager.rs @@ -14,8 +14,8 @@ use pbs_api_types::percent_encoding::percent_encode_component; use pbs_api_types::{ BackupNamespace, GroupFilter, RateLimitConfig, SyncJobConfig, DATASTORE_SCHEMA, GROUP_FILTER_LIST_SCHEMA, IGNORE_VERIFIED_BACKUPS_SCHEMA, NS_MAX_DEPTH_SCHEMA, - REMOTE_ID_SCHEMA, REMOVE_VANISHED_BACKUPS_SCHEMA, TRANSFER_LAST_SCHEMA, UPID_SCHEMA, - VERIFICATION_OUTDATED_AFTER_SCHEMA, + REMOTE_ID_SCHEMA, REMOVE_VANISHED_BACKUPS_SCHEMA, RESYNC_CORRUPT_SCHEMA, TRANSFER_LAST_SCHEMA, + UPID_SCHEMA, VERIFICATION_OUTDATED_AFTER_SCHEMA, }; use pbs_client::{display_task_log, view_task_result}; use pbs_config::sync; diff --git a/src/server/pull.rs b/src/server/pull.rs index 3117f7d2c960..67881f83b5e3 100644 --- a/src/server/pull.rs +++ b/src/server/pull.rs @@ -12,7 +12,8 @@ use tracing::info; use pbs_api_types::{ print_store_and_ns, Authid, BackupDir, BackupGroup, BackupNamespace, GroupFilter, Operation, - RateLimitConfig, Remote, MAX_NAMESPACE_DEPTH, PRIV_DATASTORE_AUDIT, PRIV_DATASTORE_BACKUP, + RateLimitConfig, Remote, VerifyState, MAX_NAMESPACE_DEPTH, PRIV_DATASTORE_AUDIT, + PRIV_DATASTORE_BACKUP, }; use pbs_client::BackupRepository; use pbs_config::CachedUserInfo; @@ -55,6 +56,8 @@ pub(crate) struct PullParameters { group_filter: Vec<GroupFilter>, /// How many snapshots should be transferred at most (taking the newest N snapshots) transfer_last: Option<usize>, + /// Whether to re-sync corrupted snapshots + resync_corrupt: bool, } impl PullParameters { @@ -72,12 +75,14 @@ impl PullParameters { group_filter: Option<Vec<GroupFilter>>, limit: RateLimitConfig, transfer_last: Option<usize>, + resync_corrupt: Option<bool>, ) -> Result<Self, Error> { if let Some(max_depth) = max_depth { ns.check_max_depth(max_depth)?; remote_ns.check_max_depth(max_depth)?; }; let remove_vanished = remove_vanished.unwrap_or(false); + let resync_corrupt = resync_corrupt.unwrap_or(false); let source: Arc<dyn SyncSource> = if let Some(remote) = remote { let (remote_config, _digest) = pbs_config::remote::config()?; @@ -116,6 +121,7 @@ impl PullParameters { max_depth, group_filter, transfer_last, + resync_corrupt, }) } } @@ -175,9 +181,10 @@ async fn pull_index_chunks<I: IndexFile>( target.cond_touch_chunk(&info.digest, false) })?; if chunk_exists { - //info!("chunk {} exists {}", pos, hex::encode(digest)); + //info!("chunk exists {}", hex::encode(info.digest)); return Ok::<_, Error>(()); } + //info!("sync {} chunk {}", pos, hex::encode(digest)); let chunk = chunk_reader.read_raw_chunk(&info.digest).await?; let raw_size = chunk.raw_size() as usize; @@ -332,6 +339,7 @@ async fn pull_snapshot<'a>( reader: Arc<dyn SyncSourceReader + 'a>, snapshot: &'a pbs_datastore::BackupDir, downloaded_chunks: Arc<Mutex<HashSet<[u8; 32]>>>, + resync_corrupt: bool, ) -> Result<SyncStats, Error> { let mut sync_stats = SyncStats::default(); let mut manifest_name = snapshot.full_path(); @@ -352,6 +360,8 @@ async fn pull_snapshot<'a>( return Ok(sync_stats); } + let snapshot_is_corrupt = snapshot.verify_state() == Some(VerifyState::Failed); + if manifest_name.exists() { let manifest_blob = proxmox_lang::try_block!({ let mut manifest_file = std::fs::File::open(&manifest_name).map_err(|err| { @@ -365,7 +375,9 @@ async fn pull_snapshot<'a>( format_err!("unable to read local manifest {manifest_name:?} - {err}") })?; - if manifest_blob.raw_data() == tmp_manifest_blob.raw_data() { + if (manifest_blob.raw_data() == tmp_manifest_blob.raw_data()) + && (!resync_corrupt && !snapshot_is_corrupt) + { if !client_log_name.exists() { reader.try_download_client_log(&client_log_name).await?; }; @@ -381,7 +393,7 @@ async fn pull_snapshot<'a>( let mut path = snapshot.full_path(); path.push(&item.filename); - if path.exists() { + if !(resync_corrupt && snapshot_is_corrupt) && path.exists() { match ArchiveType::from_path(&item.filename)? { ArchiveType::DynamicIndex => { let index = DynamicIndexReader::open(&path)?; @@ -416,6 +428,10 @@ async fn pull_snapshot<'a>( } } + if resync_corrupt && snapshot_is_corrupt { + info!("pulling snapshot {} again", snapshot.dir()); + } + let stats = pull_single_archive(reader.clone(), snapshot, item, downloaded_chunks.clone()).await?; sync_stats.add(stats); @@ -443,6 +459,7 @@ async fn pull_snapshot_from<'a>( reader: Arc<dyn SyncSourceReader + 'a>, snapshot: &'a pbs_datastore::BackupDir, downloaded_chunks: Arc<Mutex<HashSet<[u8; 32]>>>, + resync_corrupt: bool, ) -> Result<SyncStats, Error> { let (_path, is_new, _snap_lock) = snapshot .datastore() @@ -451,7 +468,7 @@ async fn pull_snapshot_from<'a>( let sync_stats = if is_new { info!("sync snapshot {}", snapshot.dir()); - match pull_snapshot(reader, snapshot, downloaded_chunks).await { + match pull_snapshot(reader, snapshot, downloaded_chunks, resync_corrupt).await { Err(err) => { if let Err(cleanup_err) = snapshot.datastore().remove_backup_dir( snapshot.backup_ns(), @@ -469,7 +486,7 @@ async fn pull_snapshot_from<'a>( } } else { info!("re-sync snapshot {}", snapshot.dir()); - pull_snapshot(reader, snapshot, downloaded_chunks).await? + pull_snapshot(reader, snapshot, downloaded_chunks, resync_corrupt).await? }; Ok(sync_stats) @@ -528,21 +545,24 @@ async fn pull_group( .enumerate() .filter(|&(pos, ref dir)| { source_snapshots.insert(dir.time); - if last_sync_time > dir.time { - already_synced_skip_info.update(dir.time); - return false; - } else if already_synced_skip_info.count > 0 { - info!("{already_synced_skip_info}"); - already_synced_skip_info.reset(); - return true; - } + // If resync_corrupt is set, we go through all the remote snapshots + if !params.resync_corrupt { + if last_sync_time > dir.time { + already_synced_skip_info.update(dir.time); + return false; + } else if already_synced_skip_info.count > 0 { + info!("{already_synced_skip_info}"); + already_synced_skip_info.reset(); + return true; + } - if pos < cutoff && last_sync_time != dir.time { - transfer_last_skip_info.update(dir.time); - return false; - } else if transfer_last_skip_info.count > 0 { - info!("{transfer_last_skip_info}"); - transfer_last_skip_info.reset(); + if pos < cutoff && last_sync_time != dir.time { + transfer_last_skip_info.update(dir.time); + return false; + } else if transfer_last_skip_info.count > 0 { + info!("{transfer_last_skip_info}"); + transfer_last_skip_info.reset(); + } } true }) @@ -566,7 +586,13 @@ async fn pull_group( .source .reader(source_namespace, &from_snapshot) .await?; - let result = pull_snapshot_from(reader, &to_snapshot, downloaded_chunks.clone()).await; + let result = pull_snapshot_from( + reader, + &to_snapshot, + downloaded_chunks.clone(), + params.resync_corrupt, + ) + .await; progress.done_snapshots = pos as u64 + 1; info!("percentage done: {progress}"); -- 2.39.5 _______________________________________________ pbs-devel mailing list pbs-devel@lists.proxmox.com https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [pbs-devel] [PATCH proxmox-backup 1/3] fix #3786: api: add resync-corrupt option to sync jobs 2024-10-15 13:18 ` [pbs-devel] [PATCH proxmox-backup 1/3] fix #3786: api: add resync-corrupt option to sync jobs Gabriel Goller @ 2024-10-17 6:15 ` Thomas Lamprecht 2024-10-17 13:20 ` Gabriel Goller 0 siblings, 1 reply; 9+ messages in thread From: Thomas Lamprecht @ 2024-10-17 6:15 UTC (permalink / raw) To: Proxmox Backup Server development discussion, Gabriel Goller Am 15/10/2024 um 15:18 schrieb Gabriel Goller: > This option allows us to "fix" corrupt snapshots (and/or their chunks) > by pulling them from another remote. When traversing the remote > snapshots, we check if it exists locally, and if it is, we check if the > last verification of it failed. If the local snapshot is broken and the > `resync-corrupt` option is turned on, we pull in the remote snapshot, > overwriting the local one. > > This is very useful and has been requested a lot, as there is currently > no way to "fix" corrupt chunks/snapshots even if the user has a healthy > version of it on their offsite instance. > > Signed-off-by: Gabriel Goller <g.goller@proxmox.com> > Originally-by: Shannon Sterz <s.sterz@proxmox.com> those two trailers probably should be switched to uphold causal ordering. In general, it would be _really_ nice to get some regression testing for sync jobs covering the selection with different matching and options. The lack of that doesn't have to block your/shannon's patch, but IMO it should be added rather sooner than later. Sync is becoming more and more complex, and the selection of snapshots is a very important and fundamental part of it. > --- > pbs-api-types/src/jobs.rs | 11 +++++ > pbs-datastore/src/backup_info.rs | 13 +++++- > src/api2/config/sync.rs | 4 ++ > src/api2/pull.rs | 9 +++- > src/bin/proxmox-backup-manager.rs | 4 +- > src/server/pull.rs | 68 +++++++++++++++++++++---------- > 6 files changed, 84 insertions(+), 25 deletions(-) > > diff --git a/pbs-api-types/src/jobs.rs b/pbs-api-types/src/jobs.rs > index 868702bc059e..a3cd68bf5afd 100644 > --- a/pbs-api-types/src/jobs.rs > +++ b/pbs-api-types/src/jobs.rs > @@ -498,6 +498,11 @@ pub const TRANSFER_LAST_SCHEMA: Schema = > .minimum(1) > .schema(); > > +pub const RESYNC_CORRUPT_SCHEMA: Schema = BooleanSchema::new( > + "If the verification failed for a local snapshot, try to pull its chunks again.", not only chunks, also its indexes (which might be also a source of a bad verification result I think) > +) > +.schema(); > + > #[api( > properties: { > id: { > @@ -552,6 +557,10 @@ pub const TRANSFER_LAST_SCHEMA: Schema = > schema: TRANSFER_LAST_SCHEMA, > optional: true, > }, > + "resync-corrupt": { > + schema: RESYNC_CORRUPT_SCHEMA, > + optional: true, > + } > } > )] > #[derive(Serialize, Deserialize, Clone, Updater, PartialEq)] > @@ -585,6 +594,8 @@ pub struct SyncJobConfig { > pub limit: RateLimitConfig, > #[serde(skip_serializing_if = "Option::is_none")] > pub transfer_last: Option<usize>, > + #[serde(skip_serializing_if = "Option::is_none")] > + pub resync_corrupt: Option<bool>, > } > > impl SyncJobConfig { > diff --git a/pbs-datastore/src/backup_info.rs b/pbs-datastore/src/backup_info.rs > index 414ec878d01a..2d283dbe5cd9 100644 > --- a/pbs-datastore/src/backup_info.rs > +++ b/pbs-datastore/src/backup_info.rs > @@ -8,7 +8,8 @@ use anyhow::{bail, format_err, Error}; > use proxmox_sys::fs::{lock_dir_noblock, replace_file, CreateOptions}; > > use pbs_api_types::{ > - Authid, BackupNamespace, BackupType, GroupFilter, BACKUP_DATE_REGEX, BACKUP_FILE_REGEX, > + Authid, BackupNamespace, BackupType, GroupFilter, SnapshotVerifyState, VerifyState, > + BACKUP_DATE_REGEX, BACKUP_FILE_REGEX, > }; > use pbs_config::{open_backup_lockfile, BackupLockGuard}; > > @@ -583,6 +584,16 @@ impl BackupDir { > > Ok(()) > } > + > + /// Load the verify state from the manifest. > + pub fn verify_state(&self) -> Option<VerifyState> { > + self.load_manifest().ok().and_then(|(m, _)| { hmm, just ignoring errors here seems a bit odd, that might be a case where one might want to re-sync. So I'd prefer wrapping this a result, even if it's a bit tedious. > + let verify = m.unprotected["verify_state"].clone(); > + serde_json::from_value::<SnapshotVerifyState>(verify) > + .ok() > + .map(|svs| svs.state) > + }) > + } > } > > impl AsRef<pbs_api_types::BackupNamespace> for BackupDir { > diff --git a/src/api2/config/sync.rs b/src/api2/config/sync.rs > index 6fdc69a9e645..fa9db92f3d11 100644 > --- a/src/api2/config/sync.rs > +++ b/src/api2/config/sync.rs > @@ -368,6 +368,9 @@ pub fn update_sync_job( > if let Some(transfer_last) = update.transfer_last { > data.transfer_last = Some(transfer_last); > } > + if let Some(resync_corrupt) = update.resync_corrupt { > + data.resync_corrupt = Some(resync_corrupt); > + } > > if update.limit.rate_in.is_some() { > data.limit.rate_in = update.limit.rate_in; > @@ -527,6 +530,7 @@ acl:1:/remote/remote1/remotestore1:write@pbs:RemoteSyncOperator > ns: None, > owner: Some(write_auth_id.clone()), > comment: None, > + resync_corrupt: None, > remove_vanished: None, > max_depth: None, > group_filter: None, > diff --git a/src/api2/pull.rs b/src/api2/pull.rs > index e733c9839e3a..0d4be0e2d228 100644 > --- a/src/api2/pull.rs > +++ b/src/api2/pull.rs > @@ -10,7 +10,7 @@ use pbs_api_types::{ > Authid, BackupNamespace, GroupFilter, RateLimitConfig, SyncJobConfig, DATASTORE_SCHEMA, > GROUP_FILTER_LIST_SCHEMA, NS_MAX_DEPTH_REDUCED_SCHEMA, PRIV_DATASTORE_BACKUP, > PRIV_DATASTORE_PRUNE, PRIV_REMOTE_READ, REMOTE_ID_SCHEMA, REMOVE_VANISHED_BACKUPS_SCHEMA, > - TRANSFER_LAST_SCHEMA, > + RESYNC_CORRUPT_SCHEMA, TRANSFER_LAST_SCHEMA, > }; > use pbs_config::CachedUserInfo; > use proxmox_human_byte::HumanByte; > @@ -89,6 +89,7 @@ impl TryFrom<&SyncJobConfig> for PullParameters { > sync_job.group_filter.clone(), > sync_job.limit.clone(), > sync_job.transfer_last, > + sync_job.resync_corrupt, > ) > } > } > @@ -240,6 +241,10 @@ pub fn do_sync_job( > schema: TRANSFER_LAST_SCHEMA, > optional: true, > }, > + "resync-corrupt": { > + schema: RESYNC_CORRUPT_SCHEMA, > + optional: true, > + }, > }, > }, > access: { > @@ -264,6 +269,7 @@ async fn pull( > group_filter: Option<Vec<GroupFilter>>, > limit: RateLimitConfig, > transfer_last: Option<usize>, > + resync_corrupt: Option<bool>, > rpcenv: &mut dyn RpcEnvironment, > ) -> Result<String, Error> { > let auth_id: Authid = rpcenv.get_auth_id().unwrap().parse()?; > @@ -301,6 +307,7 @@ async fn pull( > group_filter, > limit, > transfer_last, > + resync_corrupt, > )?; > > // fixme: set to_stdout to false? > diff --git a/src/bin/proxmox-backup-manager.rs b/src/bin/proxmox-backup-manager.rs > index 420e96665662..38a1cf0f5881 100644 > --- a/src/bin/proxmox-backup-manager.rs > +++ b/src/bin/proxmox-backup-manager.rs > @@ -14,8 +14,8 @@ use pbs_api_types::percent_encoding::percent_encode_component; > use pbs_api_types::{ > BackupNamespace, GroupFilter, RateLimitConfig, SyncJobConfig, DATASTORE_SCHEMA, > GROUP_FILTER_LIST_SCHEMA, IGNORE_VERIFIED_BACKUPS_SCHEMA, NS_MAX_DEPTH_SCHEMA, > - REMOTE_ID_SCHEMA, REMOVE_VANISHED_BACKUPS_SCHEMA, TRANSFER_LAST_SCHEMA, UPID_SCHEMA, > - VERIFICATION_OUTDATED_AFTER_SCHEMA, > + REMOTE_ID_SCHEMA, REMOVE_VANISHED_BACKUPS_SCHEMA, RESYNC_CORRUPT_SCHEMA, TRANSFER_LAST_SCHEMA, > + UPID_SCHEMA, VERIFICATION_OUTDATED_AFTER_SCHEMA, > }; > use pbs_client::{display_task_log, view_task_result}; > use pbs_config::sync; > diff --git a/src/server/pull.rs b/src/server/pull.rs > index 3117f7d2c960..67881f83b5e3 100644 > --- a/src/server/pull.rs > +++ b/src/server/pull.rs > @@ -12,7 +12,8 @@ use tracing::info; > > use pbs_api_types::{ > print_store_and_ns, Authid, BackupDir, BackupGroup, BackupNamespace, GroupFilter, Operation, > - RateLimitConfig, Remote, MAX_NAMESPACE_DEPTH, PRIV_DATASTORE_AUDIT, PRIV_DATASTORE_BACKUP, > + RateLimitConfig, Remote, VerifyState, MAX_NAMESPACE_DEPTH, PRIV_DATASTORE_AUDIT, > + PRIV_DATASTORE_BACKUP, > }; > use pbs_client::BackupRepository; > use pbs_config::CachedUserInfo; > @@ -55,6 +56,8 @@ pub(crate) struct PullParameters { > group_filter: Vec<GroupFilter>, > /// How many snapshots should be transferred at most (taking the newest N snapshots) > transfer_last: Option<usize>, > + /// Whether to re-sync corrupted snapshots > + resync_corrupt: bool, > } > > impl PullParameters { > @@ -72,12 +75,14 @@ impl PullParameters { > group_filter: Option<Vec<GroupFilter>>, > limit: RateLimitConfig, > transfer_last: Option<usize>, > + resync_corrupt: Option<bool>, > ) -> Result<Self, Error> { > if let Some(max_depth) = max_depth { > ns.check_max_depth(max_depth)?; > remote_ns.check_max_depth(max_depth)?; > }; > let remove_vanished = remove_vanished.unwrap_or(false); > + let resync_corrupt = resync_corrupt.unwrap_or(false); > > let source: Arc<dyn SyncSource> = if let Some(remote) = remote { > let (remote_config, _digest) = pbs_config::remote::config()?; > @@ -116,6 +121,7 @@ impl PullParameters { > max_depth, > group_filter, > transfer_last, > + resync_corrupt, > }) > } > } > @@ -175,9 +181,10 @@ async fn pull_index_chunks<I: IndexFile>( > target.cond_touch_chunk(&info.digest, false) > })?; > if chunk_exists { > - //info!("chunk {} exists {}", pos, hex::encode(digest)); > + //info!("chunk exists {}", hex::encode(info.digest)); > return Ok::<_, Error>(()); > } > + > //info!("sync {} chunk {}", pos, hex::encode(digest)); > let chunk = chunk_reader.read_raw_chunk(&info.digest).await?; > let raw_size = chunk.raw_size() as usize; > @@ -332,6 +339,7 @@ async fn pull_snapshot<'a>( > reader: Arc<dyn SyncSourceReader + 'a>, > snapshot: &'a pbs_datastore::BackupDir, > downloaded_chunks: Arc<Mutex<HashSet<[u8; 32]>>>, > + resync_corrupt: bool, would be good to update the doc comment of this method to mention the new param in the description of steps > ) -> Result<SyncStats, Error> { > let mut sync_stats = SyncStats::default(); > let mut manifest_name = snapshot.full_path(); > @@ -352,6 +360,8 @@ async fn pull_snapshot<'a>( > return Ok(sync_stats); > } > > + let snapshot_is_corrupt = snapshot.verify_state() == Some(VerifyState::Failed); Hmm, but currently you'd only need to get the state if resync_corrupt is true? Or at least if the local manifest already exists, if we want to allow re-syncing on some other conditions (but that can be adapted to that once required). e.g.: let must_resync_exisiting = resync_corrupt && snapshot.verify_state() == Some(VerifyState::Failed); That would make the checks below a bit easier to read. And would it make sense to check the downloaded manifest's verify state, no point in re-pulling if the other side is also corrupt, or is that possibility already excluded somewhere else in this code path? > + > if manifest_name.exists() { > let manifest_blob = proxmox_lang::try_block!({ > let mut manifest_file = std::fs::File::open(&manifest_name).map_err(|err| { > @@ -365,7 +375,9 @@ async fn pull_snapshot<'a>( > format_err!("unable to read local manifest {manifest_name:?} - {err}") > })?; > > - if manifest_blob.raw_data() == tmp_manifest_blob.raw_data() { > + if (manifest_blob.raw_data() == tmp_manifest_blob.raw_data()) unnecessary addition of parenthesis a == b && (!c && !d) would be fine too and less noise to read, so to say. > + && (!resync_corrupt && !snapshot_is_corrupt) This is equal to `.. && !(resync_corrupt || snapshot_is_corrupt)`, which makes it slightly easier to read for me, and I'm slightly confused as why it's not `.. && !(resync_corrupt && snapshot_is_corrupt)`, i.e., with the new variable above: if manifest_blob.raw_data() == tmp_manifest_blob.raw_data() && !must_resync_exisiting { // ... } Or why would the expression need to be different here? > + { > if !client_log_name.exists() { > reader.try_download_client_log(&client_log_name).await?; > }; > @@ -381,7 +393,7 @@ async fn pull_snapshot<'a>( > let mut path = snapshot.full_path(); > path.push(&item.filename); > > - if path.exists() { > + if !(resync_corrupt && snapshot_is_corrupt) && path.exists() { > match ArchiveType::from_path(&item.filename)? { > ArchiveType::DynamicIndex => { > let index = DynamicIndexReader::open(&path)?; > @@ -416,6 +428,10 @@ async fn pull_snapshot<'a>( > } > } > > + if resync_corrupt && snapshot_is_corrupt { > + info!("pulling snapshot {} again", snapshot.dir()); maybe include the reason for re-pulling this here, e.g. something like: "re-syncing snapshot {} due to bad verification result" And wouldn't that get logged multiple times for backups with multiple archives, like e.g. guests with multiple disks? Makes probably more sense to log that before the for loop > + } > + > let stats = > pull_single_archive(reader.clone(), snapshot, item, downloaded_chunks.clone()).await?; > sync_stats.add(stats); > @@ -443,6 +459,7 @@ async fn pull_snapshot_from<'a>( > reader: Arc<dyn SyncSourceReader + 'a>, > snapshot: &'a pbs_datastore::BackupDir, > downloaded_chunks: Arc<Mutex<HashSet<[u8; 32]>>>, > + resync_corrupt: bool, > ) -> Result<SyncStats, Error> { > let (_path, is_new, _snap_lock) = snapshot > .datastore() > @@ -451,7 +468,7 @@ async fn pull_snapshot_from<'a>( > let sync_stats = if is_new { > info!("sync snapshot {}", snapshot.dir()); > > - match pull_snapshot(reader, snapshot, downloaded_chunks).await { > + match pull_snapshot(reader, snapshot, downloaded_chunks, resync_corrupt).await { > Err(err) => { > if let Err(cleanup_err) = snapshot.datastore().remove_backup_dir( > snapshot.backup_ns(), > @@ -469,7 +486,7 @@ async fn pull_snapshot_from<'a>( > } > } else { > info!("re-sync snapshot {}", snapshot.dir()); > - pull_snapshot(reader, snapshot, downloaded_chunks).await? > + pull_snapshot(reader, snapshot, downloaded_chunks, resync_corrupt).await? > }; > > Ok(sync_stats) > @@ -528,21 +545,24 @@ async fn pull_group( > .enumerate() > .filter(|&(pos, ref dir)| { > source_snapshots.insert(dir.time); > - if last_sync_time > dir.time { > - already_synced_skip_info.update(dir.time); > - return false; > - } else if already_synced_skip_info.count > 0 { > - info!("{already_synced_skip_info}"); > - already_synced_skip_info.reset(); > - return true; > - } > + // If resync_corrupt is set, we go through all the remote snapshots > + if !params.resync_corrupt { maybe reverse that and use an explicit return? i.e.: if params.resync_corrupt { return true; // always check all remote snapshots when resyncing } // rest of the filter would avoid an indentation level for the non-bypass path and be slightly easier to read to me > + if last_sync_time > dir.time { > + already_synced_skip_info.update(dir.time); > + return false; > + } else if already_synced_skip_info.count > 0 { > + info!("{already_synced_skip_info}"); > + already_synced_skip_info.reset(); > + return true; > + } > > - if pos < cutoff && last_sync_time != dir.time { > - transfer_last_skip_info.update(dir.time); > - return false; > - } else if transfer_last_skip_info.count > 0 { > - info!("{transfer_last_skip_info}"); > - transfer_last_skip_info.reset(); > + if pos < cutoff && last_sync_time != dir.time { > + transfer_last_skip_info.update(dir.time); > + return false; > + } else if transfer_last_skip_info.count > 0 { > + info!("{transfer_last_skip_info}"); > + transfer_last_skip_info.reset(); > + } > } > true > }) > @@ -566,7 +586,13 @@ async fn pull_group( > .source > .reader(source_namespace, &from_snapshot) > .await?; > - let result = pull_snapshot_from(reader, &to_snapshot, downloaded_chunks.clone()).await; > + let result = pull_snapshot_from( > + reader, > + &to_snapshot, > + downloaded_chunks.clone(), > + params.resync_corrupt, > + ) > + .await; > > progress.done_snapshots = pos as u64 + 1; > info!("percentage done: {progress}"); _______________________________________________ pbs-devel mailing list pbs-devel@lists.proxmox.com https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [pbs-devel] [PATCH proxmox-backup 1/3] fix #3786: api: add resync-corrupt option to sync jobs 2024-10-17 6:15 ` Thomas Lamprecht @ 2024-10-17 13:20 ` Gabriel Goller 2024-10-17 14:09 ` Thomas Lamprecht 2024-10-21 8:16 ` Lukas Wagner 0 siblings, 2 replies; 9+ messages in thread From: Gabriel Goller @ 2024-10-17 13:20 UTC (permalink / raw) To: Thomas Lamprecht; +Cc: Proxmox Backup Server development discussion On 17.10.2024 08:15, Thomas Lamprecht wrote: >Am 15/10/2024 um 15:18 schrieb Gabriel Goller: >> This option allows us to "fix" corrupt snapshots (and/or their chunks) >> by pulling them from another remote. When traversing the remote >> snapshots, we check if it exists locally, and if it is, we check if the >> last verification of it failed. If the local snapshot is broken and the >> `resync-corrupt` option is turned on, we pull in the remote snapshot, >> overwriting the local one. >> >> This is very useful and has been requested a lot, as there is currently >> no way to "fix" corrupt chunks/snapshots even if the user has a healthy >> version of it on their offsite instance. >> >> Signed-off-by: Gabriel Goller <g.goller@proxmox.com> >> Originally-by: Shannon Sterz <s.sterz@proxmox.com> > >those two trailers probably should be switched to uphold causal ordering. Makes sense. >In general, it would be _really_ nice to get some regression testing for >sync jobs covering the selection with different matching and options. >The lack of that doesn't have to block your/shannon's patch, but IMO it >should be added rather sooner than later. Sync is becoming more and more >complex, and the selection of snapshots is a very important and fundamental >part of it. Do you think we could mock ourselves out of this and unit-test it somehow? For full regression tests we will need Lukas's CI work. >> --- >> pbs-api-types/src/jobs.rs | 11 +++++ >> pbs-datastore/src/backup_info.rs | 13 +++++- >> src/api2/config/sync.rs | 4 ++ >> src/api2/pull.rs | 9 +++- >> src/bin/proxmox-backup-manager.rs | 4 +- >> src/server/pull.rs | 68 +++++++++++++++++++++---------- >> 6 files changed, 84 insertions(+), 25 deletions(-) >> >> diff --git a/pbs-api-types/src/jobs.rs b/pbs-api-types/src/jobs.rs >> index 868702bc059e..a3cd68bf5afd 100644 >> --- a/pbs-api-types/src/jobs.rs >> +++ b/pbs-api-types/src/jobs.rs >> @@ -498,6 +498,11 @@ pub const TRANSFER_LAST_SCHEMA: Schema = >> .minimum(1) >> .schema(); >> >> +pub const RESYNC_CORRUPT_SCHEMA: Schema = BooleanSchema::new( >> + "If the verification failed for a local snapshot, try to pull its chunks again.", > >not only chunks, also its indexes (which might be also a source of a bad >verification result I think) Yes, correct, I adjusted the description above. >> +) >> +.schema(); >> + >> #[api( >> properties: { >> id: { >> @@ -552,6 +557,10 @@ pub const TRANSFER_LAST_SCHEMA: Schema = >> schema: TRANSFER_LAST_SCHEMA, >> optional: true, >> }, >> + "resync-corrupt": { >> + schema: RESYNC_CORRUPT_SCHEMA, >> + optional: true, >> + } >> } >> )] >> #[derive(Serialize, Deserialize, Clone, Updater, PartialEq)] >> @@ -585,6 +594,8 @@ pub struct SyncJobConfig { >> pub limit: RateLimitConfig, >> #[serde(skip_serializing_if = "Option::is_none")] >> pub transfer_last: Option<usize>, >> + #[serde(skip_serializing_if = "Option::is_none")] >> + pub resync_corrupt: Option<bool>, >> } >> >> impl SyncJobConfig { >> diff --git a/pbs-datastore/src/backup_info.rs b/pbs-datastore/src/backup_info.rs >> index 414ec878d01a..2d283dbe5cd9 100644 >> --- a/pbs-datastore/src/backup_info.rs >> +++ b/pbs-datastore/src/backup_info.rs >> @@ -8,7 +8,8 @@ use anyhow::{bail, format_err, Error}; >> use proxmox_sys::fs::{lock_dir_noblock, replace_file, CreateOptions}; >> >> use pbs_api_types::{ >> - Authid, BackupNamespace, BackupType, GroupFilter, BACKUP_DATE_REGEX, BACKUP_FILE_REGEX, >> + Authid, BackupNamespace, BackupType, GroupFilter, SnapshotVerifyState, VerifyState, >> + BACKUP_DATE_REGEX, BACKUP_FILE_REGEX, >> }; >> use pbs_config::{open_backup_lockfile, BackupLockGuard}; >> >> @@ -583,6 +584,16 @@ impl BackupDir { >> >> Ok(()) >> } >> + >> + /// Load the verify state from the manifest. >> + pub fn verify_state(&self) -> Option<VerifyState> { >> + self.load_manifest().ok().and_then(|(m, _)| { > >hmm, just ignoring errors here seems a bit odd, that might be a case where one >might want to re-sync. So I'd prefer wrapping this a result, even if it's a bit >tedious. Converted to (and returned) a `Result<VerifyState, anyhow::Error>` for now, but I'd be wary about treating a failed `load_manifest` the same as a `VerifyState::Failed`. Especially because currently an unverified snapshot simply returns `Err(null...)`. So we would first need to extend `VerifyState` to have a `Unknown` variant. >> @@ -332,6 +339,7 @@ async fn pull_snapshot<'a>( >> reader: Arc<dyn SyncSourceReader + 'a>, >> snapshot: &'a pbs_datastore::BackupDir, >> downloaded_chunks: Arc<Mutex<HashSet<[u8; 32]>>>, >> + resync_corrupt: bool, > >would be good to update the doc comment of this method to mention the new param >in the description of steps done. >> ) -> Result<SyncStats, Error> { >> let mut sync_stats = SyncStats::default(); >> let mut manifest_name = snapshot.full_path(); >> @@ -352,6 +360,8 @@ async fn pull_snapshot<'a>( >> return Ok(sync_stats); >> } >> >> + let snapshot_is_corrupt = snapshot.verify_state() == Some(VerifyState::Failed); > >Hmm, but currently you'd only need to get the state if resync_corrupt is true? >Or at least if the local manifest already exists, if we want to allow re-syncing >on some other conditions (but that can be adapted to that once required). > >e.g.: > >let must_resync_exisiting = resync_corrupt && snapshot.verify_state() == Some(VerifyState::Failed); > >That would make the checks below a bit easier to read. Yep, done as well, this makes all the checks easier to read. >And would it make sense to check the downloaded manifest's verify state, no point >in re-pulling if the other side is also corrupt, or is that possibility already >excluded somewhere else in this code path? This should be covered, because: - if the index is broken, we can't even open it - if an unencrypted chunk is corrupted, the crc is checked when reading it (pbs-client/src/remote_chunk_reader.rs:55) - if an encrypted chunk is corrupted, the crc is checked as well (pbs-datastore/src/data_blob.rs:81) >> + >> if manifest_name.exists() { >> let manifest_blob = proxmox_lang::try_block!({ >> let mut manifest_file = std::fs::File::open(&manifest_name).map_err(|err| { >> @@ -365,7 +375,9 @@ async fn pull_snapshot<'a>( >> format_err!("unable to read local manifest {manifest_name:?} - {err}") >> })?; >> >> - if manifest_blob.raw_data() == tmp_manifest_blob.raw_data() { >> + if (manifest_blob.raw_data() == tmp_manifest_blob.raw_data()) > >unnecessary addition of parenthesis a == b && (!c && !d) would be fine too and less >noise to read, so to say. Changed. >> + && (!resync_corrupt && !snapshot_is_corrupt) > >This is equal to `.. && !(resync_corrupt || snapshot_is_corrupt)`, which makes it >slightly easier to read for me, and I'm slightly confused as why it's not >`.. && !(resync_corrupt && snapshot_is_corrupt)`, i.e., with the new variable above: > >if manifest_blob.raw_data() == tmp_manifest_blob.raw_data() && !must_resync_exisiting { > // ... >} > >Or why would the expression need to be different here? Changed this as well after combining the flags. >> + { >> if !client_log_name.exists() { >> reader.try_download_client_log(&client_log_name).await?; >> }; >> @@ -381,7 +393,7 @@ async fn pull_snapshot<'a>( >> let mut path = snapshot.full_path(); >> path.push(&item.filename); >> >> - if path.exists() { >> + if !(resync_corrupt && snapshot_is_corrupt) && path.exists() { >> match ArchiveType::from_path(&item.filename)? { >> ArchiveType::DynamicIndex => { >> let index = DynamicIndexReader::open(&path)?; >> @@ -416,6 +428,10 @@ async fn pull_snapshot<'a>( >> } >> } >> >> + if resync_corrupt && snapshot_is_corrupt { >> + info!("pulling snapshot {} again", snapshot.dir()); > >maybe include the reason for re-pulling this here, e.g. something like: > >"re-syncing snapshot {} due to bad verification result" > >And wouldn't that get logged multiple times for backups with multiple archives, like >e.g. guests with multiple disks? Makes probably more sense to log that before the for >loop Yep, you're right, moved it to the beginning of the loop! >> + } >> + >> let stats = >> pull_single_archive(reader.clone(), snapshot, item, downloaded_chunks.clone()).await?; >> sync_stats.add(stats); >> @@ -443,6 +459,7 @@ async fn pull_snapshot_from<'a>( >> reader: Arc<dyn SyncSourceReader + 'a>, >> snapshot: &'a pbs_datastore::BackupDir, >> downloaded_chunks: Arc<Mutex<HashSet<[u8; 32]>>>, >> + resync_corrupt: bool, >> ) -> Result<SyncStats, Error> { >> let (_path, is_new, _snap_lock) = snapshot >> .datastore() >> @@ -451,7 +468,7 @@ async fn pull_snapshot_from<'a>( >> let sync_stats = if is_new { >> info!("sync snapshot {}", snapshot.dir()); >> >> - match pull_snapshot(reader, snapshot, downloaded_chunks).await { >> + match pull_snapshot(reader, snapshot, downloaded_chunks, resync_corrupt).await { >> Err(err) => { >> if let Err(cleanup_err) = snapshot.datastore().remove_backup_dir( >> snapshot.backup_ns(), >> @@ -469,7 +486,7 @@ async fn pull_snapshot_from<'a>( >> } >> } else { >> info!("re-sync snapshot {}", snapshot.dir()); >> - pull_snapshot(reader, snapshot, downloaded_chunks).await? >> + pull_snapshot(reader, snapshot, downloaded_chunks, resync_corrupt).await? >> }; >> >> Ok(sync_stats) >> @@ -528,21 +545,24 @@ async fn pull_group( >> .enumerate() >> .filter(|&(pos, ref dir)| { >> source_snapshots.insert(dir.time); >> - if last_sync_time > dir.time { >> - already_synced_skip_info.update(dir.time); >> - return false; >> - } else if already_synced_skip_info.count > 0 { >> - info!("{already_synced_skip_info}"); >> - already_synced_skip_info.reset(); >> - return true; >> - } >> + // If resync_corrupt is set, we go through all the remote snapshots >> + if !params.resync_corrupt { > >maybe reverse that and use an explicit return? i.e.: > >if params.resync_corrupt { > return true; // always check all remote snapshots when resyncing >} > >// rest of the filter > >would avoid an indentation level for the non-bypass path and be slightly easier >to read to me True, fixed this as well. Thanks for the review! _______________________________________________ pbs-devel mailing list pbs-devel@lists.proxmox.com https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [pbs-devel] [PATCH proxmox-backup 1/3] fix #3786: api: add resync-corrupt option to sync jobs 2024-10-17 13:20 ` Gabriel Goller @ 2024-10-17 14:09 ` Thomas Lamprecht 2024-10-18 9:08 ` Gabriel Goller 2024-10-21 8:16 ` Lukas Wagner 1 sibling, 1 reply; 9+ messages in thread From: Thomas Lamprecht @ 2024-10-17 14:09 UTC (permalink / raw) To: Gabriel Goller; +Cc: Proxmox Backup Server development discussion Am 17/10/2024 um 15:20 schrieb Gabriel Goller: > On 17.10.2024 08:15, Thomas Lamprecht wrote: >> In general, it would be _really_ nice to get some regression testing for >> sync jobs covering the selection with different matching and options. >> The lack of that doesn't have to block your/shannon's patch, but IMO it >> should be added rather sooner than later. Sync is becoming more and more >> complex, and the selection of snapshots is a very important and fundamental >> part of it. > > Do you think we could mock ourselves out of this and unit-test it > somehow? For full regression tests we will need Lukas's CI work. I hope so, as that is mostly pure application logic. Albeit I'm I currently can't give you tips for what the best/simplest way for mocking this would be. Obvious candidates would probably be to pull out the parts for getting local and remote backup snapshot lists and info into a trait or alternatively refactoring more logic out that gets the info passed as parameter and call that in the tests with the test data. I have no strong preference here and there might be even better options for this specific cases, but it'd be great if the existing "real" code does not need to bend backwards to support the mocking and that defining or expanding the tests isn't too tedious. FWIW, one option might even be to have the list of snapshots defined and generating two directory trees during build that mimic the actual datastores more closely and might require less mocking and thus have bigger code coverage. >>> + /// Load the verify state from the manifest. >>> + pub fn verify_state(&self) -> Option<VerifyState> { >>> + self.load_manifest().ok().and_then(|(m, _)| { >> >> hmm, just ignoring errors here seems a bit odd, that might be a case where one >> might want to re-sync. So I'd prefer wrapping this a result, even if it's a bit >> tedious. > > Converted to (and returned) a `Result<VerifyState, anyhow::Error>` for now, > but I'd be wary about treating a failed `load_manifest` the same as a > `VerifyState::Failed`. Especially because currently an unverified > snapshot simply returns `Err(null...)`. So we would first need to extend > `VerifyState` to have a `Unknown` variant. Yeah, I agree with you not treating a failure to load/parse the manifest the same as a failed verification state, but the caller can log the error and thus at least make the user aware of something unexpected happening. >> And would it make sense to check the downloaded manifest's verify state, no point >> in re-pulling if the other side is also corrupt, or is that possibility already >> excluded somewhere else in this code path? > > This should be covered, because: > - if the index is broken, we can't even open it > - if an unencrypted chunk is corrupted, the crc is checked when reading > it (pbs-client/src/remote_chunk_reader.rs:55) > - if an encrypted chunk is corrupted, the crc is checked as well > (pbs-datastore/src/data_blob.rs:81) ok, thanks for confirming this. _______________________________________________ pbs-devel mailing list pbs-devel@lists.proxmox.com https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [pbs-devel] [PATCH proxmox-backup 1/3] fix #3786: api: add resync-corrupt option to sync jobs 2024-10-17 14:09 ` Thomas Lamprecht @ 2024-10-18 9:08 ` Gabriel Goller 0 siblings, 0 replies; 9+ messages in thread From: Gabriel Goller @ 2024-10-18 9:08 UTC (permalink / raw) To: Thomas Lamprecht; +Cc: Proxmox Backup Server development discussion On 17.10.2024 16:09, Thomas Lamprecht wrote: >Am 17/10/2024 um 15:20 schrieb Gabriel Goller: >> On 17.10.2024 08:15, Thomas Lamprecht wrote: >>> In general, it would be _really_ nice to get some regression testing for >>> sync jobs covering the selection with different matching and options. >>> The lack of that doesn't have to block your/shannon's patch, but IMO it >>> should be added rather sooner than later. Sync is becoming more and more >>> complex, and the selection of snapshots is a very important and fundamental >>> part of it. >> >> Do you think we could mock ourselves out of this and unit-test it >> somehow? For full regression tests we will need Lukas's CI work. > >I hope so, as that is mostly pure application logic. Albeit I'm I currently >can't give you tips for what the best/simplest way for mocking this would be. >Obvious candidates would probably be to pull out the parts for getting local and >remote backup snapshot lists and info into a trait or alternatively >refactoring more logic out that gets the info passed as parameter and call >that in the tests with the test data. > >I have no strong preference here and there might be even better options for >this specific cases, but it'd be great if the existing "real" code does not >need to bend backwards to support the mocking and that defining or expanding >the tests isn't too tedious. >FWIW, one option might even be to have the list of snapshots defined and >generating two directory trees during build that mimic the actual datastores >more closely and might require less mocking and thus have bigger code coverage. Interesting, I'll have a look at it when I'm done with my other stuff :) >>>> + /// Load the verify state from the manifest. >>>> + pub fn verify_state(&self) -> Option<VerifyState> { >>>> + self.load_manifest().ok().and_then(|(m, _)| { >>> >>> hmm, just ignoring errors here seems a bit odd, that might be a case where one >>> might want to re-sync. So I'd prefer wrapping this a result, even if it's a bit >>> tedious. >> >> Converted to (and returned) a `Result<VerifyState, anyhow::Error>` for now, >> but I'd be wary about treating a failed `load_manifest` the same as a >> `VerifyState::Failed`. Especially because currently an unverified >> snapshot simply returns `Err(null...)`. So we would first need to extend >> `VerifyState` to have a `Unknown` variant. > >Yeah, I agree with you not treating a failure to load/parse the manifest >the same as a failed verification state, but the caller can log the error >and thus at least make the user aware of something unexpected happening. Will do! >>> And would it make sense to check the downloaded manifest's verify state, no point >>> in re-pulling if the other side is also corrupt, or is that possibility already >>> excluded somewhere else in this code path? >> >> This should be covered, because: >> - if the index is broken, we can't even open it >> - if an unencrypted chunk is corrupted, the crc is checked when reading >> it (pbs-client/src/remote_chunk_reader.rs:55) >> - if an encrypted chunk is corrupted, the crc is checked as well >> (pbs-datastore/src/data_blob.rs:81) > >ok, thanks for confirming this. Thanks for the reply! Posted a v2. _______________________________________________ pbs-devel mailing list pbs-devel@lists.proxmox.com https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [pbs-devel] [PATCH proxmox-backup 1/3] fix #3786: api: add resync-corrupt option to sync jobs 2024-10-17 13:20 ` Gabriel Goller 2024-10-17 14:09 ` Thomas Lamprecht @ 2024-10-21 8:16 ` Lukas Wagner 1 sibling, 0 replies; 9+ messages in thread From: Lukas Wagner @ 2024-10-21 8:16 UTC (permalink / raw) To: Proxmox Backup Server development discussion, Gabriel Goller, Thomas Lamprecht On 2024-10-17 15:20, Gabriel Goller wrote: > On 17.10.2024 08:15, Thomas Lamprecht wrote: >> Am 15/10/2024 um 15:18 schrieb Gabriel Goller: >>> This option allows us to "fix" corrupt snapshots (and/or their chunks) >>> by pulling them from another remote. When traversing the remote >>> snapshots, we check if it exists locally, and if it is, we check if the >>> last verification of it failed. If the local snapshot is broken and the >>> `resync-corrupt` option is turned on, we pull in the remote snapshot, >>> overwriting the local one. >>> >>> This is very useful and has been requested a lot, as there is currently >>> no way to "fix" corrupt chunks/snapshots even if the user has a healthy >>> version of it on their offsite instance. >>> >>> Signed-off-by: Gabriel Goller <g.goller@proxmox.com> >>> Originally-by: Shannon Sterz <s.sterz@proxmox.com> >> >> those two trailers probably should be switched to uphold causal ordering. > > Makes sense. > >> In general, it would be _really_ nice to get some regression testing for >> sync jobs covering the selection with different matching and options. >> The lack of that doesn't have to block your/shannon's patch, but IMO it >> should be added rather sooner than later. Sync is becoming more and more >> complex, and the selection of snapshots is a very important and fundamental >> part of it. > > Do you think we could mock ourselves out of this and unit-test it > somehow? For full regression tests we will need Lukas's CI work. > To chime in here, while it will be certainly possible to test such stuff once the end-to-end testing tooling is finished (it's getting there), in most cases it is preferable to test as much as possible in unit tests as they are - much faster (<<1s, compared to end-to-end tests which could take seconds or even minutes, depending on what they do) - easier to write and maintain (you are much closer to the code under test) - also serve as documentation for the code (e.g. what contracts a unit of code must uphold) - far shorter feedback loop / better DX (cargo test vs. having to set up/run test instance VMs or trigger some CI system) Especially for something like the snapshot selection in sync jobs, a good test suite might cover many different permutations and corner cases, which would be *a lot of* of work to write and also take *a long time* to execute in a full bells and whistles end-to-end test where you perform an actual sync job between two real PBS installations. That being said, it will be of course a challenge to factor out the sync logic to make it testable. Without me being familiar with the code, it could involve abstracting the interactions with the rest of the system with some trait, moving the sync logic into a separate structs / functions and use dependency injection to give the sync module/fn a concrete implementation to use (in tests you'd then use a fake implementation). For existing code these refactoring steps can be quite intimidating, which is the reason why I am a big fan of writing unit tests *while* writing the actual implementation (similar to, but not quite TDD; there you'd write the tests *first*), as this ensures that the code I'm writing is actually testable. -- - Lukas _______________________________________________ pbs-devel mailing list pbs-devel@lists.proxmox.com https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel ^ permalink raw reply [flat|nested] 9+ messages in thread
* [pbs-devel] [PATCH proxmox-backup 2/3] fix #3786: ui/cli: add resync-corrupt option on sync-jobs 2024-10-15 13:18 [pbs-devel] [PATCH proxmox-backup 0/3] fix #3786: resync corrupt chunks in sync-job Gabriel Goller 2024-10-15 13:18 ` [pbs-devel] [PATCH proxmox-backup 1/3] fix #3786: api: add resync-corrupt option to sync jobs Gabriel Goller @ 2024-10-15 13:18 ` Gabriel Goller 2024-10-15 13:18 ` [pbs-devel] [PATCH proxmox-backup 3/3] fix #3786: docs: add resync-corrupt option to sync-job Gabriel Goller 2 siblings, 0 replies; 9+ messages in thread From: Gabriel Goller @ 2024-10-15 13:18 UTC (permalink / raw) To: pbs-devel Add the `resync-corrupt` option to the ui and the `proxmox-backup-manager` cli. It is listed in the `Advanced` section, because it slows the sync-job down and is useless if no verification job was run beforehand. Signed-off-by: Gabriel Goller <g.goller@proxmox.com> Originally-by: Shannon Sterz <s.sterz@proxmox.com> --- src/bin/proxmox-backup-manager.rs | 9 +++++++++ www/window/SyncJobEdit.js | 11 +++++++++++ 2 files changed, 20 insertions(+) diff --git a/src/bin/proxmox-backup-manager.rs b/src/bin/proxmox-backup-manager.rs index 38a1cf0f5881..08728e9d7250 100644 --- a/src/bin/proxmox-backup-manager.rs +++ b/src/bin/proxmox-backup-manager.rs @@ -339,6 +339,10 @@ fn task_mgmt_cli() -> CommandLineInterface { schema: TRANSFER_LAST_SCHEMA, optional: true, }, + "resync-corrupt": { + schema: RESYNC_CORRUPT_SCHEMA, + optional: true, + }, } } )] @@ -355,6 +359,7 @@ async fn pull_datastore( group_filter: Option<Vec<GroupFilter>>, limit: RateLimitConfig, transfer_last: Option<usize>, + resync_corrupt: Option<bool>, param: Value, ) -> Result<Value, Error> { let output_format = get_output_format(¶m); @@ -391,6 +396,10 @@ async fn pull_datastore( args["transfer-last"] = json!(transfer_last) } + if let Some(resync_corrupt) = resync_corrupt { + args["resync-corrupt"] = Value::from(resync_corrupt); + } + let mut limit_json = json!(limit); let limit_map = limit_json .as_object_mut() diff --git a/www/window/SyncJobEdit.js b/www/window/SyncJobEdit.js index 6543995e8800..a3c497fc2185 100644 --- a/www/window/SyncJobEdit.js +++ b/www/window/SyncJobEdit.js @@ -321,6 +321,17 @@ Ext.define('PBS.window.SyncJobEdit', { deleteEmpty: '{!isCreate}', }, }, + { + fieldLabel: gettext('Resync corrupt snapshots'), + xtype: 'proxmoxcheckbox', + name: 'resync-corrupt', + autoEl: { + tag: 'div', + 'data-qtip': gettext('Re-sync snapshots, whose verfification failed.'), + }, + uncheckedValue: false, + value: false, + }, ], }, { -- 2.39.5 _______________________________________________ pbs-devel mailing list pbs-devel@lists.proxmox.com https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel ^ permalink raw reply [flat|nested] 9+ messages in thread
* [pbs-devel] [PATCH proxmox-backup 3/3] fix #3786: docs: add resync-corrupt option to sync-job 2024-10-15 13:18 [pbs-devel] [PATCH proxmox-backup 0/3] fix #3786: resync corrupt chunks in sync-job Gabriel Goller 2024-10-15 13:18 ` [pbs-devel] [PATCH proxmox-backup 1/3] fix #3786: api: add resync-corrupt option to sync jobs Gabriel Goller 2024-10-15 13:18 ` [pbs-devel] [PATCH proxmox-backup 2/3] fix #3786: ui/cli: add resync-corrupt option on sync-jobs Gabriel Goller @ 2024-10-15 13:18 ` Gabriel Goller 2 siblings, 0 replies; 9+ messages in thread From: Gabriel Goller @ 2024-10-15 13:18 UTC (permalink / raw) To: pbs-devel Add short section explaining the `resync-corrupt` option on the sync-job. Signed-off-by: Gabriel Goller <g.goller@proxmox.com> Originally-by: Shannon Sterz <s.sterz@proxmox.com> --- docs/managing-remotes.rst | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/docs/managing-remotes.rst b/docs/managing-remotes.rst index 38ccd5af25d5..b441daa9b0bd 100644 --- a/docs/managing-remotes.rst +++ b/docs/managing-remotes.rst @@ -132,6 +132,12 @@ For mixing include and exclude filter, following rules apply: .. note:: The ``protected`` flag of remote backup snapshots will not be synced. +Enabling the advanced option 'resync-corrupt' will re-sync all snapshots that have +failed to verify during the last :ref:`maintenance_verification`. Hence, a verification +job needs to be run before a sync job with 'resync-corrupt' can be carried out. Be aware +that a 'resync-corrupt'-job needs to check the manifests of all snapshots in a datastore +and might take much longer than regular sync jobs. + Namespace Support ^^^^^^^^^^^^^^^^^ -- 2.39.5 _______________________________________________ pbs-devel mailing list pbs-devel@lists.proxmox.com https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2024-10-21 8:16 UTC | newest] Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2024-10-15 13:18 [pbs-devel] [PATCH proxmox-backup 0/3] fix #3786: resync corrupt chunks in sync-job Gabriel Goller 2024-10-15 13:18 ` [pbs-devel] [PATCH proxmox-backup 1/3] fix #3786: api: add resync-corrupt option to sync jobs Gabriel Goller 2024-10-17 6:15 ` Thomas Lamprecht 2024-10-17 13:20 ` Gabriel Goller 2024-10-17 14:09 ` Thomas Lamprecht 2024-10-18 9:08 ` Gabriel Goller 2024-10-21 8:16 ` Lukas Wagner 2024-10-15 13:18 ` [pbs-devel] [PATCH proxmox-backup 2/3] fix #3786: ui/cli: add resync-corrupt option on sync-jobs Gabriel Goller 2024-10-15 13:18 ` [pbs-devel] [PATCH proxmox-backup 3/3] fix #3786: docs: add resync-corrupt option to sync-job Gabriel Goller
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox