From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from firstgate.proxmox.com (firstgate.proxmox.com [IPv6:2a01:7e0:0:424::9]) by lore.proxmox.com (Postfix) with ESMTPS id 6F5D21FF170 for ; Thu, 17 Oct 2024 15:20:15 +0200 (CEST) Received: from firstgate.proxmox.com (localhost [127.0.0.1]) by firstgate.proxmox.com (Proxmox) with ESMTP id 3B8B9FD8F; Thu, 17 Oct 2024 15:20:49 +0200 (CEST) Date: Thu, 17 Oct 2024 15:20:10 +0200 From: Gabriel Goller To: Thomas Lamprecht Message-ID: <20241017132010.5q7i7llfkn6jcuqn@luna.proxmox.com> References: <20241015131823.338766-1-g.goller@proxmox.com> <20241015131823.338766-2-g.goller@proxmox.com> <6376441e-f2ac-4cb6-ae63-39f21bb2c4c3@proxmox.com> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: <6376441e-f2ac-4cb6-ae63-39f21bb2c4c3@proxmox.com> User-Agent: NeoMutt/20220429 X-SPAM-LEVEL: Spam detection results: 0 AWL -0.042 Adjusted score from AWL reputation of From: address BAYES_00 -1.9 Bayes spam probability is 0 to 1% DMARC_MISSING 0.1 Missing DMARC policy KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment SPF_HELO_NONE 0.001 SPF: HELO does not publish an SPF Record SPF_PASS -0.001 SPF: sender matches SPF record Subject: Re: [pbs-devel] [PATCH proxmox-backup 1/3] fix #3786: api: add resync-corrupt option to sync jobs X-BeenThere: pbs-devel@lists.proxmox.com X-Mailman-Version: 2.1.29 Precedence: list List-Id: Proxmox Backup Server development discussion List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: Proxmox Backup Server development discussion Cc: Proxmox Backup Server development discussion Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset="us-ascii"; Format="flowed" Errors-To: pbs-devel-bounces@lists.proxmox.com Sender: "pbs-devel" On 17.10.2024 08:15, Thomas Lamprecht wrote: >Am 15/10/2024 um 15:18 schrieb Gabriel Goller: >> This option allows us to "fix" corrupt snapshots (and/or their chunks) >> by pulling them from another remote. When traversing the remote >> snapshots, we check if it exists locally, and if it is, we check if the >> last verification of it failed. If the local snapshot is broken and the >> `resync-corrupt` option is turned on, we pull in the remote snapshot, >> overwriting the local one. >> >> This is very useful and has been requested a lot, as there is currently >> no way to "fix" corrupt chunks/snapshots even if the user has a healthy >> version of it on their offsite instance. >> >> Signed-off-by: Gabriel Goller >> Originally-by: Shannon Sterz > >those two trailers probably should be switched to uphold causal ordering. Makes sense. >In general, it would be _really_ nice to get some regression testing for >sync jobs covering the selection with different matching and options. >The lack of that doesn't have to block your/shannon's patch, but IMO it >should be added rather sooner than later. Sync is becoming more and more >complex, and the selection of snapshots is a very important and fundamental >part of it. Do you think we could mock ourselves out of this and unit-test it somehow? For full regression tests we will need Lukas's CI work. >> --- >> pbs-api-types/src/jobs.rs | 11 +++++ >> pbs-datastore/src/backup_info.rs | 13 +++++- >> src/api2/config/sync.rs | 4 ++ >> src/api2/pull.rs | 9 +++- >> src/bin/proxmox-backup-manager.rs | 4 +- >> src/server/pull.rs | 68 +++++++++++++++++++++---------- >> 6 files changed, 84 insertions(+), 25 deletions(-) >> >> diff --git a/pbs-api-types/src/jobs.rs b/pbs-api-types/src/jobs.rs >> index 868702bc059e..a3cd68bf5afd 100644 >> --- a/pbs-api-types/src/jobs.rs >> +++ b/pbs-api-types/src/jobs.rs >> @@ -498,6 +498,11 @@ pub const TRANSFER_LAST_SCHEMA: Schema = >> .minimum(1) >> .schema(); >> >> +pub const RESYNC_CORRUPT_SCHEMA: Schema = BooleanSchema::new( >> + "If the verification failed for a local snapshot, try to pull its chunks again.", > >not only chunks, also its indexes (which might be also a source of a bad >verification result I think) Yes, correct, I adjusted the description above. >> +) >> +.schema(); >> + >> #[api( >> properties: { >> id: { >> @@ -552,6 +557,10 @@ pub const TRANSFER_LAST_SCHEMA: Schema = >> schema: TRANSFER_LAST_SCHEMA, >> optional: true, >> }, >> + "resync-corrupt": { >> + schema: RESYNC_CORRUPT_SCHEMA, >> + optional: true, >> + } >> } >> )] >> #[derive(Serialize, Deserialize, Clone, Updater, PartialEq)] >> @@ -585,6 +594,8 @@ pub struct SyncJobConfig { >> pub limit: RateLimitConfig, >> #[serde(skip_serializing_if = "Option::is_none")] >> pub transfer_last: Option, >> + #[serde(skip_serializing_if = "Option::is_none")] >> + pub resync_corrupt: Option, >> } >> >> impl SyncJobConfig { >> diff --git a/pbs-datastore/src/backup_info.rs b/pbs-datastore/src/backup_info.rs >> index 414ec878d01a..2d283dbe5cd9 100644 >> --- a/pbs-datastore/src/backup_info.rs >> +++ b/pbs-datastore/src/backup_info.rs >> @@ -8,7 +8,8 @@ use anyhow::{bail, format_err, Error}; >> use proxmox_sys::fs::{lock_dir_noblock, replace_file, CreateOptions}; >> >> use pbs_api_types::{ >> - Authid, BackupNamespace, BackupType, GroupFilter, BACKUP_DATE_REGEX, BACKUP_FILE_REGEX, >> + Authid, BackupNamespace, BackupType, GroupFilter, SnapshotVerifyState, VerifyState, >> + BACKUP_DATE_REGEX, BACKUP_FILE_REGEX, >> }; >> use pbs_config::{open_backup_lockfile, BackupLockGuard}; >> >> @@ -583,6 +584,16 @@ impl BackupDir { >> >> Ok(()) >> } >> + >> + /// Load the verify state from the manifest. >> + pub fn verify_state(&self) -> Option { >> + self.load_manifest().ok().and_then(|(m, _)| { > >hmm, just ignoring errors here seems a bit odd, that might be a case where one >might want to re-sync. So I'd prefer wrapping this a result, even if it's a bit >tedious. Converted to (and returned) a `Result` for now, but I'd be wary about treating a failed `load_manifest` the same as a `VerifyState::Failed`. Especially because currently an unverified snapshot simply returns `Err(null...)`. So we would first need to extend `VerifyState` to have a `Unknown` variant. >> @@ -332,6 +339,7 @@ async fn pull_snapshot<'a>( >> reader: Arc, >> snapshot: &'a pbs_datastore::BackupDir, >> downloaded_chunks: Arc>>, >> + resync_corrupt: bool, > >would be good to update the doc comment of this method to mention the new param >in the description of steps done. >> ) -> Result { >> let mut sync_stats = SyncStats::default(); >> let mut manifest_name = snapshot.full_path(); >> @@ -352,6 +360,8 @@ async fn pull_snapshot<'a>( >> return Ok(sync_stats); >> } >> >> + let snapshot_is_corrupt = snapshot.verify_state() == Some(VerifyState::Failed); > >Hmm, but currently you'd only need to get the state if resync_corrupt is true? >Or at least if the local manifest already exists, if we want to allow re-syncing >on some other conditions (but that can be adapted to that once required). > >e.g.: > >let must_resync_exisiting = resync_corrupt && snapshot.verify_state() == Some(VerifyState::Failed); > >That would make the checks below a bit easier to read. Yep, done as well, this makes all the checks easier to read. >And would it make sense to check the downloaded manifest's verify state, no point >in re-pulling if the other side is also corrupt, or is that possibility already >excluded somewhere else in this code path? This should be covered, because: - if the index is broken, we can't even open it - if an unencrypted chunk is corrupted, the crc is checked when reading it (pbs-client/src/remote_chunk_reader.rs:55) - if an encrypted chunk is corrupted, the crc is checked as well (pbs-datastore/src/data_blob.rs:81) >> + >> if manifest_name.exists() { >> let manifest_blob = proxmox_lang::try_block!({ >> let mut manifest_file = std::fs::File::open(&manifest_name).map_err(|err| { >> @@ -365,7 +375,9 @@ async fn pull_snapshot<'a>( >> format_err!("unable to read local manifest {manifest_name:?} - {err}") >> })?; >> >> - if manifest_blob.raw_data() == tmp_manifest_blob.raw_data() { >> + if (manifest_blob.raw_data() == tmp_manifest_blob.raw_data()) > >unnecessary addition of parenthesis a == b && (!c && !d) would be fine too and less >noise to read, so to say. Changed. >> + && (!resync_corrupt && !snapshot_is_corrupt) > >This is equal to `.. && !(resync_corrupt || snapshot_is_corrupt)`, which makes it >slightly easier to read for me, and I'm slightly confused as why it's not >`.. && !(resync_corrupt && snapshot_is_corrupt)`, i.e., with the new variable above: > >if manifest_blob.raw_data() == tmp_manifest_blob.raw_data() && !must_resync_exisiting { > // ... >} > >Or why would the expression need to be different here? Changed this as well after combining the flags. >> + { >> if !client_log_name.exists() { >> reader.try_download_client_log(&client_log_name).await?; >> }; >> @@ -381,7 +393,7 @@ async fn pull_snapshot<'a>( >> let mut path = snapshot.full_path(); >> path.push(&item.filename); >> >> - if path.exists() { >> + if !(resync_corrupt && snapshot_is_corrupt) && path.exists() { >> match ArchiveType::from_path(&item.filename)? { >> ArchiveType::DynamicIndex => { >> let index = DynamicIndexReader::open(&path)?; >> @@ -416,6 +428,10 @@ async fn pull_snapshot<'a>( >> } >> } >> >> + if resync_corrupt && snapshot_is_corrupt { >> + info!("pulling snapshot {} again", snapshot.dir()); > >maybe include the reason for re-pulling this here, e.g. something like: > >"re-syncing snapshot {} due to bad verification result" > >And wouldn't that get logged multiple times for backups with multiple archives, like >e.g. guests with multiple disks? Makes probably more sense to log that before the for >loop Yep, you're right, moved it to the beginning of the loop! >> + } >> + >> let stats = >> pull_single_archive(reader.clone(), snapshot, item, downloaded_chunks.clone()).await?; >> sync_stats.add(stats); >> @@ -443,6 +459,7 @@ async fn pull_snapshot_from<'a>( >> reader: Arc, >> snapshot: &'a pbs_datastore::BackupDir, >> downloaded_chunks: Arc>>, >> + resync_corrupt: bool, >> ) -> Result { >> let (_path, is_new, _snap_lock) = snapshot >> .datastore() >> @@ -451,7 +468,7 @@ async fn pull_snapshot_from<'a>( >> let sync_stats = if is_new { >> info!("sync snapshot {}", snapshot.dir()); >> >> - match pull_snapshot(reader, snapshot, downloaded_chunks).await { >> + match pull_snapshot(reader, snapshot, downloaded_chunks, resync_corrupt).await { >> Err(err) => { >> if let Err(cleanup_err) = snapshot.datastore().remove_backup_dir( >> snapshot.backup_ns(), >> @@ -469,7 +486,7 @@ async fn pull_snapshot_from<'a>( >> } >> } else { >> info!("re-sync snapshot {}", snapshot.dir()); >> - pull_snapshot(reader, snapshot, downloaded_chunks).await? >> + pull_snapshot(reader, snapshot, downloaded_chunks, resync_corrupt).await? >> }; >> >> Ok(sync_stats) >> @@ -528,21 +545,24 @@ async fn pull_group( >> .enumerate() >> .filter(|&(pos, ref dir)| { >> source_snapshots.insert(dir.time); >> - if last_sync_time > dir.time { >> - already_synced_skip_info.update(dir.time); >> - return false; >> - } else if already_synced_skip_info.count > 0 { >> - info!("{already_synced_skip_info}"); >> - already_synced_skip_info.reset(); >> - return true; >> - } >> + // If resync_corrupt is set, we go through all the remote snapshots >> + if !params.resync_corrupt { > >maybe reverse that and use an explicit return? i.e.: > >if params.resync_corrupt { > return true; // always check all remote snapshots when resyncing >} > >// rest of the filter > >would avoid an indentation level for the non-bypass path and be slightly easier >to read to me True, fixed this as well. Thanks for the review! _______________________________________________ pbs-devel mailing list pbs-devel@lists.proxmox.com https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel