Re: [pbs-devel] [PATCH proxmox-backup 1/3] fix #3786: api: add resync-corrupt option to sync jobs

From: Gabriel Goller <g.goller@proxmox.com>
To: Thomas Lamprecht <t.lamprecht@proxmox.com>
Cc: Proxmox Backup Server development discussion
	<pbs-devel@lists.proxmox.com>
Subject: Re: [pbs-devel] [PATCH proxmox-backup 1/3] fix #3786: api: add resync-corrupt option to sync jobs
Date: Thu, 17 Oct 2024 15:20:10 +0200	[thread overview]
Message-ID: <20241017132010.5q7i7llfkn6jcuqn@luna.proxmox.com> (raw)
In-Reply-To: <6376441e-f2ac-4cb6-ae63-39f21bb2c4c3@proxmox.com>

On 17.10.2024 08:15, Thomas Lamprecht wrote:
>Am 15/10/2024 um 15:18 schrieb Gabriel Goller:
>> This option allows us to "fix" corrupt snapshots (and/or their chunks)
>> by pulling them from another remote. When traversing the remote
>> snapshots, we check if it exists locally, and if it is, we check if the
>> last verification of it failed. If the local snapshot is broken and the
>> `resync-corrupt` option is turned on, we pull in the remote snapshot,
>> overwriting the local one.
>>
>> This is very useful and has been requested a lot, as there is currently
>> no way to "fix" corrupt chunks/snapshots even if the user has a healthy
>> version of it on their offsite instance.
>>
>> Signed-off-by: Gabriel Goller <g.goller@proxmox.com>
>> Originally-by: Shannon Sterz <s.sterz@proxmox.com>
>
>those two trailers probably should be switched to uphold causal ordering.

Makes sense.

>In general, it would be _really_ nice to get some regression testing for
>sync jobs covering the selection with different matching and options.
>The lack of that doesn't have to block your/shannon's patch, but IMO it
>should be added rather sooner than later. Sync is becoming more and more
>complex, and the selection of snapshots is a very important and fundamental
>part of it.

Do you think we could mock ourselves out of this and unit-test it
somehow? For full regression tests we will need Lukas's CI work.

>> ---
>>  pbs-api-types/src/jobs.rs         | 11 +++++
>>  pbs-datastore/src/backup_info.rs  | 13 +++++-
>>  src/api2/config/sync.rs           |  4 ++
>>  src/api2/pull.rs                  |  9 +++-
>>  src/bin/proxmox-backup-manager.rs |  4 +-
>>  src/server/pull.rs                | 68 +++++++++++++++++++++----------
>>  6 files changed, 84 insertions(+), 25 deletions(-)
>>
>> diff --git a/pbs-api-types/src/jobs.rs b/pbs-api-types/src/jobs.rs
>> index 868702bc059e..a3cd68bf5afd 100644
>> --- a/pbs-api-types/src/jobs.rs
>> +++ b/pbs-api-types/src/jobs.rs
>> @@ -498,6 +498,11 @@ pub const TRANSFER_LAST_SCHEMA: Schema =
>>          .minimum(1)
>>          .schema();
>>
>> +pub const RESYNC_CORRUPT_SCHEMA: Schema = BooleanSchema::new(
>> +    "If the verification failed for a local snapshot, try to pull its chunks again.",
>
>not only chunks, also its indexes (which might be also a source of a bad
>verification result I think)

Yes, correct, I adjusted the description above.

>> +)
>> +.schema();
>> +
>>  #[api(
>>      properties: {
>>          id: {
>> @@ -552,6 +557,10 @@ pub const TRANSFER_LAST_SCHEMA: Schema =
>>              schema: TRANSFER_LAST_SCHEMA,
>>              optional: true,
>>          },
>> +        "resync-corrupt": {
>> +            schema: RESYNC_CORRUPT_SCHEMA,
>> +            optional: true,
>> +        }
>>      }
>>  )]
>>  #[derive(Serialize, Deserialize, Clone, Updater, PartialEq)]
>> @@ -585,6 +594,8 @@ pub struct SyncJobConfig {
>>      pub limit: RateLimitConfig,
>>      #[serde(skip_serializing_if = "Option::is_none")]
>>      pub transfer_last: Option<usize>,
>> +    #[serde(skip_serializing_if = "Option::is_none")]
>> +    pub resync_corrupt: Option<bool>,
>>  }
>>
>>  impl SyncJobConfig {
>> diff --git a/pbs-datastore/src/backup_info.rs b/pbs-datastore/src/backup_info.rs
>> index 414ec878d01a..2d283dbe5cd9 100644
>> --- a/pbs-datastore/src/backup_info.rs
>> +++ b/pbs-datastore/src/backup_info.rs
>> @@ -8,7 +8,8 @@ use anyhow::{bail, format_err, Error};
>>  use proxmox_sys::fs::{lock_dir_noblock, replace_file, CreateOptions};
>>
>>  use pbs_api_types::{
>> -    Authid, BackupNamespace, BackupType, GroupFilter, BACKUP_DATE_REGEX, BACKUP_FILE_REGEX,
>> +    Authid, BackupNamespace, BackupType, GroupFilter, SnapshotVerifyState, VerifyState,
>> +    BACKUP_DATE_REGEX, BACKUP_FILE_REGEX,
>>  };
>>  use pbs_config::{open_backup_lockfile, BackupLockGuard};
>>
>> @@ -583,6 +584,16 @@ impl BackupDir {
>>
>>          Ok(())
>>      }
>> +
>> +    /// Load the verify state from the manifest.
>> +    pub fn verify_state(&self) -> Option<VerifyState> {
>> +        self.load_manifest().ok().and_then(|(m, _)| {
>
>hmm, just ignoring errors here seems a bit odd, that might be a case where one
>might want to re-sync. So I'd prefer wrapping this a result, even if it's a bit
>tedious.

Converted to (and returned) a `Result<VerifyState, anyhow::Error>` for now, 
but I'd be wary about treating a failed `load_manifest` the same as a
`VerifyState::Failed`. Especially because currently an unverified
snapshot simply returns `Err(null...)`. So we would first need to extend
`VerifyState` to have a `Unknown` variant.

>> @@ -332,6 +339,7 @@ async fn pull_snapshot<'a>(
>>      reader: Arc<dyn SyncSourceReader + 'a>,
>>      snapshot: &'a pbs_datastore::BackupDir,
>>      downloaded_chunks: Arc<Mutex<HashSet<[u8; 32]>>>,
>> +    resync_corrupt: bool,
>
>would be good to update the doc comment of this method to mention the new param
>in the description of steps

done.

>>  ) -> Result<SyncStats, Error> {
>>      let mut sync_stats = SyncStats::default();
>>      let mut manifest_name = snapshot.full_path();
>> @@ -352,6 +360,8 @@ async fn pull_snapshot<'a>(
>>          return Ok(sync_stats);
>>      }
>>
>> +    let snapshot_is_corrupt = snapshot.verify_state() == Some(VerifyState::Failed);
>
>Hmm, but currently you'd only need to get the state if resync_corrupt is true?
>Or at least if the local manifest already exists, if we want to allow re-syncing
>on some other conditions (but that can be adapted to that once required).
>
>e.g.:
>
>let must_resync_exisiting = resync_corrupt && snapshot.verify_state() == Some(VerifyState::Failed);
>
>That would make the checks below a bit easier to read.

Yep, done as well, this makes all the checks easier to read.

>And would it make sense to check the downloaded manifest's verify state, no point
>in re-pulling if the other side is also corrupt, or is that possibility already
>excluded somewhere else in this code path?

This should be covered, because:
  - if the index is broken, we can't even open it
  - if an unencrypted chunk is corrupted, the crc is checked when reading
    it (pbs-client/src/remote_chunk_reader.rs:55)
  - if an encrypted chunk is corrupted, the crc is checked as well 
    (pbs-datastore/src/data_blob.rs:81)

>> +
>>      if manifest_name.exists() {
>>          let manifest_blob = proxmox_lang::try_block!({
>>              let mut manifest_file = std::fs::File::open(&manifest_name).map_err(|err| {
>> @@ -365,7 +375,9 @@ async fn pull_snapshot<'a>(
>>              format_err!("unable to read local manifest {manifest_name:?} - {err}")
>>          })?;
>>
>> -        if manifest_blob.raw_data() == tmp_manifest_blob.raw_data() {
>> +        if (manifest_blob.raw_data() == tmp_manifest_blob.raw_data())
>
>unnecessary addition of parenthesis a == b && (!c && !d) would be fine too and less
>noise to read, so to say.

Changed.

>> +            && (!resync_corrupt && !snapshot_is_corrupt)
>
>This is equal to `.. && !(resync_corrupt || snapshot_is_corrupt)`, which makes it
>slightly easier to read for me, and I'm slightly confused as why it's not
>`.. && !(resync_corrupt && snapshot_is_corrupt)`, i.e., with the new variable above:
>
>if manifest_blob.raw_data() == tmp_manifest_blob.raw_data() && !must_resync_exisiting {
>    // ...
>}
>
>Or why would the expression need to be different here?

Changed this as well after combining the flags.

>> +        {
>>              if !client_log_name.exists() {
>>                  reader.try_download_client_log(&client_log_name).await?;
>>              };
>> @@ -381,7 +393,7 @@ async fn pull_snapshot<'a>(
>>          let mut path = snapshot.full_path();
>>          path.push(&item.filename);
>>
>> -        if path.exists() {
>> +        if !(resync_corrupt && snapshot_is_corrupt) && path.exists() {
>>              match ArchiveType::from_path(&item.filename)? {
>>                  ArchiveType::DynamicIndex => {
>>                      let index = DynamicIndexReader::open(&path)?;
>> @@ -416,6 +428,10 @@ async fn pull_snapshot<'a>(
>>              }
>>          }
>>
>> +        if resync_corrupt && snapshot_is_corrupt {
>> +            info!("pulling snapshot {} again", snapshot.dir());
>
>maybe include the reason for re-pulling this here, e.g. something like:
>
>"re-syncing snapshot {} due to bad verification result"
>
>And wouldn't that get logged multiple times for backups with multiple archives, like
>e.g. guests with multiple disks? Makes probably more sense to log that before the for
>loop

Yep, you're right, moved it to the beginning of the loop!

>> +        }
>> +
>>          let stats =
>>              pull_single_archive(reader.clone(), snapshot, item, downloaded_chunks.clone()).await?;
>>          sync_stats.add(stats);
>> @@ -443,6 +459,7 @@ async fn pull_snapshot_from<'a>(
>>      reader: Arc<dyn SyncSourceReader + 'a>,
>>      snapshot: &'a pbs_datastore::BackupDir,
>>      downloaded_chunks: Arc<Mutex<HashSet<[u8; 32]>>>,
>> +    resync_corrupt: bool,
>>  ) -> Result<SyncStats, Error> {
>>      let (_path, is_new, _snap_lock) = snapshot
>>          .datastore()
>> @@ -451,7 +468,7 @@ async fn pull_snapshot_from<'a>(
>>      let sync_stats = if is_new {
>>          info!("sync snapshot {}", snapshot.dir());
>>
>> -        match pull_snapshot(reader, snapshot, downloaded_chunks).await {
>> +        match pull_snapshot(reader, snapshot, downloaded_chunks, resync_corrupt).await {
>>              Err(err) => {
>>                  if let Err(cleanup_err) = snapshot.datastore().remove_backup_dir(
>>                      snapshot.backup_ns(),
>> @@ -469,7 +486,7 @@ async fn pull_snapshot_from<'a>(
>>          }
>>      } else {
>>          info!("re-sync snapshot {}", snapshot.dir());
>> -        pull_snapshot(reader, snapshot, downloaded_chunks).await?
>> +        pull_snapshot(reader, snapshot, downloaded_chunks, resync_corrupt).await?
>>      };
>>
>>      Ok(sync_stats)
>> @@ -528,21 +545,24 @@ async fn pull_group(
>>          .enumerate()
>>          .filter(|&(pos, ref dir)| {
>>              source_snapshots.insert(dir.time);
>> -            if last_sync_time > dir.time {
>> -                already_synced_skip_info.update(dir.time);
>> -                return false;
>> -            } else if already_synced_skip_info.count > 0 {
>> -                info!("{already_synced_skip_info}");
>> -                already_synced_skip_info.reset();
>> -                return true;
>> -            }
>> +            // If resync_corrupt is set, we go through all the remote snapshots
>> +            if !params.resync_corrupt {
>
>maybe reverse that and use an explicit return? i.e.:
>
>if params.resync_corrupt {
>    return true; // always check all remote snapshots when resyncing
>}
>
>// rest of the filter
>
>would avoid an indentation level for the non-bypass path and be slightly easier
>to read to me

True, fixed this as well.

Thanks for the review!

_______________________________________________
pbs-devel mailing list
pbs-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel