Re: [pbs-devel] [PATCH proxmox-backup v2 1/3] fix #3786: api: add resync-corrupt option to sync jobs

From: Gabriel Goller <g.goller@proxmox.com>
To: "Fabian Grünbichler" <f.gruenbichler@proxmox.com>
Cc: Proxmox Backup Server development discussion
	<pbs-devel@lists.proxmox.com>
Subject: Re: [pbs-devel] [PATCH proxmox-backup v2 1/3] fix #3786: api: add resync-corrupt option to sync jobs
Date: Mon, 4 Nov 2024 17:02:21 +0100	[thread overview]
Message-ID: <hktvuzb35ggshduj4iutuswutxpnostvkd352sedwhlf72cuiq@xomssmwbk2v6> (raw)
In-Reply-To: <1730717237.hc5rhqwdje.astroid@yuna.none>

On 04.11.2024 12:51, Fabian Grünbichler wrote:
>this doesn't really do what it says on the tin, see below.
>
>On October 18, 2024 11:09 am, Gabriel Goller wrote:
>> [snip]
>> diff --git a/pbs-datastore/src/backup_info.rs b/pbs-datastore/src/backup_info.rs
>> index 414ec878d01a..c86fbb7568ab 100644
>> --- a/pbs-datastore/src/backup_info.rs
>> +++ b/pbs-datastore/src/backup_info.rs
>> @@ -8,7 +8,8 @@ use anyhow::{bail, format_err, Error};
>>  use proxmox_sys::fs::{lock_dir_noblock, replace_file, CreateOptions};
>>
>>  use pbs_api_types::{
>> -    Authid, BackupNamespace, BackupType, GroupFilter, BACKUP_DATE_REGEX, BACKUP_FILE_REGEX,
>> +    Authid, BackupNamespace, BackupType, GroupFilter, SnapshotVerifyState, VerifyState,
>> +    BACKUP_DATE_REGEX, BACKUP_FILE_REGEX,
>>  };
>>  use pbs_config::{open_backup_lockfile, BackupLockGuard};
>>
>> @@ -583,6 +584,16 @@ impl BackupDir {
>>
>>          Ok(())
>>      }
>> +
>> +    /// Load the verify state from the manifest.
>> +    pub fn verify_state(&self) -> Result<VerifyState, anyhow::Error> {
>> +        self.load_manifest().and_then(|(m, _)| {
>> +            let verify = m.unprotected["verify_state"].clone();
>> +            serde_json::from_value::<SnapshotVerifyState>(verify)
>> +                .map(|svs| svs.state)
>> +                .map_err(Into::into)
>> +        })
>> +    }
>
>wouldn't it make more sense to have this as a getter for an optional
>SnapshotVerifyState on the BackupManifest?
>
>then it could go into its own commit, other call sites that load the
>verify state from a manifest could be adapted to it, and then this
>commit can also start using it?

You're right the BackupManifest is loaded here twice actually. So I
could move this function to a getter in BackupManifest and then move the
construction of it (src/server/pull.rs:396):

     let manifest = BackupManifest::try_from(tmp_manifest_blob)?;

up to (src/server/pull.rs:365). And instead of having the try_block to
try read from the manifest just call BackupManifest::try_from.

I'll see if I can make this work...

>also see the comment further below about how the current implementation
>is very noisy if snapshots are newly synced as opposed to resynced..
>
>>  }
>>
>>  impl AsRef<pbs_api_types::BackupNamespace> for BackupDir {
>> diff --git a/src/api2/config/sync.rs b/src/api2/config/sync.rs
>> index 6fdc69a9e645..fa9db92f3d11 100644
>> --- a/src/api2/config/sync.rs
>> +++ b/src/api2/config/sync.rs
>> [snip]
>> diff --git a/src/server/pull.rs b/src/server/pull.rs
>> index 3117f7d2c960..b2dd15d9d6db 100644
>> --- a/src/server/pull.rs
>> +++ b/src/server/pull.rs
>> @@ -7,12 +7,14 @@ use std::sync::{Arc, Mutex};
>>  use std::time::SystemTime;
>>
>>  use anyhow::{bail, format_err, Error};
>> +use nom::combinator::verify;
>
>I think this snuck in ;)

Oops, my fault :)

>> [snip]
>> @@ -175,9 +182,10 @@ async fn pull_index_chunks<I: IndexFile>(
>>                      target.cond_touch_chunk(&info.digest, false)
>>                  })?;
>>                  if chunk_exists {
>> -                    //info!("chunk {} exists {}", pos, hex::encode(digest));
>> +                    //info!("chunk exists {}", hex::encode(info.digest));
>
>this
>
>>                      return Ok::<_, Error>(());
>>                  }
>> +
>
>and this as well?

Yep, removed both, thanks for the heads-up!

>> [snip]
>> @@ -325,13 +333,15 @@ async fn pull_single_archive<'a>(
>>  /// - (Re)download the manifest
>>  /// -- if it matches, only download log and treat snapshot as already synced
>>  /// - Iterate over referenced files
>> -/// -- if file already exists, verify contents
>> +/// -- if file already exists, verify contents or pull again if last
>> +///     verification failed and `resync_corrupt` is true
>>  /// -- if not, pull it from the remote
>>  /// - Download log if not already existing
>>  async fn pull_snapshot<'a>(
>>      reader: Arc<dyn SyncSourceReader + 'a>,
>>      snapshot: &'a pbs_datastore::BackupDir,
>>      downloaded_chunks: Arc<Mutex<HashSet<[u8; 32]>>>,
>> +    resync_corrupt: bool,
>>  ) -> Result<SyncStats, Error> {
>>      let mut sync_stats = SyncStats::default();
>>      let mut manifest_name = snapshot.full_path();
>> @@ -352,6 +362,14 @@ async fn pull_snapshot<'a>(
>>          return Ok(sync_stats);
>>      }
>>
>
>I think this part here is somewhat wrong ordering wise, or at least,
>unnecessarily expensive..
>
>if resync_corrupt is enabled, we want to (in this order!)
>- check the local snapshot for corruption, if it exists
>- if it is corrupt, we proceed with resyncing
>- if not, we only proceed with resyncing if it is the last snapshot in
>  this group, and return early otherwise
>
>that way, we avoid redownloading all the manifests.. but see further
>below for another issue with the current implementation..
>
>> +    let must_resync_existing = resync_corrupt
>> +        && snapshot
>> +            .verify_state()
>> +            .inspect_err(|err| {
>> +                tracing::error!("Failed to check verification state of snapshot: {err:?}")
>
>2024-11-04T12:34:57+01:00: Failed to check verification state of snapshot: unable to load blob '"/tank/pbs/ns/foobar/ns/test/ns/another_test/vm/900/2023-04-06T14:36:00Z/index.json.blob"' - No such file or directory (os error 2)
>
>this seems to be very noisy for newly synced snapshots, because the
>helper is implemented on BackupInfo instead of on BackupManifest..

True, but I think this is mostly because new backups don't have a
verify_state yet (and that doesn't necessarily mean == bad). Removed the
log line as it's silly anyway :)

>> +            })
>> +            .is_ok_and(|state| state == VerifyState::Failed);
>> +
>>      if manifest_name.exists() {
>>          let manifest_blob = proxmox_lang::try_block!({
>>              let mut manifest_file = std::fs::File::open(&manifest_name).map_err(|err| {
>> [snip]
>> @@ -528,6 +554,10 @@ async fn pull_group(
>>          .enumerate()
>>          .filter(|&(pos, ref dir)| {
>>              source_snapshots.insert(dir.time);
>> +            // If resync_corrupt is set, we go through all the remote snapshots
>> +            if params.resync_corrupt {
>> +                return true;
>> +            }
>
>alternatively, we could check the local manifest here, and only include
>existing snapshots with a failed verification state, the last one and
>new ones? that way, we'd get more meaningful progress stats as well..

That's true, that would be cleaner. The downside is that we would have
open/parse the BackupManifest twice.

I could write something like:

     if params.resync_corrupt {
        let local_dir = params.target.store.backup_dir(target_ns.clone(), dir.clone());
        if let Ok(dir) = local_dir {
           let verify_state = dir.verify_state(); 
           if verify_state == Some(VerifyState::Failed) {
               return true;
           }
        }
     }

>because right now, this will not only resync existing corrupt snapshots,
>but also ones that have been pruned locally, but not on the source
>(i.e., the other proposed "fixing" sync mode that syncs "missing"
>old snapshots, not just corrupt ones).

I'm too stupid to find the mail where this was mentioned/discussed, I'm
quite sure we said to just pull both, and then maybe separate them in a
future iteration/feature. But now that I think about the flag is named
`resync_corrupt` so I'd expect it to only pull in the corrupt snapshots.

I actually agree with this change, it probably is also more performant
(reading backup_manifest twice is probably faster than pulling lots of
unneeded manifests from the remote).

>>              if last_sync_time > dir.time {
>>                  already_synced_skip_info.update(dir.time);
>>                  return false;
>> @@ -566,7 +596,13 @@ async fn pull_group(
>>              .source
>>              .reader(source_namespace, &from_snapshot)
>>              .await?;
>> -        let result = pull_snapshot_from(reader, &to_snapshot, downloaded_chunks.clone()).await;
>> +        let result = pull_snapshot_from(
>> +            reader,
>> +            &to_snapshot,
>> +            downloaded_chunks.clone(),
>> +            params.resync_corrupt,
>> +        )
>> +        .await;
>>
>>          progress.done_snapshots = pos as u64 + 1;
>>          info!("percentage done: {progress}");

_______________________________________________
pbs-devel mailing list
pbs-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel