all lists on lists.proxmox.com
 help / color / mirror / Atom feed
From: Gabriel Goller <g.goller@proxmox.com>
To: Thomas Lamprecht <t.lamprecht@proxmox.com>
Cc: Proxmox Backup Server development discussion
	<pbs-devel@lists.proxmox.com>
Subject: Re: [pbs-devel] [PATCH proxmox-backup 1/3] fix #3786: api: add resync-corrupt option to sync jobs
Date: Thu, 17 Oct 2024 15:20:10 +0200	[thread overview]
Message-ID: <20241017132010.5q7i7llfkn6jcuqn@luna.proxmox.com> (raw)
In-Reply-To: <6376441e-f2ac-4cb6-ae63-39f21bb2c4c3@proxmox.com>

On 17.10.2024 08:15, Thomas Lamprecht wrote:
>Am 15/10/2024 um 15:18 schrieb Gabriel Goller:
>> This option allows us to "fix" corrupt snapshots (and/or their chunks)
>> by pulling them from another remote. When traversing the remote
>> snapshots, we check if it exists locally, and if it is, we check if the
>> last verification of it failed. If the local snapshot is broken and the
>> `resync-corrupt` option is turned on, we pull in the remote snapshot,
>> overwriting the local one.
>>
>> This is very useful and has been requested a lot, as there is currently
>> no way to "fix" corrupt chunks/snapshots even if the user has a healthy
>> version of it on their offsite instance.
>>
>> Signed-off-by: Gabriel Goller <g.goller@proxmox.com>
>> Originally-by: Shannon Sterz <s.sterz@proxmox.com>
>
>those two trailers probably should be switched to uphold causal ordering.

Makes sense.

>In general, it would be _really_ nice to get some regression testing for
>sync jobs covering the selection with different matching and options.
>The lack of that doesn't have to block your/shannon's patch, but IMO it
>should be added rather sooner than later. Sync is becoming more and more
>complex, and the selection of snapshots is a very important and fundamental
>part of it.

Do you think we could mock ourselves out of this and unit-test it
somehow? For full regression tests we will need Lukas's CI work.

>> ---
>>  pbs-api-types/src/jobs.rs         | 11 +++++
>>  pbs-datastore/src/backup_info.rs  | 13 +++++-
>>  src/api2/config/sync.rs           |  4 ++
>>  src/api2/pull.rs                  |  9 +++-
>>  src/bin/proxmox-backup-manager.rs |  4 +-
>>  src/server/pull.rs                | 68 +++++++++++++++++++++----------
>>  6 files changed, 84 insertions(+), 25 deletions(-)
>>
>> diff --git a/pbs-api-types/src/jobs.rs b/pbs-api-types/src/jobs.rs
>> index 868702bc059e..a3cd68bf5afd 100644
>> --- a/pbs-api-types/src/jobs.rs
>> +++ b/pbs-api-types/src/jobs.rs
>> @@ -498,6 +498,11 @@ pub const TRANSFER_LAST_SCHEMA: Schema =
>>          .minimum(1)
>>          .schema();
>>
>> +pub const RESYNC_CORRUPT_SCHEMA: Schema = BooleanSchema::new(
>> +    "If the verification failed for a local snapshot, try to pull its chunks again.",
>
>not only chunks, also its indexes (which might be also a source of a bad
>verification result I think)

Yes, correct, I adjusted the description above.

>> +)
>> +.schema();
>> +
>>  #[api(
>>      properties: {
>>          id: {
>> @@ -552,6 +557,10 @@ pub const TRANSFER_LAST_SCHEMA: Schema =
>>              schema: TRANSFER_LAST_SCHEMA,
>>              optional: true,
>>          },
>> +        "resync-corrupt": {
>> +            schema: RESYNC_CORRUPT_SCHEMA,
>> +            optional: true,
>> +        }
>>      }
>>  )]
>>  #[derive(Serialize, Deserialize, Clone, Updater, PartialEq)]
>> @@ -585,6 +594,8 @@ pub struct SyncJobConfig {
>>      pub limit: RateLimitConfig,
>>      #[serde(skip_serializing_if = "Option::is_none")]
>>      pub transfer_last: Option<usize>,
>> +    #[serde(skip_serializing_if = "Option::is_none")]
>> +    pub resync_corrupt: Option<bool>,
>>  }
>>
>>  impl SyncJobConfig {
>> diff --git a/pbs-datastore/src/backup_info.rs b/pbs-datastore/src/backup_info.rs
>> index 414ec878d01a..2d283dbe5cd9 100644
>> --- a/pbs-datastore/src/backup_info.rs
>> +++ b/pbs-datastore/src/backup_info.rs
>> @@ -8,7 +8,8 @@ use anyhow::{bail, format_err, Error};
>>  use proxmox_sys::fs::{lock_dir_noblock, replace_file, CreateOptions};
>>
>>  use pbs_api_types::{
>> -    Authid, BackupNamespace, BackupType, GroupFilter, BACKUP_DATE_REGEX, BACKUP_FILE_REGEX,
>> +    Authid, BackupNamespace, BackupType, GroupFilter, SnapshotVerifyState, VerifyState,
>> +    BACKUP_DATE_REGEX, BACKUP_FILE_REGEX,
>>  };
>>  use pbs_config::{open_backup_lockfile, BackupLockGuard};
>>
>> @@ -583,6 +584,16 @@ impl BackupDir {
>>
>>          Ok(())
>>      }
>> +
>> +    /// Load the verify state from the manifest.
>> +    pub fn verify_state(&self) -> Option<VerifyState> {
>> +        self.load_manifest().ok().and_then(|(m, _)| {
>
>hmm, just ignoring errors here seems a bit odd, that might be a case where one
>might want to re-sync. So I'd prefer wrapping this a result, even if it's a bit
>tedious.

Converted to (and returned) a `Result<VerifyState, anyhow::Error>` for now, 
but I'd be wary about treating a failed `load_manifest` the same as a
`VerifyState::Failed`. Especially because currently an unverified
snapshot simply returns `Err(null...)`. So we would first need to extend
`VerifyState` to have a `Unknown` variant.

>> @@ -332,6 +339,7 @@ async fn pull_snapshot<'a>(
>>      reader: Arc<dyn SyncSourceReader + 'a>,
>>      snapshot: &'a pbs_datastore::BackupDir,
>>      downloaded_chunks: Arc<Mutex<HashSet<[u8; 32]>>>,
>> +    resync_corrupt: bool,
>
>would be good to update the doc comment of this method to mention the new param
>in the description of steps

done.

>>  ) -> Result<SyncStats, Error> {
>>      let mut sync_stats = SyncStats::default();
>>      let mut manifest_name = snapshot.full_path();
>> @@ -352,6 +360,8 @@ async fn pull_snapshot<'a>(
>>          return Ok(sync_stats);
>>      }
>>
>> +    let snapshot_is_corrupt = snapshot.verify_state() == Some(VerifyState::Failed);
>
>Hmm, but currently you'd only need to get the state if resync_corrupt is true?
>Or at least if the local manifest already exists, if we want to allow re-syncing
>on some other conditions (but that can be adapted to that once required).
>
>e.g.:
>
>let must_resync_exisiting = resync_corrupt && snapshot.verify_state() == Some(VerifyState::Failed);
>
>That would make the checks below a bit easier to read.

Yep, done as well, this makes all the checks easier to read.

>And would it make sense to check the downloaded manifest's verify state, no point
>in re-pulling if the other side is also corrupt, or is that possibility already
>excluded somewhere else in this code path?

This should be covered, because:
  - if the index is broken, we can't even open it
  - if an unencrypted chunk is corrupted, the crc is checked when reading
    it (pbs-client/src/remote_chunk_reader.rs:55)
  - if an encrypted chunk is corrupted, the crc is checked as well 
    (pbs-datastore/src/data_blob.rs:81)

>> +
>>      if manifest_name.exists() {
>>          let manifest_blob = proxmox_lang::try_block!({
>>              let mut manifest_file = std::fs::File::open(&manifest_name).map_err(|err| {
>> @@ -365,7 +375,9 @@ async fn pull_snapshot<'a>(
>>              format_err!("unable to read local manifest {manifest_name:?} - {err}")
>>          })?;
>>
>> -        if manifest_blob.raw_data() == tmp_manifest_blob.raw_data() {
>> +        if (manifest_blob.raw_data() == tmp_manifest_blob.raw_data())
>
>unnecessary addition of parenthesis a == b && (!c && !d) would be fine too and less
>noise to read, so to say.

Changed.

>> +            && (!resync_corrupt && !snapshot_is_corrupt)
>
>This is equal to `.. && !(resync_corrupt || snapshot_is_corrupt)`, which makes it
>slightly easier to read for me, and I'm slightly confused as why it's not
>`.. && !(resync_corrupt && snapshot_is_corrupt)`, i.e., with the new variable above:
>
>if manifest_blob.raw_data() == tmp_manifest_blob.raw_data() && !must_resync_exisiting {
>    // ...
>}
>
>Or why would the expression need to be different here?

Changed this as well after combining the flags.

>> +        {
>>              if !client_log_name.exists() {
>>                  reader.try_download_client_log(&client_log_name).await?;
>>              };
>> @@ -381,7 +393,7 @@ async fn pull_snapshot<'a>(
>>          let mut path = snapshot.full_path();
>>          path.push(&item.filename);
>>
>> -        if path.exists() {
>> +        if !(resync_corrupt && snapshot_is_corrupt) && path.exists() {
>>              match ArchiveType::from_path(&item.filename)? {
>>                  ArchiveType::DynamicIndex => {
>>                      let index = DynamicIndexReader::open(&path)?;
>> @@ -416,6 +428,10 @@ async fn pull_snapshot<'a>(
>>              }
>>          }
>>
>> +        if resync_corrupt && snapshot_is_corrupt {
>> +            info!("pulling snapshot {} again", snapshot.dir());
>
>maybe include the reason for re-pulling this here, e.g. something like:
>
>"re-syncing snapshot {} due to bad verification result"
>
>And wouldn't that get logged multiple times for backups with multiple archives, like
>e.g. guests with multiple disks? Makes probably more sense to log that before the for
>loop

Yep, you're right, moved it to the beginning of the loop!

>> +        }
>> +
>>          let stats =
>>              pull_single_archive(reader.clone(), snapshot, item, downloaded_chunks.clone()).await?;
>>          sync_stats.add(stats);
>> @@ -443,6 +459,7 @@ async fn pull_snapshot_from<'a>(
>>      reader: Arc<dyn SyncSourceReader + 'a>,
>>      snapshot: &'a pbs_datastore::BackupDir,
>>      downloaded_chunks: Arc<Mutex<HashSet<[u8; 32]>>>,
>> +    resync_corrupt: bool,
>>  ) -> Result<SyncStats, Error> {
>>      let (_path, is_new, _snap_lock) = snapshot
>>          .datastore()
>> @@ -451,7 +468,7 @@ async fn pull_snapshot_from<'a>(
>>      let sync_stats = if is_new {
>>          info!("sync snapshot {}", snapshot.dir());
>>
>> -        match pull_snapshot(reader, snapshot, downloaded_chunks).await {
>> +        match pull_snapshot(reader, snapshot, downloaded_chunks, resync_corrupt).await {
>>              Err(err) => {
>>                  if let Err(cleanup_err) = snapshot.datastore().remove_backup_dir(
>>                      snapshot.backup_ns(),
>> @@ -469,7 +486,7 @@ async fn pull_snapshot_from<'a>(
>>          }
>>      } else {
>>          info!("re-sync snapshot {}", snapshot.dir());
>> -        pull_snapshot(reader, snapshot, downloaded_chunks).await?
>> +        pull_snapshot(reader, snapshot, downloaded_chunks, resync_corrupt).await?
>>      };
>>
>>      Ok(sync_stats)
>> @@ -528,21 +545,24 @@ async fn pull_group(
>>          .enumerate()
>>          .filter(|&(pos, ref dir)| {
>>              source_snapshots.insert(dir.time);
>> -            if last_sync_time > dir.time {
>> -                already_synced_skip_info.update(dir.time);
>> -                return false;
>> -            } else if already_synced_skip_info.count > 0 {
>> -                info!("{already_synced_skip_info}");
>> -                already_synced_skip_info.reset();
>> -                return true;
>> -            }
>> +            // If resync_corrupt is set, we go through all the remote snapshots
>> +            if !params.resync_corrupt {
>
>maybe reverse that and use an explicit return? i.e.:
>
>if params.resync_corrupt {
>    return true; // always check all remote snapshots when resyncing
>}
>
>// rest of the filter
>
>would avoid an indentation level for the non-bypass path and be slightly easier
>to read to me

True, fixed this as well.


Thanks for the review!


_______________________________________________
pbs-devel mailing list
pbs-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel


  reply	other threads:[~2024-10-17 13:20 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-10-15 13:18 [pbs-devel] [PATCH proxmox-backup 0/3] fix #3786: resync corrupt chunks in sync-job Gabriel Goller
2024-10-15 13:18 ` [pbs-devel] [PATCH proxmox-backup 1/3] fix #3786: api: add resync-corrupt option to sync jobs Gabriel Goller
2024-10-17  6:15   ` Thomas Lamprecht
2024-10-17 13:20     ` Gabriel Goller [this message]
2024-10-17 14:09       ` Thomas Lamprecht
2024-10-18  9:08         ` Gabriel Goller
2024-10-21  8:16       ` Lukas Wagner
2024-10-15 13:18 ` [pbs-devel] [PATCH proxmox-backup 2/3] fix #3786: ui/cli: add resync-corrupt option on sync-jobs Gabriel Goller
2024-10-15 13:18 ` [pbs-devel] [PATCH proxmox-backup 3/3] fix #3786: docs: add resync-corrupt option to sync-job Gabriel Goller

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20241017132010.5q7i7llfkn6jcuqn@luna.proxmox.com \
    --to=g.goller@proxmox.com \
    --cc=pbs-devel@lists.proxmox.com \
    --cc=t.lamprecht@proxmox.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.
Service provided by Proxmox Server Solutions GmbH | Privacy | Legal