From: Gabriel Goller <g.goller@proxmox.com>
To: "Fabian Grünbichler" <f.gruenbichler@proxmox.com>
Cc: Proxmox Backup Server development discussion
<pbs-devel@lists.proxmox.com>
Subject: Re: [pbs-devel] [PATCH proxmox-backup v2 1/3] fix #3786: api: add resync-corrupt option to sync jobs
Date: Mon, 4 Nov 2024 17:02:21 +0100 [thread overview]
Message-ID: <hktvuzb35ggshduj4iutuswutxpnostvkd352sedwhlf72cuiq@xomssmwbk2v6> (raw)
In-Reply-To: <1730717237.hc5rhqwdje.astroid@yuna.none>
On 04.11.2024 12:51, Fabian Grünbichler wrote:
>this doesn't really do what it says on the tin, see below.
>
>On October 18, 2024 11:09 am, Gabriel Goller wrote:
>> [snip]
>> diff --git a/pbs-datastore/src/backup_info.rs b/pbs-datastore/src/backup_info.rs
>> index 414ec878d01a..c86fbb7568ab 100644
>> --- a/pbs-datastore/src/backup_info.rs
>> +++ b/pbs-datastore/src/backup_info.rs
>> @@ -8,7 +8,8 @@ use anyhow::{bail, format_err, Error};
>> use proxmox_sys::fs::{lock_dir_noblock, replace_file, CreateOptions};
>>
>> use pbs_api_types::{
>> - Authid, BackupNamespace, BackupType, GroupFilter, BACKUP_DATE_REGEX, BACKUP_FILE_REGEX,
>> + Authid, BackupNamespace, BackupType, GroupFilter, SnapshotVerifyState, VerifyState,
>> + BACKUP_DATE_REGEX, BACKUP_FILE_REGEX,
>> };
>> use pbs_config::{open_backup_lockfile, BackupLockGuard};
>>
>> @@ -583,6 +584,16 @@ impl BackupDir {
>>
>> Ok(())
>> }
>> +
>> + /// Load the verify state from the manifest.
>> + pub fn verify_state(&self) -> Result<VerifyState, anyhow::Error> {
>> + self.load_manifest().and_then(|(m, _)| {
>> + let verify = m.unprotected["verify_state"].clone();
>> + serde_json::from_value::<SnapshotVerifyState>(verify)
>> + .map(|svs| svs.state)
>> + .map_err(Into::into)
>> + })
>> + }
>
>wouldn't it make more sense to have this as a getter for an optional
>SnapshotVerifyState on the BackupManifest?
>
>then it could go into its own commit, other call sites that load the
>verify state from a manifest could be adapted to it, and then this
>commit can also start using it?
You're right the BackupManifest is loaded here twice actually. So I
could move this function to a getter in BackupManifest and then move the
construction of it (src/server/pull.rs:396):
let manifest = BackupManifest::try_from(tmp_manifest_blob)?;
up to (src/server/pull.rs:365). And instead of having the try_block to
try read from the manifest just call BackupManifest::try_from.
I'll see if I can make this work...
>also see the comment further below about how the current implementation
>is very noisy if snapshots are newly synced as opposed to resynced..
>
>> }
>>
>> impl AsRef<pbs_api_types::BackupNamespace> for BackupDir {
>> diff --git a/src/api2/config/sync.rs b/src/api2/config/sync.rs
>> index 6fdc69a9e645..fa9db92f3d11 100644
>> --- a/src/api2/config/sync.rs
>> +++ b/src/api2/config/sync.rs
>> [snip]
>> diff --git a/src/server/pull.rs b/src/server/pull.rs
>> index 3117f7d2c960..b2dd15d9d6db 100644
>> --- a/src/server/pull.rs
>> +++ b/src/server/pull.rs
>> @@ -7,12 +7,14 @@ use std::sync::{Arc, Mutex};
>> use std::time::SystemTime;
>>
>> use anyhow::{bail, format_err, Error};
>> +use nom::combinator::verify;
>
>I think this snuck in ;)
Oops, my fault :)
>> [snip]
>> @@ -175,9 +182,10 @@ async fn pull_index_chunks<I: IndexFile>(
>> target.cond_touch_chunk(&info.digest, false)
>> })?;
>> if chunk_exists {
>> - //info!("chunk {} exists {}", pos, hex::encode(digest));
>> + //info!("chunk exists {}", hex::encode(info.digest));
>
>this
>
>> return Ok::<_, Error>(());
>> }
>> +
>
>and this as well?
Yep, removed both, thanks for the heads-up!
>> [snip]
>> @@ -325,13 +333,15 @@ async fn pull_single_archive<'a>(
>> /// - (Re)download the manifest
>> /// -- if it matches, only download log and treat snapshot as already synced
>> /// - Iterate over referenced files
>> -/// -- if file already exists, verify contents
>> +/// -- if file already exists, verify contents or pull again if last
>> +/// verification failed and `resync_corrupt` is true
>> /// -- if not, pull it from the remote
>> /// - Download log if not already existing
>> async fn pull_snapshot<'a>(
>> reader: Arc<dyn SyncSourceReader + 'a>,
>> snapshot: &'a pbs_datastore::BackupDir,
>> downloaded_chunks: Arc<Mutex<HashSet<[u8; 32]>>>,
>> + resync_corrupt: bool,
>> ) -> Result<SyncStats, Error> {
>> let mut sync_stats = SyncStats::default();
>> let mut manifest_name = snapshot.full_path();
>> @@ -352,6 +362,14 @@ async fn pull_snapshot<'a>(
>> return Ok(sync_stats);
>> }
>>
>
>I think this part here is somewhat wrong ordering wise, or at least,
>unnecessarily expensive..
>
>if resync_corrupt is enabled, we want to (in this order!)
>- check the local snapshot for corruption, if it exists
>- if it is corrupt, we proceed with resyncing
>- if not, we only proceed with resyncing if it is the last snapshot in
> this group, and return early otherwise
>
>that way, we avoid redownloading all the manifests.. but see further
>below for another issue with the current implementation..
>
>> + let must_resync_existing = resync_corrupt
>> + && snapshot
>> + .verify_state()
>> + .inspect_err(|err| {
>> + tracing::error!("Failed to check verification state of snapshot: {err:?}")
>
>2024-11-04T12:34:57+01:00: Failed to check verification state of snapshot: unable to load blob '"/tank/pbs/ns/foobar/ns/test/ns/another_test/vm/900/2023-04-06T14:36:00Z/index.json.blob"' - No such file or directory (os error 2)
>
>this seems to be very noisy for newly synced snapshots, because the
>helper is implemented on BackupInfo instead of on BackupManifest..
True, but I think this is mostly because new backups don't have a
verify_state yet (and that doesn't necessarily mean == bad). Removed the
log line as it's silly anyway :)
>> + })
>> + .is_ok_and(|state| state == VerifyState::Failed);
>> +
>> if manifest_name.exists() {
>> let manifest_blob = proxmox_lang::try_block!({
>> let mut manifest_file = std::fs::File::open(&manifest_name).map_err(|err| {
>> [snip]
>> @@ -528,6 +554,10 @@ async fn pull_group(
>> .enumerate()
>> .filter(|&(pos, ref dir)| {
>> source_snapshots.insert(dir.time);
>> + // If resync_corrupt is set, we go through all the remote snapshots
>> + if params.resync_corrupt {
>> + return true;
>> + }
>
>alternatively, we could check the local manifest here, and only include
>existing snapshots with a failed verification state, the last one and
>new ones? that way, we'd get more meaningful progress stats as well..
That's true, that would be cleaner. The downside is that we would have
open/parse the BackupManifest twice.
I could write something like:
if params.resync_corrupt {
let local_dir = params.target.store.backup_dir(target_ns.clone(), dir.clone());
if let Ok(dir) = local_dir {
let verify_state = dir.verify_state();
if verify_state == Some(VerifyState::Failed) {
return true;
}
}
}
>because right now, this will not only resync existing corrupt snapshots,
>but also ones that have been pruned locally, but not on the source
>(i.e., the other proposed "fixing" sync mode that syncs "missing"
>old snapshots, not just corrupt ones).
I'm too stupid to find the mail where this was mentioned/discussed, I'm
quite sure we said to just pull both, and then maybe separate them in a
future iteration/feature. But now that I think about the flag is named
`resync_corrupt` so I'd expect it to only pull in the corrupt snapshots.
I actually agree with this change, it probably is also more performant
(reading backup_manifest twice is probably faster than pulling lots of
unneeded manifests from the remote).
>> if last_sync_time > dir.time {
>> already_synced_skip_info.update(dir.time);
>> return false;
>> @@ -566,7 +596,13 @@ async fn pull_group(
>> .source
>> .reader(source_namespace, &from_snapshot)
>> .await?;
>> - let result = pull_snapshot_from(reader, &to_snapshot, downloaded_chunks.clone()).await;
>> + let result = pull_snapshot_from(
>> + reader,
>> + &to_snapshot,
>> + downloaded_chunks.clone(),
>> + params.resync_corrupt,
>> + )
>> + .await;
>>
>> progress.done_snapshots = pos as u64 + 1;
>> info!("percentage done: {progress}");
_______________________________________________
pbs-devel mailing list
pbs-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel
next prev parent reply other threads:[~2024-11-04 16:02 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-10-18 9:09 [pbs-devel] [PATCH proxmox-backup v2 0/3] fix #3786: resync corrupt chunks in sync-job Gabriel Goller
2024-10-18 9:09 ` [pbs-devel] [PATCH proxmox-backup v2 1/3] fix #3786: api: add resync-corrupt option to sync jobs Gabriel Goller
2024-11-04 11:51 ` Fabian Grünbichler
2024-11-04 16:02 ` Gabriel Goller [this message]
2024-11-05 7:20 ` Fabian Grünbichler
2024-11-05 10:39 ` Gabriel Goller
2024-10-18 9:09 ` [pbs-devel] [PATCH proxmox-backup v2 2/3] fix #3786: ui/cli: add resync-corrupt option on sync-jobs Gabriel Goller
2024-10-18 9:09 ` [pbs-devel] [PATCH proxmox-backup v2 3/3] fix #3786: docs: add resync-corrupt option to sync-job Gabriel Goller
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=hktvuzb35ggshduj4iutuswutxpnostvkd352sedwhlf72cuiq@xomssmwbk2v6 \
--to=g.goller@proxmox.com \
--cc=f.gruenbichler@proxmox.com \
--cc=pbs-devel@lists.proxmox.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.