public inbox for pbs-devel@lists.proxmox.com
 help / color / mirror / Atom feed
* [pbs-devel] [PATCH v2 0/5] Improve corrupt chunk handling
@ 2020-09-07 15:30 Stefan Reiter
  2020-09-07 15:30 ` [pbs-devel] [PATCH v2 proxmox-backup 1/5] verify: fix log units Stefan Reiter
                   ` (5 more replies)
  0 siblings, 6 replies; 9+ messages in thread
From: Stefan Reiter @ 2020-09-07 15:30 UTC (permalink / raw)
  To: pbs-devel

Verify will now rename chunks it detects as corrupted, so future backups will be
forced to write them. The next GC will then clean these ".bad" files up, since
it has to scan each chunk directory anyway.

In case the last backup uses some of these chunks, but is not the one that
failed verification, the client may still omit these chunks, which could lead to
a broken backup. Patch 4 detects these cases by checking all referenced chunks
for existance (which certainly adds a bit of overhead, especially to otherwise
minimal dirty-bitmap backups).

Additionally, the last patch makes sure all chunks have their atime updated,
even if they won't be written (when they already exist), to eliminate a race
with GC where the chunk might be missing after a successful backup.

Also, friendly ping on:
https://lists.proxmox.com/pipermail/pbs-devel/2020-September/000479.html
This series makes the most sense with that patch already applied, the feedback
from Dominik is addressed here.

v2:
* address Thomas' feedback (as well as Dietmar's comment)
* add patch 5


proxmox-backup: Stefan Reiter (5):
  verify: fix log units
  verify: rename corrupted chunks with .bad extension
  gc: remove .bad files on garbage collect
  backup: check all referenced chunks actually exist
  backup: touch all chunks, even if they exist

 src/api2/backup/environment.rs | 21 +++++++++-
 src/api2/types/mod.rs          |  3 ++
 src/backup/chunk_store.rs      | 73 ++++++++++++++++++++++++++++------
 src/backup/datastore.rs        |  5 ++-
 src/backup/verify.rs           | 34 +++++++++++++++-
 5 files changed, 120 insertions(+), 16 deletions(-)

-- 
2.20.1




^ permalink raw reply	[flat|nested] 9+ messages in thread

* [pbs-devel] [PATCH v2 proxmox-backup 1/5] verify: fix log units
  2020-09-07 15:30 [pbs-devel] [PATCH v2 0/5] Improve corrupt chunk handling Stefan Reiter
@ 2020-09-07 15:30 ` Stefan Reiter
  2020-09-07 15:30 ` [pbs-devel] [PATCH v2 proxmox-backup 2/5] verify: rename corrupted chunks with .bad extension Stefan Reiter
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 9+ messages in thread
From: Stefan Reiter @ 2020-09-07 15:30 UTC (permalink / raw)
  To: pbs-devel

Signed-off-by: Stefan Reiter <s.reiter@proxmox.com>
---
 src/backup/verify.rs | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/backup/verify.rs b/src/backup/verify.rs
index b65be9c9..1c848d57 100644
--- a/src/backup/verify.rs
+++ b/src/backup/verify.rs
@@ -163,7 +163,7 @@ fn verify_index_chunks(
 
     let error_count = errors.load(Ordering::SeqCst);
 
-    worker.log(format!("  verified {:.2}/{:.2} Mib in {:.2} seconds, speed {:.2}/{:.2} Mib/s ({} errors)",
+    worker.log(format!("  verified {:.2}/{:.2} MiB in {:.2} seconds, speed {:.2}/{:.2} MiB/s ({} errors)",
                        read_bytes_mib, decoded_bytes_mib, elapsed, read_speed, decode_speed, error_count));
 
     if errors.load(Ordering::SeqCst) > 0 {
-- 
2.20.1





^ permalink raw reply	[flat|nested] 9+ messages in thread

* [pbs-devel] [PATCH v2 proxmox-backup 2/5] verify: rename corrupted chunks with .bad extension
  2020-09-07 15:30 [pbs-devel] [PATCH v2 0/5] Improve corrupt chunk handling Stefan Reiter
  2020-09-07 15:30 ` [pbs-devel] [PATCH v2 proxmox-backup 1/5] verify: fix log units Stefan Reiter
@ 2020-09-07 15:30 ` Stefan Reiter
  2020-09-08 10:27   ` Dietmar Maurer
  2020-09-07 15:30 ` [pbs-devel] [PATCH v2 proxmox-backup 3/5] gc: remove .bad files on garbage collect Stefan Reiter
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 9+ messages in thread
From: Stefan Reiter @ 2020-09-07 15:30 UTC (permalink / raw)
  To: pbs-devel

This ensures that following backups will always upload the chunk,
thereby replacing it with a correct version again.

Format for renaming is <digest>.<counter>.bad where <counter> is used if
a chunk is found to be bad again before a GC cleans it up.

Care has been taken to deliberately only rename a chunk in conditions
where it is guaranteed to be an error in the chunk itself. Otherwise a
broken index file could lead to an unwanted mass-rename of chunks.

Signed-off-by: Stefan Reiter <s.reiter@proxmox.com>
---
 src/backup/verify.rs | 32 +++++++++++++++++++++++++++++++-
 1 file changed, 31 insertions(+), 1 deletion(-)

diff --git a/src/backup/verify.rs b/src/backup/verify.rs
index 1c848d57..697ed236 100644
--- a/src/backup/verify.rs
+++ b/src/backup/verify.rs
@@ -39,6 +39,34 @@ fn verify_blob(datastore: Arc<DataStore>, backup_dir: &BackupDir, info: &FileInf
     }
 }
 
+fn rename_corrupted_chunk(
+    datastore: Arc<DataStore>,
+    digest: &[u8;32],
+    worker: Arc<WorkerTask>,
+) {
+    let (path, digest_str) = datastore.chunk_path(digest);
+
+    let mut counter = 0;
+    let mut new_path = path.clone();
+    new_path.set_file_name(format!("{}.{}.bad", digest_str, counter));
+    while new_path.exists() && counter < 9 {
+        counter += 1;
+        new_path.set_file_name(format!("{}.{}.bad", digest_str, counter));
+    }
+
+    match std::fs::rename(&path, &new_path) {
+        Ok(_) => {
+            worker.log(format!("corrupted chunk renamed to {:?}", &new_path));
+        },
+        Err(err) => {
+            match err.kind() {
+                std::io::ErrorKind::NotFound => { /* ignored */ },
+                _ => worker.log(format!("could not rename corrupted chunk {:?} - {}", &path, err))
+            }
+        }
+    };
+}
+
 // We use a separate thread to read/load chunks, so that we can do
 // load and verify in parallel to increase performance.
 fn chunk_reader_thread(
@@ -73,6 +101,7 @@ fn chunk_reader_thread(
                     corrupt_chunks.lock().unwrap().insert(info.digest);
                     worker.log(format!("can't verify chunk, load failed - {}", err));
                     errors.fetch_add(1, Ordering::SeqCst);
+                    rename_corrupted_chunk(datastore.clone(), &info.digest, worker.clone());
                     continue;
                 }
                 Ok(chunk) => {
@@ -101,7 +130,7 @@ fn verify_index_chunks(
     let start_time = Instant::now();
 
     let chunk_channel = chunk_reader_thread(
-        datastore,
+        datastore.clone(),
         index,
         verified_chunks.clone(),
         corrupt_chunks.clone(),
@@ -148,6 +177,7 @@ fn verify_index_chunks(
             corrupt_chunks.lock().unwrap().insert(digest);
             worker.log(format!("{}", err));
             errors.fetch_add(1, Ordering::SeqCst);
+            rename_corrupted_chunk(datastore.clone(), &digest, worker.clone());
         } else {
             verified_chunks.lock().unwrap().insert(digest);
         }
-- 
2.20.1





^ permalink raw reply	[flat|nested] 9+ messages in thread

* [pbs-devel] [PATCH v2 proxmox-backup 3/5] gc: remove .bad files on garbage collect
  2020-09-07 15:30 [pbs-devel] [PATCH v2 0/5] Improve corrupt chunk handling Stefan Reiter
  2020-09-07 15:30 ` [pbs-devel] [PATCH v2 proxmox-backup 1/5] verify: fix log units Stefan Reiter
  2020-09-07 15:30 ` [pbs-devel] [PATCH v2 proxmox-backup 2/5] verify: rename corrupted chunks with .bad extension Stefan Reiter
@ 2020-09-07 15:30 ` Stefan Reiter
  2020-09-07 15:30 ` [pbs-devel] [PATCH v2 proxmox-backup 4/5] backup: check all referenced chunks actually exist Stefan Reiter
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 9+ messages in thread
From: Stefan Reiter @ 2020-09-07 15:30 UTC (permalink / raw)
  To: pbs-devel

The iterator of get_chunk_iterator is extended with a third parameter
indicating whether the current file is a chunk (false) or a .bad file
(true).

Count their sizes to the total of removed bytes, since it also frees
disk space.

.bad files are only deleted if the corresponding chunk exists, i.e. has
been rewritten. Otherwise we might delete data only marked bad because
of transient errors.

While at it, also clean up and use nix::unistd::unlinkat instead of
unsafe libc calls.

Signed-off-by: Stefan Reiter <s.reiter@proxmox.com>
---

v2:
* only remove .bad files if corresponding chunk exists
* only consider .bad files that start with a valid digest
* improve log messages
* use safe unlinkat
* only count actually removed chunks

 src/api2/types/mod.rs     |  3 ++
 src/backup/chunk_store.rs | 72 ++++++++++++++++++++++++++++++++-------
 src/backup/datastore.rs   |  5 ++-
 3 files changed, 67 insertions(+), 13 deletions(-)

diff --git a/src/api2/types/mod.rs b/src/api2/types/mod.rs
index 6854fdf0..e29a7e37 100644
--- a/src/api2/types/mod.rs
+++ b/src/api2/types/mod.rs
@@ -559,6 +559,8 @@ pub struct GarbageCollectionStatus {
     pub pending_bytes: u64,
     /// Number of pending chunks (pending removal - kept for safety).
     pub pending_chunks: usize,
+    /// Number of chunks marked as .bad by verify that have been removed by GC.
+    pub removed_bad: usize,
 }
 
 impl Default for GarbageCollectionStatus {
@@ -573,6 +575,7 @@ impl Default for GarbageCollectionStatus {
             removed_chunks: 0,
             pending_bytes: 0,
             pending_chunks: 0,
+            removed_bad: 0,
         }
     }
 }
diff --git a/src/backup/chunk_store.rs b/src/backup/chunk_store.rs
index e1da5a8a..abe1d67f 100644
--- a/src/backup/chunk_store.rs
+++ b/src/backup/chunk_store.rs
@@ -187,7 +187,7 @@ impl ChunkStore {
     pub fn get_chunk_iterator(
         &self,
     ) -> Result<
-        impl Iterator<Item = (Result<tools::fs::ReadDirEntry, Error>, usize)> + std::iter::FusedIterator,
+        impl Iterator<Item = (Result<tools::fs::ReadDirEntry, Error>, usize, bool)> + std::iter::FusedIterator,
         Error
     > {
         use nix::dir::Dir;
@@ -219,19 +219,21 @@ impl ChunkStore {
                         Some(Ok(entry)) => {
                             // skip files if they're not a hash
                             let bytes = entry.file_name().to_bytes();
-                            if bytes.len() != 64 {
+                            if bytes.len() != 64 && bytes.len() != 64 + ".0.bad".len() {
                                 continue;
                             }
-                            if !bytes.iter().all(u8::is_ascii_hexdigit) {
+                            if !bytes.iter().take(64).all(u8::is_ascii_hexdigit) {
                                 continue;
                             }
-                            return Some((Ok(entry), percentage));
+
+                            let bad = bytes.ends_with(".bad".as_bytes());
+                            return Some((Ok(entry), percentage, bad));
                         }
                         Some(Err(err)) => {
                             // stop after first error
                             done = true;
                             // and pass the error through:
-                            return Some((Err(err), percentage));
+                            return Some((Err(err), percentage, false));
                         }
                         None => (), // open next directory
                     }
@@ -261,7 +263,7 @@ impl ChunkStore {
                         // other errors are fatal, so end our iteration
                         done = true;
                         // and pass the error through:
-                        return Some((Err(format_err!("unable to read subdir '{}' - {}", subdir, err)), percentage));
+                        return Some((Err(format_err!("unable to read subdir '{}' - {}", subdir, err)), percentage, false));
                     }
                 }
             }
@@ -280,6 +282,7 @@ impl ChunkStore {
         worker: &WorkerTask,
     ) -> Result<(), Error> {
         use nix::sys::stat::fstatat;
+        use nix::unistd::{unlinkat, UnlinkatFlags};
 
         let mut min_atime = phase1_start_time - 3600*24; // at least 24h (see mount option relatime)
 
@@ -292,7 +295,7 @@ impl ChunkStore {
         let mut last_percentage = 0;
         let mut chunk_count = 0;
 
-        for (entry, percentage) in self.get_chunk_iterator()? {
+        for (entry, percentage, bad) in self.get_chunk_iterator()? {
             if last_percentage != percentage {
                 last_percentage = percentage;
                 worker.log(format!("percentage done: phase2 {}% (processed {} chunks)", percentage, chunk_count));
@@ -321,14 +324,59 @@ impl ChunkStore {
             let lock = self.mutex.lock();
 
             if let Ok(stat) = fstatat(dirfd, filename, nix::fcntl::AtFlags::AT_SYMLINK_NOFOLLOW) {
-                if stat.st_atime < min_atime {
+                if bad {
+                    match std::ffi::CString::new(&filename.to_bytes()[..64]) {
+                        Ok(orig_filename) => {
+                            match fstatat(
+                                dirfd,
+                                orig_filename.as_c_str(),
+                                nix::fcntl::AtFlags::AT_SYMLINK_NOFOLLOW)
+                            {
+                                Ok(_) => { /* do nothing */ },
+                                Err(nix::Error::Sys(nix::errno::Errno::ENOENT)) => {
+                                    // chunk hasn't been rewritten yet, keep
+                                    // .bad file around for manual recovery
+                                    continue;
+                                },
+                                Err(err) => {
+                                    // some other error, warn user and keep
+                                    // .bad file around too
+                                    worker.warn(format!(
+                                        "error during stat on '{:?}' - {}",
+                                        orig_filename,
+                                        err,
+                                    ));
+                                    continue;
+                                }
+                            }
+                        },
+                        Err(err) => {
+                            worker.warn(format!(
+                                "could not get original filename from .bad file '{:?}' - {}",
+                                filename,
+                                err,
+                            ));
+                            continue;
+                        }
+                    }
+
+                    if let Err(err) = unlinkat(Some(dirfd), filename, UnlinkatFlags::NoRemoveDir) {
+                        worker.warn(format!(
+                            "unlinking corrupt chunk {:?} failed on store '{}' - {}",
+                            filename,
+                            self.name,
+                            err,
+                        ));
+                    } else {
+                        status.removed_bad += 1;
+                        status.removed_bytes += stat.st_size as u64;
+                    }
+                } else if stat.st_atime < min_atime {
                     //let age = now - stat.st_atime;
                     //println!("UNLINK {}  {:?}", age/(3600*24), filename);
-                    let res = unsafe { libc::unlinkat(dirfd, filename.as_ptr(), 0) };
-                    if res != 0 {
-                        let err = nix::Error::last();
+                    if let Err(err) = unlinkat(Some(dirfd), filename, UnlinkatFlags::NoRemoveDir) {
                         bail!(
-                            "unlink chunk {:?} failed on store '{}' - {}",
+                            "unlinking chunk {:?} failed on store '{}' - {}",
                             filename,
                             self.name,
                             err,
diff --git a/src/backup/datastore.rs b/src/backup/datastore.rs
index 42866e38..ebe47487 100644
--- a/src/backup/datastore.rs
+++ b/src/backup/datastore.rs
@@ -85,7 +85,7 @@ impl DataStore {
     pub fn get_chunk_iterator(
         &self,
     ) -> Result<
-        impl Iterator<Item = (Result<tools::fs::ReadDirEntry, Error>, usize)>,
+        impl Iterator<Item = (Result<tools::fs::ReadDirEntry, Error>, usize, bool)>,
         Error
     > {
         self.chunk_store.get_chunk_iterator()
@@ -495,6 +495,9 @@ impl DataStore {
             if gc_status.pending_bytes > 0 {
                 worker.log(&format!("Pending removals: {} (in {} chunks)", HumanByte::from(gc_status.pending_bytes), gc_status.pending_chunks));
             }
+            if gc_status.removed_bad > 0 {
+                worker.log(&format!("Removed bad files: {}", gc_status.removed_bad));
+            }
 
             worker.log(&format!("Original data usage: {}", HumanByte::from(gc_status.index_data_bytes)));
 
-- 
2.20.1





^ permalink raw reply	[flat|nested] 9+ messages in thread

* [pbs-devel] [PATCH v2 proxmox-backup 4/5] backup: check all referenced chunks actually exist
  2020-09-07 15:30 [pbs-devel] [PATCH v2 0/5] Improve corrupt chunk handling Stefan Reiter
                   ` (2 preceding siblings ...)
  2020-09-07 15:30 ` [pbs-devel] [PATCH v2 proxmox-backup 3/5] gc: remove .bad files on garbage collect Stefan Reiter
@ 2020-09-07 15:30 ` Stefan Reiter
  2020-09-08 10:49   ` Dietmar Maurer
  2020-09-07 15:30 ` [pbs-devel] [PATCH v2 proxmox-backup 5/5] backup: touch all chunks, even if they exist Stefan Reiter
  2020-09-08 10:51 ` [pbs-devel] [PATCH v2 0/5] Improve corrupt chunk handling Dietmar Maurer
  5 siblings, 1 reply; 9+ messages in thread
From: Stefan Reiter @ 2020-09-07 15:30 UTC (permalink / raw)
  To: pbs-devel

A client can omit uploading chunks in the "known_chunks" list, those
then also won't be written on the server side.  Check all those chunks
mentioned in the index but not uploaded for existance and report an
error if they don't exist instead of marking a potentially broken backup
as "successful".

Signed-off-by: Stefan Reiter <s.reiter@proxmox.com>
---
 src/api2/backup/environment.rs | 21 ++++++++++++++++++++-
 1 file changed, 20 insertions(+), 1 deletion(-)

diff --git a/src/api2/backup/environment.rs b/src/api2/backup/environment.rs
index 973563d3..df22b1d6 100644
--- a/src/api2/backup/environment.rs
+++ b/src/api2/backup/environment.rs
@@ -1,6 +1,6 @@
 use anyhow::{bail, format_err, Error};
 use std::sync::{Arc, Mutex};
-use std::collections::HashMap;
+use std::collections::{HashMap, HashSet};
 
 use ::serde::{Serialize};
 use serde_json::{json, Value};
@@ -73,6 +73,7 @@ struct SharedBackupState {
     dynamic_writers: HashMap<usize, DynamicWriterState>,
     fixed_writers: HashMap<usize, FixedWriterState>,
     known_chunks: HashMap<[u8;32], u32>,
+    touched_chunks: HashSet<[u8;32]>,
     backup_size: u64, // sums up size of all files
     backup_stat: UploadStatistic,
 }
@@ -126,6 +127,7 @@ impl BackupEnvironment {
             dynamic_writers: HashMap::new(),
             fixed_writers: HashMap::new(),
             known_chunks: HashMap::new(),
+            touched_chunks: HashSet::new(),
             backup_size: 0,
             backup_stat: UploadStatistic::new(),
         };
@@ -196,6 +198,7 @@ impl BackupEnvironment {
 
         // register chunk
         state.known_chunks.insert(digest, size);
+        state.touched_chunks.insert(digest);
 
         Ok(())
     }
@@ -229,6 +232,7 @@ impl BackupEnvironment {
 
         // register chunk
         state.known_chunks.insert(digest, size);
+        state.touched_chunks.insert(digest);
 
         Ok(())
     }
@@ -490,6 +494,21 @@ impl BackupEnvironment {
             }
         }
 
+        // make sure all chunks that were referenced actually exist
+        for (digest, _) in state.known_chunks.iter() {
+            // if they were uploaded just now they have already been touched
+            if state.touched_chunks.contains(digest) {
+                continue;
+            }
+
+            if !self.datastore.chunk_path(digest).0.exists() {
+                bail!(
+                    "chunk '{}' was attempted to be reused but doesn't exist",
+                    digest_to_hex(digest)
+                );
+            }
+        }
+
         // marks the backup as successful
         state.finished = true;
 
-- 
2.20.1





^ permalink raw reply	[flat|nested] 9+ messages in thread

* [pbs-devel] [PATCH v2 proxmox-backup 5/5] backup: touch all chunks, even if they exist
  2020-09-07 15:30 [pbs-devel] [PATCH v2 0/5] Improve corrupt chunk handling Stefan Reiter
                   ` (3 preceding siblings ...)
  2020-09-07 15:30 ` [pbs-devel] [PATCH v2 proxmox-backup 4/5] backup: check all referenced chunks actually exist Stefan Reiter
@ 2020-09-07 15:30 ` Stefan Reiter
  2020-09-08 10:51 ` [pbs-devel] [PATCH v2 0/5] Improve corrupt chunk handling Dietmar Maurer
  5 siblings, 0 replies; 9+ messages in thread
From: Stefan Reiter @ 2020-09-07 15:30 UTC (permalink / raw)
  To: pbs-devel

We need to update the atime of chunk files if they already exist,
otherwise a concurrently running GC could sweep them away.

This is protected with ChunkStore.mutex, so the fstat/unlink does not
race with touching.

Signed-off-by: Stefan Reiter <s.reiter@proxmox.com>
---

New in v2.

 src/backup/chunk_store.rs | 1 +
 1 file changed, 1 insertion(+)

diff --git a/src/backup/chunk_store.rs b/src/backup/chunk_store.rs
index abe1d67f..1d9de70a 100644
--- a/src/backup/chunk_store.rs
+++ b/src/backup/chunk_store.rs
@@ -414,6 +414,7 @@ impl ChunkStore {
 
         if let Ok(metadata) = std::fs::metadata(&chunk_path) {
             if metadata.is_file() {
+                self.touch_chunk(digest)?;
                 return Ok((true, metadata.len()));
             } else {
                 bail!("Got unexpected file type on store '{}' for chunk {}", self.name, digest_str);
-- 
2.20.1





^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [pbs-devel] [PATCH v2 proxmox-backup 2/5] verify: rename corrupted chunks with .bad extension
  2020-09-07 15:30 ` [pbs-devel] [PATCH v2 proxmox-backup 2/5] verify: rename corrupted chunks with .bad extension Stefan Reiter
@ 2020-09-08 10:27   ` Dietmar Maurer
  0 siblings, 0 replies; 9+ messages in thread
From: Dietmar Maurer @ 2020-09-08 10:27 UTC (permalink / raw)
  To: Proxmox Backup Server development discussion, Stefan Reiter

possible improvement to avoid duplicate format!():

diff --git a/src/backup/verify.rs b/src/backup/verify.rs
index 697ed23..a96c572 100644
--- a/src/backup/verify.rs
+++ b/src/backup/verify.rs
@@ -48,10 +48,9 @@ fn rename_corrupted_chunk(
 
     let mut counter = 0;
     let mut new_path = path.clone();
-    new_path.set_file_name(format!("{}.{}.bad", digest_str, counter));
-    while new_path.exists() && counter < 9 {
-        counter += 1;
+    loop {
         new_path.set_file_name(format!("{}.{}.bad", digest_str, counter));
+        if new_path.exists() && counter < 9 { counter += 1; } else { break; }
     }
 
     match std::fs::rename(&path, &new_path) {

> On 09/07/2020 5:30 PM Stefan Reiter <s.reiter@proxmox.com> wrote:
> 
>  
> This ensures that following backups will always upload the chunk,
> thereby replacing it with a correct version again.
> 
> Format for renaming is <digest>.<counter>.bad where <counter> is used if
> a chunk is found to be bad again before a GC cleans it up.
> 
> Care has been taken to deliberately only rename a chunk in conditions
> where it is guaranteed to be an error in the chunk itself. Otherwise a
> broken index file could lead to an unwanted mass-rename of chunks.
> 
> Signed-off-by: Stefan Reiter <s.reiter@proxmox.com>
> ---
>  src/backup/verify.rs | 32 +++++++++++++++++++++++++++++++-
>  1 file changed, 31 insertions(+), 1 deletion(-)
> 
> diff --git a/src/backup/verify.rs b/src/backup/verify.rs
> index 1c848d57..697ed236 100644
> --- a/src/backup/verify.rs
> +++ b/src/backup/verify.rs
> @@ -39,6 +39,34 @@ fn verify_blob(datastore: Arc<DataStore>, backup_dir: &BackupDir, info: &FileInf
>      }
>  }
>  
> +fn rename_corrupted_chunk(
> +    datastore: Arc<DataStore>,
> +    digest: &[u8;32],
> +    worker: Arc<WorkerTask>,
> +) {
> +    let (path, digest_str) = datastore.chunk_path(digest);
> +
> +    let mut counter = 0;
> +    let mut new_path = path.clone();
> +    new_path.set_file_name(format!("{}.{}.bad", digest_str, counter));
> +    while new_path.exists() && counter < 9 {
> +        counter += 1;
> +        new_path.set_file_name(format!("{}.{}.bad", digest_str, counter));
> +    }
> +
> +    match std::fs::rename(&path, &new_path) {
> +        Ok(_) => {
> +            worker.log(format!("corrupted chunk renamed to {:?}", &new_path));
> +        },
> +        Err(err) => {
> +            match err.kind() {
> +                std::io::ErrorKind::NotFound => { /* ignored */ },
> +                _ => worker.log(format!("could not rename corrupted chunk {:?} - {}", &path, err))
> +            }
> +        }
> +    };
> +}
> +
>  // We use a separate thread to read/load chunks, so that we can do
>  // load and verify in parallel to increase performance.
>  fn chunk_reader_thread(
> @@ -73,6 +101,7 @@ fn chunk_reader_thread(
>                      corrupt_chunks.lock().unwrap().insert(info.digest);
>                      worker.log(format!("can't verify chunk, load failed - {}", err));
>                      errors.fetch_add(1, Ordering::SeqCst);
> +                    rename_corrupted_chunk(datastore.clone(), &info.digest, worker.clone());
>                      continue;
>                  }
>                  Ok(chunk) => {
> @@ -101,7 +130,7 @@ fn verify_index_chunks(
>      let start_time = Instant::now();
>  
>      let chunk_channel = chunk_reader_thread(
> -        datastore,
> +        datastore.clone(),
>          index,
>          verified_chunks.clone(),
>          corrupt_chunks.clone(),
> @@ -148,6 +177,7 @@ fn verify_index_chunks(
>              corrupt_chunks.lock().unwrap().insert(digest);
>              worker.log(format!("{}", err));
>              errors.fetch_add(1, Ordering::SeqCst);
> +            rename_corrupted_chunk(datastore.clone(), &digest, worker.clone());
>          } else {
>              verified_chunks.lock().unwrap().insert(digest);
>          }
> -- 
> 2.20.1
> 
> 
> 
> _______________________________________________
> pbs-devel mailing list
> pbs-devel@lists.proxmox.com
> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel




^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [pbs-devel] [PATCH v2 proxmox-backup 4/5] backup: check all referenced chunks actually exist
  2020-09-07 15:30 ` [pbs-devel] [PATCH v2 proxmox-backup 4/5] backup: check all referenced chunks actually exist Stefan Reiter
@ 2020-09-08 10:49   ` Dietmar Maurer
  0 siblings, 0 replies; 9+ messages in thread
From: Dietmar Maurer @ 2020-09-08 10:49 UTC (permalink / raw)
  To: Proxmox Backup Server development discussion, Stefan Reiter


>  
>  use ::serde::{Serialize};
>  use serde_json::{json, Value};
> @@ -73,6 +73,7 @@ struct SharedBackupState {
>      dynamic_writers: HashMap<usize, DynamicWriterState>,
>      fixed_writers: HashMap<usize, FixedWriterState>,
>      known_chunks: HashMap<[u8;32], u32>,
> +    touched_chunks: HashSet<[u8;32]>,
>      backup_size: u64, // sums up size of all files
>      backup_stat: UploadStatistic,

Instead of touched_chunks, can we store that info inside known_chunks?

-      known_chunks: HashMap<[u8;32], u32>,
+      known_chunks: HashMap<[u8;32], (u32, bool)>,




^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [pbs-devel] [PATCH v2 0/5] Improve corrupt chunk handling
  2020-09-07 15:30 [pbs-devel] [PATCH v2 0/5] Improve corrupt chunk handling Stefan Reiter
                   ` (4 preceding siblings ...)
  2020-09-07 15:30 ` [pbs-devel] [PATCH v2 proxmox-backup 5/5] backup: touch all chunks, even if they exist Stefan Reiter
@ 2020-09-08 10:51 ` Dietmar Maurer
  5 siblings, 0 replies; 9+ messages in thread
From: Dietmar Maurer @ 2020-09-08 10:51 UTC (permalink / raw)
  To: Proxmox Backup Server development discussion, Stefan Reiter

applied 0, 1,2,3 and 5

patch 4/5 needs some improvements.

> On 09/07/2020 5:30 PM Stefan Reiter <s.reiter@proxmox.com> wrote:
> 
>  
> Verify will now rename chunks it detects as corrupted, so future backups will be
> forced to write them. The next GC will then clean these ".bad" files up, since
> it has to scan each chunk directory anyway.
> 
> In case the last backup uses some of these chunks, but is not the one that
> failed verification, the client may still omit these chunks, which could lead to
> a broken backup. Patch 4 detects these cases by checking all referenced chunks
> for existance (which certainly adds a bit of overhead, especially to otherwise
> minimal dirty-bitmap backups).
> 
> Additionally, the last patch makes sure all chunks have their atime updated,
> even if they won't be written (when they already exist), to eliminate a race
> with GC where the chunk might be missing after a successful backup.
> 
> Also, friendly ping on:
> https://lists.proxmox.com/pipermail/pbs-devel/2020-September/000479.html
> This series makes the most sense with that patch already applied, the feedback
> from Dominik is addressed here.
> 
> v2:
> * address Thomas' feedback (as well as Dietmar's comment)
> * add patch 5
> 
> 
> proxmox-backup: Stefan Reiter (5):
>   verify: fix log units
>   verify: rename corrupted chunks with .bad extension
>   gc: remove .bad files on garbage collect
>   backup: check all referenced chunks actually exist
>   backup: touch all chunks, even if they exist
> 
>  src/api2/backup/environment.rs | 21 +++++++++-
>  src/api2/types/mod.rs          |  3 ++
>  src/backup/chunk_store.rs      | 73 ++++++++++++++++++++++++++++------
>  src/backup/datastore.rs        |  5 ++-
>  src/backup/verify.rs           | 34 +++++++++++++++-
>  5 files changed, 120 insertions(+), 16 deletions(-)
> 
> -- 
> 2.20.1
> 
> 
> _______________________________________________
> pbs-devel mailing list
> pbs-devel@lists.proxmox.com
> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel




^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2020-09-08 10:52 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-09-07 15:30 [pbs-devel] [PATCH v2 0/5] Improve corrupt chunk handling Stefan Reiter
2020-09-07 15:30 ` [pbs-devel] [PATCH v2 proxmox-backup 1/5] verify: fix log units Stefan Reiter
2020-09-07 15:30 ` [pbs-devel] [PATCH v2 proxmox-backup 2/5] verify: rename corrupted chunks with .bad extension Stefan Reiter
2020-09-08 10:27   ` Dietmar Maurer
2020-09-07 15:30 ` [pbs-devel] [PATCH v2 proxmox-backup 3/5] gc: remove .bad files on garbage collect Stefan Reiter
2020-09-07 15:30 ` [pbs-devel] [PATCH v2 proxmox-backup 4/5] backup: check all referenced chunks actually exist Stefan Reiter
2020-09-08 10:49   ` Dietmar Maurer
2020-09-07 15:30 ` [pbs-devel] [PATCH v2 proxmox-backup 5/5] backup: touch all chunks, even if they exist Stefan Reiter
2020-09-08 10:51 ` [pbs-devel] [PATCH v2 0/5] Improve corrupt chunk handling Dietmar Maurer

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox
Service provided by Proxmox Server Solutions GmbH | Privacy | Legal