* [pbs-devel] [PATCH proxmox-backup 1/4] verify: fix log units
2020-09-03 14:17 [pbs-devel] [PATCH 0/4] Improve corrupt chunk handling Stefan Reiter
@ 2020-09-03 14:17 ` Stefan Reiter
2020-09-03 14:17 ` [pbs-devel] [PATCH proxmox-backup 2/4] verify: rename corrupted chunks with .bad extension Stefan Reiter
` (3 subsequent siblings)
4 siblings, 0 replies; 9+ messages in thread
From: Stefan Reiter @ 2020-09-03 14:17 UTC (permalink / raw)
To: pbs-devel
Signed-off-by: Stefan Reiter <s.reiter@proxmox.com>
---
src/backup/verify.rs | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/src/backup/verify.rs b/src/backup/verify.rs
index b65be9c9..1c848d57 100644
--- a/src/backup/verify.rs
+++ b/src/backup/verify.rs
@@ -163,7 +163,7 @@ fn verify_index_chunks(
let error_count = errors.load(Ordering::SeqCst);
- worker.log(format!(" verified {:.2}/{:.2} Mib in {:.2} seconds, speed {:.2}/{:.2} Mib/s ({} errors)",
+ worker.log(format!(" verified {:.2}/{:.2} MiB in {:.2} seconds, speed {:.2}/{:.2} MiB/s ({} errors)",
read_bytes_mib, decoded_bytes_mib, elapsed, read_speed, decode_speed, error_count));
if errors.load(Ordering::SeqCst) > 0 {
--
2.20.1
^ permalink raw reply [flat|nested] 9+ messages in thread
* [pbs-devel] [PATCH proxmox-backup 2/4] verify: rename corrupted chunks with .bad extension
2020-09-03 14:17 [pbs-devel] [PATCH 0/4] Improve corrupt chunk handling Stefan Reiter
2020-09-03 14:17 ` [pbs-devel] [PATCH proxmox-backup 1/4] verify: fix log units Stefan Reiter
@ 2020-09-03 14:17 ` Stefan Reiter
2020-09-03 14:17 ` [pbs-devel] [PATCH proxmox-backup 3/4] gc: remove .bad files on garbage collect Stefan Reiter
` (2 subsequent siblings)
4 siblings, 0 replies; 9+ messages in thread
From: Stefan Reiter @ 2020-09-03 14:17 UTC (permalink / raw)
To: pbs-devel
This ensures that following backups will always upload the chunk,
thereby replacing it with a correct version again.
Format for renaming is <digest>.<counter>.bad where <counter> is used if
a chunk is found to be bad again before a GC cleans it up.
Care has been taken to deliberately only rename a chunk in conditions
where it is guaranteed to be an error in the chunk itself. Otherwise a
broken index file could lead to an unwanted mass-rename of chunks.
Signed-off-by: Stefan Reiter <s.reiter@proxmox.com>
---
src/backup/verify.rs | 32 +++++++++++++++++++++++++++++++-
1 file changed, 31 insertions(+), 1 deletion(-)
diff --git a/src/backup/verify.rs b/src/backup/verify.rs
index 1c848d57..697ed236 100644
--- a/src/backup/verify.rs
+++ b/src/backup/verify.rs
@@ -39,6 +39,34 @@ fn verify_blob(datastore: Arc<DataStore>, backup_dir: &BackupDir, info: &FileInf
}
}
+fn rename_corrupted_chunk(
+ datastore: Arc<DataStore>,
+ digest: &[u8;32],
+ worker: Arc<WorkerTask>,
+) {
+ let (path, digest_str) = datastore.chunk_path(digest);
+
+ let mut counter = 0;
+ let mut new_path = path.clone();
+ new_path.set_file_name(format!("{}.{}.bad", digest_str, counter));
+ while new_path.exists() && counter < 9 {
+ counter += 1;
+ new_path.set_file_name(format!("{}.{}.bad", digest_str, counter));
+ }
+
+ match std::fs::rename(&path, &new_path) {
+ Ok(_) => {
+ worker.log(format!("corrupted chunk renamed to {:?}", &new_path));
+ },
+ Err(err) => {
+ match err.kind() {
+ std::io::ErrorKind::NotFound => { /* ignored */ },
+ _ => worker.log(format!("could not rename corrupted chunk {:?} - {}", &path, err))
+ }
+ }
+ };
+}
+
// We use a separate thread to read/load chunks, so that we can do
// load and verify in parallel to increase performance.
fn chunk_reader_thread(
@@ -73,6 +101,7 @@ fn chunk_reader_thread(
corrupt_chunks.lock().unwrap().insert(info.digest);
worker.log(format!("can't verify chunk, load failed - {}", err));
errors.fetch_add(1, Ordering::SeqCst);
+ rename_corrupted_chunk(datastore.clone(), &info.digest, worker.clone());
continue;
}
Ok(chunk) => {
@@ -101,7 +130,7 @@ fn verify_index_chunks(
let start_time = Instant::now();
let chunk_channel = chunk_reader_thread(
- datastore,
+ datastore.clone(),
index,
verified_chunks.clone(),
corrupt_chunks.clone(),
@@ -148,6 +177,7 @@ fn verify_index_chunks(
corrupt_chunks.lock().unwrap().insert(digest);
worker.log(format!("{}", err));
errors.fetch_add(1, Ordering::SeqCst);
+ rename_corrupted_chunk(datastore.clone(), &digest, worker.clone());
} else {
verified_chunks.lock().unwrap().insert(digest);
}
--
2.20.1
^ permalink raw reply [flat|nested] 9+ messages in thread
* [pbs-devel] [PATCH proxmox-backup 3/4] gc: remove .bad files on garbage collect
2020-09-03 14:17 [pbs-devel] [PATCH 0/4] Improve corrupt chunk handling Stefan Reiter
2020-09-03 14:17 ` [pbs-devel] [PATCH proxmox-backup 1/4] verify: fix log units Stefan Reiter
2020-09-03 14:17 ` [pbs-devel] [PATCH proxmox-backup 2/4] verify: rename corrupted chunks with .bad extension Stefan Reiter
@ 2020-09-03 14:17 ` Stefan Reiter
2020-09-04 12:20 ` Thomas Lamprecht
2020-09-03 14:17 ` [pbs-devel] [PATCH proxmox-backup 4/4] backup: check all referenced chunks actually exist Stefan Reiter
2020-09-03 15:40 ` [pbs-devel] [PATCH 0/4] Improve corrupt chunk handling Dietmar Maurer
4 siblings, 1 reply; 9+ messages in thread
From: Stefan Reiter @ 2020-09-03 14:17 UTC (permalink / raw)
To: pbs-devel
The iterator of get_chunk_iterator is extended with a third parameter
indicating whether the current file is a chunk (false) or a .bad file
(true).
Count their sizes to the total of removed bytes, since it also frees
disk space.
Signed-off-by: Stefan Reiter <s.reiter@proxmox.com>
---
src/api2/types/mod.rs | 3 +++
src/backup/chunk_store.rs | 43 ++++++++++++++++++++++++++++-----------
src/backup/datastore.rs | 5 ++++-
3 files changed, 38 insertions(+), 13 deletions(-)
diff --git a/src/api2/types/mod.rs b/src/api2/types/mod.rs
index 6854fdf0..e29a7e37 100644
--- a/src/api2/types/mod.rs
+++ b/src/api2/types/mod.rs
@@ -559,6 +559,8 @@ pub struct GarbageCollectionStatus {
pub pending_bytes: u64,
/// Number of pending chunks (pending removal - kept for safety).
pub pending_chunks: usize,
+ /// Number of chunks marked as .bad by verify that have been removed by GC.
+ pub removed_bad: usize,
}
impl Default for GarbageCollectionStatus {
@@ -573,6 +575,7 @@ impl Default for GarbageCollectionStatus {
removed_chunks: 0,
pending_bytes: 0,
pending_chunks: 0,
+ removed_bad: 0,
}
}
}
diff --git a/src/backup/chunk_store.rs b/src/backup/chunk_store.rs
index e1da5a8a..5c2fb29d 100644
--- a/src/backup/chunk_store.rs
+++ b/src/backup/chunk_store.rs
@@ -187,7 +187,7 @@ impl ChunkStore {
pub fn get_chunk_iterator(
&self,
) -> Result<
- impl Iterator<Item = (Result<tools::fs::ReadDirEntry, Error>, usize)> + std::iter::FusedIterator,
+ impl Iterator<Item = (Result<tools::fs::ReadDirEntry, Error>, usize, bool)> + std::iter::FusedIterator,
Error
> {
use nix::dir::Dir;
@@ -218,20 +218,26 @@ impl ChunkStore {
match inner.next() {
Some(Ok(entry)) => {
// skip files if they're not a hash
- let bytes = entry.file_name().to_bytes();
- if bytes.len() != 64 {
- continue;
+ let hash = {
+ let bytes = entry.file_name().to_bytes();
+ bytes.len() == 64 && bytes.iter().all(u8::is_ascii_hexdigit)
+ };
+
+ if hash {
+ return Some((Ok(entry), percentage, false));
+ } else if let Ok(name) = entry.file_name().to_str() {
+ if name.ends_with(".bad") {
+ return Some((Ok(entry), percentage, true));
+ }
}
- if !bytes.iter().all(u8::is_ascii_hexdigit) {
- continue;
- }
- return Some((Ok(entry), percentage));
+
+ continue;
}
Some(Err(err)) => {
// stop after first error
done = true;
// and pass the error through:
- return Some((Err(err), percentage));
+ return Some((Err(err), percentage, false));
}
None => (), // open next directory
}
@@ -261,7 +267,7 @@ impl ChunkStore {
// other errors are fatal, so end our iteration
done = true;
// and pass the error through:
- return Some((Err(format_err!("unable to read subdir '{}' - {}", subdir, err)), percentage));
+ return Some((Err(format_err!("unable to read subdir '{}' - {}", subdir, err)), percentage, false));
}
}
}
@@ -292,7 +298,7 @@ impl ChunkStore {
let mut last_percentage = 0;
let mut chunk_count = 0;
- for (entry, percentage) in self.get_chunk_iterator()? {
+ for (entry, percentage, bad) in self.get_chunk_iterator()? {
if last_percentage != percentage {
last_percentage = percentage;
worker.log(format!("percentage done: phase2 {}% (processed {} chunks)", percentage, chunk_count));
@@ -321,7 +327,20 @@ impl ChunkStore {
let lock = self.mutex.lock();
if let Ok(stat) = fstatat(dirfd, filename, nix::fcntl::AtFlags::AT_SYMLINK_NOFOLLOW) {
- if stat.st_atime < min_atime {
+ if bad {
+ let res = unsafe { libc::unlinkat(dirfd, filename.as_ptr(), 0) };
+ if res != 0 {
+ let err = nix::Error::last();
+ worker.warn(format!(
+ "unlink .bad file {:?} failed on store '{}' - {}",
+ filename,
+ self.name,
+ err,
+ ));
+ }
+ status.removed_bad += 1;
+ status.removed_bytes += stat.st_size as u64;
+ } else if stat.st_atime < min_atime {
//let age = now - stat.st_atime;
//println!("UNLINK {} {:?}", age/(3600*24), filename);
let res = unsafe { libc::unlinkat(dirfd, filename.as_ptr(), 0) };
diff --git a/src/backup/datastore.rs b/src/backup/datastore.rs
index 42866e38..ebe47487 100644
--- a/src/backup/datastore.rs
+++ b/src/backup/datastore.rs
@@ -85,7 +85,7 @@ impl DataStore {
pub fn get_chunk_iterator(
&self,
) -> Result<
- impl Iterator<Item = (Result<tools::fs::ReadDirEntry, Error>, usize)>,
+ impl Iterator<Item = (Result<tools::fs::ReadDirEntry, Error>, usize, bool)>,
Error
> {
self.chunk_store.get_chunk_iterator()
@@ -495,6 +495,9 @@ impl DataStore {
if gc_status.pending_bytes > 0 {
worker.log(&format!("Pending removals: {} (in {} chunks)", HumanByte::from(gc_status.pending_bytes), gc_status.pending_chunks));
}
+ if gc_status.removed_bad > 0 {
+ worker.log(&format!("Removed bad files: {}", gc_status.removed_bad));
+ }
worker.log(&format!("Original data usage: {}", HumanByte::from(gc_status.index_data_bytes)));
--
2.20.1
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [pbs-devel] [PATCH proxmox-backup 3/4] gc: remove .bad files on garbage collect
2020-09-03 14:17 ` [pbs-devel] [PATCH proxmox-backup 3/4] gc: remove .bad files on garbage collect Stefan Reiter
@ 2020-09-04 12:20 ` Thomas Lamprecht
0 siblings, 0 replies; 9+ messages in thread
From: Thomas Lamprecht @ 2020-09-04 12:20 UTC (permalink / raw)
To: Proxmox Backup Server development discussion, Stefan Reiter
On 03.09.20 16:17, Stefan Reiter wrote:
> diff --git a/src/backup/chunk_store.rs b/src/backup/chunk_store.rs
> index e1da5a8a..5c2fb29d 100644
> --- a/src/backup/chunk_store.rs
> +++ b/src/backup/chunk_store.rs
> @@ -187,7 +187,7 @@ impl ChunkStore {
> pub fn get_chunk_iterator(
> &self,
> ) -> Result<
> - impl Iterator<Item = (Result<tools::fs::ReadDirEntry, Error>, usize)> + std::iter::FusedIterator,
> + impl Iterator<Item = (Result<tools::fs::ReadDirEntry, Error>, usize, bool)> + std::iter::FusedIterator,
> Error
> > {
> use nix::dir::Dir;
> @@ -218,20 +218,26 @@ impl ChunkStore {
> match inner.next() {
> Some(Ok(entry)) => {
> // skip files if they're not a hash
> - let bytes = entry.file_name().to_bytes();
> - if bytes.len() != 64 {
> - continue;
> + let hash = {
> + let bytes = entry.file_name().to_bytes();
> + bytes.len() == 64 && bytes.iter().all(u8::is_ascii_hexdigit)
why change the check, we do not care for deleting a non hex named file,
this is a private controlled directory after all. I'd avoid doing extra
checks, their costs, even if small, are amplified a lot.
> + };
> +
> + if hash {
> + return Some((Ok(entry), percentage, false));
> + } else if let Ok(name) = entry.file_name().to_str() {
> + if name.ends_with(".bad") {
> + return Some((Ok(entry), percentage, true));
> + }
we only want to remove bad chunks if a good exists again, else we may
loose information - e.g., if the detected corruption was caused by bad
memory (RAM) not the chunk itself.
> [...]
> @@ -321,7 +327,20 @@ impl ChunkStore {
> let lock = self.mutex.lock();
>
> if let Ok(stat) = fstatat(dirfd, filename, nix::fcntl::AtFlags::AT_SYMLINK_NOFOLLOW) {
> - if stat.st_atime < min_atime {
> + if bad {
> + let res = unsafe { libc::unlinkat(dirfd, filename.as_ptr(), 0) };
don't we have something for this already? Else I'd move this out to a closure
or local function
> + if res != 0 {
> + let err = nix::Error::last();
> + worker.warn(format!(
> + "unlink .bad file {:?} failed on store '{}' - {}",
"unlinking corrupt chunk ...
> + filename,
> + self.name,
> + err,
> + ));
> + }
> + status.removed_bad += 1;
> + status.removed_bytes += stat.st_size as u64;
you count this up even if the unlinkat failed?
> + } else if stat.st_atime < min_atime {
> //let age = now - stat.st_atime;
> //println!("UNLINK {} {:?}", age/(3600*24), filename);
> let res = unsafe { libc::unlinkat(dirfd, filename.as_ptr(), 0) };
^ permalink raw reply [flat|nested] 9+ messages in thread
* [pbs-devel] [PATCH proxmox-backup 4/4] backup: check all referenced chunks actually exist
2020-09-03 14:17 [pbs-devel] [PATCH 0/4] Improve corrupt chunk handling Stefan Reiter
` (2 preceding siblings ...)
2020-09-03 14:17 ` [pbs-devel] [PATCH proxmox-backup 3/4] gc: remove .bad files on garbage collect Stefan Reiter
@ 2020-09-03 14:17 ` Stefan Reiter
2020-09-03 15:40 ` [pbs-devel] [PATCH 0/4] Improve corrupt chunk handling Dietmar Maurer
4 siblings, 0 replies; 9+ messages in thread
From: Stefan Reiter @ 2020-09-03 14:17 UTC (permalink / raw)
To: pbs-devel
A client can omit uploading chunks in the "known_chunks" list, those
then also won't be written on the server side. Check all those chunks
mentioned in the index but not uploaded for existance and report an
error if they don't exist instead of marking a potentially broken backup
as "successful".
Signed-off-by: Stefan Reiter <s.reiter@proxmox.com>
---
src/api2/backup/environment.rs | 21 ++++++++++++++++++++-
1 file changed, 20 insertions(+), 1 deletion(-)
diff --git a/src/api2/backup/environment.rs b/src/api2/backup/environment.rs
index 973563d3..df22b1d6 100644
--- a/src/api2/backup/environment.rs
+++ b/src/api2/backup/environment.rs
@@ -1,6 +1,6 @@
use anyhow::{bail, format_err, Error};
use std::sync::{Arc, Mutex};
-use std::collections::HashMap;
+use std::collections::{HashMap, HashSet};
use ::serde::{Serialize};
use serde_json::{json, Value};
@@ -73,6 +73,7 @@ struct SharedBackupState {
dynamic_writers: HashMap<usize, DynamicWriterState>,
fixed_writers: HashMap<usize, FixedWriterState>,
known_chunks: HashMap<[u8;32], u32>,
+ touched_chunks: HashSet<[u8;32]>,
backup_size: u64, // sums up size of all files
backup_stat: UploadStatistic,
}
@@ -126,6 +127,7 @@ impl BackupEnvironment {
dynamic_writers: HashMap::new(),
fixed_writers: HashMap::new(),
known_chunks: HashMap::new(),
+ touched_chunks: HashSet::new(),
backup_size: 0,
backup_stat: UploadStatistic::new(),
};
@@ -196,6 +198,7 @@ impl BackupEnvironment {
// register chunk
state.known_chunks.insert(digest, size);
+ state.touched_chunks.insert(digest);
Ok(())
}
@@ -229,6 +232,7 @@ impl BackupEnvironment {
// register chunk
state.known_chunks.insert(digest, size);
+ state.touched_chunks.insert(digest);
Ok(())
}
@@ -490,6 +494,21 @@ impl BackupEnvironment {
}
}
+ // make sure all chunks that were referenced actually exist
+ for (digest, _) in state.known_chunks.iter() {
+ // if they were uploaded just now they have already been touched
+ if state.touched_chunks.contains(digest) {
+ continue;
+ }
+
+ if !self.datastore.chunk_path(digest).0.exists() {
+ bail!(
+ "chunk '{}' was attempted to be reused but doesn't exist",
+ digest_to_hex(digest)
+ );
+ }
+ }
+
// marks the backup as successful
state.finished = true;
--
2.20.1
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [pbs-devel] [PATCH 0/4] Improve corrupt chunk handling
2020-09-03 14:17 [pbs-devel] [PATCH 0/4] Improve corrupt chunk handling Stefan Reiter
` (3 preceding siblings ...)
2020-09-03 14:17 ` [pbs-devel] [PATCH proxmox-backup 4/4] backup: check all referenced chunks actually exist Stefan Reiter
@ 2020-09-03 15:40 ` Dietmar Maurer
2020-09-03 15:51 ` Dietmar Maurer
4 siblings, 1 reply; 9+ messages in thread
From: Dietmar Maurer @ 2020-09-03 15:40 UTC (permalink / raw)
To: Proxmox Backup Server development discussion, Stefan Reiter
> Verify will now rename chunks it detects as corrupted, so future backups will be
> forced to write them. The next GC will then clean these ".bad" files up, since
> it has to scan each chunk directory anyway.
That sounds quite dangerous to me. What about transient storage errors?
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [pbs-devel] [PATCH 0/4] Improve corrupt chunk handling
2020-09-03 15:40 ` [pbs-devel] [PATCH 0/4] Improve corrupt chunk handling Dietmar Maurer
@ 2020-09-03 15:51 ` Dietmar Maurer
2020-09-07 9:31 ` Stefan Reiter
0 siblings, 1 reply; 9+ messages in thread
From: Dietmar Maurer @ 2020-09-03 15:51 UTC (permalink / raw)
To: Proxmox Backup Server development discussion, Stefan Reiter
> On 09/03/2020 5:40 PM Dietmar Maurer <dietmar@proxmox.com> wrote:
>
>
> > Verify will now rename chunks it detects as corrupted, so future backups will be
> > forced to write them. The next GC will then clean these ".bad" files up, since
> > it has to scan each chunk directory anyway.
>
> That sounds quite dangerous to me. What about transient storage errors?
Maybe I miss-read your path. When do you remove those .bad files exactly? Only when
the client uploaded a good chunk already?
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [pbs-devel] [PATCH 0/4] Improve corrupt chunk handling
2020-09-03 15:51 ` Dietmar Maurer
@ 2020-09-07 9:31 ` Stefan Reiter
0 siblings, 0 replies; 9+ messages in thread
From: Stefan Reiter @ 2020-09-07 9:31 UTC (permalink / raw)
To: Dietmar Maurer, Proxmox Backup Server development discussion
On 9/3/20 5:51 PM, Dietmar Maurer wrote:
>
>> On 09/03/2020 5:40 PM Dietmar Maurer <dietmar@proxmox.com> wrote:
>>
>>
>>> Verify will now rename chunks it detects as corrupted, so future backups will be
>>> forced to write them. The next GC will then clean these ".bad" files up, since
>>> it has to scan each chunk directory anyway.
>>
>> That sounds quite dangerous to me. What about transient storage errors?
>
> Maybe I miss-read your path. When do you remove those .bad files exactly? Only when
> the client uploaded a good chunk already?
>
Not currently, but I will make it so GC checks for a new good chunk
before deleting the .bad one in v2.
^ permalink raw reply [flat|nested] 9+ messages in thread