public inbox for pbs-devel@lists.proxmox.com
 help / color / mirror / Atom feed
From: Christian Ebner <c.ebner@proxmox.com>
To: pbs-devel@lists.proxmox.com
Subject: [pbs-devel] [PATCH v5 proxmox-backup 26/28] pxar: add heuristic to reduce reused chunk fragmentation
Date: Wed, 15 Nov 2023 16:48:11 +0100	[thread overview]
Message-ID: <20231115154813.281564-27-c.ebner@proxmox.com> (raw)
In-Reply-To: <20231115154813.281564-1-c.ebner@proxmox.com>

For multiple consecutive runs with metadata based file change detection
the referencing of existing chunks can lead to fragmentation and an
increased size of the index file.

In order to reduce this, check which chunks relative size has been
referenced by the previous run and re-chunk files where a constant
threshold value has been underrun.

Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
---
Changes since version 4:
- no changes

Changes since version 3:
- no present in version 3

 pbs-client/src/pxar/create.rs | 55 +++++++++++++++++++++++++----------
 1 file changed, 40 insertions(+), 15 deletions(-)

diff --git a/pbs-client/src/pxar/create.rs b/pbs-client/src/pxar/create.rs
index 8d3db5e8..baeba859 100644
--- a/pbs-client/src/pxar/create.rs
+++ b/pbs-client/src/pxar/create.rs
@@ -1,7 +1,8 @@
-use std::collections::{HashMap, HashSet, VecDeque};
+use std::collections::{BTreeMap, HashMap, HashSet, VecDeque};
 use std::ffi::{CStr, CString, OsStr};
 use std::fmt;
 use std::io::{self, Read};
+use std::ops::Bound::Included;
 use std::os::unix::ffi::OsStrExt;
 use std::os::unix::io::{AsRawFd, FromRawFd, IntoRawFd, OwnedFd, RawFd};
 use std::path::{Path, PathBuf};
@@ -36,6 +37,7 @@ use crate::pxar::tools::assert_single_path_component;
 use crate::pxar::Flags;
 
 const MAX_FILE_SIZE: u64 = 1024;
+const FRAGMENTATION_THRESHOLD: f64 = 0.75;
 
 /// Pxar options for creating a pxar archive/stream
 #[derive(Default)]
@@ -236,6 +238,7 @@ struct Archiver {
     forced_boundaries: Arc<Mutex<VecDeque<InjectChunks>>>,
     appendix: Appendix,
     prev_appendix: Option<AppendixStartOffset>,
+    ref_offsets: BTreeMap<u64, u64>,
 }
 
 type Encoder<'a, T> = pxar::encoder::aio::Encoder<'a, T>;
@@ -291,21 +294,23 @@ where
         )?);
     }
 
-    let (appendix_start, prev_cat_parent) = if let Some(ref mut prev_ref) = options.previous_ref {
-        let entry = prev_ref
-            .catalog
-            .lookup_recursive(prev_ref.archive_name.as_bytes())?;
-        let parent = match entry.attr {
-            DirEntryAttribute::Archive { .. } => Some(entry),
-            _ => None,
+    let (appendix_start, prev_cat_parent, ref_offsets) =
+        if let Some(ref mut prev_ref) = options.previous_ref {
+            let entry = prev_ref
+                .catalog
+                .lookup_recursive(prev_ref.archive_name.as_bytes())?;
+            let parent = match entry.attr {
+                DirEntryAttribute::Archive { .. } => Some(entry),
+                _ => None,
+            };
+            let appendix_start = prev_ref
+                .catalog
+                .appendix_offset(prev_ref.archive_name.as_bytes())?;
+            let ref_offsets = prev_ref.catalog.fetch_offsets()?;
+            (appendix_start, parent, ref_offsets)
+        } else {
+            (None, None, BTreeMap::new())
         };
-        let appendix_start = prev_ref
-            .catalog
-            .appendix_offset(prev_ref.archive_name.as_bytes())?;
-        (appendix_start, parent)
-    } else {
-        (None, None)
-    };
 
     let mut archiver = Archiver {
         feature_flags,
@@ -325,6 +330,7 @@ where
         forced_boundaries,
         appendix: Appendix::new(),
         prev_appendix: appendix_start,
+        ref_offsets,
     };
 
     if let Some(ref mut catalog) = archiver.catalog {
@@ -771,6 +777,25 @@ impl Archiver {
         let (indices, start_padding, end_padding) =
             prev_ref.index.indices(start_offset, end_offset)?;
 
+        // Heuristic to reduce chunk fragmentation by only referencing files if the previous
+        // run referenced the files. If the files spans more than 2 chunks, it should always be
+        // referenced, otherwise check if at least 3/4 of the chunk where referenced in the
+        // last backup run.
+        if indices.len() < 3 {
+            let chunk_sum = indices.iter().fold(0f64, |sum, ind| sum + ind.end() as f64);
+            let chunks_start = start_offset - start_padding;
+            let chunks_end = end_offset + end_padding;
+            let refs_sum = self
+                .ref_offsets
+                .range((Included(chunks_start), Included(chunks_end)))
+                .fold(0f64, |sum, (_, size)| sum + *size as f64);
+
+            // Does not cover the attribute size information
+            if refs_sum / chunk_sum < FRAGMENTATION_THRESHOLD {
+                return Ok(false);
+            }
+        }
+
         let appendix_ref_offset = self.appendix.insert(indices, start_padding);
         let file_name: &Path = OsStr::from_bytes(c_file_name.to_bytes()).as_ref();
         self.add_appendix_ref(encoder, file_name, appendix_ref_offset, file_size)
-- 
2.39.2





  parent reply	other threads:[~2023-11-15 15:49 UTC|newest]

Thread overview: 29+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-11-15 15:47 [pbs-devel] [PATCH-SERIES v5 pxar proxmox-backup proxmox-widget-toolkit 00/28] fix #3174: improve file-level backup Christian Ebner
2023-11-15 15:47 ` [pbs-devel] [PATCH v5 pxar 1/28] fix #3174: decoder: factor out skip_bytes from skip_entry Christian Ebner
2023-11-15 15:47 ` [pbs-devel] [PATCH v5 pxar 2/28] fix #3174: decoder: impl skip_bytes for sync dec Christian Ebner
2023-11-15 15:47 ` [pbs-devel] [PATCH v5 pxar 3/28] fix #3174: encoder: calc filename + metadata byte size Christian Ebner
2023-11-15 15:47 ` [pbs-devel] [PATCH v5 pxar 4/28] fix #3174: enc/dec: impl PXAR_APPENDIX_REF entrytype Christian Ebner
2023-11-15 15:47 ` [pbs-devel] [PATCH v5 pxar 5/28] fix #3174: enc/dec: impl PXAR_APPENDIX entrytype Christian Ebner
2023-11-15 15:47 ` [pbs-devel] [PATCH v5 pxar 6/28] fix #3174: encoder: helper to add to encoder position Christian Ebner
2023-11-15 15:47 ` [pbs-devel] [PATCH v5 pxar 7/28] fix #3174: enc/dec: impl PXAR_APPENDIX_TAIL entrytype Christian Ebner
2023-11-15 15:47 ` [pbs-devel] [PATCH v5 pxar 8/28] fix #3174: enc/dec: introduce pxar format version 2 Christian Ebner
2023-11-15 15:47 ` [pbs-devel] [PATCH v5 proxmox-backup 09/28] fix #3174: index: add fn index list from start/end-offsets Christian Ebner
2023-11-15 15:47 ` [pbs-devel] [PATCH v5 proxmox-backup 10/28] fix #3174: index: add fn digest for DynamicEntry Christian Ebner
2023-11-15 15:47 ` [pbs-devel] [PATCH v5 proxmox-backup 11/28] fix #3174: api: double catalog upload size Christian Ebner
2023-11-15 15:47 ` [pbs-devel] [PATCH v5 proxmox-backup 12/28] fix #3174: catalog: introduce extended format v2 Christian Ebner
2023-11-15 15:47 ` [pbs-devel] [PATCH v5 proxmox-backup 13/28] fix #3174: archiver/extractor: impl appendix ref Christian Ebner
2023-11-15 15:47 ` [pbs-devel] [PATCH v5 proxmox-backup 14/28] fix #3174: catalog: add specialized Archive entry Christian Ebner
2023-11-15 15:48 ` [pbs-devel] [PATCH v5 proxmox-backup 15/28] fix #3174: extractor: impl seq restore from appendix Christian Ebner
2023-11-15 15:48 ` [pbs-devel] [PATCH v5 proxmox-backup 16/28] fix #3174: archiver: store ref to previous backup Christian Ebner
2023-11-15 15:48 ` [pbs-devel] [PATCH v5 proxmox-backup 17/28] fix #3174: upload stream: impl reused chunk injector Christian Ebner
2023-11-15 15:48 ` [pbs-devel] [PATCH v5 proxmox-backup 18/28] fix #3174: chunker: add forced boundaries Christian Ebner
2023-11-15 15:48 ` [pbs-devel] [PATCH v5 proxmox-backup 19/28] fix #3174: backup writer: inject queued chunk in upload steam Christian Ebner
2023-11-15 15:48 ` [pbs-devel] [PATCH v5 proxmox-backup 20/28] fix #3174: archiver: reuse files with unchanged metadata Christian Ebner
2023-11-15 15:48 ` [pbs-devel] [PATCH v5 proxmox-backup 21/28] fix #3174: specs: add backup detection mode specification Christian Ebner
2023-11-15 15:48 ` [pbs-devel] [PATCH v5 proxmox-backup 22/28] fix #3174: client: Add detection mode to backup creation Christian Ebner
2023-11-15 15:48 ` [pbs-devel] [PATCH v5 proxmox-backup 23/28] test-suite: add detection mode change benchmark Christian Ebner
2023-11-15 15:48 ` [pbs-devel] [PATCH v5 proxmox-backup 24/28] test-suite: Add bin to deb, add shell completions Christian Ebner
2023-11-15 15:48 ` [pbs-devel] [PATCH v5 proxmox-backup 25/28] catalog: fetch offset and size for files and refs Christian Ebner
2023-11-15 15:48 ` Christian Ebner [this message]
2023-11-15 15:48 ` [pbs-devel] [PATCH v5 proxmox-backup 27/28] catalog: use format version 2 conditionally Christian Ebner
2023-11-15 15:48 ` [pbs-devel] [PATCH v5 proxmox-widget-toolkit 28/28] file-browser: support pxar archive and fileref types Christian Ebner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20231115154813.281564-27-c.ebner@proxmox.com \
    --to=c.ebner@proxmox.com \
    --cc=pbs-devel@lists.proxmox.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox
Service provided by Proxmox Server Solutions GmbH | Privacy | Legal