public inbox for pbs-devel@lists.proxmox.com
 help / color / mirror / Atom feed
From: Christian Ebner <c.ebner@proxmox.com>
To: pbs-devel@lists.proxmox.com
Subject: [pbs-devel] [PATCH v5 proxmox-backup 57/62] datastore: chunker: implement chunker for payload stream
Date: Tue,  7 May 2024 17:52:39 +0200	[thread overview]
Message-ID: <20240507155244.793819-58-c.ebner@proxmox.com> (raw)
In-Reply-To: <20240507155244.793819-1-c.ebner@proxmox.com>

Implement the Chunker trait for a dedicated payload stream chunker,
which extends the regular chunker by the option to suggest boundaries
to be used over the hast based boundaries whenever possible.

Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
---
changes since version 4:
- fix issue with scan already consuming full buffer, now only scan up
  until suggested boundary
- add more debug log output

 pbs-datastore/src/chunker.rs | 81 ++++++++++++++++++++++++++++++++++++
 pbs-datastore/src/lib.rs     |  2 +-
 2 files changed, 82 insertions(+), 1 deletion(-)

diff --git a/pbs-datastore/src/chunker.rs b/pbs-datastore/src/chunker.rs
index 119b88a03..ceda2d7de 100644
--- a/pbs-datastore/src/chunker.rs
+++ b/pbs-datastore/src/chunker.rs
@@ -1,3 +1,5 @@
+use std::sync::mpsc::Receiver;
+
 /// Note: window size 32 or 64, is faster because we can
 /// speedup modulo operations, but always computes hash 0
 /// for constant data streams .. 0,0,0,0,0,0
@@ -45,6 +47,16 @@ pub struct ChunkerImpl {
     window: [u8; CA_CHUNKER_WINDOW_SIZE],
 }
 
+/// Sliding window chunker (Buzhash) with boundary suggestions
+///
+/// Suggest to chunk at a given boundary instead of the regular chunk boundary for better alignment
+/// with file payload boundaries.
+pub struct PayloadChunker {
+    chunker: ChunkerImpl,
+    current_suggested: Option<u64>,
+    suggested_boundaries: Receiver<u64>,
+}
+
 const BUZHASH_TABLE: [u32; 256] = [
     0x458be752, 0xc10748cc, 0xfbbcdbb8, 0x6ded5b68, 0xb10a82b5, 0x20d75648, 0xdfc5665f, 0xa8428801,
     0x7ebf5191, 0x841135c7, 0x65cc53b3, 0x280a597c, 0x16f60255, 0xc78cbc3e, 0x294415f5, 0xb938d494,
@@ -214,6 +226,75 @@ impl Chunker for ChunkerImpl {
     }
 }
 
+impl PayloadChunker {
+    /// Create a new PayloadChunker instance, which produces and average
+    /// chunk size of `chunk_size_avg` (need to be a power of two), if no
+    /// suggested boundaries are provided.
+    /// Use suggested boundaries instead,  whenever the chunk size is within
+    /// the min - max range.
+    pub fn new(chunk_size_avg: usize, suggested_boundaries: Receiver<u64>) -> Self {
+        Self {
+            chunker: ChunkerImpl::new(chunk_size_avg),
+            current_suggested: None,
+            suggested_boundaries,
+        }
+    }
+}
+
+impl Chunker for PayloadChunker {
+    fn scan(&mut self, data: &[u8], ctx: &Context) -> usize {
+        let pos = ctx.total - data.len() as u64;
+
+        loop {
+            if let Some(boundary) = self.current_suggested {
+                if boundary < ctx.base + pos {
+                    log::debug!("Boundary {boundary} in past");
+                    // ignore passed boundaries
+                    self.current_suggested = None;
+                    continue;
+                }
+
+                if boundary > ctx.base + ctx.total {
+                    log::debug!("Boundary {boundary} in future");
+                    // boundary in future, cannot decide yet
+                    return self.chunker.scan(data, ctx);
+                }
+
+                let chunk_size = (boundary - ctx.base) as usize;
+                if chunk_size < self.chunker.chunk_size_min {
+                    log::debug!("Chunk size {chunk_size} below minimum chunk size");
+                    // chunk to small, ignore boundary
+                    self.current_suggested = None;
+                    continue;
+                }
+
+                if chunk_size <= self.chunker.chunk_size_max {
+                    log::debug!("Chunk at suggested boundary: {boundary}, {chunk_size}");
+                    self.current_suggested = None;
+                    // calculate boundary relative to start of given data buffer
+                    let len = chunk_size - pos as usize;
+                    log::debug!("Chunk at suggested boundary: {boundary}, chunk size {chunk_size}, len {len}");
+                    // although we ignore the output, consume the data with the chunker
+                    let _ignore = self.chunker.scan(&data[..len], ctx);
+                    return len;
+                }
+
+                log::debug!("Chunk {chunk_size} to big, regular scan");
+                // chunk to big, cannot decide yet
+                // scan for hash based chunk boundary instead
+                return self.chunker.scan(data, ctx);
+            }
+
+            if let Ok(boundary) = self.suggested_boundaries.try_recv() {
+                self.current_suggested = Some(boundary);
+            } else {
+                log::debug!("No suggested boundary, regular scan");
+                return self.chunker.scan(data, ctx);
+            }
+        }
+    }
+}
+
 #[test]
 fn test_chunker1() {
     let mut buffer = Vec::new();
diff --git a/pbs-datastore/src/lib.rs b/pbs-datastore/src/lib.rs
index 24429626c..3e4aa34c2 100644
--- a/pbs-datastore/src/lib.rs
+++ b/pbs-datastore/src/lib.rs
@@ -196,7 +196,7 @@ pub use backup_info::{BackupDir, BackupGroup, BackupInfo};
 pub use checksum_reader::ChecksumReader;
 pub use checksum_writer::ChecksumWriter;
 pub use chunk_store::ChunkStore;
-pub use chunker::{Chunker, ChunkerImpl};
+pub use chunker::{Chunker, ChunkerImpl, PayloadChunker};
 pub use crypt_reader::CryptReader;
 pub use crypt_writer::CryptWriter;
 pub use data_blob::DataBlob;
-- 
2.39.2



_______________________________________________
pbs-devel mailing list
pbs-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel


  parent reply	other threads:[~2024-05-07 16:01 UTC|newest]

Thread overview: 64+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-05-07 15:51 [pbs-devel] [PATCH v5 pxar proxmox-backup 00/62] fix #3174: improve file-level backup Christian Ebner
2024-05-07 15:51 ` [pbs-devel] [PATCH v5 pxar 01/62] format/examples: add header type `PXAR_PAYLOAD_REF` Christian Ebner
2024-05-07 15:51 ` [pbs-devel] [PATCH v5 pxar 02/62] decoder: add method to read payload references Christian Ebner
2024-05-07 15:51 ` [pbs-devel] [PATCH v5 pxar 03/62] decoder: factor out skip part from skip_entry Christian Ebner
2024-05-07 15:51 ` [pbs-devel] [PATCH v5 pxar 04/62] encoder: add optional output writer for file payloads Christian Ebner
2024-05-07 15:51 ` [pbs-devel] [PATCH v5 pxar 05/62] encoder: move to stack based state tracking Christian Ebner
2024-05-07 15:51 ` [pbs-devel] [PATCH v5 pxar 06/62] decoder/accessor: add optional payload input stream Christian Ebner
2024-05-07 15:51 ` [pbs-devel] [PATCH v5 pxar 07/62] decoder: set payload input range when decoding via accessor Christian Ebner
2024-05-07 15:51 ` [pbs-devel] [PATCH v5 pxar 08/62] encoder: add payload reference capability Christian Ebner
2024-05-07 15:51 ` [pbs-devel] [PATCH v5 pxar 09/62] encoder: add payload position capability Christian Ebner
2024-05-07 15:51 ` [pbs-devel] [PATCH v5 pxar 10/62] encoder: add payload advance capability Christian Ebner
2024-05-07 15:51 ` [pbs-devel] [PATCH v5 pxar 11/62] encoder/format: finish payload stream with marker Christian Ebner
2024-05-07 15:51 ` [pbs-devel] [PATCH v5 pxar 12/62] format: add payload stream start marker Christian Ebner
2024-05-07 15:51 ` [pbs-devel] [PATCH v5 pxar 13/62] format/encoder/decoder: new pxar entry type `Version` Christian Ebner
2024-05-07 15:51 ` [pbs-devel] [PATCH v5 pxar 14/62] format/encoder/decoder: new pxar entry type `Prelude` Christian Ebner
2024-05-07 15:51 ` [pbs-devel] [PATCH v5 proxmox-backup 15/62] client: pxar: switch to stack based encoder state Christian Ebner
2024-05-07 15:51 ` [pbs-devel] [PATCH v5 proxmox-backup 16/62] client: backup: factor out extension from backup target Christian Ebner
2024-05-07 15:51 ` [pbs-devel] [PATCH v5 proxmox-backup 17/62] client: pxar: combine writers into struct Christian Ebner
2024-05-07 15:52 ` [pbs-devel] [PATCH v5 proxmox-backup 18/62] client: pxar: add optional pxar payload writer instance Christian Ebner
2024-05-07 15:52 ` [pbs-devel] [PATCH v5 proxmox-backup 19/62] client: pxar: optionally split metadata and payload streams Christian Ebner
2024-05-07 15:52 ` [pbs-devel] [PATCH v5 proxmox-backup 20/62] client: helper: add helpers for creating reader instances Christian Ebner
2024-05-07 15:52 ` [pbs-devel] [PATCH v5 proxmox-backup 21/62] client: helper: add method for split archive name mapping Christian Ebner
2024-05-07 15:52 ` [pbs-devel] [PATCH v5 proxmox-backup 22/62] client: restore: read payload from dedicated index Christian Ebner
2024-05-07 15:52 ` [pbs-devel] [PATCH v5 proxmox-backup 23/62] tools: cover extension for split pxar archives Christian Ebner
2024-05-07 15:52 ` [pbs-devel] [PATCH v5 proxmox-backup 24/62] restore: " Christian Ebner
2024-05-07 15:52 ` [pbs-devel] [PATCH v5 proxmox-backup 25/62] client: mount: make split pxar archives mountable Christian Ebner
2024-05-07 15:52 ` [pbs-devel] [PATCH v5 proxmox-backup 26/62] api: datastore: refactor getting local chunk reader Christian Ebner
2024-05-07 15:52 ` [pbs-devel] [PATCH v5 proxmox-backup 27/62] api: datastore: attach optional payload " Christian Ebner
2024-05-07 15:52 ` [pbs-devel] [PATCH v5 proxmox-backup 28/62] catalog: shell: make split pxar archives accessible Christian Ebner
2024-05-07 15:52 ` [pbs-devel] [PATCH v5 proxmox-backup 29/62] www: cover metadata extension for pxar archives Christian Ebner
2024-05-07 15:52 ` [pbs-devel] [PATCH v5 proxmox-backup 30/62] file restore: factor out getting pxar reader Christian Ebner
2024-05-07 15:52 ` [pbs-devel] [PATCH v5 proxmox-backup 31/62] file restore: cover split metadata and payload archives Christian Ebner
2024-05-07 15:52 ` [pbs-devel] [PATCH v5 proxmox-backup 32/62] file restore: show more error context when extraction fails Christian Ebner
2024-05-07 15:52 ` [pbs-devel] [PATCH v5 proxmox-backup 33/62] pxar: add optional payload input for achive restore Christian Ebner
2024-05-07 15:52 ` [pbs-devel] [PATCH v5 proxmox-backup 34/62] pxar: add more context to extraction error Christian Ebner
2024-05-07 15:52 ` [pbs-devel] [PATCH v5 proxmox-backup 35/62] client: pxar: include payload offset in entry listing Christian Ebner
2024-05-07 15:52 ` [pbs-devel] [PATCH v5 proxmox-backup 36/62] pxar: show padding in debug output on archive list Christian Ebner
2024-05-07 15:52 ` [pbs-devel] [PATCH v5 proxmox-backup 37/62] datastore: dynamic index: add method to get digest Christian Ebner
2024-05-07 15:52 ` [pbs-devel] [PATCH v5 proxmox-backup 38/62] client: pxar: helper for lookup of reusable dynamic entries Christian Ebner
2024-05-07 15:52 ` [pbs-devel] [PATCH v5 proxmox-backup 39/62] upload stream: implement reused chunk injector Christian Ebner
2024-05-07 15:52 ` [pbs-devel] [PATCH v5 proxmox-backup 40/62] client: chunk stream: add struct to hold injection state Christian Ebner
2024-05-07 15:52 ` [pbs-devel] [PATCH v5 proxmox-backup 41/62] client: streams: add channels for dynamic entry injection Christian Ebner
2024-05-07 15:52 ` [pbs-devel] [PATCH v5 proxmox-backup 42/62] specs: add backup detection mode specification Christian Ebner
2024-05-07 15:52 ` [pbs-devel] [PATCH v5 proxmox-backup 43/62] client: implement prepare reference method Christian Ebner
2024-05-07 15:52 ` [pbs-devel] [PATCH v5 proxmox-backup 44/62] client: pxar: add method for metadata comparison Christian Ebner
2024-05-07 15:52 ` [pbs-devel] [PATCH v5 proxmox-backup 45/62] pxar: caching: add look-ahead cache types Christian Ebner
2024-05-07 15:52 ` [pbs-devel] [PATCH v5 proxmox-backup 46/62] fix #3174: client: pxar: enable caching and meta comparison Christian Ebner
2024-05-07 15:52 ` [pbs-devel] [PATCH v5 proxmox-backup 47/62] client: backup writer: add injected chunk count to stats Christian Ebner
2024-05-07 15:52 ` [pbs-devel] [PATCH v5 proxmox-backup 48/62] pxar: create: keep track of reused chunks and files Christian Ebner
2024-05-07 15:52 ` [pbs-devel] [PATCH v5 proxmox-backup 49/62] pxar: create: show chunk injection stats debug output Christian Ebner
2024-05-07 15:52 ` [pbs-devel] [PATCH v5 proxmox-backup 50/62] client: pxar: add helper to handle optional preludes Christian Ebner
2024-05-07 15:52 ` [pbs-devel] [PATCH v5 proxmox-backup 51/62] client: pxar: opt encode cli exclude patterns as Prelude Christian Ebner
2024-05-07 15:52 ` [pbs-devel] [PATCH v5 proxmox-backup 52/62] docs: file formats: describe split pxar archive file layout Christian Ebner
2024-05-07 15:52 ` [pbs-devel] [PATCH v5 proxmox-backup 53/62] docs: add section describing change detection mode Christian Ebner
2024-05-07 15:52 ` [pbs-devel] [PATCH v5 proxmox-backup 54/62] test-suite: add detection mode change benchmark Christian Ebner
2024-05-07 15:52 ` [pbs-devel] [PATCH v5 proxmox-backup 55/62] test-suite: add bin to deb, add shell completions Christian Ebner
2024-05-07 15:52 ` [pbs-devel] [PATCH v5 proxmox-backup 56/62] datastore: chunker: add Chunker trait Christian Ebner
2024-05-07 15:52 ` Christian Ebner [this message]
2024-05-07 15:52 ` [pbs-devel] [PATCH v5 proxmox-backup 58/62] client: chunk stream: switch payload stream chunker Christian Ebner
2024-05-07 15:52 ` [pbs-devel] [PATCH v5 proxmox-backup 59/62] client: pxar: allow to restore prelude to optional path Christian Ebner
2024-05-07 15:52 ` [pbs-devel] [PATCH v5 proxmox-backup 60/62] client: pxar: add archive creation with reference test Christian Ebner
2024-05-07 15:52 ` [pbs-devel] [PATCH v5 proxmox-backup 61/62] client: tools: add helper to raise nofile rlimit Christian Ebner
2024-05-07 15:52 ` [pbs-devel] [PATCH v5 proxmox-backup 62/62] client: pxar: set cache limit based on " Christian Ebner
2024-05-14 10:52 ` [pbs-devel] [PATCH v5 pxar proxmox-backup 00/62] fix #3174: improve file-level backup Christian Ebner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20240507155244.793819-58-c.ebner@proxmox.com \
    --to=c.ebner@proxmox.com \
    --cc=pbs-devel@lists.proxmox.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox
Service provided by Proxmox Server Solutions GmbH | Privacy | Legal