public inbox for pbs-devel@lists.proxmox.com
 help / color / mirror / Atom feed
* [pbs-devel] [PATCH v3 pxar proxmox-backup 00/58] fix #3174: improve file-level backup
@ 2024-03-28 12:36 Christian Ebner
  2024-03-28 12:36 ` [pbs-devel] [PATCH v3 pxar 01/58] encoder: fix two typos in comments Christian Ebner
                   ` (59 more replies)
  0 siblings, 60 replies; 122+ messages in thread
From: Christian Ebner @ 2024-03-28 12:36 UTC (permalink / raw)
  To: pbs-devel

A big thank you to Dietmar and Fabian for the review of the previous
version and Fabian for extensive testing and help during debugging.

This series of patches implements an metadata based file change
detection mechanism for improved pxar file level backup creation speed
for unchanged files.

The chosen approach is to split pxar archives on creation via the
proxmox-backup-client into two separate data and upload streams,
one exclusive for regular file payloads, the other one for the rest
of the pxar archive, which is mostly metadata.

On consecutive runs, the metadata archive of the previous backup run,
which is limited in size and therefore rapidly accessed is used to
lookup and compare the metadata for entries to encode.
This assumes that the connection speed to the Proxmox Backup Server is
sufficiently fast, allowing the download and chaching of the chunks for
that index.

Changes to regular files are detected by comparing all of the files
metadata object, including mtime, acls, ecc. If no changes are detected,
the previous payload index is used to lookup chunks to possibly re-use
in the payload stream of the new archive.
In order to reduce possible chunk fragmentation, the decision whether to
re-use or re-encode a file payload is deferred until enough information
is gathered by adding entries to a look-ahead cache. If the padding
introduced by reusing chunks falls below a threshold, the entries are
referenced, the chunks are re-used and injected into the pxar payload
upload stream, otherwise they are discated and the files encoded
regularly.

The following lists the most notable changes included in this series since
the version 2:
- many bugfixes regarding incorrect archive encoding by wrong offset
  generation, adding additional sanity checks and rather fail on
  encoding than produce an incorrectly encoded archive
- different approach for deciding whether to re-use or re-encode the
  entries. Previously, the entries have been encoded when a cached
  payload size threshold was reached. Now, the padding introduced by
  reusable chunks is tracked, and only if the padding does not exceed
  the set threshold, the entries are re-used. This reduces the possible
  padding, at the cost of re-encoding more entries. Also avoids to
  re-use chunks which have now large padding holes because of
  moved/removed files contained within.
- added headers for metadata archive and payload file
- added documentation

An invocation of a backup run with this patches now is:
```bash
proxmox-backup-client backup <label>.pxar:<source-path> --change-detection-mode=metadata
```
During the first run, no reference index is available, the pxar archive
will however be split into the two parts.
Following backups will however utilize the pxar archive accessor and
index files of the previous run to perform file change detection.

As benchmarks, the linux source code as well as the coco dataset for
computer vision and pattern recognition can be used.
The benchmarks can be performed by running:
```bash
proxmox-backup-test-suite detection-mode-bench prepare --target /<path-to-bench-source-target>
proxmox-backup-test-suite detection-mode-bench run linux.pxar:/<path-to-bench-source-target>/linux
proxmox-backup-test-suite detection-mode-bench run coco.pxar:/<path-to-bench-source-target>/coco
```

Above command invocations assume the default repository and credentials
to be set as environment variables, they might however be passed as
additional optional parameters instead.

pxar:

Christian Ebner (14):
  encoder: fix two typos in comments
  format/examples: add PXAR_PAYLOAD_REF entry header
  decoder: add method to read payload references
  decoder: factor out skip part from skip_entry
  encoder: add optional output writer for file payloads
  encoder: move to stack based state tracking
  decoder/accessor: add optional payload input stream
  encoder: add payload reference capability
  encoder: add payload position capability
  encoder: add payload advance capability
  encoder/format: finish payload stream with marker
  format: add payload stream start marker
  format: add pxar format version entry
  format/encoder/decoder: add entry type cli params

 examples/apxar.rs            |   2 +-
 examples/mk-format-hashes.rs |  21 ++
 examples/pxarcmd.rs          |   7 +-
 src/accessor/aio.rs          |  10 +-
 src/accessor/mod.rs          |  52 +++-
 src/accessor/sync.rs         |   8 +-
 src/decoder/aio.rs           |  14 +-
 src/decoder/mod.rs           | 191 ++++++++++++--
 src/decoder/sync.rs          |  15 +-
 src/encoder/aio.rs           |  87 +++++--
 src/encoder/mod.rs           | 475 +++++++++++++++++++++++++----------
 src/encoder/sync.rs          |  67 ++++-
 src/format/mod.rs            |  63 +++++
 src/lib.rs                   |   9 +
 tests/simple/main.rs         |   3 +
 15 files changed, 827 insertions(+), 197 deletions(-)

proxmox-backup:

Christian Ebner (44):
  client: pxar: switch to stack based encoder state
  client: backup writer: only borrow http client
  client: backup: factor out extension from backup target
  client: backup: early check for fixed index type
  client: pxar: combine writer params into struct
  client: backup: split payload to dedicated stream
  client: helper: add helpers for creating reader instances
  client: helper: add method for split archive name mapping
  client: restore: read payload from dedicated index
  tools: cover meta extension for pxar archives
  restore: cover meta extension for pxar archives
  client: mount: make split pxar archives mountable
  api: datastore: refactor getting local chunk reader
  api: datastore: attach optional payload chunk reader
  catalog: shell: factor out pxar fuse reader instantiation
  catalog: shell: redirect payload reader for split streams
  www: cover meta extension for pxar archives
  pxar: add optional payload input for achive restore
  pxar: add more context to extraction error
  client: pxar: include payload offset in output
  pxar: show padding in debug output on archive list
  datastore: dynamic index: add method to get digest
  client: pxar: helper for lookup of reusable dynamic entries
  upload stream: impl reused chunk injector
  client: chunk stream: add struct to hold injection state
  client: chunk stream: add dynamic entries injection queues
  specs: add backup detection mode specification
  client: implement prepare reference method
  client: pxar: implement store to insert chunks on caching
  client: pxar: add previous reference to archiver
  client: pxar: add method for metadata comparison
  pxar: caching: add look-ahead cache types
  client: pxar: add look-ahead caching
  fix #3174: client: pxar: enable caching and meta comparison
  client: backup: increase average chunk size for metadata
  client: backup writer: add injected chunk count to stats
  pxar: create: show chunk injection stats debug output
  client: pxar: add entry kind format version
  client: pxar: opt encode cli exclude patterns as CliParams
  client: pxar: add flow chart for metadata change detection
  docs: describe file format for split payload files
  docs: add section describing change detection mode
  test-suite: add detection mode change benchmark
  test-suite: add bin to deb, add shell completions

 Cargo.toml                                    |   1 +
 Makefile                                      |  13 +-
 debian/proxmox-backup-client.bash-completion  |   1 +
 debian/proxmox-backup-client.install          |   2 +
 debian/proxmox-backup-test-suite.bc           |   8 +
 docs/backup-client.rst                        |  33 +
 docs/file-formats.rst                         |  32 +
 docs/meta-format-overview.dot                 |  50 ++
 examples/test_chunk_speed2.rs                 |   2 +-
 examples/upload-speed.rs                      |   2 +-
 pbs-client/src/backup_specification.rs        |  40 +
 pbs-client/src/backup_writer.rs               | 103 ++-
 pbs-client/src/chunk_stream.rs                |  60 +-
 pbs-client/src/inject_reused_chunks.rs        | 152 ++++
 pbs-client/src/lib.rs                         |   3 +-
 pbs-client/src/pxar/create.rs                 | 779 +++++++++++++++++-
 pbs-client/src/pxar/extract.rs                |   2 +
 ...t-metadata-based-file-change-detection.svg |   1 +
 ...t-metadata-based-file-change-detection.txt |  12 +
 pbs-client/src/pxar/look_ahead_cache.rs       |  38 +
 pbs-client/src/pxar/mod.rs                    |   3 +-
 pbs-client/src/pxar/tools.rs                  | 123 ++-
 pbs-client/src/pxar_backup_stream.rs          |  57 +-
 pbs-client/src/tools/mod.rs                   |   5 +-
 pbs-datastore/src/dynamic_index.rs            |   5 +
 pbs-pxar-fuse/src/lib.rs                      |   2 +-
 proxmox-backup-client/src/benchmark.rs        |   2 +-
 proxmox-backup-client/src/catalog.rs          |  42 +-
 proxmox-backup-client/src/helper.rs           |  64 ++
 proxmox-backup-client/src/main.rs             | 281 ++++++-
 proxmox-backup-client/src/mount.rs            |  54 +-
 proxmox-backup-test-suite/Cargo.toml          |  18 +
 .../src/detection_mode_bench.rs               | 294 +++++++
 proxmox-backup-test-suite/src/main.rs         |  17 +
 proxmox-file-restore/src/main.rs              |  20 +-
 .../src/proxmox_restore_daemon/api.rs         |  16 +-
 pxar-bin/src/main.rs                          |  53 +-
 src/api2/admin/datastore.rs                   |  47 +-
 src/api2/tape/restore.rs                      |   4 +-
 src/bin/proxmox_backup_debug/diff.rs          |   2 +-
 src/tape/file_formats/snapshot_archive.rs     |   9 +-
 tests/catar.rs                                |   4 +-
 www/datastore/Content.js                      |   6 +-
 zsh-completions/_proxmox-backup-test-suite    |  13 +
 44 files changed, 2219 insertions(+), 256 deletions(-)
 create mode 100644 debian/proxmox-backup-test-suite.bc
 create mode 100644 docs/meta-format-overview.dot
 create mode 100644 pbs-client/src/inject_reused_chunks.rs
 create mode 100644 pbs-client/src/pxar/flow-chart-metadata-based-file-change-detection.svg
 create mode 100644 pbs-client/src/pxar/flow-chart-metadata-based-file-change-detection.txt
 create mode 100644 pbs-client/src/pxar/look_ahead_cache.rs
 create mode 100644 proxmox-backup-client/src/helper.rs
 create mode 100644 proxmox-backup-test-suite/Cargo.toml
 create mode 100644 proxmox-backup-test-suite/src/detection_mode_bench.rs
 create mode 100644 proxmox-backup-test-suite/src/main.rs
 create mode 100644 zsh-completions/_proxmox-backup-test-suite

-- 
2.39.2





^ permalink raw reply	[flat|nested] 122+ messages in thread

* [pbs-devel] [PATCH v3 pxar 01/58] encoder: fix two typos in comments
  2024-03-28 12:36 [pbs-devel] [PATCH v3 pxar proxmox-backup 00/58] fix #3174: improve file-level backup Christian Ebner
@ 2024-03-28 12:36 ` Christian Ebner
  2024-04-03  9:12   ` [pbs-devel] applied: " Fabian Grünbichler
  2024-03-28 12:36 ` [pbs-devel] [PATCH v3 pxar 02/58] format/examples: add PXAR_PAYLOAD_REF entry header Christian Ebner
                   ` (58 subsequent siblings)
  59 siblings, 1 reply; 122+ messages in thread
From: Christian Ebner @ 2024-03-28 12:36 UTC (permalink / raw)
  To: pbs-devel

Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
---
changes since version 2:
- not present in previous version

 src/encoder/mod.rs | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/src/encoder/mod.rs b/src/encoder/mod.rs
index 0d342ec..c93f13b 100644
--- a/src/encoder/mod.rs
+++ b/src/encoder/mod.rs
@@ -166,7 +166,7 @@ where
     seq_write_all(output, &[0u8], position).await
 }
 
-/// Write a pxar entry consiting of an endian-swappable struct.
+/// Write a pxar entry consisting of an endian-swappable struct.
 async fn seq_write_pxar_struct_entry<E, T>(
     output: &mut T,
     htype: u64,
@@ -188,7 +188,7 @@ where
 pub enum EncodeError {
     /// The user dropped a `File` without without finishing writing all of its contents.
     ///
-    /// This is required because the payload lengths is written out at the begining and decoding
+    /// This is required because the payload lengths is written out at the beginning and decoding
     /// requires there to follow the right amount of data.
     IncompleteFile,
 
-- 
2.39.2





^ permalink raw reply	[flat|nested] 122+ messages in thread

* [pbs-devel] [PATCH v3 pxar 02/58] format/examples: add PXAR_PAYLOAD_REF entry header
  2024-03-28 12:36 [pbs-devel] [PATCH v3 pxar proxmox-backup 00/58] fix #3174: improve file-level backup Christian Ebner
  2024-03-28 12:36 ` [pbs-devel] [PATCH v3 pxar 01/58] encoder: fix two typos in comments Christian Ebner
@ 2024-03-28 12:36 ` Christian Ebner
  2024-03-28 12:36 ` [pbs-devel] [PATCH v3 pxar 03/58] decoder: add method to read payload references Christian Ebner
                   ` (57 subsequent siblings)
  59 siblings, 0 replies; 122+ messages in thread
From: Christian Ebner @ 2024-03-28 12:36 UTC (permalink / raw)
  To: pbs-devel

Introduces a new PXAR_PAYLOAD_REF entry header to mark regular file
payloads which are not encoded within the regular pxar archive stream
but rather redirected to a different output stream.

The corresponding header marks the entry containing all the necessary
data for restoring the actual payload from the dedicated payload stream.

Further, add a dedicated type to store offset and size as well as the
methods to serialize its values for encoding.

Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
---
changes since version 2:
- Added struct and serialization method

 examples/mk-format-hashes.rs |  5 +++++
 src/format/mod.rs            | 20 ++++++++++++++++++++
 2 files changed, 25 insertions(+)

diff --git a/examples/mk-format-hashes.rs b/examples/mk-format-hashes.rs
index 6e00654..83adb38 100644
--- a/examples/mk-format-hashes.rs
+++ b/examples/mk-format-hashes.rs
@@ -41,6 +41,11 @@ const CONSTANTS: &[(&str, &str, &str)] = &[
         "PXAR_PAYLOAD",
         "__PROXMOX_FORMAT_PXAR_PAYLOAD__",
     ),
+    (
+        "Marks the beginning of a payload reference for regular files",
+        "PXAR_PAYLOAD_REF",
+        "__PROXMOX_FORMAT_PXAR_PAYLOAD_REF__",
+    ),
     (
         "Marks item as entry of goodbye table",
         "PXAR_GOODBYE",
diff --git a/src/format/mod.rs b/src/format/mod.rs
index bfea9f6..1fda535 100644
--- a/src/format/mod.rs
+++ b/src/format/mod.rs
@@ -22,6 +22,7 @@
 //!   * `FCAPS`             -- file capability in Linux disk format
 //!   * `QUOTA_PROJECT_ID`  -- the ext4/xfs quota project ID
 //!   * `PAYLOAD`           -- file contents, if it is one
+//!   * `PAYLOAD_REF`       -- reference to file offset in split payload archive (introduced in v2)
 //!   * `SYMLINK`           -- symlink target, if it is one
 //!   * `DEVICE`            -- device major/minor, if it is a block/char device
 //!
@@ -99,6 +100,8 @@ pub const PXAR_QUOTA_PROJID: u64 = 0xe07540e82f7d1cbb;
 pub const PXAR_HARDLINK: u64 = 0x51269c8422bd7275;
 /// Marks the beginning of the payload (actual content) of regular files
 pub const PXAR_PAYLOAD: u64 = 0x28147a1b0b7c1a25;
+/// Marks the beginning of a payload reference for regular files
+pub const PXAR_PAYLOAD_REF: u64 = 0x419d3d6bc4ba977e;
 /// Marks item as entry of goodbye table
 pub const PXAR_GOODBYE: u64 = 0x2fec4fa642d5731d;
 /// The end marker used in the GOODBYE object
@@ -152,6 +155,7 @@ impl Header {
             PXAR_QUOTA_PROJID => size_of::<QuotaProjectId>() as u64,
             PXAR_ENTRY => size_of::<Stat>() as u64,
             PXAR_PAYLOAD | PXAR_GOODBYE => u64::MAX - (size_of::<Self>() as u64),
+            PXAR_PAYLOAD_REF => size_of::<PayloadRef>() as u64,
             _ => u64::MAX - (size_of::<Self>() as u64),
         }
     }
@@ -192,6 +196,7 @@ impl Display for Header {
             PXAR_QUOTA_PROJID => "QUOTA_PROJID",
             PXAR_ENTRY => "ENTRY",
             PXAR_PAYLOAD => "PAYLOAD",
+            PXAR_PAYLOAD_REF => "PAYLOAD_REF",
             PXAR_GOODBYE => "GOODBYE",
             _ => "UNKNOWN",
         };
@@ -723,6 +728,21 @@ impl GoodbyeItem {
     }
 }
 
+/// References a regular file payload found in a separated payload archive
+#[derive(Clone, Debug, Endian)]
+pub struct PayloadRef {
+    pub offset: u64,
+    pub size: u64,
+}
+
+impl PayloadRef {
+    pub(crate) fn data(&self) -> Vec<u8> {
+        let mut data = self.offset.to_le_bytes().to_vec();
+        data.append(&mut self.size.to_le_bytes().to_vec());
+        data
+    }
+}
+
 /// Hash a file name for use in the goodbye table.
 pub fn hash_filename(name: &[u8]) -> u64 {
     use std::hash::Hasher;
-- 
2.39.2





^ permalink raw reply	[flat|nested] 122+ messages in thread

* [pbs-devel] [PATCH v3 pxar 03/58] decoder: add method to read payload references
  2024-03-28 12:36 [pbs-devel] [PATCH v3 pxar proxmox-backup 00/58] fix #3174: improve file-level backup Christian Ebner
  2024-03-28 12:36 ` [pbs-devel] [PATCH v3 pxar 01/58] encoder: fix two typos in comments Christian Ebner
  2024-03-28 12:36 ` [pbs-devel] [PATCH v3 pxar 02/58] format/examples: add PXAR_PAYLOAD_REF entry header Christian Ebner
@ 2024-03-28 12:36 ` Christian Ebner
  2024-03-28 12:36 ` [pbs-devel] [PATCH v3 pxar 04/58] decoder: factor out skip part from skip_entry Christian Ebner
                   ` (56 subsequent siblings)
  59 siblings, 0 replies; 122+ messages in thread
From: Christian Ebner @ 2024-03-28 12:36 UTC (permalink / raw)
  To: pbs-devel

This is in preparation for reading payloads from a dedicated payload
input stream.

Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
---
changes since version 2:
- decode based on format struct, instead of reading u64 and constructing
  object

 src/decoder/mod.rs | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/src/decoder/mod.rs b/src/decoder/mod.rs
index d1fb911..cc50e4f 100644
--- a/src/decoder/mod.rs
+++ b/src/decoder/mod.rs
@@ -661,6 +661,17 @@ impl<I: SeqRead> DecoderImpl<I> {
     async fn read_quota_project_id(&mut self) -> io::Result<format::QuotaProjectId> {
         self.read_simple_entry("quota project id").await
     }
+
+    async fn read_payload_ref(&mut self) -> io::Result<format::PayloadRef> {
+        let content_size =
+            usize::try_from(self.current_header.content_size()).map_err(io_err_other)?;
+
+        if content_size != 2 * size_of::<u64>() {
+            io_bail!("bad payload reference entry");
+        }
+
+        seq_read_entry(&mut self.input).await
+    }
 }
 
 /// Reader for file contents inside a pxar archive.
-- 
2.39.2





^ permalink raw reply	[flat|nested] 122+ messages in thread

* [pbs-devel] [PATCH v3 pxar 04/58] decoder: factor out skip part from skip_entry
  2024-03-28 12:36 [pbs-devel] [PATCH v3 pxar proxmox-backup 00/58] fix #3174: improve file-level backup Christian Ebner
                   ` (2 preceding siblings ...)
  2024-03-28 12:36 ` [pbs-devel] [PATCH v3 pxar 03/58] decoder: add method to read payload references Christian Ebner
@ 2024-03-28 12:36 ` Christian Ebner
  2024-04-03  9:18   ` Fabian Grünbichler
  2024-03-28 12:36 ` [pbs-devel] [PATCH v3 pxar 05/58] encoder: add optional output writer for file payloads Christian Ebner
                   ` (55 subsequent siblings)
  59 siblings, 1 reply; 122+ messages in thread
From: Christian Ebner @ 2024-03-28 12:36 UTC (permalink / raw)
  To: pbs-devel

Make the skip part reusable for a different input.

In preparation for skipping payload paddings in a separated input.

Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
---
changes since version 2:
- not present in previous version

 src/decoder/mod.rs | 16 ++++++++++------
 1 file changed, 10 insertions(+), 6 deletions(-)

diff --git a/src/decoder/mod.rs b/src/decoder/mod.rs
index cc50e4f..f439327 100644
--- a/src/decoder/mod.rs
+++ b/src/decoder/mod.rs
@@ -563,15 +563,19 @@ impl<I: SeqRead> DecoderImpl<I> {
     //
 
     async fn skip_entry(&mut self, offset: u64) -> io::Result<()> {
-        let mut len = self.current_header.content_size() - offset;
+        let len = (self.current_header.content_size() - offset) as usize;
+        Self::skip(&mut self.input, len).await
+    }
+
+    async fn skip(input: &mut I, len: usize) -> io::Result<()> {
+        let mut len = len;
         let scratch = scratch_buffer();
-        while len >= (scratch.len() as u64) {
-            seq_read_exact(&mut self.input, scratch).await?;
-            len -= scratch.len() as u64;
+        while len >= (scratch.len()) {
+            seq_read_exact(input, scratch).await?;
+            len -= scratch.len();
         }
-        let len = len as usize;
         if len > 0 {
-            seq_read_exact(&mut self.input, &mut scratch[..len]).await?;
+            seq_read_exact(input, &mut scratch[..len]).await?;
         }
         Ok(())
     }
-- 
2.39.2





^ permalink raw reply	[flat|nested] 122+ messages in thread

* [pbs-devel] [PATCH v3 pxar 05/58] encoder: add optional output writer for file payloads
  2024-03-28 12:36 [pbs-devel] [PATCH v3 pxar proxmox-backup 00/58] fix #3174: improve file-level backup Christian Ebner
                   ` (3 preceding siblings ...)
  2024-03-28 12:36 ` [pbs-devel] [PATCH v3 pxar 04/58] decoder: factor out skip part from skip_entry Christian Ebner
@ 2024-03-28 12:36 ` Christian Ebner
  2024-03-28 12:36 ` [pbs-devel] [PATCH v3 pxar 06/58] encoder: move to stack based state tracking Christian Ebner
                   ` (54 subsequent siblings)
  59 siblings, 0 replies; 122+ messages in thread
From: Christian Ebner @ 2024-03-28 12:36 UTC (permalink / raw)
  To: pbs-devel

During regular pxar archive encoding, the payload of regular files is
written as part of the archive.

This patch introduces functionality to attach an optional, dedicated
writer instance to redirect the payload to a different output.
The intention for this change is to allow to separate data and metadata
streams in order to allow the reuse of payload data by referencing the
payload writer byte offset, without having to re-encode it.

Whenever the payload of regular files is redirected to a dedicated
output writer, encode a payload reference header followed by the
required data to locate the data, instead of adding the regular payload
header followed by the encoded payload to the archive.

This is in preparation for reusing payload chunks for unchanged files
of backups created via the proxmox-backup-client.

Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
---
changes since version 2:
- pass optional payload output on instantiation, so it cannot be
  attached/detached during encoding.
- major refactoring

 src/encoder/aio.rs  | 24 +++++++++----
 src/encoder/mod.rs  | 85 +++++++++++++++++++++++++++++++++++++++------
 src/encoder/sync.rs | 13 +++++--
 3 files changed, 103 insertions(+), 19 deletions(-)

diff --git a/src/encoder/aio.rs b/src/encoder/aio.rs
index ad25fea..31a1a2f 100644
--- a/src/encoder/aio.rs
+++ b/src/encoder/aio.rs
@@ -24,8 +24,14 @@ impl<'a, T: tokio::io::AsyncWrite + 'a> Encoder<'a, TokioWriter<T>> {
     pub async fn from_tokio(
         output: T,
         metadata: &Metadata,
+        payload_output: Option<T>,
     ) -> io::Result<Encoder<'a, TokioWriter<T>>> {
-        Encoder::new(TokioWriter::new(output), metadata).await
+        Encoder::new(
+            TokioWriter::new(output),
+            metadata,
+            payload_output.map(|payload_output| TokioWriter::new(payload_output)),
+        )
+        .await
     }
 }
 
@@ -39,6 +45,7 @@ impl<'a> Encoder<'a, TokioWriter<tokio::fs::File>> {
         Encoder::new(
             TokioWriter::new(tokio::fs::File::create(path.as_ref()).await?),
             metadata,
+            None,
         )
         .await
     }
@@ -46,9 +53,13 @@ impl<'a> Encoder<'a, TokioWriter<tokio::fs::File>> {
 
 impl<'a, T: SeqWrite + 'a> Encoder<'a, T> {
     /// Create an asynchronous encoder for an output implementing our internal write interface.
-    pub async fn new(output: T, metadata: &Metadata) -> io::Result<Encoder<'a, T>> {
+    pub async fn new(
+        output: T,
+        metadata: &Metadata,
+        payload_output: Option<T>,
+    ) -> io::Result<Encoder<'a, T>> {
         Ok(Self {
-            inner: encoder::EncoderImpl::new(output.into(), metadata).await?,
+            inner: encoder::EncoderImpl::new(output.into(), metadata, payload_output).await?,
         })
     }
 
@@ -291,9 +302,10 @@ mod test {
     /// Assert that `Encoder` is `Send`
     fn send_test() {
         let test = async {
-            let mut encoder = Encoder::new(DummyOutput, &Metadata::dir_builder(0o700).build())
-                .await
-                .unwrap();
+            let mut encoder =
+                Encoder::new(DummyOutput, &Metadata::dir_builder(0o700).build(), None)
+                    .await
+                    .unwrap();
             {
                 let mut dir = encoder
                     .create_directory("baba", &Metadata::dir_builder(0o700).build())
diff --git a/src/encoder/mod.rs b/src/encoder/mod.rs
index c93f13b..bff6acf 100644
--- a/src/encoder/mod.rs
+++ b/src/encoder/mod.rs
@@ -17,7 +17,7 @@ use endian_trait::Endian;
 
 use crate::binary_tree_array;
 use crate::decoder::{self, SeqRead};
-use crate::format::{self, GoodbyeItem};
+use crate::format::{self, GoodbyeItem, PayloadRef};
 use crate::Metadata;
 
 pub mod aio;
@@ -221,6 +221,9 @@ struct EncoderState {
 
     /// We need to keep track how much we have written to get offsets.
     write_position: u64,
+
+    /// Track the bytes written to the payload writer
+    payload_write_position: u64,
 }
 
 impl EncoderState {
@@ -278,6 +281,7 @@ impl<'a, T> std::convert::From<&'a mut T> for EncoderOutput<'a, T> {
 /// synchronous or `async` I/O objects in as output.
 pub(crate) struct EncoderImpl<'a, T: SeqWrite + 'a> {
     output: EncoderOutput<'a, T>,
+    payload_output: EncoderOutput<'a, Option<T>>,
     state: EncoderState,
     parent: Option<&'a mut EncoderState>,
     finished: bool,
@@ -306,12 +310,14 @@ impl<'a, T: SeqWrite + 'a> EncoderImpl<'a, T> {
     pub async fn new(
         output: EncoderOutput<'a, T>,
         metadata: &Metadata,
+        payload_output: Option<T>,
     ) -> io::Result<EncoderImpl<'a, T>> {
         if !metadata.is_dir() {
             io_bail!("directory metadata must contain the directory mode flag");
         }
         let mut this = Self {
             output,
+            payload_output: EncoderOutput::Owned(None),
             state: EncoderState::default(),
             parent: None,
             finished: false,
@@ -323,6 +329,10 @@ impl<'a, T: SeqWrite + 'a> EncoderImpl<'a, T> {
         this.encode_metadata(metadata).await?;
         this.state.files_offset = this.position();
 
+        if let Some(payload_output) = payload_output {
+            this.payload_output = EncoderOutput::Owned(Some(payload_output));
+        }
+
         Ok(this)
     }
 
@@ -361,10 +371,33 @@ impl<'a, T: SeqWrite + 'a> EncoderImpl<'a, T> {
         let file_offset = self.position();
         self.start_file_do(Some(metadata), file_name).await?;
 
-        let header = format::Header::with_content_size(format::PXAR_PAYLOAD, file_size);
-        header.check_header_size()?;
+        if let Some(payload_output) = self.payload_output.as_mut() {
+            // Position prior to the payload header
+            let payload_position = self.state.payload_write_position;
+
+            // Separate payloads in payload archive PXAR_PAYLOAD markers
+            let header = format::Header::with_content_size(format::PXAR_PAYLOAD, file_size);
+            header.check_header_size()?;
+            seq_write_struct(payload_output, header, &mut self.state.payload_write_position).await?;
 
-        seq_write_struct(self.output.as_mut(), header, &mut self.state.write_position).await?;
+            let payload_ref = PayloadRef {
+                offset: payload_position,
+                size: file_size,
+            };
+
+            // Write ref to metadata archive
+            seq_write_pxar_entry(
+                self.output.as_mut(),
+                format::PXAR_PAYLOAD_REF,
+                &payload_ref.data(),
+                &mut self.state.write_position,
+            )
+            .await?;
+        } else {
+            let header = format::Header::with_content_size(format::PXAR_PAYLOAD, file_size);
+            header.check_header_size()?;
+            seq_write_struct(self.output.as_mut(), header, &mut self.state.write_position).await?;
+        }
 
         let payload_data_offset = self.position();
 
@@ -372,6 +405,7 @@ impl<'a, T: SeqWrite + 'a> EncoderImpl<'a, T> {
 
         Ok(FileImpl {
             output: self.output.as_mut(),
+            payload_output: self.payload_output.as_mut().as_mut(),
             goodbye_item: GoodbyeItem {
                 hash: format::hash_filename(file_name),
                 offset: file_offset,
@@ -564,6 +598,11 @@ impl<'a, T: SeqWrite + 'a> EncoderImpl<'a, T> {
         self.state.write_position
     }
 
+    #[inline]
+    fn payload_position(&mut self) -> u64 {
+        self.state.payload_write_position
+    }
+
     pub async fn create_directory(
         &mut self,
         file_name: &Path,
@@ -588,18 +627,21 @@ impl<'a, T: SeqWrite + 'a> EncoderImpl<'a, T> {
 
         // the child will write to OUR state now:
         let write_position = self.position();
+        let payload_write_position = self.payload_position();
 
         let file_copy_buffer = Arc::clone(&self.file_copy_buffer);
 
         Ok(EncoderImpl {
             // always forward as Borrowed(), to avoid stacking references on nested calls
             output: self.output.to_borrowed_mut(),
+            payload_output: self.payload_output.to_borrowed_mut(),
             state: EncoderState {
                 entry_offset,
                 files_offset,
                 file_offset: Some(file_offset),
                 file_hash,
                 write_position,
+                payload_write_position,
                 ..Default::default()
             },
             parent: Some(&mut self.state),
@@ -764,15 +806,21 @@ impl<'a, T: SeqWrite + 'a> EncoderImpl<'a, T> {
         )
         .await?;
 
+        if let EncoderOutput::Owned(Some(output)) = &mut self.payload_output {
+            flush(output).await?;
+        }
+
         if let EncoderOutput::Owned(output) = &mut self.output {
             flush(output).await?;
         }
 
         // done up here because of the self-borrow and to propagate
         let end_offset = self.position();
+        let payload_end_offset = self.payload_position();
 
         if let Some(parent) = &mut self.parent {
             parent.write_position = end_offset;
+            parent.payload_write_position = payload_end_offset;
 
             let file_offset = self
                 .state
@@ -837,6 +885,9 @@ impl<'a, T: SeqWrite + 'a> EncoderImpl<'a, T> {
 pub(crate) struct FileImpl<'a, S: SeqWrite> {
     output: &'a mut S,
 
+    /// Optional write redirection of file payloads to this sequential stream
+    payload_output: Option<&'a mut S>,
+
     /// This file's `GoodbyeItem`. FIXME: We currently don't touch this, can we just push it
     /// directly instead of on Drop of FileImpl?
     goodbye_item: GoodbyeItem,
@@ -916,19 +967,33 @@ impl<'a, S: SeqWrite> FileImpl<'a, S> {
     /// for convenience.
     pub async fn write(&mut self, data: &[u8]) -> io::Result<usize> {
         self.check_remaining(data.len())?;
-        let put =
-            poll_fn(|cx| unsafe { Pin::new_unchecked(&mut self.output).poll_seq_write(cx, data) })
-                .await?;
-        //let put = seq_write(self.output.as_mut().unwrap(), data).await?;
+        let put = if let Some(mut output) = self.payload_output.as_mut() {
+            let put =
+                poll_fn(|cx| unsafe { Pin::new_unchecked(&mut output).poll_seq_write(cx, data) })
+                    .await?;
+            self.parent.payload_write_position += put as u64;
+            put
+        } else {
+            let put = poll_fn(|cx| unsafe {
+                Pin::new_unchecked(&mut self.output).poll_seq_write(cx, data)
+            })
+            .await?;
+            self.parent.write_position += put as u64;
+            put
+        };
+
         self.remaining_size -= put as u64;
-        self.parent.write_position += put as u64;
         Ok(put)
     }
 
     /// Completely write file data for the current file entry in a pxar archive.
     pub async fn write_all(&mut self, data: &[u8]) -> io::Result<()> {
         self.check_remaining(data.len())?;
-        seq_write_all(self.output, data, &mut self.parent.write_position).await?;
+        if let Some(ref mut output) = self.payload_output {
+            seq_write_all(output, data, &mut self.parent.payload_write_position).await?;
+        } else {
+            seq_write_all(self.output, data, &mut self.parent.write_position).await?;
+        }
         self.remaining_size -= data.len() as u64;
         Ok(())
     }
diff --git a/src/encoder/sync.rs b/src/encoder/sync.rs
index 1ec91b8..96d056d 100644
--- a/src/encoder/sync.rs
+++ b/src/encoder/sync.rs
@@ -28,7 +28,7 @@ impl<'a, T: io::Write + 'a> Encoder<'a, StandardWriter<T>> {
     /// Encode a `pxar` archive into a regular `std::io::Write` output.
     #[inline]
     pub fn from_std(output: T, metadata: &Metadata) -> io::Result<Encoder<'a, StandardWriter<T>>> {
-        Encoder::new(StandardWriter::new(output), metadata)
+        Encoder::new(StandardWriter::new(output), metadata, None)
     }
 }
 
@@ -41,6 +41,7 @@ impl<'a> Encoder<'a, StandardWriter<std::fs::File>> {
         Encoder::new(
             StandardWriter::new(std::fs::File::create(path.as_ref())?),
             metadata,
+            None,
         )
     }
 }
@@ -50,9 +51,15 @@ impl<'a, T: SeqWrite + 'a> Encoder<'a, T> {
     ///
     /// Note that the `output`'s `SeqWrite` implementation must always return `Poll::Ready` and is
     /// not allowed to use the `Waker`, as this will cause a `panic!`.
-    pub fn new(output: T, metadata: &Metadata) -> io::Result<Self> {
+    // Optionally attach a dedicated writer to redirect the payloads of regular files to a separate
+    // output.
+    pub fn new(output: T, metadata: &Metadata, payload_output: Option<T>) -> io::Result<Self> {
         Ok(Self {
-            inner: poll_result_once(encoder::EncoderImpl::new(output.into(), metadata))?,
+            inner: poll_result_once(encoder::EncoderImpl::new(
+                output.into(),
+                metadata,
+                payload_output,
+            ))?,
         })
     }
 
-- 
2.39.2





^ permalink raw reply	[flat|nested] 122+ messages in thread

* [pbs-devel] [PATCH v3 pxar 06/58] encoder: move to stack based state tracking
  2024-03-28 12:36 [pbs-devel] [PATCH v3 pxar proxmox-backup 00/58] fix #3174: improve file-level backup Christian Ebner
                   ` (4 preceding siblings ...)
  2024-03-28 12:36 ` [pbs-devel] [PATCH v3 pxar 05/58] encoder: add optional output writer for file payloads Christian Ebner
@ 2024-03-28 12:36 ` Christian Ebner
  2024-04-03  9:54   ` Fabian Grünbichler
  2024-03-28 12:36 ` [pbs-devel] [PATCH v3 pxar 07/58] decoder/accessor: add optional payload input stream Christian Ebner
                   ` (53 subsequent siblings)
  59 siblings, 1 reply; 122+ messages in thread
From: Christian Ebner @ 2024-03-28 12:36 UTC (permalink / raw)
  To: pbs-devel

In preparation for the proxmox-backup-client look-ahead caching,
where a passing around of different encoder instances with internal
references is not feasible.

Instead of creating a new encoder instance for each directory level
and keeping references to the parent state, use an internal stack.

This is a breaking change in the pxar library API.

Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
---
changes since version 2:
- consume encoder with new `close` method to finalize
- new output_state helper usable when for both a mut borrow is required
- double checked use of state/state_mut usage
- major refactoring

 examples/pxarcmd.rs  |   7 +-
 src/encoder/aio.rs   |  26 ++--
 src/encoder/mod.rs   | 285 +++++++++++++++++++++++--------------------
 src/encoder/sync.rs  |  16 ++-
 tests/simple/main.rs |   3 +
 5 files changed, 187 insertions(+), 150 deletions(-)

diff --git a/examples/pxarcmd.rs b/examples/pxarcmd.rs
index e0c779d..0294eba 100644
--- a/examples/pxarcmd.rs
+++ b/examples/pxarcmd.rs
@@ -106,6 +106,7 @@ fn cmd_create(mut args: std::env::ArgsOs) -> Result<(), Error> {
     let mut encoder = Encoder::create(file, &meta)?;
     add_directory(&mut encoder, dir, &dir_path, &mut HashMap::new())?;
     encoder.finish()?;
+    encoder.close()?;
 
     Ok(())
 }
@@ -138,14 +139,14 @@ fn add_directory<'a, T: SeqWrite + 'a>(
 
         let meta = Metadata::from(&file_meta);
         if file_type.is_dir() {
-            let mut dir = encoder.create_directory(file_name, &meta)?;
+            encoder.create_directory(file_name, &meta)?;
             add_directory(
-                &mut dir,
+                encoder,
                 std::fs::read_dir(file_path)?,
                 root_path,
                 &mut *hardlinks,
             )?;
-            dir.finish()?;
+            encoder.finish()?;
         } else if file_type.is_symlink() {
             todo!("symlink handling");
         } else if file_type.is_file() {
diff --git a/src/encoder/aio.rs b/src/encoder/aio.rs
index 31a1a2f..635e550 100644
--- a/src/encoder/aio.rs
+++ b/src/encoder/aio.rs
@@ -109,20 +109,23 @@ impl<'a, T: SeqWrite + 'a> Encoder<'a, T> {
         &mut self,
         file_name: P,
         metadata: &Metadata,
-    ) -> io::Result<Encoder<'_, T>> {
-        Ok(Encoder {
-            inner: self
-                .inner
-                .create_directory(file_name.as_ref(), metadata)
-                .await?,
-        })
+    ) -> io::Result<()> {
+        self.inner
+            .create_directory(file_name.as_ref(), metadata)
+            .await
     }
 
-    /// Finish this directory. This is mandatory, otherwise the `Drop` handler will `panic!`.
-    pub async fn finish(self) -> io::Result<()> {
+    /// Finish this directory. This is mandatory, encodes the end for the current directory.
+    pub async fn finish(&mut self) -> io::Result<()> {
         self.inner.finish().await
     }
 
+    /// Close the encoder instance. This is mandatory, encodes the end for the optional payload
+    /// output stream, if some is given
+    pub async fn close(self) -> io::Result<()> {
+        self.inner.close().await
+    }
+
     /// Add a symbolic link to the archive.
     pub async fn add_symlink<PF: AsRef<Path>, PT: AsRef<Path>>(
         &mut self,
@@ -307,11 +310,12 @@ mod test {
                     .await
                     .unwrap();
             {
-                let mut dir = encoder
+                encoder
                     .create_directory("baba", &Metadata::dir_builder(0o700).build())
                     .await
                     .unwrap();
-                dir.create_file(&Metadata::file_builder(0o755).build(), "abab", 1024)
+                encoder
+                    .create_file(&Metadata::file_builder(0o755).build(), "abab", 1024)
                     .await
                     .unwrap();
             }
diff --git a/src/encoder/mod.rs b/src/encoder/mod.rs
index bff6acf..31bb0fa 100644
--- a/src/encoder/mod.rs
+++ b/src/encoder/mod.rs
@@ -227,6 +227,16 @@ struct EncoderState {
 }
 
 impl EncoderState {
+    #[inline]
+    fn position(&self) -> u64 {
+        self.write_position
+    }
+
+    #[inline]
+    fn payload_position(&self) -> u64 {
+        self.payload_write_position
+    }
+
     fn merge_error(&mut self, error: Option<EncodeError>) {
         // one error is enough:
         if self.encode_error.is_none() {
@@ -244,16 +254,6 @@ pub(crate) enum EncoderOutput<'a, T> {
     Borrowed(&'a mut T),
 }
 
-impl<'a, T> EncoderOutput<'a, T> {
-    #[inline]
-    fn to_borrowed_mut<'s>(&'s mut self) -> EncoderOutput<'s, T>
-    where
-        'a: 's,
-    {
-        EncoderOutput::Borrowed(self.as_mut())
-    }
-}
-
 impl<'a, T> std::convert::AsMut<T> for EncoderOutput<'a, T> {
     fn as_mut(&mut self) -> &mut T {
         match self {
@@ -282,8 +282,8 @@ impl<'a, T> std::convert::From<&'a mut T> for EncoderOutput<'a, T> {
 pub(crate) struct EncoderImpl<'a, T: SeqWrite + 'a> {
     output: EncoderOutput<'a, T>,
     payload_output: EncoderOutput<'a, Option<T>>,
-    state: EncoderState,
-    parent: Option<&'a mut EncoderState>,
+    /// EncoderState stack storing the state for each directory level
+    state: Vec<EncoderState>,
     finished: bool,
 
     /// Since only the "current" entry can be actively writing files, we share the file copy
@@ -291,21 +291,6 @@ pub(crate) struct EncoderImpl<'a, T: SeqWrite + 'a> {
     file_copy_buffer: Arc<Mutex<Vec<u8>>>,
 }
 
-impl<'a, T: SeqWrite + 'a> Drop for EncoderImpl<'a, T> {
-    fn drop(&mut self) {
-        if let Some(ref mut parent) = self.parent {
-            // propagate errors:
-            parent.merge_error(self.state.encode_error);
-            if !self.finished {
-                parent.add_error(EncodeError::IncompleteDirectory);
-            }
-        } else if !self.finished {
-            // FIXME: how do we deal with this?
-            // eprintln!("Encoder dropped without finishing!");
-        }
-    }
-}
-
 impl<'a, T: SeqWrite + 'a> EncoderImpl<'a, T> {
     pub async fn new(
         output: EncoderOutput<'a, T>,
@@ -318,8 +303,7 @@ impl<'a, T: SeqWrite + 'a> EncoderImpl<'a, T> {
         let mut this = Self {
             output,
             payload_output: EncoderOutput::Owned(None),
-            state: EncoderState::default(),
-            parent: None,
+            state: vec![EncoderState::default()],
             finished: false,
             file_copy_buffer: Arc::new(Mutex::new(unsafe {
                 crate::util::vec_new_uninitialized(1024 * 1024)
@@ -327,7 +311,8 @@ impl<'a, T: SeqWrite + 'a> EncoderImpl<'a, T> {
         };
 
         this.encode_metadata(metadata).await?;
-        this.state.files_offset = this.position();
+        let state = this.state_mut()?;
+        state.files_offset = state.position();
 
         if let Some(payload_output) = payload_output {
             this.payload_output = EncoderOutput::Owned(Some(payload_output));
@@ -337,13 +322,38 @@ impl<'a, T: SeqWrite + 'a> EncoderImpl<'a, T> {
     }
 
     fn check(&self) -> io::Result<()> {
-        match self.state.encode_error {
+        if self.finished {
+            io_bail!("unexpected encoder finished state");
+        }
+        let state = self.state()?;
+        match state.encode_error {
             Some(EncodeError::IncompleteFile) => io_bail!("incomplete file"),
             Some(EncodeError::IncompleteDirectory) => io_bail!("directory not finalized"),
             None => Ok(()),
         }
     }
 
+    fn state(&self) -> io::Result<&EncoderState> {
+        self.state
+            .last()
+            .ok_or_else(|| io_format_err!("encoder state stack underflow"))
+    }
+
+    fn state_mut(&mut self) -> io::Result<&mut EncoderState> {
+        self.state
+            .last_mut()
+            .ok_or_else(|| io_format_err!("encoder state stack underflow"))
+    }
+
+    fn output_state(&mut self) -> io::Result<(&mut T, &mut EncoderState)> {
+        Ok((
+            self.output.as_mut(),
+            self.state
+                .last_mut()
+                .ok_or_else(|| io_format_err!("encoder state stack underflow"))?,
+        ))
+    }
+
     pub async fn create_file<'b>(
         &'b mut self,
         metadata: &Metadata,
@@ -368,17 +378,22 @@ impl<'a, T: SeqWrite + 'a> EncoderImpl<'a, T> {
     {
         self.check()?;
 
-        let file_offset = self.position();
+        let file_offset = self.state()?.position();
         self.start_file_do(Some(metadata), file_name).await?;
 
         if let Some(payload_output) = self.payload_output.as_mut() {
+            let state = self
+                .state
+                .last_mut()
+                .ok_or_else(|| io_format_err!("encoder state stack underflow"))?;
+
             // Position prior to the payload header
-            let payload_position = self.state.payload_write_position;
+            let payload_position = state.payload_position();
 
             // Separate payloads in payload archive PXAR_PAYLOAD markers
             let header = format::Header::with_content_size(format::PXAR_PAYLOAD, file_size);
             header.check_header_size()?;
-            seq_write_struct(payload_output, header, &mut self.state.payload_write_position).await?;
+            seq_write_struct(payload_output, header, &mut state.payload_write_position).await?;
 
             let payload_ref = PayloadRef {
                 offset: payload_position,
@@ -390,16 +405,21 @@ impl<'a, T: SeqWrite + 'a> EncoderImpl<'a, T> {
                 self.output.as_mut(),
                 format::PXAR_PAYLOAD_REF,
                 &payload_ref.data(),
-                &mut self.state.write_position,
+                &mut state.write_position,
             )
             .await?;
         } else {
             let header = format::Header::with_content_size(format::PXAR_PAYLOAD, file_size);
             header.check_header_size()?;
-            seq_write_struct(self.output.as_mut(), header, &mut self.state.write_position).await?;
+            let (output, state) = self.output_state()?;
+            seq_write_struct(output, header, &mut state.write_position).await?;
         }
 
-        let payload_data_offset = self.position();
+        let state = self
+            .state
+            .last_mut()
+            .ok_or_else(|| io_format_err!("encoder state stack underflow"))?;
+        let payload_data_offset = state.position();
 
         let meta_size = payload_data_offset - file_offset;
 
@@ -412,7 +432,7 @@ impl<'a, T: SeqWrite + 'a> EncoderImpl<'a, T> {
                 size: file_size + meta_size,
             },
             remaining_size: file_size,
-            parent: &mut self.state,
+            parent: state,
         })
     }
 
@@ -493,7 +513,7 @@ impl<'a, T: SeqWrite + 'a> EncoderImpl<'a, T> {
         target: &Path,
         target_offset: LinkOffset,
     ) -> io::Result<()> {
-        let current_offset = self.position();
+        let current_offset = self.state()?.position();
         if current_offset <= target_offset.0 {
             io_bail!("invalid hardlink offset, can only point to prior files");
         }
@@ -567,24 +587,20 @@ impl<'a, T: SeqWrite + 'a> EncoderImpl<'a, T> {
     ) -> io::Result<LinkOffset> {
         self.check()?;
 
-        let file_offset = self.position();
+        let file_offset = self.state()?.position();
 
         let file_name = file_name.as_os_str().as_bytes();
 
         self.start_file_do(metadata, file_name).await?;
+
+        let (output, state) = self.output_state()?;
         if let Some((htype, entry_data)) = entry_htype_data {
-            seq_write_pxar_entry(
-                self.output.as_mut(),
-                htype,
-                entry_data,
-                &mut self.state.write_position,
-            )
-            .await?;
+            seq_write_pxar_entry(output, htype, entry_data, &mut state.write_position).await?;
         }
 
-        let end_offset = self.position();
+        let end_offset = state.position();
 
-        self.state.items.push(GoodbyeItem {
+        state.items.push(GoodbyeItem {
             hash: format::hash_filename(file_name),
             offset: file_offset,
             size: end_offset - file_offset,
@@ -593,21 +609,11 @@ impl<'a, T: SeqWrite + 'a> EncoderImpl<'a, T> {
         Ok(LinkOffset(file_offset))
     }
 
-    #[inline]
-    fn position(&mut self) -> u64 {
-        self.state.write_position
-    }
-
-    #[inline]
-    fn payload_position(&mut self) -> u64 {
-        self.state.payload_write_position
-    }
-
     pub async fn create_directory(
         &mut self,
         file_name: &Path,
         metadata: &Metadata,
-    ) -> io::Result<EncoderImpl<'_, T>> {
+    ) -> io::Result<()> {
         self.check()?;
 
         if !metadata.is_dir() {
@@ -617,37 +623,30 @@ impl<'a, T: SeqWrite + 'a> EncoderImpl<'a, T> {
         let file_name = file_name.as_os_str().as_bytes();
         let file_hash = format::hash_filename(file_name);
 
-        let file_offset = self.position();
+        let file_offset = self.state()?.position();
         self.encode_filename(file_name).await?;
 
-        let entry_offset = self.position();
+        let entry_offset = self.state()?.position();
         self.encode_metadata(metadata).await?;
 
-        let files_offset = self.position();
+        let state = self.state_mut()?;
+        let files_offset = state.position();
 
         // the child will write to OUR state now:
-        let write_position = self.position();
-        let payload_write_position = self.payload_position();
-
-        let file_copy_buffer = Arc::clone(&self.file_copy_buffer);
-
-        Ok(EncoderImpl {
-            // always forward as Borrowed(), to avoid stacking references on nested calls
-            output: self.output.to_borrowed_mut(),
-            payload_output: self.payload_output.to_borrowed_mut(),
-            state: EncoderState {
-                entry_offset,
-                files_offset,
-                file_offset: Some(file_offset),
-                file_hash,
-                write_position,
-                payload_write_position,
-                ..Default::default()
-            },
-            parent: Some(&mut self.state),
-            finished: false,
-            file_copy_buffer,
-        })
+        let write_position = state.position();
+        let payload_write_position = state.payload_position();
+
+        self.state.push(EncoderState {
+            entry_offset,
+            files_offset,
+            file_offset: Some(file_offset),
+            file_hash,
+            write_position,
+            payload_write_position,
+            ..Default::default()
+        });
+
+        Ok(())
     }
 
     async fn start_file_do(
@@ -663,11 +662,12 @@ impl<'a, T: SeqWrite + 'a> EncoderImpl<'a, T> {
     }
 
     async fn encode_metadata(&mut self, metadata: &Metadata) -> io::Result<()> {
+        let (output, state) = self.output_state()?;
         seq_write_pxar_struct_entry(
-            self.output.as_mut(),
+            output,
             format::PXAR_ENTRY,
             metadata.stat.clone(),
-            &mut self.state.write_position,
+            &mut state.write_position,
         )
         .await?;
 
@@ -689,72 +689,74 @@ impl<'a, T: SeqWrite + 'a> EncoderImpl<'a, T> {
     }
 
     async fn write_xattr(&mut self, xattr: &format::XAttr) -> io::Result<()> {
+        let (output, state) = self.output_state()?;
         seq_write_pxar_entry(
-            self.output.as_mut(),
+            output,
             format::PXAR_XATTR,
             &xattr.data,
-            &mut self.state.write_position,
+            &mut state.write_position,
         )
         .await
     }
 
     async fn write_acls(&mut self, acl: &crate::Acl) -> io::Result<()> {
+        let (output, state) = self.output_state()?;
         for acl in &acl.users {
             seq_write_pxar_struct_entry(
-                self.output.as_mut(),
+                output,
                 format::PXAR_ACL_USER,
                 acl.clone(),
-                &mut self.state.write_position,
+                &mut state.write_position,
             )
             .await?;
         }
 
         for acl in &acl.groups {
             seq_write_pxar_struct_entry(
-                self.output.as_mut(),
+                output,
                 format::PXAR_ACL_GROUP,
                 acl.clone(),
-                &mut self.state.write_position,
+                &mut state.write_position,
             )
             .await?;
         }
 
         if let Some(acl) = &acl.group_obj {
             seq_write_pxar_struct_entry(
-                self.output.as_mut(),
+                output,
                 format::PXAR_ACL_GROUP_OBJ,
                 acl.clone(),
-                &mut self.state.write_position,
+                &mut state.write_position,
             )
             .await?;
         }
 
         if let Some(acl) = &acl.default {
             seq_write_pxar_struct_entry(
-                self.output.as_mut(),
+                output,
                 format::PXAR_ACL_DEFAULT,
                 acl.clone(),
-                &mut self.state.write_position,
+                &mut state.write_position,
             )
             .await?;
         }
 
         for acl in &acl.default_users {
             seq_write_pxar_struct_entry(
-                self.output.as_mut(),
+                output,
                 format::PXAR_ACL_DEFAULT_USER,
                 acl.clone(),
-                &mut self.state.write_position,
+                &mut state.write_position,
             )
             .await?;
         }
 
         for acl in &acl.default_groups {
             seq_write_pxar_struct_entry(
-                self.output.as_mut(),
+                output,
                 format::PXAR_ACL_DEFAULT_GROUP,
                 acl.clone(),
-                &mut self.state.write_position,
+                &mut state.write_position,
             )
             .await?;
         }
@@ -763,11 +765,12 @@ impl<'a, T: SeqWrite + 'a> EncoderImpl<'a, T> {
     }
 
     async fn write_file_capabilities(&mut self, fcaps: &format::FCaps) -> io::Result<()> {
+        let (output, state) = self.output_state()?;
         seq_write_pxar_entry(
-            self.output.as_mut(),
+            output,
             format::PXAR_FCAPS,
             &fcaps.data,
-            &mut self.state.write_position,
+            &mut state.write_position,
         )
         .await
     }
@@ -776,35 +779,32 @@ impl<'a, T: SeqWrite + 'a> EncoderImpl<'a, T> {
         &mut self,
         quota_project_id: &format::QuotaProjectId,
     ) -> io::Result<()> {
+        let (output, state) = self.output_state()?;
         seq_write_pxar_struct_entry(
-            self.output.as_mut(),
+            output,
             format::PXAR_QUOTA_PROJID,
             *quota_project_id,
-            &mut self.state.write_position,
+            &mut state.write_position,
         )
         .await
     }
 
     async fn encode_filename(&mut self, file_name: &[u8]) -> io::Result<()> {
         crate::util::validate_filename(file_name)?;
+        let (output, state) = self.output_state()?;
         seq_write_pxar_entry_zero(
-            self.output.as_mut(),
+            output,
             format::PXAR_FILENAME,
             file_name,
-            &mut self.state.write_position,
+            &mut state.write_position,
         )
         .await
     }
 
-    pub async fn finish(mut self) -> io::Result<()> {
-        let tail_bytes = self.finish_goodbye_table().await?;
-        seq_write_pxar_entry(
-            self.output.as_mut(),
-            format::PXAR_GOODBYE,
-            &tail_bytes,
-            &mut self.state.write_position,
-        )
-        .await?;
+    pub async fn close(mut self) -> io::Result<()> {
+        if !self.state.is_empty() {
+            io_bail!("unexpected state on encoder close");
+        }
 
         if let EncoderOutput::Owned(Some(output)) = &mut self.payload_output {
             flush(output).await?;
@@ -814,34 +814,59 @@ impl<'a, T: SeqWrite + 'a> EncoderImpl<'a, T> {
             flush(output).await?;
         }
 
-        // done up here because of the self-borrow and to propagate
-        let end_offset = self.position();
-        let payload_end_offset = self.payload_position();
+        self.finished = true;
+
+        Ok(())
+    }
+
+    pub async fn finish(&mut self) -> io::Result<()> {
+        let tail_bytes = self.finish_goodbye_table().await?;
+        let mut state = self
+            .state
+            .pop()
+            .ok_or_else(|| io_format_err!("encoder state stack underflow"))?;
+        seq_write_pxar_entry(
+            self.output.as_mut(),
+            format::PXAR_GOODBYE,
+            &tail_bytes,
+            &mut state.write_position,
+        )
+        .await?;
+
+        let end_offset = state.position();
+        let payload_end_offset = state.payload_position();
 
-        if let Some(parent) = &mut self.parent {
+        if let Some(parent) = self.state.last_mut() {
             parent.write_position = end_offset;
             parent.payload_write_position = payload_end_offset;
 
-            let file_offset = self
-                .state
+            let file_offset = state
                 .file_offset
                 .expect("internal error: parent set but no file_offset?");
 
             parent.items.push(GoodbyeItem {
-                hash: self.state.file_hash,
+                hash: state.file_hash,
                 offset: file_offset,
                 size: end_offset - file_offset,
             });
+            // propagate errors
+            parent.merge_error(state.encode_error);
+            Ok(())
+        } else {
+            match state.encode_error {
+                Some(EncodeError::IncompleteFile) => io_bail!("incomplete file"),
+                Some(EncodeError::IncompleteDirectory) => io_bail!("directory not finalized"),
+                None => Ok(()),
+            }
         }
-        self.finished = true;
-        Ok(())
     }
 
     async fn finish_goodbye_table(&mut self) -> io::Result<Vec<u8>> {
-        let goodbye_offset = self.position();
+        let state = self.state_mut()?;
+        let goodbye_offset = state.position();
 
         // "take" out the tail (to not leave an array of endian-swapped structs in `self`)
-        let mut tail = take(&mut self.state.items);
+        let mut tail = take(&mut state.items);
         let tail_size = (tail.len() + 1) * size_of::<GoodbyeItem>();
         let goodbye_size = tail_size as u64 + size_of::<format::Header>() as u64;
 
@@ -866,7 +891,7 @@ impl<'a, T: SeqWrite + 'a> EncoderImpl<'a, T> {
         bst.push(
             GoodbyeItem {
                 hash: format::PXAR_GOODBYE_TAIL_MARKER,
-                offset: goodbye_offset - self.state.entry_offset,
+                offset: goodbye_offset - state.entry_offset,
                 size: goodbye_size,
             }
             .to_le(),
@@ -896,8 +921,8 @@ pub(crate) struct FileImpl<'a, S: SeqWrite> {
     /// exactly zero.
     remaining_size: u64,
 
-    /// The directory containing this file. This is where we propagate the `IncompleteFile` error
-    /// to, and where we insert our `GoodbyeItem`.
+    /// The directory stack with the last item being the directory containing this file. This is
+    /// where we propagate the `IncompleteFile` error to, and where we insert our `GoodbyeItem`.
     parent: &'a mut EncoderState,
 }
 
diff --git a/src/encoder/sync.rs b/src/encoder/sync.rs
index 96d056d..d0d62ba 100644
--- a/src/encoder/sync.rs
+++ b/src/encoder/sync.rs
@@ -106,17 +106,21 @@ impl<'a, T: SeqWrite + 'a> Encoder<'a, T> {
         &mut self,
         file_name: P,
         metadata: &Metadata,
-    ) -> io::Result<Encoder<'_, T>> {
-        Ok(Encoder {
-            inner: poll_result_once(self.inner.create_directory(file_name.as_ref(), metadata))?,
-        })
+    ) -> io::Result<()> {
+        poll_result_once(self.inner.create_directory(file_name.as_ref(), metadata))
     }
 
-    /// Finish this directory. This is mandatory, otherwise the `Drop` handler will `panic!`.
-    pub fn finish(self) -> io::Result<()> {
+    /// Finish this directory. This is mandatory, encodes the end for the current directory.
+    pub fn finish(&mut self) -> io::Result<()> {
         poll_result_once(self.inner.finish())
     }
 
+    /// Close the encoder instance. This is mandatory, encodes the end for the optional payload
+    /// output stream, if some is given
+    pub fn close(self) -> io::Result<()> {
+        poll_result_once(self.inner.close())
+    }
+
     /// Add a symbolic link to the archive.
     pub fn add_symlink<PF: AsRef<Path>, PT: AsRef<Path>>(
         &mut self,
diff --git a/tests/simple/main.rs b/tests/simple/main.rs
index d661c7d..e55457f 100644
--- a/tests/simple/main.rs
+++ b/tests/simple/main.rs
@@ -51,6 +51,9 @@ fn test1() {
     encoder
         .finish()
         .expect("failed to finish encoding the pxar archive");
+    encoder
+        .close()
+        .expect("failed to close the encoder instance");
 
     assert!(!file.is_empty(), "encoder did not write any data");
 
-- 
2.39.2





^ permalink raw reply	[flat|nested] 122+ messages in thread

* [pbs-devel] [PATCH v3 pxar 07/58] decoder/accessor: add optional payload input stream
  2024-03-28 12:36 [pbs-devel] [PATCH v3 pxar proxmox-backup 00/58] fix #3174: improve file-level backup Christian Ebner
                   ` (5 preceding siblings ...)
  2024-03-28 12:36 ` [pbs-devel] [PATCH v3 pxar 06/58] encoder: move to stack based state tracking Christian Ebner
@ 2024-03-28 12:36 ` Christian Ebner
  2024-04-03 10:38   ` Fabian Grünbichler
  2024-03-28 12:36 ` [pbs-devel] [PATCH v3 pxar 08/58] encoder: add payload reference capability Christian Ebner
                   ` (52 subsequent siblings)
  59 siblings, 1 reply; 122+ messages in thread
From: Christian Ebner @ 2024-03-28 12:36 UTC (permalink / raw)
  To: pbs-devel

Implement an optional redirection to read the payload for regular files
from a different input stream.

This allows to decode split stream archives.

Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
---
changes since version 2:
- pass the payload input on decoder/accessor instantiation in order to
  avoid possible adding/removing during decoding/accessing.
- major refactoring

 examples/apxar.rs    |   2 +-
 src/accessor/aio.rs  |  10 ++--
 src/accessor/mod.rs  |  61 ++++++++++++++++++++++---
 src/accessor/sync.rs |   8 ++--
 src/decoder/aio.rs   |  14 ++++--
 src/decoder/mod.rs   | 106 +++++++++++++++++++++++++++++++++++++++----
 src/decoder/sync.rs  |  15 ++++--
 src/lib.rs           |   3 ++
 8 files changed, 184 insertions(+), 35 deletions(-)

diff --git a/examples/apxar.rs b/examples/apxar.rs
index 0c62242..d5eb04e 100644
--- a/examples/apxar.rs
+++ b/examples/apxar.rs
@@ -9,7 +9,7 @@ async fn main() {
         .await
         .expect("failed to open file");
 
-    let mut reader = Decoder::from_tokio(file)
+    let mut reader = Decoder::from_tokio(file, None)
         .await
         .expect("failed to open pxar archive contents");
 
diff --git a/src/accessor/aio.rs b/src/accessor/aio.rs
index 98d7755..0ebb921 100644
--- a/src/accessor/aio.rs
+++ b/src/accessor/aio.rs
@@ -39,7 +39,7 @@ impl<T: FileExt> Accessor<FileReader<T>> {
     /// by a blocking file.
     #[inline]
     pub async fn from_file_and_size(input: T, size: u64) -> io::Result<Self> {
-        Accessor::new(FileReader::new(input), size).await
+        Accessor::new(FileReader::new(input), size, None).await
     }
 }
 
@@ -75,7 +75,7 @@ where
         input: T,
         size: u64,
     ) -> io::Result<Accessor<FileRefReader<T>>> {
-        Accessor::new(FileRefReader::new(input), size).await
+        Accessor::new(FileRefReader::new(input), size, None).await
     }
 }
 
@@ -85,9 +85,11 @@ impl<T: ReadAt> Accessor<T> {
     ///
     /// Note that the `input`'s `SeqRead` implementation must always return `Poll::Ready` and is
     /// not allowed to use the `Waker`, as this will cause a `panic!`.
-    pub async fn new(input: T, size: u64) -> io::Result<Self> {
+    /// Optionally take the file payloads from the provided input stream rather than the regular
+    /// pxar stream.
+    pub async fn new(input: T, size: u64, payload_input: Option<T>) -> io::Result<Self> {
         Ok(Self {
-            inner: accessor::AccessorImpl::new(input, size).await?,
+            inner: accessor::AccessorImpl::new(input, size, payload_input).await?,
         })
     }
 
diff --git a/src/accessor/mod.rs b/src/accessor/mod.rs
index 6a2de73..4789595 100644
--- a/src/accessor/mod.rs
+++ b/src/accessor/mod.rs
@@ -182,10 +182,11 @@ pub(crate) struct AccessorImpl<T> {
     input: T,
     size: u64,
     caches: Arc<Caches>,
+    payload_input: Option<T>,
 }
 
 impl<T: ReadAt> AccessorImpl<T> {
-    pub async fn new(input: T, size: u64) -> io::Result<Self> {
+    pub async fn new(input: T, size: u64, payload_input: Option<T>) -> io::Result<Self> {
         if size < (size_of::<GoodbyeItem>() as u64) {
             io_bail!("too small to contain a pxar archive");
         }
@@ -194,6 +195,7 @@ impl<T: ReadAt> AccessorImpl<T> {
             input,
             size,
             caches: Arc::new(Caches::default()),
+            payload_input,
         })
     }
 
@@ -207,6 +209,9 @@ impl<T: ReadAt> AccessorImpl<T> {
             self.size,
             "/".into(),
             Arc::clone(&self.caches),
+            self.payload_input
+                .as_ref()
+                .map(|input| input as &dyn ReadAt),
         )
         .await
     }
@@ -228,7 +233,13 @@ async fn get_decoder<T: ReadAt>(
     entry_range: Range<u64>,
     path: PathBuf,
 ) -> io::Result<DecoderImpl<SeqReadAtAdapter<T>>> {
-    DecoderImpl::new_full(SeqReadAtAdapter::new(input, entry_range), path, true).await
+    DecoderImpl::new_full(
+        SeqReadAtAdapter::new(input, entry_range.clone()),
+        path,
+        true,
+        None,
+    )
+    .await
 }
 
 // NOTE: This performs the Decoder::read_next_item() behavior! Keep in mind when changing!
@@ -263,6 +274,7 @@ impl<T: Clone + ReadAt> AccessorImpl<T> {
             self.size,
             "/".into(),
             Arc::clone(&self.caches),
+            self.payload_input.clone(),
         )
         .await
     }
@@ -274,6 +286,7 @@ impl<T: Clone + ReadAt> AccessorImpl<T> {
             offset,
             "/".into(),
             Arc::clone(&self.caches),
+            self.payload_input.clone(),
         )
         .await
     }
@@ -293,17 +306,23 @@ impl<T: Clone + ReadAt> AccessorImpl<T> {
             .next()
             .await
             .ok_or_else(|| io_format_err!("unexpected EOF while decoding file entry"))??;
+
         Ok(FileEntryImpl {
             input: self.input.clone(),
             entry,
             entry_range_info: entry_range_info.clone(),
             caches: Arc::clone(&self.caches),
+            payload_input: self.payload_input.clone(),
         })
     }
 
     /// Allow opening arbitrary contents from a specific range.
     pub unsafe fn open_contents_at_range(&self, range: Range<u64>) -> FileContentsImpl<T> {
-        FileContentsImpl::new(self.input.clone(), range)
+        if let Some(payload_input) = &self.payload_input {
+            FileContentsImpl::new(payload_input.clone(), range)
+        } else {
+            FileContentsImpl::new(self.input.clone(), range)
+        }
     }
 
     /// Following a hardlink breaks a couple of conventions we otherwise have, particularly we will
@@ -326,9 +345,12 @@ impl<T: Clone + ReadAt> AccessorImpl<T> {
 
         let link_offset = entry_file_offset - link_offset;
 
-        let (mut decoder, entry_offset) =
-            get_decoder_at_filename(self.input.clone(), link_offset..self.size, PathBuf::new())
-                .await?;
+        let (mut decoder, entry_offset) = get_decoder_at_filename(
+            self.input.clone(),
+            link_offset..self.size,
+            PathBuf::new(),
+        )
+        .await?;
 
         let entry = decoder
             .next()
@@ -342,6 +364,7 @@ impl<T: Clone + ReadAt> AccessorImpl<T> {
             EntryKind::File {
                 offset: Some(offset),
                 size,
+                ..
             } => {
                 let meta_size = offset - link_offset;
                 let entry_end = link_offset + meta_size + size;
@@ -353,6 +376,7 @@ impl<T: Clone + ReadAt> AccessorImpl<T> {
                         entry_range: entry_offset..entry_end,
                     },
                     caches: Arc::clone(&self.caches),
+                    payload_input: self.payload_input.clone(),
                 })
             }
             _ => io_bail!("hardlink does not point to a regular file"),
@@ -369,6 +393,7 @@ pub(crate) struct DirectoryImpl<T> {
     table: Arc<[GoodbyeItem]>,
     path: PathBuf,
     caches: Arc<Caches>,
+    payload_input: Option<T>,
 }
 
 impl<T: Clone + ReadAt> DirectoryImpl<T> {
@@ -378,6 +403,7 @@ impl<T: Clone + ReadAt> DirectoryImpl<T> {
         end_offset: u64,
         path: PathBuf,
         caches: Arc<Caches>,
+        payload_input: Option<T>,
     ) -> io::Result<DirectoryImpl<T>> {
         let tail = Self::read_tail_entry(&input, end_offset).await?;
 
@@ -407,6 +433,7 @@ impl<T: Clone + ReadAt> DirectoryImpl<T> {
             table: table.as_ref().map_or_else(|| Arc::new([]), Arc::clone),
             path,
             caches,
+            payload_input,
         };
 
         // sanity check:
@@ -533,6 +560,7 @@ impl<T: Clone + ReadAt> DirectoryImpl<T> {
                 entry_range: self.entry_range(),
             },
             caches: Arc::clone(&self.caches),
+            payload_input: self.payload_input.clone(),
         })
     }
 
@@ -685,6 +713,7 @@ pub(crate) struct FileEntryImpl<T: Clone + ReadAt> {
     entry: Entry,
     entry_range_info: EntryRangeInfo,
     caches: Arc<Caches>,
+    payload_input: Option<T>,
 }
 
 impl<T: Clone + ReadAt> FileEntryImpl<T> {
@@ -698,6 +727,7 @@ impl<T: Clone + ReadAt> FileEntryImpl<T> {
             self.entry_range_info.entry_range.end,
             self.entry.path.clone(),
             Arc::clone(&self.caches),
+            self.payload_input.clone(),
         )
         .await
     }
@@ -711,14 +741,30 @@ impl<T: Clone + ReadAt> FileEntryImpl<T> {
             EntryKind::File {
                 size,
                 offset: Some(offset),
+                payload_offset: None,
             } => Ok(Some(offset..(offset + size))),
+            // Payload offset beats regular offset if some
+            EntryKind::File {
+                size,
+                offset: Some(_offset),
+                payload_offset: Some(payload_offset),
+            } => {
+                let start_offset = payload_offset + size_of::<format::Header>() as u64;
+                Ok(Some(start_offset..start_offset + size))
+            }
             _ => Ok(None),
         }
     }
 
     pub async fn contents(&self) -> io::Result<FileContentsImpl<T>> {
         match self.content_range()? {
-            Some(range) => Ok(FileContentsImpl::new(self.input.clone(), range)),
+            Some(range) => {
+                if let Some(ref payload_input) = self.payload_input {
+                    Ok(FileContentsImpl::new(payload_input.clone(), range))
+                } else {
+                    Ok(FileContentsImpl::new(self.input.clone(), range))
+                }
+            }
             None => io_bail!("not a file"),
         }
     }
@@ -808,6 +854,7 @@ impl<'a, T: Clone + ReadAt> DirEntryImpl<'a, T> {
             entry,
             entry_range_info: self.entry_range_info.clone(),
             caches: Arc::clone(&self.caches),
+            payload_input: self.dir.payload_input.clone(),
         })
     }
 
diff --git a/src/accessor/sync.rs b/src/accessor/sync.rs
index a777152..6150a18 100644
--- a/src/accessor/sync.rs
+++ b/src/accessor/sync.rs
@@ -31,7 +31,7 @@ impl<T: FileExt> Accessor<FileReader<T>> {
     /// Decode a `pxar` archive from a standard file implementing `FileExt`.
     #[inline]
     pub fn from_file_and_size(input: T, size: u64) -> io::Result<Self> {
-        Accessor::new(FileReader::new(input), size)
+        Accessor::new(FileReader::new(input), size, None)
     }
 }
 
@@ -64,7 +64,7 @@ where
 {
     /// Open an `Arc` or `Rc` of `File`.
     pub fn from_file_ref_and_size(input: T, size: u64) -> io::Result<Accessor<FileRefReader<T>>> {
-        Accessor::new(FileRefReader::new(input), size)
+        Accessor::new(FileRefReader::new(input), size, None)
     }
 }
 
@@ -74,9 +74,9 @@ impl<T: ReadAt> Accessor<T> {
     ///
     /// Note that the `input`'s `SeqRead` implementation must always return `Poll::Ready` and is
     /// not allowed to use the `Waker`, as this will cause a `panic!`.
-    pub fn new(input: T, size: u64) -> io::Result<Self> {
+    pub fn new(input: T, size: u64, payload_input: Option<T>) -> io::Result<Self> {
         Ok(Self {
-            inner: poll_result_once(accessor::AccessorImpl::new(input, size))?,
+            inner: poll_result_once(accessor::AccessorImpl::new(input, size, payload_input))?,
         })
     }
 
diff --git a/src/decoder/aio.rs b/src/decoder/aio.rs
index 4de8c6f..bb032cf 100644
--- a/src/decoder/aio.rs
+++ b/src/decoder/aio.rs
@@ -20,8 +20,12 @@ pub struct Decoder<T> {
 impl<T: tokio::io::AsyncRead> Decoder<TokioReader<T>> {
     /// Decode a `pxar` archive from a `tokio::io::AsyncRead` input.
     #[inline]
-    pub async fn from_tokio(input: T) -> io::Result<Self> {
-        Decoder::new(TokioReader::new(input)).await
+    pub async fn from_tokio(input: T, payload_input: Option<T>) -> io::Result<Self> {
+        Decoder::new(
+            TokioReader::new(input),
+            payload_input.map(|payload_input| TokioReader::new(payload_input)),
+        )
+        .await
     }
 }
 
@@ -30,15 +34,15 @@ impl Decoder<TokioReader<tokio::fs::File>> {
     /// Decode a `pxar` archive from a `tokio::io::AsyncRead` input.
     #[inline]
     pub async fn open<P: AsRef<Path>>(path: P) -> io::Result<Self> {
-        Decoder::from_tokio(tokio::fs::File::open(path.as_ref()).await?).await
+        Decoder::from_tokio(tokio::fs::File::open(path.as_ref()).await?, None).await
     }
 }
 
 impl<T: SeqRead> Decoder<T> {
     /// Create an async decoder from an input implementing our internal read interface.
-    pub async fn new(input: T) -> io::Result<Self> {
+    pub async fn new(input: T, payload_input: Option<T>) -> io::Result<Self> {
         Ok(Self {
-            inner: decoder::DecoderImpl::new(input).await?,
+            inner: decoder::DecoderImpl::new(input, payload_input).await?,
         })
     }
 
diff --git a/src/decoder/mod.rs b/src/decoder/mod.rs
index f439327..8cc4877 100644
--- a/src/decoder/mod.rs
+++ b/src/decoder/mod.rs
@@ -157,6 +157,10 @@ pub(crate) struct DecoderImpl<T> {
     state: State,
     with_goodbye_tables: bool,
 
+    // Payload of regular files might be provided by a different reader
+    payload_input: Option<T>,
+    payload_consumed: u64,
+
     /// The random access code uses decoders for sub-ranges which may not end in a `PAYLOAD` for
     /// entries like FIFOs or sockets, so there we explicitly allow an item to terminate with EOF.
     eof_after_entry: bool,
@@ -167,6 +171,8 @@ enum State {
     Default,
     InPayload {
         offset: u64,
+        size: u64,
+        payload_ref: bool,
     },
 
     /// file entries with no data (fifo, socket)
@@ -195,8 +201,8 @@ pub(crate) enum ItemResult {
 }
 
 impl<I: SeqRead> DecoderImpl<I> {
-    pub async fn new(input: I) -> io::Result<Self> {
-        Self::new_full(input, "/".into(), false).await
+    pub async fn new(input: I, payload_input: Option<I>) -> io::Result<Self> {
+        Self::new_full(input, "/".into(), false, payload_input).await
     }
 
     pub(crate) fn input(&self) -> &I {
@@ -207,6 +213,7 @@ impl<I: SeqRead> DecoderImpl<I> {
         input: I,
         path: PathBuf,
         eof_after_entry: bool,
+        payload_input: Option<I>,
     ) -> io::Result<Self> {
         let this = DecoderImpl {
             input,
@@ -219,6 +226,8 @@ impl<I: SeqRead> DecoderImpl<I> {
             path_lengths: Vec::new(),
             state: State::Begin,
             with_goodbye_tables: false,
+            payload_input,
+            payload_consumed: 0,
             eof_after_entry,
         };
 
@@ -242,9 +251,18 @@ impl<I: SeqRead> DecoderImpl<I> {
                     // hierarchy and parse the next PXAR_FILENAME or the PXAR_GOODBYE:
                     self.read_next_item().await?;
                 }
-                State::InPayload { offset } => {
-                    // We need to skip the current payload first.
-                    self.skip_entry(offset).await?;
+                State::InPayload {
+                    offset,
+                    payload_ref,
+                    ..
+                } => {
+                    if payload_ref {
+                        // Update consumed payload as given by the offset referenced by the content reader
+                        self.payload_consumed += offset;
+                    } else if self.payload_input.is_none() {
+                        // Skip remaining payload of current entry in regular stream
+                        self.skip_entry(offset).await?;
+                    }
                     self.read_next_item().await?;
                 }
                 State::InGoodbyeTable => {
@@ -308,11 +326,19 @@ impl<I: SeqRead> DecoderImpl<I> {
     }
 
     pub fn content_reader(&mut self) -> Option<Contents<I>> {
-        if let State::InPayload { offset } = &mut self.state {
+        if let State::InPayload {
+            offset,
+            size,
+            payload_ref,
+        } = &mut self.state
+        {
+            if *payload_ref && self.payload_input.is_none() {
+                return None;
+            }
             Some(Contents::new(
-                &mut self.input,
+                self.payload_input.as_mut().unwrap_or(&mut self.input),
                 offset,
-                self.current_header.content_size(),
+                *size,
             ))
         } else {
             None
@@ -531,8 +557,60 @@ impl<I: SeqRead> DecoderImpl<I> {
                 self.entry.kind = EntryKind::File {
                     size: self.current_header.content_size(),
                     offset,
+                    payload_offset: None,
+                };
+                self.state = State::InPayload {
+                    offset: 0,
+                    size: self.current_header.content_size(),
+                    payload_ref: false,
+                };
+                return Ok(ItemResult::Entry);
+            }
+            format::PXAR_PAYLOAD_REF => {
+                let offset = seq_read_position(&mut self.input).await.transpose()?;
+                let payload_ref = self.read_payload_ref().await?;
+
+                if let Some(payload_input) = self.payload_input.as_mut() {
+                    if seq_read_position(payload_input)
+                        .await
+                        .transpose()?
+                        .is_none()
+                    {
+                        // Skip payload padding for injected chunks in sequential decoder
+                        let to_skip = payload_ref.offset - self.payload_consumed;
+                        self.skip_payload(to_skip).await?;
+                    }
+                }
+
+                if let Some(payload_input) = self.payload_input.as_mut() {
+                    let header: u64 = seq_read_entry(payload_input).await?;
+                    if header != format::PXAR_PAYLOAD {
+                        io_bail!(
+                            "unexpected header in payload input: expected {} , got {header}",
+                            format::PXAR_PAYLOAD,
+                        );
+                    }
+                    let size: u64 = seq_read_entry(payload_input).await?;
+                    self.payload_consumed += size_of::<Header>() as u64;
+
+                    if size != payload_ref.size + size_of::<Header>() as u64 {
+                        io_bail!(
+                            "encountered payload size mismatch: got {}, expected {size}",
+                            payload_ref.size
+                        );
+                    }
+                }
+
+                self.entry.kind = EntryKind::File {
+                    size: payload_ref.size,
+                    offset,
+                    payload_offset: Some(payload_ref.offset),
+                };
+                self.state = State::InPayload {
+                    offset: 0,
+                    size: payload_ref.size,
+                    payload_ref: true,
                 };
-                self.state = State::InPayload { offset: 0 };
                 return Ok(ItemResult::Entry);
             }
             format::PXAR_FILENAME | format::PXAR_GOODBYE => {
@@ -567,6 +645,16 @@ impl<I: SeqRead> DecoderImpl<I> {
         Self::skip(&mut self.input, len).await
     }
 
+    async fn skip_payload(&mut self, length: u64) -> io::Result<()> {
+        if let Some(payload_input) = self.payload_input.as_mut() {
+            Self::skip(payload_input, length as usize).await?;
+            self.payload_consumed += length;
+        } else {
+            io_bail!("skip payload called, but got no payload input");
+        }
+        Ok(())
+    }
+
     async fn skip(input: &mut I, len: usize) -> io::Result<()> {
         let mut len = len;
         let scratch = scratch_buffer();
diff --git a/src/decoder/sync.rs b/src/decoder/sync.rs
index 5597a03..caa8bcd 100644
--- a/src/decoder/sync.rs
+++ b/src/decoder/sync.rs
@@ -25,8 +25,11 @@ pub struct Decoder<T> {
 impl<T: io::Read> Decoder<StandardReader<T>> {
     /// Decode a `pxar` archive from a regular `std::io::Read` input.
     #[inline]
-    pub fn from_std(input: T) -> io::Result<Self> {
-        Decoder::new(StandardReader::new(input))
+    pub fn from_std(input: T, payload_input: Option<T>) -> io::Result<Self> {
+        Decoder::new(
+            StandardReader::new(input),
+            payload_input.map(|payload_input| StandardReader::new(payload_input)),
+        )
     }
 
     /// Get a direct reference to the reader contained inside the contained [`StandardReader`].
@@ -38,7 +41,7 @@ impl<T: io::Read> Decoder<StandardReader<T>> {
 impl Decoder<StandardReader<std::fs::File>> {
     /// Convenience shortcut for `File::open` followed by `Accessor::from_file`.
     pub fn open<P: AsRef<Path>>(path: P) -> io::Result<Self> {
-        Self::from_std(std::fs::File::open(path.as_ref())?)
+        Self::from_std(std::fs::File::open(path.as_ref())?, None)
     }
 }
 
@@ -47,9 +50,11 @@ impl<T: SeqRead> Decoder<T> {
     ///
     /// Note that the `input`'s `SeqRead` implementation must always return `Poll::Ready` and is
     /// not allowed to use the `Waker`, as this will cause a `panic!`.
-    pub fn new(input: T) -> io::Result<Self> {
+    /// The optional payload input must be used to restore regular file payloads for payload references
+    /// encountered within the archive.
+    pub fn new(input: T, payload_input: Option<T>) -> io::Result<Self> {
         Ok(Self {
-            inner: poll_result_once(decoder::DecoderImpl::new(input))?,
+            inner: poll_result_once(decoder::DecoderImpl::new(input, payload_input))?,
         })
     }
 
diff --git a/src/lib.rs b/src/lib.rs
index 210c4b1..ef81a85 100644
--- a/src/lib.rs
+++ b/src/lib.rs
@@ -364,6 +364,9 @@ pub enum EntryKind {
 
         /// The file's byte offset inside the archive, if available.
         offset: Option<u64>,
+
+        /// The file's byte offset inside the payload stream, if available.
+        payload_offset: Option<u64>,
     },
 
     /// Directory entry. When iterating through an archive, the contents follow next.
-- 
2.39.2





^ permalink raw reply	[flat|nested] 122+ messages in thread

* [pbs-devel] [PATCH v3 pxar 08/58] encoder: add payload reference capability
  2024-03-28 12:36 [pbs-devel] [PATCH v3 pxar proxmox-backup 00/58] fix #3174: improve file-level backup Christian Ebner
                   ` (6 preceding siblings ...)
  2024-03-28 12:36 ` [pbs-devel] [PATCH v3 pxar 07/58] decoder/accessor: add optional payload input stream Christian Ebner
@ 2024-03-28 12:36 ` Christian Ebner
  2024-03-28 12:36 ` [pbs-devel] [PATCH v3 pxar 09/58] encoder: add payload position capability Christian Ebner
                   ` (51 subsequent siblings)
  59 siblings, 0 replies; 122+ messages in thread
From: Christian Ebner @ 2024-03-28 12:36 UTC (permalink / raw)
  To: pbs-devel

Allows to encode regular files with a payload reference within a
separate payload archive rather than encoding the payload within the
regular archive.

Following the PXAR_PAYLOAD_REF marked header, the payload offset and
size are encoded.

Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
---
changes since version 2:
- Add additional consistency check for current payload position
- Use PayloadRefs `data` for encoding
- Fixed incorrect comments

 src/encoder/aio.rs  | 18 ++++++++++++++-
 src/encoder/mod.rs  | 54 +++++++++++++++++++++++++++++++++++++++++++++
 src/encoder/sync.rs | 21 +++++++++++++++++-
 3 files changed, 91 insertions(+), 2 deletions(-)

diff --git a/src/encoder/aio.rs b/src/encoder/aio.rs
index 635e550..17ade7d 100644
--- a/src/encoder/aio.rs
+++ b/src/encoder/aio.rs
@@ -5,7 +5,7 @@ use std::path::Path;
 use std::pin::Pin;
 use std::task::{Context, Poll};
 
-use crate::encoder::{self, LinkOffset, SeqWrite};
+use crate::encoder::{self, LinkOffset, PayloadOffset, SeqWrite};
 use crate::format;
 use crate::Metadata;
 
@@ -103,6 +103,22 @@ impl<'a, T: SeqWrite + 'a> Encoder<'a, T> {
     //     ).await
     // }
 
+    /// Encode a payload reference pointing to given offset in the separate payload output
+    ///
+    /// Returns with error if the encoder instance has no separate payload output or encoding
+    /// failed.
+    pub async fn add_payload_ref(
+        &mut self,
+        metadata: &Metadata,
+        file_name: &Path,
+        file_size: u64,
+        payload_offset: PayloadOffset,
+    ) -> io::Result<()> {
+        self.inner
+            .add_payload_ref(metadata, file_name.as_ref(), file_size, payload_offset)
+            .await
+    }
+
     /// Create a new subdirectory. Note that the subdirectory has to be finished by calling the
     /// `finish()` method, otherwise the entire archive will be in an error state.
     pub async fn create_directory<P: AsRef<Path>>(
diff --git a/src/encoder/mod.rs b/src/encoder/mod.rs
index 31bb0fa..fcd636d 100644
--- a/src/encoder/mod.rs
+++ b/src/encoder/mod.rs
@@ -38,6 +38,24 @@ impl LinkOffset {
     }
 }
 
+/// File reference used to create payload references.
+#[derive(Clone, Copy, Debug, Default, Eq, PartialEq, Ord, PartialOrd)]
+pub struct PayloadOffset(u64);
+
+impl PayloadOffset {
+    /// Get the raw byte offset of this link.
+    #[inline]
+    pub fn raw(self) -> u64 {
+        self.0
+    }
+
+    /// Return a new PayloadOffset, positively shifted by offset
+    #[inline]
+    pub fn add(&self, offset: u64) -> Self {
+        Self(self.0 + offset)
+    }
+}
+
 /// Sequential write interface used by the encoder's state machine.
 ///
 /// This is our internal writer trait which is available for `std::io::Write` types in the
@@ -485,6 +503,42 @@ impl<'a, T: SeqWrite + 'a> EncoderImpl<'a, T> {
         Ok(offset)
     }
 
+    /// Encode a payload reference pointing to given offset in the separate payload output
+    ///
+    /// Returns with error if the encoder instance has no separate payload output or encoding
+    /// failed.
+    pub async fn add_payload_ref(
+        &mut self,
+        metadata: &Metadata,
+        file_name: &Path,
+        file_size: u64,
+        payload_offset: PayloadOffset,
+    ) -> io::Result<()> {
+        if self.payload_output.as_mut().is_none() {
+            io_bail!("unable to add payload reference");
+        }
+
+        let offset = payload_offset.raw();
+        let payload_position = self.state()?.payload_position();
+        if offset < payload_position {
+            io_bail!("offset smaller than current position: {offset} < {payload_position}");
+        }
+
+        let payload_ref = PayloadRef {
+            offset,
+            size: file_size,
+        };
+        let _this_offset: LinkOffset = self
+            .add_file_entry(
+                Some(metadata),
+                file_name,
+                Some((format::PXAR_PAYLOAD_REF, &payload_ref.data())),
+            )
+            .await?;
+
+        Ok(())
+    }
+
     /// Return a file offset usable with `add_hardlink`.
     pub async fn add_symlink(
         &mut self,
diff --git a/src/encoder/sync.rs b/src/encoder/sync.rs
index d0d62ba..48942b9 100644
--- a/src/encoder/sync.rs
+++ b/src/encoder/sync.rs
@@ -6,7 +6,7 @@ use std::pin::Pin;
 use std::task::{Context, Poll};
 
 use crate::decoder::sync::StandardReader;
-use crate::encoder::{self, LinkOffset, SeqWrite};
+use crate::encoder::{self, LinkOffset, PayloadOffset, SeqWrite};
 use crate::format;
 use crate::util::poll_result_once;
 use crate::Metadata;
@@ -100,6 +100,25 @@ impl<'a, T: SeqWrite + 'a> Encoder<'a, T> {
         ))
     }
 
+    /// Encode a payload reference pointing to given offset in the separate payload output
+    ///
+    /// Returns with error if the encoder instance has no separate payload output or encoding
+    /// failed.
+    pub async fn add_payload_ref(
+        &mut self,
+        metadata: &Metadata,
+        file_name: &Path,
+        file_size: u64,
+        payload_offset: PayloadOffset,
+    ) -> io::Result<()> {
+        poll_result_once(self.inner.add_payload_ref(
+            metadata,
+            file_name.as_ref(),
+            file_size,
+            payload_offset,
+        ))
+    }
+
     /// Create a new subdirectory. Note that the subdirectory has to be finished by calling the
     /// `finish()` method, otherwise the entire archive will be in an error state.
     pub fn create_directory<P: AsRef<Path>>(
-- 
2.39.2





^ permalink raw reply	[flat|nested] 122+ messages in thread

* [pbs-devel] [PATCH v3 pxar 09/58] encoder: add payload position capability
  2024-03-28 12:36 [pbs-devel] [PATCH v3 pxar proxmox-backup 00/58] fix #3174: improve file-level backup Christian Ebner
                   ` (7 preceding siblings ...)
  2024-03-28 12:36 ` [pbs-devel] [PATCH v3 pxar 08/58] encoder: add payload reference capability Christian Ebner
@ 2024-03-28 12:36 ` Christian Ebner
  2024-03-28 12:36 ` [pbs-devel] [PATCH v3 pxar 10/58] encoder: add payload advance capability Christian Ebner
                   ` (50 subsequent siblings)
  59 siblings, 0 replies; 122+ messages in thread
From: Christian Ebner @ 2024-03-28 12:36 UTC (permalink / raw)
  To: pbs-devel

Allows to read the current payload offset from the dedicated payload
input stream. This is required to get the current offset for calculation
of forced boundaries in the proxmox-backup-client, when injecting reused
payload chunks into the payload stream.

Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
---
changes since version 2:
- no changes

 src/encoder/aio.rs  | 5 +++++
 src/encoder/mod.rs  | 4 ++++
 src/encoder/sync.rs | 5 +++++
 3 files changed, 14 insertions(+)

diff --git a/src/encoder/aio.rs b/src/encoder/aio.rs
index 17ade7d..b9cbe11 100644
--- a/src/encoder/aio.rs
+++ b/src/encoder/aio.rs
@@ -83,6 +83,11 @@ impl<'a, T: SeqWrite + 'a> Encoder<'a, T> {
         })
     }
 
+    /// Get current position for payload stream
+    pub fn payload_position(&self) -> io::Result<PayloadOffset> {
+        self.inner.payload_position()
+    }
+
     // /// Convenience shortcut to add a *regular* file by path including its contents to the archive.
     // pub async fn add_file<P, F>(
     //     &mut self,
diff --git a/src/encoder/mod.rs b/src/encoder/mod.rs
index fcd636d..607cea5 100644
--- a/src/encoder/mod.rs
+++ b/src/encoder/mod.rs
@@ -503,6 +503,10 @@ impl<'a, T: SeqWrite + 'a> EncoderImpl<'a, T> {
         Ok(offset)
     }
 
+    pub fn payload_position(&self) -> io::Result<PayloadOffset> {
+        Ok(PayloadOffset(self.state()?.payload_position()))
+    }
+
     /// Encode a payload reference pointing to given offset in the separate payload output
     ///
     /// Returns with error if the encoder instance has no separate payload output or encoding
diff --git a/src/encoder/sync.rs b/src/encoder/sync.rs
index 48942b9..a63a3cd 100644
--- a/src/encoder/sync.rs
+++ b/src/encoder/sync.rs
@@ -100,6 +100,11 @@ impl<'a, T: SeqWrite + 'a> Encoder<'a, T> {
         ))
     }
 
+    /// Get current payload position for payload stream
+    pub fn payload_position(&self) -> io::Result<PayloadOffset> {
+        self.inner.payload_position()
+    }
+
     /// Encode a payload reference pointing to given offset in the separate payload output
     ///
     /// Returns with error if the encoder instance has no separate payload output or encoding
-- 
2.39.2





^ permalink raw reply	[flat|nested] 122+ messages in thread

* [pbs-devel] [PATCH v3 pxar 10/58] encoder: add payload advance capability
  2024-03-28 12:36 [pbs-devel] [PATCH v3 pxar proxmox-backup 00/58] fix #3174: improve file-level backup Christian Ebner
                   ` (8 preceding siblings ...)
  2024-03-28 12:36 ` [pbs-devel] [PATCH v3 pxar 09/58] encoder: add payload position capability Christian Ebner
@ 2024-03-28 12:36 ` Christian Ebner
  2024-03-28 12:36 ` [pbs-devel] [PATCH v3 pxar 11/58] encoder/format: finish payload stream with marker Christian Ebner
                   ` (49 subsequent siblings)
  59 siblings, 0 replies; 122+ messages in thread
From: Christian Ebner @ 2024-03-28 12:36 UTC (permalink / raw)
  To: pbs-devel

Allows to advance the payload writer position by a given size.
This is used to update the encoders payload input position when
injecting reused chunks for files with unchanged metadata.

Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
---
changes since version 2:
- wrap size in PayloadOffset to have additional type check

 src/encoder/aio.rs  | 5 +++++
 src/encoder/mod.rs  | 6 ++++++
 src/encoder/sync.rs | 5 +++++
 3 files changed, 16 insertions(+)

diff --git a/src/encoder/aio.rs b/src/encoder/aio.rs
index b9cbe11..6da32bd 100644
--- a/src/encoder/aio.rs
+++ b/src/encoder/aio.rs
@@ -124,6 +124,11 @@ impl<'a, T: SeqWrite + 'a> Encoder<'a, T> {
             .await
     }
 
+    /// Add size to payload stream
+    pub fn advance(&mut self, size: PayloadOffset) -> io::Result<()> {
+        self.inner.advance(size)
+    }
+
     /// Create a new subdirectory. Note that the subdirectory has to be finished by calling the
     /// `finish()` method, otherwise the entire archive will be in an error state.
     pub async fn create_directory<P: AsRef<Path>>(
diff --git a/src/encoder/mod.rs b/src/encoder/mod.rs
index 607cea5..6458bc0 100644
--- a/src/encoder/mod.rs
+++ b/src/encoder/mod.rs
@@ -543,6 +543,12 @@ impl<'a, T: SeqWrite + 'a> EncoderImpl<'a, T> {
         Ok(())
     }
 
+    /// Add size to payload stream
+    pub fn advance(&mut self, size: PayloadOffset) -> io::Result<()> {
+        self.state_mut()?.payload_write_position += size.raw();
+        Ok(())
+    }
+
     /// Return a file offset usable with `add_hardlink`.
     pub async fn add_symlink(
         &mut self,
diff --git a/src/encoder/sync.rs b/src/encoder/sync.rs
index a63a3cd..a6e16f4 100644
--- a/src/encoder/sync.rs
+++ b/src/encoder/sync.rs
@@ -124,6 +124,11 @@ impl<'a, T: SeqWrite + 'a> Encoder<'a, T> {
         ))
     }
 
+    /// Add size to payload stream
+    pub fn advance(&mut self, size: PayloadOffset) -> io::Result<()> {
+        self.inner.advance(size)
+    }
+
     /// Create a new subdirectory. Note that the subdirectory has to be finished by calling the
     /// `finish()` method, otherwise the entire archive will be in an error state.
     pub fn create_directory<P: AsRef<Path>>(
-- 
2.39.2





^ permalink raw reply	[flat|nested] 122+ messages in thread

* [pbs-devel] [PATCH v3 pxar 11/58] encoder/format: finish payload stream with marker
  2024-03-28 12:36 [pbs-devel] [PATCH v3 pxar proxmox-backup 00/58] fix #3174: improve file-level backup Christian Ebner
                   ` (9 preceding siblings ...)
  2024-03-28 12:36 ` [pbs-devel] [PATCH v3 pxar 10/58] encoder: add payload advance capability Christian Ebner
@ 2024-03-28 12:36 ` Christian Ebner
  2024-03-28 12:36 ` [pbs-devel] [PATCH v3 pxar 12/58] format: add payload stream start marker Christian Ebner
                   ` (48 subsequent siblings)
  59 siblings, 0 replies; 122+ messages in thread
From: Christian Ebner @ 2024-03-28 12:36 UTC (permalink / raw)
  To: pbs-devel

Mark the end of the optional payload stream, this makes sure that at
least some bytes are written to the stream (as empty archives are not
allowed by the proxmox backup server) and possible injected chunks
must be consumed.

Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
---
changes since version 2:
- add missing Display and size_of impl for this header type

 examples/mk-format-hashes.rs | 5 +++++
 src/encoder/mod.rs           | 8 ++++++++
 src/format/mod.rs            | 4 ++++
 3 files changed, 17 insertions(+)

diff --git a/examples/mk-format-hashes.rs b/examples/mk-format-hashes.rs
index 83adb38..de73df0 100644
--- a/examples/mk-format-hashes.rs
+++ b/examples/mk-format-hashes.rs
@@ -56,6 +56,11 @@ const CONSTANTS: &[(&str, &str, &str)] = &[
         "PXAR_GOODBYE_TAIL_MARKER",
         "__PROXMOX_FORMAT_PXAR_GOODBYE_TAIL_MARKER__",
     ),
+    (
+        "The end marker used in the separate payload stream",
+        "PXAR_PAYLOAD_TAIL_MARKER",
+        "__PROXMOX_FORMAT_PXAR_PAYLOAD_TAIL_MARKER__",
+    ),
 ];
 
 fn main() {
diff --git a/src/encoder/mod.rs b/src/encoder/mod.rs
index 6458bc0..24dcbc9 100644
--- a/src/encoder/mod.rs
+++ b/src/encoder/mod.rs
@@ -871,6 +871,14 @@ impl<'a, T: SeqWrite + 'a> EncoderImpl<'a, T> {
         }
 
         if let EncoderOutput::Owned(Some(output)) = &mut self.payload_output {
+            let mut dummy_writer = 0;
+            seq_write_pxar_entry(
+                output,
+                format::PXAR_PAYLOAD_TAIL_MARKER,
+                &[],
+                &mut dummy_writer,
+            )
+            .await?;
             flush(output).await?;
         }
 
diff --git a/src/format/mod.rs b/src/format/mod.rs
index 1fda535..10192e7 100644
--- a/src/format/mod.rs
+++ b/src/format/mod.rs
@@ -106,6 +106,8 @@ pub const PXAR_PAYLOAD_REF: u64 = 0x419d3d6bc4ba977e;
 pub const PXAR_GOODBYE: u64 = 0x2fec4fa642d5731d;
 /// The end marker used in the GOODBYE object
 pub const PXAR_GOODBYE_TAIL_MARKER: u64 = 0xef5eed5b753e1555;
+/// The end marker used in the separate payload stream
+pub const PXAR_PAYLOAD_TAIL_MARKER: u64 = 0x6c72b78b984c81b5;
 
 #[derive(Debug, Endian)]
 #[repr(C)]
@@ -156,6 +158,7 @@ impl Header {
             PXAR_ENTRY => size_of::<Stat>() as u64,
             PXAR_PAYLOAD | PXAR_GOODBYE => u64::MAX - (size_of::<Self>() as u64),
             PXAR_PAYLOAD_REF => size_of::<PayloadRef>() as u64,
+            PXAR_PAYLOAD_TAIL_MARKER => size_of::<Header>() as u64,
             _ => u64::MAX - (size_of::<Self>() as u64),
         }
     }
@@ -197,6 +200,7 @@ impl Display for Header {
             PXAR_ENTRY => "ENTRY",
             PXAR_PAYLOAD => "PAYLOAD",
             PXAR_PAYLOAD_REF => "PAYLOAD_REF",
+            PXAR_PAYLOAD_TAIL_MARKER => "PXAR_PAYLOAD_TAIL_MARKER",
             PXAR_GOODBYE => "GOODBYE",
             _ => "UNKNOWN",
         };
-- 
2.39.2





^ permalink raw reply	[flat|nested] 122+ messages in thread

* [pbs-devel] [PATCH v3 pxar 12/58] format: add payload stream start marker
  2024-03-28 12:36 [pbs-devel] [PATCH v3 pxar proxmox-backup 00/58] fix #3174: improve file-level backup Christian Ebner
                   ` (10 preceding siblings ...)
  2024-03-28 12:36 ` [pbs-devel] [PATCH v3 pxar 11/58] encoder/format: finish payload stream with marker Christian Ebner
@ 2024-03-28 12:36 ` Christian Ebner
  2024-03-28 12:36 ` [pbs-devel] [PATCH v3 pxar 13/58] format: add pxar format version entry Christian Ebner
                   ` (47 subsequent siblings)
  59 siblings, 0 replies; 122+ messages in thread
From: Christian Ebner @ 2024-03-28 12:36 UTC (permalink / raw)
  To: pbs-devel

Mark the beginning of the payload stream with a magic number. Allows for
version and file type detection.

Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
---
changes since version 2:
- not present in previous version

 examples/mk-format-hashes.rs |  5 +++++
 src/decoder/mod.rs           | 17 +++++++++++++++--
 src/encoder/mod.rs           | 18 +++++++++++-------
 src/format/mod.rs            |  2 ++
 4 files changed, 33 insertions(+), 9 deletions(-)

diff --git a/examples/mk-format-hashes.rs b/examples/mk-format-hashes.rs
index de73df0..35cff99 100644
--- a/examples/mk-format-hashes.rs
+++ b/examples/mk-format-hashes.rs
@@ -56,6 +56,11 @@ const CONSTANTS: &[(&str, &str, &str)] = &[
         "PXAR_GOODBYE_TAIL_MARKER",
         "__PROXMOX_FORMAT_PXAR_GOODBYE_TAIL_MARKER__",
     ),
+    (
+        "The start marker used in the separate payload stream",
+        "PXAR_PAYLOAD_START_MARKER",
+        "__PROXMOX_FORMAT_PXAR_PAYLOAD_START_MARKER__",
+    ),
     (
         "The end marker used in the separate payload stream",
         "PXAR_PAYLOAD_TAIL_MARKER",
diff --git a/src/decoder/mod.rs b/src/decoder/mod.rs
index 8cc4877..00d9abf 100644
--- a/src/decoder/mod.rs
+++ b/src/decoder/mod.rs
@@ -213,8 +213,21 @@ impl<I: SeqRead> DecoderImpl<I> {
         input: I,
         path: PathBuf,
         eof_after_entry: bool,
-        payload_input: Option<I>,
+        mut payload_input: Option<I>,
     ) -> io::Result<Self> {
+        let payload_consumed = if let Some(payload_input) = payload_input.as_mut() {
+            let header: Header = seq_read_entry(payload_input).await?;
+            if header.htype != format::PXAR_PAYLOAD_START_MARKER {
+                io_bail!(
+                    "unexpected header in payload input: expected {:#x?} , got {header:#x?}",
+                    format::PXAR_PAYLOAD_START_MARKER,
+                );
+            }
+            header.full_size()
+        } else {
+            0
+        };
+
         let this = DecoderImpl {
             input,
             current_header: unsafe { mem::zeroed() },
@@ -227,7 +240,7 @@ impl<I: SeqRead> DecoderImpl<I> {
             state: State::Begin,
             with_goodbye_tables: false,
             payload_input,
-            payload_consumed: 0,
+            payload_consumed,
             eof_after_entry,
         };
 
diff --git a/src/encoder/mod.rs b/src/encoder/mod.rs
index 24dcbc9..88c0ed5 100644
--- a/src/encoder/mod.rs
+++ b/src/encoder/mod.rs
@@ -313,15 +313,23 @@ impl<'a, T: SeqWrite + 'a> EncoderImpl<'a, T> {
     pub async fn new(
         output: EncoderOutput<'a, T>,
         metadata: &Metadata,
-        payload_output: Option<T>,
+        mut payload_output: Option<T>,
     ) -> io::Result<EncoderImpl<'a, T>> {
         if !metadata.is_dir() {
             io_bail!("directory metadata must contain the directory mode flag");
         }
+
+        let mut state = EncoderState::default();
+        if let Some(payload_output) = payload_output.as_mut() {
+            let header = format::Header::with_content_size(format::PXAR_PAYLOAD_START_MARKER, 0);
+            header.check_header_size()?;
+            seq_write_struct(payload_output, header, &mut state.payload_write_position).await?;
+        }
+
         let mut this = Self {
             output,
-            payload_output: EncoderOutput::Owned(None),
-            state: vec![EncoderState::default()],
+            payload_output: EncoderOutput::Owned(payload_output),
+            state: vec![state],
             finished: false,
             file_copy_buffer: Arc::new(Mutex::new(unsafe {
                 crate::util::vec_new_uninitialized(1024 * 1024)
@@ -332,10 +340,6 @@ impl<'a, T: SeqWrite + 'a> EncoderImpl<'a, T> {
         let state = this.state_mut()?;
         state.files_offset = state.position();
 
-        if let Some(payload_output) = payload_output {
-            this.payload_output = EncoderOutput::Owned(Some(payload_output));
-        }
-
         Ok(this)
     }
 
diff --git a/src/format/mod.rs b/src/format/mod.rs
index 10192e7..a672d19 100644
--- a/src/format/mod.rs
+++ b/src/format/mod.rs
@@ -106,6 +106,8 @@ pub const PXAR_PAYLOAD_REF: u64 = 0x419d3d6bc4ba977e;
 pub const PXAR_GOODBYE: u64 = 0x2fec4fa642d5731d;
 /// The end marker used in the GOODBYE object
 pub const PXAR_GOODBYE_TAIL_MARKER: u64 = 0xef5eed5b753e1555;
+/// The start marker used in the separate payload stream
+pub const PXAR_PAYLOAD_START_MARKER: u64 = 0x834c68c2194a4ed2;
 /// The end marker used in the separate payload stream
 pub const PXAR_PAYLOAD_TAIL_MARKER: u64 = 0x6c72b78b984c81b5;
 
-- 
2.39.2





^ permalink raw reply	[flat|nested] 122+ messages in thread

* [pbs-devel] [PATCH v3 pxar 13/58] format: add pxar format version entry
  2024-03-28 12:36 [pbs-devel] [PATCH v3 pxar proxmox-backup 00/58] fix #3174: improve file-level backup Christian Ebner
                   ` (11 preceding siblings ...)
  2024-03-28 12:36 ` [pbs-devel] [PATCH v3 pxar 12/58] format: add payload stream start marker Christian Ebner
@ 2024-03-28 12:36 ` Christian Ebner
  2024-04-03 11:41   ` Fabian Grünbichler
  2024-03-28 12:36 ` [pbs-devel] [PATCH v3 pxar 14/58] format/encoder/decoder: add entry type cli params Christian Ebner
                   ` (46 subsequent siblings)
  59 siblings, 1 reply; 122+ messages in thread
From: Christian Ebner @ 2024-03-28 12:36 UTC (permalink / raw)
  To: pbs-devel

Adds an additional entry type at the start of each pxar archive
signaling the encoding format version. If not present, the default
version 1 is assumed.

This allows to early on detect the pxar encoding version, allowing tools
to switch mode or bail on non compatible encoder/decoder functionality.

Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
---
changes since version 2:
- not present in previous version

 examples/mk-format-hashes.rs |  5 +++++
 src/decoder/mod.rs           | 29 ++++++++++++++++++++++++++--
 src/encoder/mod.rs           | 37 +++++++++++++++++++++++++++++++++---
 src/format/mod.rs            | 11 +++++++++++
 src/lib.rs                   |  3 +++
 5 files changed, 80 insertions(+), 5 deletions(-)

diff --git a/examples/mk-format-hashes.rs b/examples/mk-format-hashes.rs
index 35cff99..e5d69b1 100644
--- a/examples/mk-format-hashes.rs
+++ b/examples/mk-format-hashes.rs
@@ -1,6 +1,11 @@
 use pxar::format::hash_filename;
 
 const CONSTANTS: &[(&str, &str, &str)] = &[
+    (
+        "Pxar format version entry, fallback to version 1 if not present",
+        "PXAR_FORMAT_VERSION",
+        "__PROXMOX_FORMAT_VERSION__",
+    ),
     (
         "Beginning of an entry (current version).",
         "PXAR_ENTRY",
diff --git a/src/decoder/mod.rs b/src/decoder/mod.rs
index 00d9abf..5b2fafb 100644
--- a/src/decoder/mod.rs
+++ b/src/decoder/mod.rs
@@ -17,7 +17,7 @@ use std::task::{Context, Poll};
 
 use endian_trait::Endian;
 
-use crate::format::{self, Header};
+use crate::format::{self, FormatVersion, Header};
 use crate::util::{self, io_err_other};
 use crate::{Entry, EntryKind, Metadata};
 
@@ -164,6 +164,8 @@ pub(crate) struct DecoderImpl<T> {
     /// The random access code uses decoders for sub-ranges which may not end in a `PAYLOAD` for
     /// entries like FIFOs or sockets, so there we explicitly allow an item to terminate with EOF.
     eof_after_entry: bool,
+    /// The format version as determined by the format version header
+    version: format::FormatVersion,
 }
 
 enum State {
@@ -242,6 +244,7 @@ impl<I: SeqRead> DecoderImpl<I> {
             payload_input,
             payload_consumed,
             eof_after_entry,
+            version: FormatVersion::default(),
         };
 
         // this.read_next_entry().await?;
@@ -258,7 +261,16 @@ impl<I: SeqRead> DecoderImpl<I> {
         loop {
             match self.state {
                 State::Eof => return Ok(None),
-                State::Begin => return self.read_next_entry().await.map(Some),
+                State::Begin => {
+                    let entry = self.read_next_entry().await.map(Some);
+                    if let Ok(Some(ref entry)) = entry {
+                        if let EntryKind::Version(version) = entry.kind() {
+                            self.version = version.clone();
+                            return self.read_next_entry().await.map(Some);
+                        }
+                    }
+                    return entry;
+                }
                 State::Default => {
                     // we completely finished an entry, so now we're going "up" in the directory
                     // hierarchy and parse the next PXAR_FILENAME or the PXAR_GOODBYE:
@@ -412,6 +424,11 @@ impl<I: SeqRead> DecoderImpl<I> {
             self.entry.metadata = Metadata::default();
             self.entry.kind = EntryKind::Hardlink(self.read_hardlink().await?);
 
+            Ok(Some(self.entry.take()))
+        } else if header.htype == format::PXAR_FORMAT_VERSION {
+            self.current_header = header;
+            self.entry.kind = EntryKind::Version(self.read_format_version().await?);
+
             Ok(Some(self.entry.take()))
         } else if header.htype == format::PXAR_ENTRY || header.htype == format::PXAR_ENTRY_V1 {
             if header.htype == format::PXAR_ENTRY {
@@ -777,6 +794,14 @@ impl<I: SeqRead> DecoderImpl<I> {
 
         seq_read_entry(&mut self.input).await
     }
+
+    async fn read_format_version(&mut self) -> io::Result<format::FormatVersion> {
+        match seq_read_entry(&mut self.input).await? {
+            1u64 => Ok(format::FormatVersion::Version1),
+            2u64 => Ok(format::FormatVersion::Version2),
+            _ => io_bail!("unexpected pxar format version"),
+        }
+    }
 }
 
 /// Reader for file contents inside a pxar archive.
diff --git a/src/encoder/mod.rs b/src/encoder/mod.rs
index 88c0ed5..9270153 100644
--- a/src/encoder/mod.rs
+++ b/src/encoder/mod.rs
@@ -17,7 +17,7 @@ use endian_trait::Endian;
 
 use crate::binary_tree_array;
 use crate::decoder::{self, SeqRead};
-use crate::format::{self, GoodbyeItem, PayloadRef};
+use crate::format::{self, FormatVersion, GoodbyeItem, PayloadRef};
 use crate::Metadata;
 
 pub mod aio;
@@ -307,6 +307,8 @@ pub(crate) struct EncoderImpl<'a, T: SeqWrite + 'a> {
     /// Since only the "current" entry can be actively writing files, we share the file copy
     /// buffer.
     file_copy_buffer: Arc<Mutex<Vec<u8>>>,
+    /// Pxar format version to encode
+    version: format::FormatVersion,
 }
 
 impl<'a, T: SeqWrite + 'a> EncoderImpl<'a, T> {
@@ -320,11 +322,14 @@ impl<'a, T: SeqWrite + 'a> EncoderImpl<'a, T> {
         }
 
         let mut state = EncoderState::default();
-        if let Some(payload_output) = payload_output.as_mut() {
+        let version = if let Some(payload_output) = payload_output.as_mut() {
             let header = format::Header::with_content_size(format::PXAR_PAYLOAD_START_MARKER, 0);
             header.check_header_size()?;
             seq_write_struct(payload_output, header, &mut state.payload_write_position).await?;
-        }
+            format::FormatVersion::Version2
+        } else {
+            format::FormatVersion::default()
+        };
 
         let mut this = Self {
             output,
@@ -334,8 +339,10 @@ impl<'a, T: SeqWrite + 'a> EncoderImpl<'a, T> {
             file_copy_buffer: Arc::new(Mutex::new(unsafe {
                 crate::util::vec_new_uninitialized(1024 * 1024)
             })),
+            version,
         };
 
+        this.encode_format_version().await?;
         this.encode_metadata(metadata).await?;
         let state = this.state_mut()?;
         state.files_offset = state.position();
@@ -522,6 +529,10 @@ impl<'a, T: SeqWrite + 'a> EncoderImpl<'a, T> {
         file_size: u64,
         payload_offset: PayloadOffset,
     ) -> io::Result<()> {
+        if self.version == FormatVersion::Version1 {
+            io_bail!("payload references not supported pxar format version 1");
+        }
+
         if self.payload_output.as_mut().is_none() {
             io_bail!("unable to add payload reference");
         }
@@ -729,6 +740,26 @@ impl<'a, T: SeqWrite + 'a> EncoderImpl<'a, T> {
         Ok(())
     }
 
+    async fn encode_format_version(&mut self) -> io::Result<()> {
+		let version_bytes = match self.version {
+			format::FormatVersion::Version1 => return Ok(()),
+			format::FormatVersion::Version2 => 2u64.to_le_bytes(),
+		};
+
+        let (output, state) = self.output_state()?;
+		if state.write_position != 0 {
+			io_bail!("pxar format version must be encoded at the beginning of an archive");
+		}
+
+        seq_write_pxar_entry(
+            output,
+            format::PXAR_FORMAT_VERSION,
+            &version_bytes,
+            &mut state.write_position,
+        )
+        .await
+    }
+
     async fn encode_metadata(&mut self, metadata: &Metadata) -> io::Result<()> {
         let (output, state) = self.output_state()?;
         seq_write_pxar_struct_entry(
diff --git a/src/format/mod.rs b/src/format/mod.rs
index a672d19..2bf33c9 100644
--- a/src/format/mod.rs
+++ b/src/format/mod.rs
@@ -6,6 +6,7 @@
 //! item data.
 //!
 //! An archive contains items in the following order:
+//!  * `FORMAT_VERSION`     -- (optional for v1), version of encoding format
 //!  * `ENTRY`              -- containing general stat() data and related bits
 //!   * `XATTR`             -- one extended attribute
 //!   * ...                 -- more of these when there are multiple defined
@@ -80,6 +81,8 @@ pub mod mode {
 }
 
 // Generated by `cargo run --example mk-format-hashes`
+/// Pxar format version entry, fallback to version 1 if not present
+pub const PXAR_FORMAT_VERSION: u64 = 0x730f6c75df16a40d;
 /// Beginning of an entry (current version).
 pub const PXAR_ENTRY: u64 = 0xd5956474e588acef;
 /// Previous version of the entry struct
@@ -186,6 +189,7 @@ impl Header {
 impl Display for Header {
     fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
         let readable = match self.htype {
+            PXAR_FORMAT_VERSION => "FORMAT_VERSION",
             PXAR_FILENAME => "FILENAME",
             PXAR_SYMLINK => "SYMLINK",
             PXAR_HARDLINK => "HARDLINK",
@@ -551,6 +555,13 @@ impl From<&std::fs::Metadata> for Stat {
     }
 }
 
+#[derive(Clone, Debug, Default, PartialEq)]
+pub enum FormatVersion {
+    #[default]
+    Version1,
+    Version2,
+}
+
 #[derive(Clone, Debug)]
 pub struct Filename {
     pub name: Vec<u8>,
diff --git a/src/lib.rs b/src/lib.rs
index ef81a85..a87b5ac 100644
--- a/src/lib.rs
+++ b/src/lib.rs
@@ -342,6 +342,9 @@ impl Acl {
 /// Identifies whether the entry is a file, symlink, directory, etc.
 #[derive(Clone, Debug)]
 pub enum EntryKind {
+    /// Pxar file format version
+    Version(format::FormatVersion),
+
     /// Symbolic links.
     Symlink(format::Symlink),
 
-- 
2.39.2





^ permalink raw reply	[flat|nested] 122+ messages in thread

* [pbs-devel] [PATCH v3 pxar 14/58] format/encoder/decoder: add entry type cli params
  2024-03-28 12:36 [pbs-devel] [PATCH v3 pxar proxmox-backup 00/58] fix #3174: improve file-level backup Christian Ebner
                   ` (12 preceding siblings ...)
  2024-03-28 12:36 ` [pbs-devel] [PATCH v3 pxar 13/58] format: add pxar format version entry Christian Ebner
@ 2024-03-28 12:36 ` Christian Ebner
  2024-04-03 12:01   ` Fabian Grünbichler
  2024-03-28 12:36 ` [pbs-devel] [PATCH v3 proxmox-backup 15/58] client: pxar: switch to stack based encoder state Christian Ebner
                   ` (45 subsequent siblings)
  59 siblings, 1 reply; 122+ messages in thread
From: Christian Ebner @ 2024-03-28 12:36 UTC (permalink / raw)
  To: pbs-devel

Add an additional entrt type PXAR_CLI_PARAMS which is used to store
additional metadata passed by the cli arguments such as the pxar cli
exclude patterns.

The content is encoded as an arbitrary byte slice. The entry must be
encoded right after the pxar format version entry, it is not possible to
encode this with the previous format version 1.

Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
---
changes since version 2:
- not present in previous version

 examples/mk-format-hashes.rs |  1 +
 src/accessor/mod.rs          |  9 +++-----
 src/decoder/mod.rs           | 18 +++++++++++++++-
 src/encoder/aio.rs           | 19 ++++++++++++-----
 src/encoder/mod.rs           | 40 +++++++++++++++++++++++++++++-------
 src/encoder/sync.rs          | 11 ++++++++--
 src/format/mod.rs            | 26 +++++++++++++++++++++++
 src/lib.rs                   |  3 +++
 8 files changed, 106 insertions(+), 21 deletions(-)

diff --git a/examples/mk-format-hashes.rs b/examples/mk-format-hashes.rs
index e5d69b1..12394f3 100644
--- a/examples/mk-format-hashes.rs
+++ b/examples/mk-format-hashes.rs
@@ -16,6 +16,7 @@ const CONSTANTS: &[(&str, &str, &str)] = &[
         "PXAR_ENTRY_V1",
         "__PROXMOX_FORMAT_ENTRY__",
     ),
+    ("", "PXAR_CLI_PARAMS", "__PROXMOX_FORMAT_CLI_PARAMS__"),
     ("", "PXAR_FILENAME", "__PROXMOX_FORMAT_FILENAME__"),
     ("", "PXAR_SYMLINK", "__PROXMOX_FORMAT_SYMLINK__"),
     ("", "PXAR_DEVICE", "__PROXMOX_FORMAT_DEVICE__"),
diff --git a/src/accessor/mod.rs b/src/accessor/mod.rs
index 4789595..3b6ae44 100644
--- a/src/accessor/mod.rs
+++ b/src/accessor/mod.rs
@@ -345,12 +345,9 @@ impl<T: Clone + ReadAt> AccessorImpl<T> {
 
         let link_offset = entry_file_offset - link_offset;
 
-        let (mut decoder, entry_offset) = get_decoder_at_filename(
-            self.input.clone(),
-            link_offset..self.size,
-            PathBuf::new(),
-        )
-        .await?;
+        let (mut decoder, entry_offset) =
+            get_decoder_at_filename(self.input.clone(), link_offset..self.size, PathBuf::new())
+                .await?;
 
         let entry = decoder
             .next()
diff --git a/src/decoder/mod.rs b/src/decoder/mod.rs
index 5b2fafb..4170b2f 100644
--- a/src/decoder/mod.rs
+++ b/src/decoder/mod.rs
@@ -266,7 +266,13 @@ impl<I: SeqRead> DecoderImpl<I> {
                     if let Ok(Some(ref entry)) = entry {
                         if let EntryKind::Version(version) = entry.kind() {
                             self.version = version.clone();
-                            return self.read_next_entry().await.map(Some);
+                            let entry = self.read_next_entry().await.map(Some);
+                            if let Ok(Some(ref entry)) = entry {
+                                if let EntryKind::CliParams(_) = entry.kind() {
+                                    return self.read_next_entry().await.map(Some);
+                                }
+                            }
+                            return entry;
                         }
                     }
                     return entry;
@@ -429,6 +435,11 @@ impl<I: SeqRead> DecoderImpl<I> {
             self.current_header = header;
             self.entry.kind = EntryKind::Version(self.read_format_version().await?);
 
+            Ok(Some(self.entry.take()))
+        } else if header.htype == format::PXAR_CLI_PARAMS {
+            self.current_header = header;
+            self.entry.kind = EntryKind::CliParams(self.read_cli_params().await?);
+
             Ok(Some(self.entry.take()))
         } else if header.htype == format::PXAR_ENTRY || header.htype == format::PXAR_ENTRY_V1 {
             if header.htype == format::PXAR_ENTRY {
@@ -802,6 +813,11 @@ impl<I: SeqRead> DecoderImpl<I> {
             _ => io_bail!("unexpected pxar format version"),
         }
     }
+
+    async fn read_cli_params(&mut self) -> io::Result<format::CliParams> {
+        let data = self.read_entry_as_bytes().await?;
+        Ok(format::CliParams { data })
+    }
 }
 
 /// Reader for file contents inside a pxar archive.
diff --git a/src/encoder/aio.rs b/src/encoder/aio.rs
index 6da32bd..956b2a3 100644
--- a/src/encoder/aio.rs
+++ b/src/encoder/aio.rs
@@ -25,11 +25,13 @@ impl<'a, T: tokio::io::AsyncWrite + 'a> Encoder<'a, TokioWriter<T>> {
         output: T,
         metadata: &Metadata,
         payload_output: Option<T>,
+        cli_params: Option<&[u8]>,
     ) -> io::Result<Encoder<'a, TokioWriter<T>>> {
         Encoder::new(
             TokioWriter::new(output),
             metadata,
             payload_output.map(|payload_output| TokioWriter::new(payload_output)),
+            cli_params,
         )
         .await
     }
@@ -46,6 +48,7 @@ impl<'a> Encoder<'a, TokioWriter<tokio::fs::File>> {
             TokioWriter::new(tokio::fs::File::create(path.as_ref()).await?),
             metadata,
             None,
+            None,
         )
         .await
     }
@@ -57,9 +60,11 @@ impl<'a, T: SeqWrite + 'a> Encoder<'a, T> {
         output: T,
         metadata: &Metadata,
         payload_output: Option<T>,
+        cli_params: Option<&[u8]>,
     ) -> io::Result<Encoder<'a, T>> {
         Ok(Self {
-            inner: encoder::EncoderImpl::new(output.into(), metadata, payload_output).await?,
+            inner: encoder::EncoderImpl::new(output.into(), metadata, payload_output, cli_params)
+                .await?,
         })
     }
 
@@ -331,10 +336,14 @@ mod test {
     /// Assert that `Encoder` is `Send`
     fn send_test() {
         let test = async {
-            let mut encoder =
-                Encoder::new(DummyOutput, &Metadata::dir_builder(0o700).build(), None)
-                    .await
-                    .unwrap();
+            let mut encoder = Encoder::new(
+                DummyOutput,
+                &Metadata::dir_builder(0o700).build(),
+                None,
+                None,
+            )
+            .await
+            .unwrap();
             {
                 encoder
                     .create_directory("baba", &Metadata::dir_builder(0o700).build())
diff --git a/src/encoder/mod.rs b/src/encoder/mod.rs
index 9270153..b0ec877 100644
--- a/src/encoder/mod.rs
+++ b/src/encoder/mod.rs
@@ -316,6 +316,7 @@ impl<'a, T: SeqWrite + 'a> EncoderImpl<'a, T> {
         output: EncoderOutput<'a, T>,
         metadata: &Metadata,
         mut payload_output: Option<T>,
+        cli_params: Option<&[u8]>,
     ) -> io::Result<EncoderImpl<'a, T>> {
         if !metadata.is_dir() {
             io_bail!("directory metadata must contain the directory mode flag");
@@ -343,6 +344,9 @@ impl<'a, T: SeqWrite + 'a> EncoderImpl<'a, T> {
         };
 
         this.encode_format_version().await?;
+        if let Some(params) = cli_params {
+            this.encode_cli_params(params).await?;
+        }
         this.encode_metadata(metadata).await?;
         let state = this.state_mut()?;
         state.files_offset = state.position();
@@ -740,16 +744,38 @@ impl<'a, T: SeqWrite + 'a> EncoderImpl<'a, T> {
         Ok(())
     }
 
+    async fn encode_cli_params(&mut self, params: &[u8]) -> io::Result<()> {
+        if self.version == FormatVersion::Version1 {
+            io_bail!("encoding cli params not supported pxar format version 1");
+        }
+
+        let (output, state) = self.output_state()?;
+        if state.write_position != (size_of::<u64>() + size_of::<format::Header>()) as u64 {
+            io_bail!(
+                "cli params must be encoded following the version header, current position {}",
+                state.write_position,
+            );
+        }
+
+        seq_write_pxar_entry(
+            output,
+            format::PXAR_CLI_PARAMS,
+            params,
+            &mut state.write_position,
+        )
+        .await
+    }
+
     async fn encode_format_version(&mut self) -> io::Result<()> {
-		let version_bytes = match self.version {
-			format::FormatVersion::Version1 => return Ok(()),
-			format::FormatVersion::Version2 => 2u64.to_le_bytes(),
-		};
+        let version_bytes = match self.version {
+            format::FormatVersion::Version1 => return Ok(()),
+            format::FormatVersion::Version2 => 2u64.to_le_bytes(),
+        };
 
         let (output, state) = self.output_state()?;
-		if state.write_position != 0 {
-			io_bail!("pxar format version must be encoded at the beginning of an archive");
-		}
+        if state.write_position != 0 {
+            io_bail!("pxar format version must be encoded at the beginning of an archive");
+        }
 
         seq_write_pxar_entry(
             output,
diff --git a/src/encoder/sync.rs b/src/encoder/sync.rs
index a6e16f4..3f706c1 100644
--- a/src/encoder/sync.rs
+++ b/src/encoder/sync.rs
@@ -28,7 +28,7 @@ impl<'a, T: io::Write + 'a> Encoder<'a, StandardWriter<T>> {
     /// Encode a `pxar` archive into a regular `std::io::Write` output.
     #[inline]
     pub fn from_std(output: T, metadata: &Metadata) -> io::Result<Encoder<'a, StandardWriter<T>>> {
-        Encoder::new(StandardWriter::new(output), metadata, None)
+        Encoder::new(StandardWriter::new(output), metadata, None, None)
     }
 }
 
@@ -42,6 +42,7 @@ impl<'a> Encoder<'a, StandardWriter<std::fs::File>> {
             StandardWriter::new(std::fs::File::create(path.as_ref())?),
             metadata,
             None,
+            None,
         )
     }
 }
@@ -53,12 +54,18 @@ impl<'a, T: SeqWrite + 'a> Encoder<'a, T> {
     /// not allowed to use the `Waker`, as this will cause a `panic!`.
     // Optionally attach a dedicated writer to redirect the payloads of regular files to a separate
     // output.
-    pub fn new(output: T, metadata: &Metadata, payload_output: Option<T>) -> io::Result<Self> {
+    pub fn new(
+        output: T,
+        metadata: &Metadata,
+        payload_output: Option<T>,
+        cli_params: Option<&[u8]>,
+    ) -> io::Result<Self> {
         Ok(Self {
             inner: poll_result_once(encoder::EncoderImpl::new(
                 output.into(),
                 metadata,
                 payload_output,
+                cli_params,
             ))?,
         })
     }
diff --git a/src/format/mod.rs b/src/format/mod.rs
index 2bf33c9..82ef196 100644
--- a/src/format/mod.rs
+++ b/src/format/mod.rs
@@ -87,6 +87,7 @@ pub const PXAR_FORMAT_VERSION: u64 = 0x730f6c75df16a40d;
 pub const PXAR_ENTRY: u64 = 0xd5956474e588acef;
 /// Previous version of the entry struct
 pub const PXAR_ENTRY_V1: u64 = 0x11da850a1c1cceff;
+pub const PXAR_CLI_PARAMS: u64 = 0xcf58b7dd627f604a;
 pub const PXAR_FILENAME: u64 = 0x16701121063917b3;
 pub const PXAR_SYMLINK: u64 = 0x27f971e7dbf5dc5f;
 pub const PXAR_DEVICE: u64 = 0x9fc9e906586d5ce9;
@@ -147,6 +148,7 @@ impl Header {
     #[inline]
     pub fn max_content_size(&self) -> u64 {
         match self.htype {
+            PXAR_CLI_PARAMS => u64::MAX - (size_of::<Self>() as u64),
             // + null-termination
             PXAR_FILENAME => crate::util::MAX_FILENAME_LEN + 1,
             // + null-termination
@@ -190,6 +192,7 @@ impl Display for Header {
     fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
         let readable = match self.htype {
             PXAR_FORMAT_VERSION => "FORMAT_VERSION",
+            PXAR_CLI_PARAMS => "CLI_PARAMS",
             PXAR_FILENAME => "FILENAME",
             PXAR_SYMLINK => "SYMLINK",
             PXAR_HARDLINK => "HARDLINK",
@@ -694,6 +697,29 @@ impl Device {
     }
 }
 
+#[derive(Clone, Debug)]
+pub struct CliParams {
+    pub data: Vec<u8>,
+}
+
+impl CliParams {
+    pub fn as_os_str(&self) -> &OsStr {
+        self.as_ref()
+    }
+}
+
+impl AsRef<[u8]> for CliParams {
+    fn as_ref(&self) -> &[u8] {
+        &self.data
+    }
+}
+
+impl AsRef<OsStr> for CliParams {
+    fn as_ref(&self) -> &OsStr {
+        OsStr::from_bytes(&self.data[..self.data.len().max(1) - 1])
+    }
+}
+
 #[cfg(all(test, target_os = "linux"))]
 #[test]
 fn test_linux_devices() {
diff --git a/src/lib.rs b/src/lib.rs
index a87b5ac..cc85759 100644
--- a/src/lib.rs
+++ b/src/lib.rs
@@ -345,6 +345,9 @@ pub enum EntryKind {
     /// Pxar file format version
     Version(format::FormatVersion),
 
+    /// Cli parameter.
+    CliParams(format::CliParams),
+
     /// Symbolic links.
     Symlink(format::Symlink),
 
-- 
2.39.2





^ permalink raw reply	[flat|nested] 122+ messages in thread

* [pbs-devel] [PATCH v3 proxmox-backup 15/58] client: pxar: switch to stack based encoder state
  2024-03-28 12:36 [pbs-devel] [PATCH v3 pxar proxmox-backup 00/58] fix #3174: improve file-level backup Christian Ebner
                   ` (13 preceding siblings ...)
  2024-03-28 12:36 ` [pbs-devel] [PATCH v3 pxar 14/58] format/encoder/decoder: add entry type cli params Christian Ebner
@ 2024-03-28 12:36 ` Christian Ebner
  2024-03-28 12:36 ` [pbs-devel] [PATCH v3 proxmox-backup 16/58] client: backup writer: only borrow http client Christian Ebner
                   ` (44 subsequent siblings)
  59 siblings, 0 replies; 122+ messages in thread
From: Christian Ebner @ 2024-03-28 12:36 UTC (permalink / raw)
  To: pbs-devel

In preparation for look-ahead caching where a passing around of
different encoder instances with internal references would not be
feasible to use.

Previously, for each directory level a new encoder instance has been
generated, reducing possible implementation errors. These encoder
instances have been internally linked by references to keep track of
the state changes in a parent child relationship.

This is however not feasible when the encoder has to be passed by
mutable reference, as is the case for the look-ahead cache
implementation. The encoder has therefore been adapted to use a'
single object implementation with an internal stack keeping track of
the state.

Depends on the pxar library version.

Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
---
changes since version 2:
- pass optional parameter to constructor for encoder/decoder/accessor
- finalize encoder via close method call

 pbs-client/src/pxar/create.rs             | 8 +++++---
 src/api2/admin/datastore.rs               | 2 +-
 src/api2/tape/restore.rs                  | 4 ++--
 src/tape/file_formats/snapshot_archive.rs | 3 ++-
 4 files changed, 10 insertions(+), 7 deletions(-)

diff --git a/pbs-client/src/pxar/create.rs b/pbs-client/src/pxar/create.rs
index 60efb0ce5..c9bf6df85 100644
--- a/pbs-client/src/pxar/create.rs
+++ b/pbs-client/src/pxar/create.rs
@@ -170,7 +170,7 @@ where
         set.insert(stat.st_dev);
     }
 
-    let mut encoder = Encoder::new(&mut writer, &metadata).await?;
+    let mut encoder = Encoder::new(&mut writer, &metadata, None).await?;
 
     let mut patterns = options.patterns;
 
@@ -203,6 +203,8 @@ where
         .archive_dir_contents(&mut encoder, source_dir, true)
         .await?;
     encoder.finish().await?;
+    encoder.close().await?;
+
     Ok(())
 }
 
@@ -663,7 +665,7 @@ impl Archiver {
     ) -> Result<(), Error> {
         let dir_name = OsStr::from_bytes(dir_name.to_bytes());
 
-        let mut encoder = encoder.create_directory(dir_name, metadata).await?;
+        encoder.create_directory(dir_name, metadata).await?;
 
         let old_fs_magic = self.fs_magic;
         let old_fs_feature_flags = self.fs_feature_flags;
@@ -686,7 +688,7 @@ impl Archiver {
             log::info!("skipping mount point: {:?}", self.path);
             Ok(())
         } else {
-            self.archive_dir_contents(&mut encoder, dir, false).await
+            self.archive_dir_contents(encoder, dir, false).await
         };
 
         self.fs_magic = old_fs_magic;
diff --git a/src/api2/admin/datastore.rs b/src/api2/admin/datastore.rs
index f7164b877..10e3185b6 100644
--- a/src/api2/admin/datastore.rs
+++ b/src/api2/admin/datastore.rs
@@ -1713,7 +1713,7 @@ pub fn pxar_file_download(
         let archive_size = reader.archive_size();
         let reader = LocalDynamicReadAt::new(reader);
 
-        let decoder = Accessor::new(reader, archive_size).await?;
+        let decoder = Accessor::new(reader, archive_size, None).await?;
         let root = decoder.open_root().await?;
         let path = OsStr::from_bytes(file_path).to_os_string();
         let file = root
diff --git a/src/api2/tape/restore.rs b/src/api2/tape/restore.rs
index 8273c867a..50ea8da1c 100644
--- a/src/api2/tape/restore.rs
+++ b/src/api2/tape/restore.rs
@@ -1066,7 +1066,7 @@ fn restore_snapshots_to_tmpdir(
                     "File {file_num}: snapshot archive {source_datastore}:{snapshot}",
                 );
 
-                let mut decoder = pxar::decoder::sync::Decoder::from_std(reader)?;
+                let mut decoder = pxar::decoder::sync::Decoder::from_std(reader, None)?;
 
                 let target_datastore = match store_map.target_store(&source_datastore) {
                     Some(datastore) => datastore,
@@ -1677,7 +1677,7 @@ fn restore_snapshot_archive<'a>(
     reader: Box<dyn 'a + TapeRead>,
     snapshot_path: &Path,
 ) -> Result<bool, Error> {
-    let mut decoder = pxar::decoder::sync::Decoder::from_std(reader)?;
+    let mut decoder = pxar::decoder::sync::Decoder::from_std(reader, None)?;
     match try_restore_snapshot_archive(worker, &mut decoder, snapshot_path) {
         Ok(_) => Ok(true),
         Err(err) => {
diff --git a/src/tape/file_formats/snapshot_archive.rs b/src/tape/file_formats/snapshot_archive.rs
index 252384b50..43d1cf9c3 100644
--- a/src/tape/file_formats/snapshot_archive.rs
+++ b/src/tape/file_formats/snapshot_archive.rs
@@ -59,7 +59,7 @@ pub fn tape_write_snapshot_archive<'a>(
         }
 
         let mut encoder =
-            pxar::encoder::sync::Encoder::new(PxarTapeWriter::new(writer), &root_metadata)?;
+            pxar::encoder::sync::Encoder::new(PxarTapeWriter::new(writer), &root_metadata, None)?;
 
         for filename in file_list.iter() {
             let mut file = snapshot_reader.open_file(filename).map_err(|err| {
@@ -89,6 +89,7 @@ pub fn tape_write_snapshot_archive<'a>(
             }
         }
         encoder.finish()?;
+        encoder.close()?;
         Ok(())
     });
 
-- 
2.39.2





^ permalink raw reply	[flat|nested] 122+ messages in thread

* [pbs-devel] [PATCH v3 proxmox-backup 16/58] client: backup writer: only borrow http client
  2024-03-28 12:36 [pbs-devel] [PATCH v3 pxar proxmox-backup 00/58] fix #3174: improve file-level backup Christian Ebner
                   ` (14 preceding siblings ...)
  2024-03-28 12:36 ` [pbs-devel] [PATCH v3 proxmox-backup 15/58] client: pxar: switch to stack based encoder state Christian Ebner
@ 2024-03-28 12:36 ` Christian Ebner
  2024-04-08  9:04   ` [pbs-devel] applied: " Fabian Grünbichler
  2024-03-28 12:36 ` [pbs-devel] [PATCH v3 proxmox-backup 17/58] client: backup: factor out extension from backup target Christian Ebner
                   ` (43 subsequent siblings)
  59 siblings, 1 reply; 122+ messages in thread
From: Christian Ebner @ 2024-03-28 12:36 UTC (permalink / raw)
  To: pbs-devel

Instead of taking ownership of the http client when starting a new
BackupWriter instance, only borrow the client.

This allows to reuse the http client to later reuse it to start also a
BackupReader instance as required for backup runs with metadata based
file change detection mode, where both must use the same http client.

Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
---
changes since version 2:
- not present in previous version

 examples/upload-speed.rs               | 2 +-
 pbs-client/src/backup_writer.rs        | 2 +-
 proxmox-backup-client/src/benchmark.rs | 2 +-
 proxmox-backup-client/src/main.rs      | 4 ++--
 4 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/examples/upload-speed.rs b/examples/upload-speed.rs
index f9fc52a85..e4b570ec5 100644
--- a/examples/upload-speed.rs
+++ b/examples/upload-speed.rs
@@ -18,7 +18,7 @@ async fn upload_speed() -> Result<f64, Error> {
     let backup_time = proxmox_time::epoch_i64();
 
     let client = BackupWriter::start(
-        client,
+        &client,
         None,
         datastore,
         &BackupNamespace::root(),
diff --git a/pbs-client/src/backup_writer.rs b/pbs-client/src/backup_writer.rs
index 8a03d8ea6..8bd0e4f36 100644
--- a/pbs-client/src/backup_writer.rs
+++ b/pbs-client/src/backup_writer.rs
@@ -78,7 +78,7 @@ impl BackupWriter {
     // FIXME: extract into (flattened) parameter struct?
     #[allow(clippy::too_many_arguments)]
     pub async fn start(
-        client: HttpClient,
+        client: &HttpClient,
         crypt_config: Option<Arc<CryptConfig>>,
         datastore: &str,
         ns: &BackupNamespace,
diff --git a/proxmox-backup-client/src/benchmark.rs b/proxmox-backup-client/src/benchmark.rs
index b3047308c..1262fb46d 100644
--- a/proxmox-backup-client/src/benchmark.rs
+++ b/proxmox-backup-client/src/benchmark.rs
@@ -229,7 +229,7 @@ async fn test_upload_speed(
 
     log::debug!("Connecting to backup server");
     let client = BackupWriter::start(
-        client,
+        &client,
         crypt_config.clone(),
         repo.store(),
         &BackupNamespace::root(),
diff --git a/proxmox-backup-client/src/main.rs b/proxmox-backup-client/src/main.rs
index 546275cb1..148708976 100644
--- a/proxmox-backup-client/src/main.rs
+++ b/proxmox-backup-client/src/main.rs
@@ -834,7 +834,7 @@ async fn create_backup(
 
     let backup_time = backup_time_opt.unwrap_or_else(epoch_i64);
 
-    let client = connect_rate_limited(&repo, rate_limit)?;
+    let http_client = connect_rate_limited(&repo, rate_limit)?;
     record_repository(&repo);
 
     let snapshot = BackupDir::from((backup_type, backup_id.to_owned(), backup_time));
@@ -886,7 +886,7 @@ async fn create_backup(
     };
 
     let client = BackupWriter::start(
-        client,
+        &http_client,
         crypt_config.clone(),
         repo.store(),
         &backup_ns,
-- 
2.39.2





^ permalink raw reply	[flat|nested] 122+ messages in thread

* [pbs-devel] [PATCH v3 proxmox-backup 17/58] client: backup: factor out extension from backup target
  2024-03-28 12:36 [pbs-devel] [PATCH v3 pxar proxmox-backup 00/58] fix #3174: improve file-level backup Christian Ebner
                   ` (15 preceding siblings ...)
  2024-03-28 12:36 ` [pbs-devel] [PATCH v3 proxmox-backup 16/58] client: backup writer: only borrow http client Christian Ebner
@ 2024-03-28 12:36 ` Christian Ebner
  2024-03-28 12:36 ` [pbs-devel] [PATCH v3 proxmox-backup 18/58] client: backup: early check for fixed index type Christian Ebner
                   ` (42 subsequent siblings)
  59 siblings, 0 replies; 122+ messages in thread
From: Christian Ebner @ 2024-03-28 12:36 UTC (permalink / raw)
  To: pbs-devel

Instead of composing the backup target name and pushing it to the
backup list, push the archive name and extension separately, only
constructing it while iterating the list later.

By this it remains possible to additionally prefix the extension, as
required with the separate pxar metadata and payload indexes.

Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
---
changes since version 2:
- no changes

 proxmox-backup-client/src/main.rs | 15 ++++++++++-----
 1 file changed, 10 insertions(+), 5 deletions(-)

diff --git a/proxmox-backup-client/src/main.rs b/proxmox-backup-client/src/main.rs
index 148708976..74adf1b16 100644
--- a/proxmox-backup-client/src/main.rs
+++ b/proxmox-backup-client/src/main.rs
@@ -785,7 +785,8 @@ async fn create_backup(
                 upload_list.push((
                     BackupSpecificationType::PXAR,
                     filename.to_owned(),
-                    format!("{}.didx", target),
+                    target.to_owned(),
+                    "didx",
                     0,
                 ));
             }
@@ -803,7 +804,8 @@ async fn create_backup(
                 upload_list.push((
                     BackupSpecificationType::IMAGE,
                     filename.to_owned(),
-                    format!("{}.fidx", target),
+                    target.to_owned(),
+                    "fidx",
                     size,
                 ));
             }
@@ -814,7 +816,8 @@ async fn create_backup(
                 upload_list.push((
                     BackupSpecificationType::CONFIG,
                     filename.to_owned(),
-                    format!("{}.blob", target),
+                    target.to_owned(),
+                    "blob",
                     metadata.len(),
                 ));
             }
@@ -825,7 +828,8 @@ async fn create_backup(
                 upload_list.push((
                     BackupSpecificationType::LOGFILE,
                     filename.to_owned(),
-                    format!("{}.blob", target),
+                    target.to_owned(),
+                    "blob",
                     metadata.len(),
                 ));
             }
@@ -944,7 +948,8 @@ async fn create_backup(
         log::info!("{} {} '{}' to '{}' as {}", what, desc, file, repo, target);
     };
 
-    for (backup_type, filename, target, size) in upload_list {
+    for (backup_type, filename, target_base, extension, size) in upload_list {
+        let target = format!("{target_base}.{extension}");
         match (backup_type, dry_run) {
             // dry-run
             (BackupSpecificationType::CONFIG, true) => log_file("config file", &filename, &target),
-- 
2.39.2





^ permalink raw reply	[flat|nested] 122+ messages in thread

* [pbs-devel] [PATCH v3 proxmox-backup 18/58] client: backup: early check for fixed index type
  2024-03-28 12:36 [pbs-devel] [PATCH v3 pxar proxmox-backup 00/58] fix #3174: improve file-level backup Christian Ebner
                   ` (16 preceding siblings ...)
  2024-03-28 12:36 ` [pbs-devel] [PATCH v3 proxmox-backup 17/58] client: backup: factor out extension from backup target Christian Ebner
@ 2024-03-28 12:36 ` Christian Ebner
  2024-04-08  9:05   ` [pbs-devel] applied: " Fabian Grünbichler
  2024-03-28 12:36 ` [pbs-devel] [PATCH v3 proxmox-backup 19/58] client: pxar: combine writer params into struct Christian Ebner
                   ` (41 subsequent siblings)
  59 siblings, 1 reply; 122+ messages in thread
From: Christian Ebner @ 2024-03-28 12:36 UTC (permalink / raw)
  To: pbs-devel

Early return when the check fails, avoiding constuction of unused
object instances.

Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
---
changes since version 2:
- move to top of the function as intended

 proxmox-backup-client/src/main.rs | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/proxmox-backup-client/src/main.rs b/proxmox-backup-client/src/main.rs
index 74adf1b16..931c841c7 100644
--- a/proxmox-backup-client/src/main.rs
+++ b/proxmox-backup-client/src/main.rs
@@ -192,6 +192,10 @@ async fn backup_directory<P: AsRef<Path>>(
     pxar_create_options: pbs_client::pxar::PxarCreateOptions,
     upload_options: UploadOptions,
 ) -> Result<BackupStats, Error> {
+    if upload_options.fixed_size.is_some() {
+        bail!("cannot backup directory with fixed chunk size!");
+    }
+
     let pxar_stream = PxarBackupStream::open(dir_path.as_ref(), catalog, pxar_create_options)?;
     let mut chunk_stream = ChunkStream::new(pxar_stream, chunk_size);
 
@@ -206,9 +210,6 @@ async fn backup_directory<P: AsRef<Path>>(
         }
     });
 
-    if upload_options.fixed_size.is_some() {
-        bail!("cannot backup directory with fixed chunk size!");
-    }
 
     let stats = client
         .upload_stream(archive_name, stream, upload_options)
-- 
2.39.2





^ permalink raw reply	[flat|nested] 122+ messages in thread

* [pbs-devel] [PATCH v3 proxmox-backup 19/58] client: pxar: combine writer params into struct
  2024-03-28 12:36 [pbs-devel] [PATCH v3 pxar proxmox-backup 00/58] fix #3174: improve file-level backup Christian Ebner
                   ` (17 preceding siblings ...)
  2024-03-28 12:36 ` [pbs-devel] [PATCH v3 proxmox-backup 18/58] client: backup: early check for fixed index type Christian Ebner
@ 2024-03-28 12:36 ` Christian Ebner
  2024-03-28 12:36 ` [pbs-devel] [PATCH v3 proxmox-backup 20/58] client: backup: split payload to dedicated stream Christian Ebner
                   ` (40 subsequent siblings)
  59 siblings, 0 replies; 122+ messages in thread
From: Christian Ebner @ 2024-03-28 12:36 UTC (permalink / raw)
  To: pbs-devel

Combine all writer related parameters to the pxar archive creation
into a dedicated struct to limit the number of parameters.

In preparation for adding also the payload writer as further optional
member of the stuct, reducing the methods call parameter.

Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
---
changes since version 2:
- not present in previous version

 pbs-client/src/pxar/create.rs                  | 18 ++++++++++++++----
 pbs-client/src/pxar/mod.rs                     |  2 +-
 pbs-client/src/pxar_backup_stream.rs           |  5 +++--
 .../src/proxmox_restore_daemon/api.rs          | 14 +++++++++++---
 pxar-bin/src/main.rs                           |  6 +++---
 tests/catar.rs                                 |  3 +--
 6 files changed, 33 insertions(+), 15 deletions(-)

diff --git a/pbs-client/src/pxar/create.rs b/pbs-client/src/pxar/create.rs
index c9bf6df85..82f05889b 100644
--- a/pbs-client/src/pxar/create.rs
+++ b/pbs-client/src/pxar/create.rs
@@ -135,12 +135,22 @@ struct Archiver {
 
 type Encoder<'a, T> = pxar::encoder::aio::Encoder<'a, T>;
 
+pub struct PxarWriters<T> {
+    writer: T,
+    catalog: Option<Arc<Mutex<dyn BackupCatalogWriter + Send>>>,
+}
+
+impl<T> PxarWriters<T> {
+    pub fn new(writer: T, catalog: Option<Arc<Mutex<dyn BackupCatalogWriter + Send>>>) -> Self {
+        Self { writer, catalog }
+    }
+}
+
 pub async fn create_archive<T, F>(
     source_dir: Dir,
-    mut writer: T,
+    mut writers: PxarWriters<T>,
     feature_flags: Flags,
     callback: F,
-    catalog: Option<Arc<Mutex<dyn BackupCatalogWriter + Send>>>,
     options: PxarCreateOptions,
 ) -> Result<(), Error>
 where
@@ -170,7 +180,7 @@ where
         set.insert(stat.st_dev);
     }
 
-    let mut encoder = Encoder::new(&mut writer, &metadata, None).await?;
+    let mut encoder = Encoder::new(&mut writers.writer, &metadata, None).await?;
 
     let mut patterns = options.patterns;
 
@@ -188,7 +198,7 @@ where
         fs_magic,
         callback: Box::new(callback),
         patterns,
-        catalog,
+        catalog: writers.catalog,
         path: PathBuf::new(),
         entry_counter: 0,
         entry_limit: options.entries_max,
diff --git a/pbs-client/src/pxar/mod.rs b/pbs-client/src/pxar/mod.rs
index 14674b9b9..b7dcf8362 100644
--- a/pbs-client/src/pxar/mod.rs
+++ b/pbs-client/src/pxar/mod.rs
@@ -56,7 +56,7 @@ pub(crate) mod tools;
 mod flags;
 pub use flags::Flags;
 
-pub use create::{create_archive, PxarCreateOptions};
+pub use create::{create_archive, PxarCreateOptions, PxarWriters};
 pub use extract::{
     create_tar, create_zip, extract_archive, extract_sub_dir, extract_sub_dir_seq, ErrorHandler,
     OverwriteFlags, PxarExtractContext, PxarExtractOptions,
diff --git a/pbs-client/src/pxar_backup_stream.rs b/pbs-client/src/pxar_backup_stream.rs
index 22a6ffdc2..bfa108a8b 100644
--- a/pbs-client/src/pxar_backup_stream.rs
+++ b/pbs-client/src/pxar_backup_stream.rs
@@ -17,6 +17,8 @@ use proxmox_io::StdChannelWriter;
 
 use pbs_datastore::catalog::CatalogWriter;
 
+use crate::pxar::create::PxarWriters;
+
 /// Stream implementation to encode and upload .pxar archives.
 ///
 /// The hyper client needs an async Stream for file upload, so we
@@ -56,13 +58,12 @@ impl PxarBackupStream {
             let writer = pxar::encoder::sync::StandardWriter::new(writer);
             if let Err(err) = crate::pxar::create_archive(
                 dir,
-                writer,
+                PxarWriters::new(writer, Some(catalog)),
                 crate::pxar::Flags::DEFAULT,
                 move |path| {
                     log::debug!("{:?}", path);
                     Ok(())
                 },
-                Some(catalog),
                 options,
             )
             .await
diff --git a/proxmox-restore-daemon/src/proxmox_restore_daemon/api.rs b/proxmox-restore-daemon/src/proxmox_restore_daemon/api.rs
index c20552225..4e63978b7 100644
--- a/proxmox-restore-daemon/src/proxmox_restore_daemon/api.rs
+++ b/proxmox-restore-daemon/src/proxmox_restore_daemon/api.rs
@@ -23,7 +23,9 @@ use proxmox_sortable_macro::sortable;
 use proxmox_sys::fs::read_subdir;
 
 use pbs_api_types::file_restore::{FileRestoreFormat, RestoreDaemonStatus};
-use pbs_client::pxar::{create_archive, Flags, PxarCreateOptions, ENCODER_MAX_ENTRIES};
+use pbs_client::pxar::{
+    create_archive, Flags, PxarCreateOptions, PxarWriters, ENCODER_MAX_ENTRIES,
+};
 use pbs_datastore::catalog::{ArchiveEntry, DirEntryAttribute};
 use pbs_tools::json::required_string_param;
 
@@ -356,8 +358,14 @@ fn extract(
                     };
 
                     let pxar_writer = TokioWriter::new(writer);
-                    create_archive(dir, pxar_writer, Flags::DEFAULT, |_| Ok(()), None, options)
-                        .await
+                    create_archive(
+                        dir,
+                        PxarWriters::new(pxar_writer, None),
+                        Flags::DEFAULT,
+                        |_| Ok(()),
+                        options,
+                    )
+                    .await
                 }
                 .await;
                 if let Err(err) = result {
diff --git a/pxar-bin/src/main.rs b/pxar-bin/src/main.rs
index 2bbe90e34..7083a4b82 100644
--- a/pxar-bin/src/main.rs
+++ b/pxar-bin/src/main.rs
@@ -13,7 +13,8 @@ use tokio::signal::unix::{signal, SignalKind};
 
 use pathpatterns::{MatchEntry, MatchType, PatternFlag};
 use pbs_client::pxar::{
-    format_single_line_entry, Flags, OverwriteFlags, PxarExtractOptions, ENCODER_MAX_ENTRIES,
+    format_single_line_entry, Flags, OverwriteFlags, PxarExtractOptions, PxarWriters,
+    ENCODER_MAX_ENTRIES,
 };
 
 use proxmox_router::cli::*;
@@ -376,13 +377,12 @@ async fn create_archive(
     let writer = pxar::encoder::sync::StandardWriter::new(writer);
     pbs_client::pxar::create_archive(
         dir,
-        writer,
+        PxarWriters::new(writer, None),
         feature_flags,
         move |path| {
             log::debug!("{:?}", path);
             Ok(())
         },
-        None,
         options,
     )
     .await?;
diff --git a/tests/catar.rs b/tests/catar.rs
index 36bb4f3bc..f414da8c9 100644
--- a/tests/catar.rs
+++ b/tests/catar.rs
@@ -35,10 +35,9 @@ fn run_test(dir_name: &str) -> Result<(), Error> {
     let rt = tokio::runtime::Runtime::new().unwrap();
     rt.block_on(create_archive(
         dir,
-        writer,
+        PxarWriters::new(writer, None),
         Flags::DEFAULT,
         |_| Ok(()),
-        None,
         options,
     ))?;
 
-- 
2.39.2





^ permalink raw reply	[flat|nested] 122+ messages in thread

* [pbs-devel] [PATCH v3 proxmox-backup 20/58] client: backup: split payload to dedicated stream
  2024-03-28 12:36 [pbs-devel] [PATCH v3 pxar proxmox-backup 00/58] fix #3174: improve file-level backup Christian Ebner
                   ` (18 preceding siblings ...)
  2024-03-28 12:36 ` [pbs-devel] [PATCH v3 proxmox-backup 19/58] client: pxar: combine writer params into struct Christian Ebner
@ 2024-03-28 12:36 ` Christian Ebner
  2024-03-28 12:36 ` [pbs-devel] [PATCH v3 proxmox-backup 21/58] client: helper: add helpers for creating reader instances Christian Ebner
                   ` (39 subsequent siblings)
  59 siblings, 0 replies; 122+ messages in thread
From: Christian Ebner @ 2024-03-28 12:36 UTC (permalink / raw)
  To: pbs-devel

This patch is in preparation for being able to quickly lookup
metadata for previous snapshots, by splitting the upload of
a pxar archive into two dedicated streams, one for metadata,
being assigned a .mpxar.didx suffix and one for payload
data, being assigned a .ppxar.didx suffix.

The patch constructs all the required duplicate chunk stream,
backup writer and upload stream instances required for the
split archive uploads.

This not only makes it possible reuse the payload chunks for
further backup runs but keeps the metadata archive small,
with the outlook of even making the currently used catalog
obsolete.

Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
---
changes since version 2:
- pass optional combined writer struct
- s/backup_stream_payload/backup_payload_stream/
- use new mpxar and ppxar file extensions

 pbs-client/src/pxar/create.rs                 | 20 ++++-
 pbs-client/src/pxar_backup_stream.rs          | 49 ++++++++---
 proxmox-backup-client/src/main.rs             | 81 +++++++++++++++++--
 .../src/proxmox_restore_daemon/api.rs         |  2 +-
 pxar-bin/src/main.rs                          |  2 +-
 tests/catar.rs                                |  2 +-
 6 files changed, 129 insertions(+), 27 deletions(-)

diff --git a/pbs-client/src/pxar/create.rs b/pbs-client/src/pxar/create.rs
index 82f05889b..2bb5a6253 100644
--- a/pbs-client/src/pxar/create.rs
+++ b/pbs-client/src/pxar/create.rs
@@ -137,12 +137,21 @@ type Encoder<'a, T> = pxar::encoder::aio::Encoder<'a, T>;
 
 pub struct PxarWriters<T> {
     writer: T,
+    payload_writer: Option<T>,
     catalog: Option<Arc<Mutex<dyn BackupCatalogWriter + Send>>>,
 }
 
 impl<T> PxarWriters<T> {
-    pub fn new(writer: T, catalog: Option<Arc<Mutex<dyn BackupCatalogWriter + Send>>>) -> Self {
-        Self { writer, catalog }
+    pub fn new(
+        writer: T,
+        payload_writer: Option<T>,
+        catalog: Option<Arc<Mutex<dyn BackupCatalogWriter + Send>>>,
+    ) -> Self {
+        Self {
+            writer,
+            payload_writer,
+            catalog,
+        }
     }
 }
 
@@ -180,7 +189,12 @@ where
         set.insert(stat.st_dev);
     }
 
-    let mut encoder = Encoder::new(&mut writers.writer, &metadata, None).await?;
+    let mut encoder = Encoder::new(
+        &mut writers.writer,
+        &metadata,
+        writers.payload_writer.as_mut(),
+    )
+    .await?;
 
     let mut patterns = options.patterns;
 
diff --git a/pbs-client/src/pxar_backup_stream.rs b/pbs-client/src/pxar_backup_stream.rs
index bfa108a8b..95145cb0d 100644
--- a/pbs-client/src/pxar_backup_stream.rs
+++ b/pbs-client/src/pxar_backup_stream.rs
@@ -42,23 +42,37 @@ impl PxarBackupStream {
         dir: Dir,
         catalog: Arc<Mutex<CatalogWriter<W>>>,
         options: crate::pxar::PxarCreateOptions,
-    ) -> Result<Self, Error> {
-        let (tx, rx) = std::sync::mpsc::sync_channel(10);
-
+        separate_payload_stream: bool,
+    ) -> Result<(Self, Option<Self>), Error> {
         let buffer_size = 256 * 1024;
 
-        let error = Arc::new(Mutex::new(None));
-        let error2 = Arc::clone(&error);
-        let handler = async move {
-            let writer = TokioWriterAdapter::new(std::io::BufWriter::with_capacity(
+        let (tx, rx) = std::sync::mpsc::sync_channel(10);
+        let writer = TokioWriterAdapter::new(std::io::BufWriter::with_capacity(
+            buffer_size,
+            StdChannelWriter::new(tx),
+        ));
+        let writer = pxar::encoder::sync::StandardWriter::new(writer);
+
+        let (payload_writer, payload_rx) = if separate_payload_stream {
+            let (tx, rx) = std::sync::mpsc::sync_channel(10);
+            let payload_writer = TokioWriterAdapter::new(std::io::BufWriter::with_capacity(
                 buffer_size,
                 StdChannelWriter::new(tx),
             ));
+            (
+                Some(pxar::encoder::sync::StandardWriter::new(payload_writer)),
+                Some(rx),
+            )
+        } else {
+            (None, None)
+        };
 
-            let writer = pxar::encoder::sync::StandardWriter::new(writer);
+        let error = Arc::new(Mutex::new(None));
+        let error2 = Arc::clone(&error);
+        let handler = async move {
             if let Err(err) = crate::pxar::create_archive(
                 dir,
-                PxarWriters::new(writer, Some(catalog)),
+                PxarWriters::new(writer, payload_writer, Some(catalog)),
                 crate::pxar::Flags::DEFAULT,
                 move |path| {
                     log::debug!("{:?}", path);
@@ -77,21 +91,30 @@ impl PxarBackupStream {
         let future = Abortable::new(handler, registration);
         tokio::spawn(future);
 
-        Ok(Self {
+        let backup_stream = Self {
+            rx: Some(rx),
+            handle: Some(handle.clone()),
+            error: Arc::clone(&error),
+        };
+
+        let backup_payload_stream = payload_rx.map(|rx| Self {
             rx: Some(rx),
             handle: Some(handle),
             error,
-        })
+        });
+
+        Ok((backup_stream, backup_payload_stream))
     }
 
     pub fn open<W: Write + Send + 'static>(
         dirname: &Path,
         catalog: Arc<Mutex<CatalogWriter<W>>>,
         options: crate::pxar::PxarCreateOptions,
-    ) -> Result<Self, Error> {
+        separate_payload_stream: bool,
+    ) -> Result<(Self, Option<Self>), Error> {
         let dir = nix::dir::Dir::open(dirname, OFlag::O_DIRECTORY, Mode::empty())?;
 
-        Self::new(dir, catalog, options)
+        Self::new(dir, catalog, options, separate_payload_stream)
     }
 }
 
diff --git a/proxmox-backup-client/src/main.rs b/proxmox-backup-client/src/main.rs
index 931c841c7..dc6fe0e8d 100644
--- a/proxmox-backup-client/src/main.rs
+++ b/proxmox-backup-client/src/main.rs
@@ -187,18 +187,24 @@ async fn backup_directory<P: AsRef<Path>>(
     client: &BackupWriter,
     dir_path: P,
     archive_name: &str,
+    payload_target: Option<&str>,
     chunk_size: Option<usize>,
     catalog: Arc<Mutex<CatalogWriter<TokioWriterAdapter<StdChannelWriter<Error>>>>>,
     pxar_create_options: pbs_client::pxar::PxarCreateOptions,
     upload_options: UploadOptions,
-) -> Result<BackupStats, Error> {
+) -> Result<(BackupStats, Option<BackupStats>), Error> {
     if upload_options.fixed_size.is_some() {
         bail!("cannot backup directory with fixed chunk size!");
     }
 
-    let pxar_stream = PxarBackupStream::open(dir_path.as_ref(), catalog, pxar_create_options)?;
-    let mut chunk_stream = ChunkStream::new(pxar_stream, chunk_size);
+    let (pxar_stream, payload_stream) = PxarBackupStream::open(
+        dir_path.as_ref(),
+        catalog,
+        pxar_create_options,
+        payload_target.is_some(),
+    )?;
 
+    let mut chunk_stream = ChunkStream::new(pxar_stream, chunk_size);
     let (tx, rx) = mpsc::channel(10); // allow to buffer 10 chunks
 
     let stream = ReceiverStream::new(rx).map_err(Error::from);
@@ -210,12 +216,43 @@ async fn backup_directory<P: AsRef<Path>>(
         }
     });
 
+    let stats = client.upload_stream(archive_name, stream, upload_options.clone());
 
-    let stats = client
-        .upload_stream(archive_name, stream, upload_options)
-        .await?;
+    if let Some(payload_stream) = payload_stream {
+        let payload_target = payload_target
+            .ok_or_else(|| format_err!("got payload stream, but no target archive name"))?;
 
-    Ok(stats)
+        let mut payload_chunk_stream = ChunkStream::new(
+            payload_stream,
+            chunk_size,
+        );
+        let (payload_tx, payload_rx) = mpsc::channel(10); // allow to buffer 10 chunks
+        let stream = ReceiverStream::new(payload_rx).map_err(Error::from);
+
+        // spawn payload chunker inside a separate task so that it can run parallel
+        tokio::spawn(async move {
+            while let Some(v) = payload_chunk_stream.next().await {
+                let _ = payload_tx.send(v).await;
+            }
+        });
+
+        let payload_stats = client.upload_stream(
+            &payload_target,
+            stream,
+            upload_options,
+        );
+
+        match futures::join!(stats, payload_stats) {
+            (Ok(stats), Ok(payload_stats)) => Ok((stats, Some(payload_stats))),
+            (Err(err), Ok(_)) => Err(format_err!("upload failed: {err}")),
+            (Ok(_), Err(err)) => Err(format_err!("upload failed: {err}")),
+            (Err(err), Err(payload_err)) => {
+                Err(format_err!("upload failed: {err} - {payload_err}"))
+            }
+        }
+    } else {
+        Ok((stats.await?, None))
+    }
 }
 
 async fn backup_image<P: AsRef<Path>>(
@@ -986,6 +1023,23 @@ async fn create_backup(
                 manifest.add_file(target, stats.size, stats.csum, crypto.mode)?;
             }
             (BackupSpecificationType::PXAR, false) => {
+                let metadata_mode = false; // Until enabled via param
+
+                let target_base = if let Some(base) = target_base.strip_suffix(".pxar") {
+                    base.to_string()
+                } else {
+                    bail!("unexpected suffix in target: {target_base}");
+                };
+
+                let (target, payload_target) = if metadata_mode {
+                    (
+                        format!("{target_base}.mpxar.{extension}"),
+                        Some(format!("{target_base}.ppxar.{extension}")),
+                    )
+                } else {
+                    (target, None)
+                };
+
                 // start catalog upload on first use
                 if catalog.is_none() {
                     let catalog_upload_res =
@@ -1016,16 +1070,27 @@ async fn create_backup(
                     ..UploadOptions::default()
                 };
 
-                let stats = backup_directory(
+                let (stats, payload_stats) = backup_directory(
                     &client,
                     &filename,
                     &target,
+                    payload_target.as_deref(),
                     chunk_size_opt,
                     catalog.clone(),
                     pxar_options,
                     upload_options,
                 )
                 .await?;
+
+                if let Some(payload_stats) = payload_stats {
+                    manifest.add_file(
+                        payload_target
+                            .ok_or_else(|| format_err!("missing payload target archive"))?,
+                        payload_stats.size,
+                        payload_stats.csum,
+                        crypto.mode,
+                    )?;
+                }
                 manifest.add_file(target, stats.size, stats.csum, crypto.mode)?;
                 catalog.lock().unwrap().end_directory()?;
             }
diff --git a/proxmox-restore-daemon/src/proxmox_restore_daemon/api.rs b/proxmox-restore-daemon/src/proxmox_restore_daemon/api.rs
index 4e63978b7..ea97976e6 100644
--- a/proxmox-restore-daemon/src/proxmox_restore_daemon/api.rs
+++ b/proxmox-restore-daemon/src/proxmox_restore_daemon/api.rs
@@ -360,7 +360,7 @@ fn extract(
                     let pxar_writer = TokioWriter::new(writer);
                     create_archive(
                         dir,
-                        PxarWriters::new(pxar_writer, None),
+                        PxarWriters::new(pxar_writer, None, None),
                         Flags::DEFAULT,
                         |_| Ok(()),
                         options,
diff --git a/pxar-bin/src/main.rs b/pxar-bin/src/main.rs
index 7083a4b82..6c13c3b17 100644
--- a/pxar-bin/src/main.rs
+++ b/pxar-bin/src/main.rs
@@ -377,7 +377,7 @@ async fn create_archive(
     let writer = pxar::encoder::sync::StandardWriter::new(writer);
     pbs_client::pxar::create_archive(
         dir,
-        PxarWriters::new(writer, None),
+        PxarWriters::new(writer, None, None),
         feature_flags,
         move |path| {
             log::debug!("{:?}", path);
diff --git a/tests/catar.rs b/tests/catar.rs
index f414da8c9..9e96a8610 100644
--- a/tests/catar.rs
+++ b/tests/catar.rs
@@ -35,7 +35,7 @@ fn run_test(dir_name: &str) -> Result<(), Error> {
     let rt = tokio::runtime::Runtime::new().unwrap();
     rt.block_on(create_archive(
         dir,
-        PxarWriters::new(writer, None),
+        PxarWriters::new(writer, None, None),
         Flags::DEFAULT,
         |_| Ok(()),
         options,
-- 
2.39.2





^ permalink raw reply	[flat|nested] 122+ messages in thread

* [pbs-devel] [PATCH v3 proxmox-backup 21/58] client: helper: add helpers for creating reader instances
  2024-03-28 12:36 [pbs-devel] [PATCH v3 pxar proxmox-backup 00/58] fix #3174: improve file-level backup Christian Ebner
                   ` (19 preceding siblings ...)
  2024-03-28 12:36 ` [pbs-devel] [PATCH v3 proxmox-backup 20/58] client: backup: split payload to dedicated stream Christian Ebner
@ 2024-03-28 12:36 ` Christian Ebner
  2024-03-28 12:36 ` [pbs-devel] [PATCH v3 proxmox-backup 22/58] client: helper: add method for split archive name mapping Christian Ebner
                   ` (38 subsequent siblings)
  59 siblings, 0 replies; 122+ messages in thread
From: Christian Ebner @ 2024-03-28 12:36 UTC (permalink / raw)
  To: pbs-devel

Add module to place helper methods which need to be used in different
submodules of the client.

Add `get_pxar_fuse_reader` and `get_buffered_pxar_reader` to create
reader instances to access pxar archives.

Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
---
changes since version 2:
- not present in previous version

 proxmox-backup-client/src/helper.rs | 43 +++++++++++++++++++++++++++++
 proxmox-backup-client/src/main.rs   |  2 ++
 2 files changed, 45 insertions(+)
 create mode 100644 proxmox-backup-client/src/helper.rs

diff --git a/proxmox-backup-client/src/helper.rs b/proxmox-backup-client/src/helper.rs
new file mode 100644
index 000000000..cb58db2a6
--- /dev/null
+++ b/proxmox-backup-client/src/helper.rs
@@ -0,0 +1,43 @@
+use std::sync::Arc;
+
+use anyhow::Error;
+use pbs_client::{BackupReader, RemoteChunkReader};
+use pbs_datastore::BackupManifest;
+use pbs_tools::crypt_config::CryptConfig;
+
+use crate::{BufferedDynamicReadAt, BufferedDynamicReader, IndexFile};
+
+pub(crate) async fn get_pxar_fuse_reader(
+    archive_name: &str,
+    client: Arc<BackupReader>,
+    manifest: &BackupManifest,
+    crypt_config: Option<Arc<CryptConfig>>,
+) -> Result<(pbs_pxar_fuse::Reader, u64), Error> {
+    let reader = get_buffered_pxar_reader(archive_name, client, manifest, crypt_config).await?;
+    let archive_size = reader.archive_size();
+    let reader: pbs_pxar_fuse::Reader = Arc::new(BufferedDynamicReadAt::new(reader));
+
+    Ok((reader, archive_size))
+}
+
+pub(crate) async fn get_buffered_pxar_reader(
+    archive_name: &str,
+    client: Arc<BackupReader>,
+    manifest: &BackupManifest,
+    crypt_config: Option<Arc<CryptConfig>>,
+) -> Result<BufferedDynamicReader<RemoteChunkReader>, Error> {
+    let index = client
+        .download_dynamic_index(&manifest, &archive_name)
+        .await?;
+
+    let most_used = index.find_most_used_chunks(8);
+    let file_info = manifest.lookup_file_info(&archive_name)?;
+    let chunk_reader = RemoteChunkReader::new(
+        client.clone(),
+        crypt_config.clone(),
+        file_info.chunk_crypt_mode(),
+        most_used,
+    );
+
+    Ok(BufferedDynamicReader::new(index, chunk_reader))
+}
diff --git a/proxmox-backup-client/src/main.rs b/proxmox-backup-client/src/main.rs
index dc6fe0e8d..975f6fdf8 100644
--- a/proxmox-backup-client/src/main.rs
+++ b/proxmox-backup-client/src/main.rs
@@ -72,6 +72,8 @@ mod catalog;
 pub use catalog::*;
 mod snapshot;
 pub use snapshot::*;
+mod helper;
+pub(crate) use helper::*;
 pub mod key;
 pub mod namespace;
 
-- 
2.39.2





^ permalink raw reply	[flat|nested] 122+ messages in thread

* [pbs-devel] [PATCH v3 proxmox-backup 22/58] client: helper: add method for split archive name mapping
  2024-03-28 12:36 [pbs-devel] [PATCH v3 pxar proxmox-backup 00/58] fix #3174: improve file-level backup Christian Ebner
                   ` (20 preceding siblings ...)
  2024-03-28 12:36 ` [pbs-devel] [PATCH v3 proxmox-backup 21/58] client: helper: add helpers for creating reader instances Christian Ebner
@ 2024-03-28 12:36 ` Christian Ebner
  2024-03-28 12:36 ` [pbs-devel] [PATCH v3 proxmox-backup 23/58] client: restore: read payload from dedicated index Christian Ebner
                   ` (37 subsequent siblings)
  59 siblings, 0 replies; 122+ messages in thread
From: Christian Ebner @ 2024-03-28 12:36 UTC (permalink / raw)
  To: pbs-devel

Helper method that takes the meta or payload archive name as input
and maps it to the correct archive names for metadata and payload
archive.

If neighter is matched, fallback to returning the passed in archive
name as target archive and `None` for the payload archive name.

Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
---
changes since version 2:
- not present in previous version

 proxmox-backup-client/src/helper.rs | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)

diff --git a/proxmox-backup-client/src/helper.rs b/proxmox-backup-client/src/helper.rs
index cb58db2a6..864bdd737 100644
--- a/proxmox-backup-client/src/helper.rs
+++ b/proxmox-backup-client/src/helper.rs
@@ -41,3 +41,24 @@ pub(crate) async fn get_buffered_pxar_reader(
 
     Ok(BufferedDynamicReader::new(index, chunk_reader))
 }
+
+pub(crate) fn get_pxar_archive_names(archive_name: &str) -> (String, Option<String>) {
+    if let Some(base) = archive_name
+        .strip_suffix(".mpxar.didx")
+        .or_else(|| archive_name.strip_suffix(".ppxar.didx"))
+    {
+        return (
+            format!("{base}.mpxar.didx"),
+            Some(format!("{base}.ppxar.didx")),
+        );
+    }
+
+    if let Some(base) = archive_name
+        .strip_suffix(".mpxar")
+        .or_else(|| archive_name.strip_suffix(".ppxar"))
+    {
+        return (format!("{base}.mpxar"), Some(format!("{base}.ppxar")));
+    }
+
+    (archive_name.to_owned(), None)
+}
-- 
2.39.2





^ permalink raw reply	[flat|nested] 122+ messages in thread

* [pbs-devel] [PATCH v3 proxmox-backup 23/58] client: restore: read payload from dedicated index
  2024-03-28 12:36 [pbs-devel] [PATCH v3 pxar proxmox-backup 00/58] fix #3174: improve file-level backup Christian Ebner
                   ` (21 preceding siblings ...)
  2024-03-28 12:36 ` [pbs-devel] [PATCH v3 proxmox-backup 22/58] client: helper: add method for split archive name mapping Christian Ebner
@ 2024-03-28 12:36 ` Christian Ebner
  2024-03-28 12:36 ` [pbs-devel] [PATCH v3 proxmox-backup 24/58] tools: cover meta extension for pxar archives Christian Ebner
                   ` (36 subsequent siblings)
  59 siblings, 0 replies; 122+ messages in thread
From: Christian Ebner @ 2024-03-28 12:36 UTC (permalink / raw)
  To: pbs-devel

Whenever a split pxar archive is encountered, instantiate and attach
the required dedicated reader instance to the decoder instance on
restore.

Piping the output to stdout is not possible, this would require a
decoder instance which can decode the input stream, while maintaining
the pxar stream format as output.

Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
---
changes since version 2:
- use newly introduced helpers for archive name
- use helper to get pxar reader
- refactoring

 pbs-pxar-fuse/src/lib.rs             |  2 +-
 proxmox-backup-client/src/main.rs    | 39 ++++++++++++++++++----------
 proxmox-file-restore/src/main.rs     |  4 +--
 pxar-bin/src/main.rs                 |  2 +-
 src/bin/proxmox_backup_debug/diff.rs |  2 +-
 5 files changed, 30 insertions(+), 19 deletions(-)

diff --git a/pbs-pxar-fuse/src/lib.rs b/pbs-pxar-fuse/src/lib.rs
index bf196b6c4..dff7aac31 100644
--- a/pbs-pxar-fuse/src/lib.rs
+++ b/pbs-pxar-fuse/src/lib.rs
@@ -66,7 +66,7 @@ impl Session {
         let file = std::fs::File::open(archive_path)?;
         let file_size = file.metadata()?.len();
         let reader: Reader = Arc::new(accessor::sync::FileReader::new(file));
-        let accessor = Accessor::new(reader, file_size).await?;
+        let accessor = Accessor::new(reader, file_size, None).await?;
         Self::mount(accessor, options, verbose, mountpoint)
     }
 
diff --git a/proxmox-backup-client/src/main.rs b/proxmox-backup-client/src/main.rs
index 975f6fdf8..294b52ddb 100644
--- a/proxmox-backup-client/src/main.rs
+++ b/proxmox-backup-client/src/main.rs
@@ -1223,7 +1223,7 @@ async fn dump_image<W: Write>(
 fn parse_archive_type(name: &str) -> (String, ArchiveType) {
     if name.ends_with(".didx") || name.ends_with(".fidx") || name.ends_with(".blob") {
         (name.into(), archive_type(name).unwrap())
-    } else if name.ends_with(".pxar") {
+    } else if name.ends_with(".pxar") || name.ends_with(".mpxar") || name.ends_with(".ppxar") {
         (format!("{}.didx", name), ArchiveType::DynamicIndex)
     } else if name.ends_with(".img") {
         (format!("{}.fidx", name), ArchiveType::FixedIndex)
@@ -1457,20 +1457,15 @@ async fn restore(
                 .map_err(|err| format_err!("unable to pipe data - {}", err))?;
         }
     } else if archive_type == ArchiveType::DynamicIndex {
-        let index = client
-            .download_dynamic_index(&manifest, &archive_name)
-            .await?;
+        let (archive_name, payload_archive_name) = helper::get_pxar_archive_names(&archive_name);
 
-        let most_used = index.find_most_used_chunks(8);
-
-        let chunk_reader = RemoteChunkReader::new(
+        let mut reader = get_buffered_pxar_reader(
+            &archive_name,
             client.clone(),
-            crypt_config,
-            file_info.chunk_crypt_mode(),
-            most_used,
-        );
-
-        let mut reader = BufferedDynamicReader::new(index, chunk_reader);
+            &manifest,
+            crypt_config.clone(),
+        )
+        .await?;
 
         let on_error = if ignore_extract_device_errors {
             let handler: PxarErrorHandler = Box::new(move |err: Error| {
@@ -1525,8 +1520,21 @@ async fn restore(
         }
 
         if let Some(target) = target {
+            let decoder = if let Some(payload_archive_name) = payload_archive_name {
+                let payload_reader = get_buffered_pxar_reader(
+                    &payload_archive_name,
+                    client.clone(),
+                    &manifest,
+                    crypt_config.clone(),
+                )
+                .await?;
+                pxar::decoder::Decoder::from_std(reader, Some(payload_reader))?
+            } else {
+                pxar::decoder::Decoder::from_std(reader, None)?
+            };
+
             pbs_client::pxar::extract_archive(
-                pxar::decoder::Decoder::from_std(reader)?,
+                decoder,
                 Path::new(target),
                 feature_flags,
                 |path| {
@@ -1536,6 +1544,9 @@ async fn restore(
             )
             .map_err(|err| format_err!("error extracting archive - {:#}", err))?;
         } else {
+            if archive_name.ends_with(".mpxar.didx") || archive_name.ends_with(".ppxar.didx") {
+                bail!("unable to pipe split archive");
+            }
             let mut writer = std::fs::OpenOptions::new()
                 .write(true)
                 .open("/dev/stdout")
diff --git a/proxmox-file-restore/src/main.rs b/proxmox-file-restore/src/main.rs
index 50875a636..dbab69942 100644
--- a/proxmox-file-restore/src/main.rs
+++ b/proxmox-file-restore/src/main.rs
@@ -457,7 +457,7 @@ async fn extract(
 
             let archive_size = reader.archive_size();
             let reader = LocalDynamicReadAt::new(reader);
-            let decoder = Accessor::new(reader, archive_size).await?;
+            let decoder = Accessor::new(reader, archive_size, None).await?;
             extract_to_target(decoder, &path, target, format, zstd).await?;
         }
         ExtractPath::VM(file, path) => {
@@ -483,7 +483,7 @@ async fn extract(
                     false,
                 )
                 .await?;
-                let decoder = Decoder::from_tokio(reader).await?;
+                let decoder = Decoder::from_tokio(reader, None).await?;
                 extract_sub_dir_seq(&target, decoder).await?;
 
                 // we extracted a .pxarexclude-cli file auto-generated by the VM when encoding the
diff --git a/pxar-bin/src/main.rs b/pxar-bin/src/main.rs
index 6c13c3b17..34944cf16 100644
--- a/pxar-bin/src/main.rs
+++ b/pxar-bin/src/main.rs
@@ -27,7 +27,7 @@ fn extract_archive_from_reader<R: std::io::Read>(
     options: PxarExtractOptions,
 ) -> Result<(), Error> {
     pbs_client::pxar::extract_archive(
-        pxar::decoder::Decoder::from_std(reader)?,
+        pxar::decoder::Decoder::from_std(reader, None)?,
         Path::new(target),
         feature_flags,
         |path| {
diff --git a/src/bin/proxmox_backup_debug/diff.rs b/src/bin/proxmox_backup_debug/diff.rs
index 5b68941a4..140c573c1 100644
--- a/src/bin/proxmox_backup_debug/diff.rs
+++ b/src/bin/proxmox_backup_debug/diff.rs
@@ -277,7 +277,7 @@ async fn open_dynamic_index(
     let reader = BufferedDynamicReader::new(index, chunk_reader);
     let archive_size = reader.archive_size();
     let reader: Arc<dyn ReadAt + Send + Sync> = Arc::new(LocalDynamicReadAt::new(reader));
-    let accessor = Accessor::new(reader, archive_size).await?;
+    let accessor = Accessor::new(reader, archive_size, None).await?;
 
     Ok((lookup_index, accessor))
 }
-- 
2.39.2





^ permalink raw reply	[flat|nested] 122+ messages in thread

* [pbs-devel] [PATCH v3 proxmox-backup 24/58] tools: cover meta extension for pxar archives
  2024-03-28 12:36 [pbs-devel] [PATCH v3 pxar proxmox-backup 00/58] fix #3174: improve file-level backup Christian Ebner
                   ` (22 preceding siblings ...)
  2024-03-28 12:36 ` [pbs-devel] [PATCH v3 proxmox-backup 23/58] client: restore: read payload from dedicated index Christian Ebner
@ 2024-03-28 12:36 ` Christian Ebner
  2024-04-04  9:01   ` Fabian Grünbichler
  2024-03-28 12:36 ` [pbs-devel] [PATCH v3 proxmox-backup 25/58] restore: " Christian Ebner
                   ` (35 subsequent siblings)
  59 siblings, 1 reply; 122+ messages in thread
From: Christian Ebner @ 2024-03-28 12:36 UTC (permalink / raw)
  To: pbs-devel

Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
---
changes since version 2:
- use mpxar and ppxar file extensions

 pbs-client/src/tools/mod.rs | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/pbs-client/src/tools/mod.rs b/pbs-client/src/tools/mod.rs
index 1b0123a39..08986fc5e 100644
--- a/pbs-client/src/tools/mod.rs
+++ b/pbs-client/src/tools/mod.rs
@@ -337,7 +337,7 @@ pub fn complete_pxar_archive_name(arg: &str, param: &HashMap<String, String>) ->
     complete_server_file_name(arg, param)
         .iter()
         .filter_map(|name| {
-            if name.ends_with(".pxar.didx") {
+            if name.ends_with(".pxar.didx") || name.ends_with(".pxar.meta.didx") {
                 Some(pbs_tools::format::strip_server_file_extension(name).to_owned())
             } else {
                 None
-- 
2.39.2





^ permalink raw reply	[flat|nested] 122+ messages in thread

* [pbs-devel] [PATCH v3 proxmox-backup 25/58] restore: cover meta extension for pxar archives
  2024-03-28 12:36 [pbs-devel] [PATCH v3 pxar proxmox-backup 00/58] fix #3174: improve file-level backup Christian Ebner
                   ` (23 preceding siblings ...)
  2024-03-28 12:36 ` [pbs-devel] [PATCH v3 proxmox-backup 24/58] tools: cover meta extension for pxar archives Christian Ebner
@ 2024-03-28 12:36 ` Christian Ebner
  2024-04-04  9:02   ` Fabian Grünbichler
  2024-03-28 12:36 ` [pbs-devel] [PATCH v3 proxmox-backup 26/58] client: mount: make split pxar archives mountable Christian Ebner
                   ` (34 subsequent siblings)
  59 siblings, 1 reply; 122+ messages in thread
From: Christian Ebner @ 2024-03-28 12:36 UTC (permalink / raw)
  To: pbs-devel

Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
---
changes since version 2:
- use mpxar and ppxar file extensions
- merged with file restore patch into one

 pbs-client/src/tools/mod.rs      |  5 ++++-
 proxmox-file-restore/src/main.rs | 16 +++++++++++++---
 2 files changed, 17 insertions(+), 4 deletions(-)

diff --git a/pbs-client/src/tools/mod.rs b/pbs-client/src/tools/mod.rs
index 08986fc5e..f8d3102d1 100644
--- a/pbs-client/src/tools/mod.rs
+++ b/pbs-client/src/tools/mod.rs
@@ -337,7 +337,10 @@ pub fn complete_pxar_archive_name(arg: &str, param: &HashMap<String, String>) ->
     complete_server_file_name(arg, param)
         .iter()
         .filter_map(|name| {
-            if name.ends_with(".pxar.didx") || name.ends_with(".pxar.meta.didx") {
+            if name.ends_with(".pxar.didx")
+                || name.ends_with(".mpxar.didx")
+                || name.ends_with(".ppxar.didx")
+            {
                 Some(pbs_tools::format::strip_server_file_extension(name).to_owned())
             } else {
                 None
diff --git a/proxmox-file-restore/src/main.rs b/proxmox-file-restore/src/main.rs
index dbab69942..685ce34d9 100644
--- a/proxmox-file-restore/src/main.rs
+++ b/proxmox-file-restore/src/main.rs
@@ -75,7 +75,10 @@ fn parse_path(path: String, base64: bool) -> Result<ExtractPath, Error> {
         (file, path)
     };
 
-    if file.ends_with(".pxar.didx") {
+    if file.ends_with(".pxar.didx")
+        || file.ends_with(".mpxar.didx")
+        || file.ends_with(".ppxar.didx")
+    {
         Ok(ExtractPath::Pxar(file, path))
     } else if file.ends_with(".img.fidx") {
         Ok(ExtractPath::VM(file, path))
@@ -123,11 +126,18 @@ async fn list_files(
         ExtractPath::ListArchives => {
             let mut entries = vec![];
             for file in manifest.files() {
-                if !file.filename.ends_with(".pxar.didx") && !file.filename.ends_with(".img.fidx") {
+                if !file.filename.ends_with(".pxar.didx")
+                    && !file.filename.ends_with(".img.fidx")
+                    && !file.filename.ends_with(".mpxar.didx")
+                    && !file.filename.ends_with(".ppxar.didx")
+                {
                     continue;
                 }
                 let path = format!("/{}", file.filename);
-                let attr = if file.filename.ends_with(".pxar.didx") {
+                let attr = if file.filename.ends_with(".pxar.didx")
+                    || file.filename.ends_with(".mpxar.didx")
+                    || file.filename.ends_with(".ppxar.didx")
+                {
                     // a pxar file is a file archive, so it's root is also a directory root
                     Some(&DirEntryAttribute::Directory { start: 0 })
                 } else {
-- 
2.39.2





^ permalink raw reply	[flat|nested] 122+ messages in thread

* [pbs-devel] [PATCH v3 proxmox-backup 26/58] client: mount: make split pxar archives mountable
  2024-03-28 12:36 [pbs-devel] [PATCH v3 pxar proxmox-backup 00/58] fix #3174: improve file-level backup Christian Ebner
                   ` (24 preceding siblings ...)
  2024-03-28 12:36 ` [pbs-devel] [PATCH v3 proxmox-backup 25/58] restore: " Christian Ebner
@ 2024-03-28 12:36 ` Christian Ebner
  2024-04-04  9:43   ` Fabian Grünbichler
  2024-03-28 12:36 ` [pbs-devel] [PATCH v3 proxmox-backup 27/58] api: datastore: refactor getting local chunk reader Christian Ebner
                   ` (33 subsequent siblings)
  59 siblings, 1 reply; 122+ messages in thread
From: Christian Ebner @ 2024-03-28 12:36 UTC (permalink / raw)
  To: pbs-devel

Cover the cases where the pxar archive was uploaded as split payload
data and metadata streams. Instantiate the required reader and
decoder instances to access the metadata and payload data archives.

Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
---
changes since version 2:
- use mpxar and ppxar file extensions
- use helpers for archive names and fuse reader

 proxmox-backup-client/src/mount.rs | 54 ++++++++++++++++++++----------
 1 file changed, 37 insertions(+), 17 deletions(-)

diff --git a/proxmox-backup-client/src/mount.rs b/proxmox-backup-client/src/mount.rs
index 4a2f83357..7bcd581be 100644
--- a/proxmox-backup-client/src/mount.rs
+++ b/proxmox-backup-client/src/mount.rs
@@ -21,17 +21,16 @@ use pbs_api_types::BackupNamespace;
 use pbs_client::tools::key_source::get_encryption_key_password;
 use pbs_client::{BackupReader, RemoteChunkReader};
 use pbs_datastore::cached_chunk_reader::CachedChunkReader;
-use pbs_datastore::dynamic_index::BufferedDynamicReader;
 use pbs_datastore::index::IndexFile;
 use pbs_key_config::load_and_decrypt_key;
 use pbs_tools::crypt_config::CryptConfig;
 use pbs_tools::json::required_string_param;
 
+use crate::helper;
 use crate::{
     complete_group_or_snapshot, complete_img_archive_name, complete_namespace,
     complete_pxar_archive_name, complete_repository, connect, dir_or_last_from_group,
-    extract_repository_from_value, optional_ns_param, record_repository, BufferedDynamicReadAt,
-    REPO_URL_SCHEMA,
+    extract_repository_from_value, optional_ns_param, record_repository, REPO_URL_SCHEMA,
 };
 
 #[sortable]
@@ -219,7 +218,10 @@ async fn mount_do(param: Value, pipe: Option<OwnedFd>) -> Result<Value, Error> {
         }
     };
 
-    let server_archive_name = if archive_name.ends_with(".pxar") {
+    let server_archive_name = if archive_name.ends_with(".pxar")
+        || archive_name.ends_with(".mpxar")
+        || archive_name.ends_with(".ppxar")
+    {
         if target.is_none() {
             bail!("use the 'mount' command to mount pxar archives");
         }
@@ -246,6 +248,16 @@ async fn mount_do(param: Value, pipe: Option<OwnedFd>) -> Result<Value, Error> {
     let (manifest, _) = client.download_manifest().await?;
     manifest.check_fingerprint(crypt_config.as_ref().map(Arc::as_ref))?;
 
+    let base_name = server_archive_name
+        .strip_suffix(".mpxar.didx")
+        .or_else(|| server_archive_name.strip_suffix(".ppxar.didx"));
+
+    let server_archive_name = if let Some(base) = base_name {
+        format!("{base}.mpxar.didx")
+    } else {
+        server_archive_name.to_owned()
+    };
+
     let file_info = manifest.lookup_file_info(&server_archive_name)?;
 
     let daemonize = || -> Result<(), Error> {
@@ -283,20 +295,28 @@ async fn mount_do(param: Value, pipe: Option<OwnedFd>) -> Result<Value, Error> {
         futures::future::select(interrupt_int.recv().boxed(), interrupt_term.recv().boxed());
 
     if server_archive_name.ends_with(".didx") {
-        let index = client
-            .download_dynamic_index(&manifest, &server_archive_name)
-            .await?;
-        let most_used = index.find_most_used_chunks(8);
-        let chunk_reader = RemoteChunkReader::new(
+        let (archive_name, payload_archive_name) =
+            helper::get_pxar_archive_names(&server_archive_name);
+        let (reader, archive_size) = helper::get_pxar_fuse_reader(
+            &archive_name,
             client.clone(),
-            crypt_config,
-            file_info.chunk_crypt_mode(),
-            most_used,
-        );
-        let reader = BufferedDynamicReader::new(index, chunk_reader);
-        let archive_size = reader.archive_size();
-        let reader: pbs_pxar_fuse::Reader = Arc::new(BufferedDynamicReadAt::new(reader));
-        let decoder = pbs_pxar_fuse::Accessor::new(reader, archive_size).await?;
+            &manifest,
+            crypt_config.clone(),
+        )
+        .await?;
+
+        let decoder = if let Some(payload_archive_name) = payload_archive_name {
+            let (payload_reader, _) = helper::get_pxar_fuse_reader(
+                &payload_archive_name,
+                client.clone(),
+                &manifest,
+                crypt_config.clone(),
+            )
+            .await?;
+            pbs_pxar_fuse::Accessor::new(reader, archive_size, Some(payload_reader)).await?
+        } else {
+            pbs_pxar_fuse::Accessor::new(reader, archive_size, None).await?
+        };
 
         let session =
             pbs_pxar_fuse::Session::mount(decoder, options, false, Path::new(target.unwrap()))
-- 
2.39.2





^ permalink raw reply	[flat|nested] 122+ messages in thread

* [pbs-devel] [PATCH v3 proxmox-backup 27/58] api: datastore: refactor getting local chunk reader
  2024-03-28 12:36 [pbs-devel] [PATCH v3 pxar proxmox-backup 00/58] fix #3174: improve file-level backup Christian Ebner
                   ` (25 preceding siblings ...)
  2024-03-28 12:36 ` [pbs-devel] [PATCH v3 proxmox-backup 26/58] client: mount: make split pxar archives mountable Christian Ebner
@ 2024-03-28 12:36 ` Christian Ebner
  2024-03-28 12:36 ` [pbs-devel] [PATCH v3 proxmox-backup 28/58] api: datastore: attach optional payload " Christian Ebner
                   ` (32 subsequent siblings)
  59 siblings, 0 replies; 122+ messages in thread
From: Christian Ebner @ 2024-03-28 12:36 UTC (permalink / raw)
  To: pbs-devel

Move the code to get the local chunk reader to a dedicated function
to make it reusable. The same code is required to get the local chunk
reader for the payload stream for split stream archives.

Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
---
changes since version 2:
- no changes

 src/api2/admin/datastore.rs | 37 +++++++++++++++++++++++++------------
 1 file changed, 25 insertions(+), 12 deletions(-)

diff --git a/src/api2/admin/datastore.rs b/src/api2/admin/datastore.rs
index 10e3185b6..555ab88ae 100644
--- a/src/api2/admin/datastore.rs
+++ b/src/api2/admin/datastore.rs
@@ -1654,6 +1654,29 @@ pub const API_METHOD_PXAR_FILE_DOWNLOAD: ApiMethod = ApiMethod::new(
     &Permission::Anybody,
 );
 
+fn get_local_pxar_reader(
+    datastore: Arc<DataStore>,
+    manifest: &BackupManifest,
+    backup_dir: &BackupDir,
+    pxar_name: &str,
+) -> Result<(LocalDynamicReadAt<LocalChunkReader>, u64), Error> {
+    let mut path = datastore.base_path();
+    path.push(backup_dir.relative_path());
+    path.push(pxar_name);
+
+    let index = DynamicIndexReader::open(&path)
+        .map_err(|err| format_err!("unable to read dynamic index '{:?}' - {}", &path, err))?;
+
+    let (csum, size) = index.compute_csum();
+    manifest.verify_file(pxar_name, &csum, size)?;
+
+    let chunk_reader = LocalChunkReader::new(datastore, None, CryptMode::None);
+    let reader = BufferedDynamicReader::new(index, chunk_reader);
+    let archive_size = reader.archive_size();
+
+    Ok((LocalDynamicReadAt::new(reader), archive_size))
+}
+
 pub fn pxar_file_download(
     _parts: Parts,
     _req_body: Body,
@@ -1698,20 +1721,10 @@ pub fn pxar_file_download(
             }
         }
 
-        let mut path = datastore.base_path();
-        path.push(backup_dir.relative_path());
-        path.push(pxar_name);
+        let (reader, archive_size) =
+            get_local_pxar_reader(datastore.clone(), &manifest, &backup_dir, pxar_name)?;
 
-        let index = DynamicIndexReader::open(&path)
-            .map_err(|err| format_err!("unable to read dynamic index '{:?}' - {}", &path, err))?;
 
-        let (csum, size) = index.compute_csum();
-        manifest.verify_file(pxar_name, &csum, size)?;
-
-        let chunk_reader = LocalChunkReader::new(datastore, None, CryptMode::None);
-        let reader = BufferedDynamicReader::new(index, chunk_reader);
-        let archive_size = reader.archive_size();
-        let reader = LocalDynamicReadAt::new(reader);
 
         let decoder = Accessor::new(reader, archive_size, None).await?;
         let root = decoder.open_root().await?;
-- 
2.39.2





^ permalink raw reply	[flat|nested] 122+ messages in thread

* [pbs-devel] [PATCH v3 proxmox-backup 28/58] api: datastore: attach optional payload chunk reader
  2024-03-28 12:36 [pbs-devel] [PATCH v3 pxar proxmox-backup 00/58] fix #3174: improve file-level backup Christian Ebner
                   ` (26 preceding siblings ...)
  2024-03-28 12:36 ` [pbs-devel] [PATCH v3 proxmox-backup 27/58] api: datastore: refactor getting local chunk reader Christian Ebner
@ 2024-03-28 12:36 ` Christian Ebner
  2024-03-28 12:36 ` [pbs-devel] [PATCH v3 proxmox-backup 29/58] catalog: shell: factor out pxar fuse reader instantiation Christian Ebner
                   ` (31 subsequent siblings)
  59 siblings, 0 replies; 122+ messages in thread
From: Christian Ebner @ 2024-03-28 12:36 UTC (permalink / raw)
  To: pbs-devel

Attach the payload chunk reader for pxar archives which have been
uploaded using split streams for metadata and payload data.

Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
---
changes since version 2:
- use new mpxar and ppxar file extensions

 src/api2/admin/datastore.rs | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/src/api2/admin/datastore.rs b/src/api2/admin/datastore.rs
index 555ab88ae..76cff0864 100644
--- a/src/api2/admin/datastore.rs
+++ b/src/api2/admin/datastore.rs
@@ -1724,9 +1724,15 @@ pub fn pxar_file_download(
         let (reader, archive_size) =
             get_local_pxar_reader(datastore.clone(), &manifest, &backup_dir, pxar_name)?;
 
+        let decoder = if let Some(archive_base_name) = pxar_name.strip_suffix(".mpxar.didx") {
+            let payload_archive_name = format!("{archive_base_name}.ppxar.didx");
+            let (payload_reader, _) =
+                get_local_pxar_reader(datastore, &manifest, &backup_dir, &payload_archive_name)?;
+            Accessor::new(reader, archive_size, Some(payload_reader)).await?
+        } else {
+            Accessor::new(reader, archive_size, None).await?
+        };
 
-
-        let decoder = Accessor::new(reader, archive_size, None).await?;
         let root = decoder.open_root().await?;
         let path = OsStr::from_bytes(file_path).to_os_string();
         let file = root
-- 
2.39.2





^ permalink raw reply	[flat|nested] 122+ messages in thread

* [pbs-devel] [PATCH v3 proxmox-backup 29/58] catalog: shell: factor out pxar fuse reader instantiation
  2024-03-28 12:36 [pbs-devel] [PATCH v3 pxar proxmox-backup 00/58] fix #3174: improve file-level backup Christian Ebner
                   ` (27 preceding siblings ...)
  2024-03-28 12:36 ` [pbs-devel] [PATCH v3 proxmox-backup 28/58] api: datastore: attach optional payload " Christian Ebner
@ 2024-03-28 12:36 ` Christian Ebner
  2024-03-28 12:36 ` [pbs-devel] [PATCH v3 proxmox-backup 30/58] catalog: shell: redirect payload reader for split streams Christian Ebner
                   ` (30 subsequent siblings)
  59 siblings, 0 replies; 122+ messages in thread
From: Christian Ebner @ 2024-03-28 12:36 UTC (permalink / raw)
  To: pbs-devel

In preparation to allow restoring split metadata and payload stream
pxar archives via the catalog shell.

Make the pxar fuse reader instantiation reusable.

Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
---
changes since version 2:
- use archive name and pxar fuse reader helpers

 proxmox-backup-client/src/catalog.rs | 24 ++++++++++--------------
 1 file changed, 10 insertions(+), 14 deletions(-)

diff --git a/proxmox-backup-client/src/catalog.rs b/proxmox-backup-client/src/catalog.rs
index 72b22e67f..2073e058d 100644
--- a/proxmox-backup-client/src/catalog.rs
+++ b/proxmox-backup-client/src/catalog.rs
@@ -14,12 +14,13 @@ use pbs_client::{BackupReader, RemoteChunkReader};
 use pbs_tools::crypt_config::CryptConfig;
 use pbs_tools::json::required_string_param;
 
+use crate::helper;
 use crate::{
     complete_backup_snapshot, complete_group_or_snapshot, complete_namespace,
     complete_pxar_archive_name, complete_repository, connect, crypto_parameters, decrypt_key,
     dir_or_last_from_group, extract_repository_from_value, format_key_source, optional_ns_param,
-    record_repository, BackupDir, BufferedDynamicReadAt, BufferedDynamicReader, CatalogReader,
-    DynamicIndexReader, IndexFile, Shell, CATALOG_NAME, KEYFD_SCHEMA, REPO_URL_SCHEMA,
+    record_repository, BackupDir, BufferedDynamicReader, CatalogReader, DynamicIndexReader,
+    IndexFile, Shell, CATALOG_NAME, KEYFD_SCHEMA, REPO_URL_SCHEMA,
 };
 
 #[api(
@@ -205,21 +206,16 @@ async fn catalog_shell(param: Value) -> Result<(), Error> {
     let (manifest, _) = client.download_manifest().await?;
     manifest.check_fingerprint(crypt_config.as_ref().map(Arc::as_ref))?;
 
-    let index = client
-        .download_dynamic_index(&manifest, &server_archive_name)
-        .await?;
-    let most_used = index.find_most_used_chunks(8);
+    let (archive_name, payload_archive_name) = helper::get_pxar_archive_names(&server_archive_name);
 
-    let file_info = manifest.lookup_file_info(&server_archive_name)?;
-    let chunk_reader = RemoteChunkReader::new(
+    let (reader, archive_size) = helper::get_pxar_fuse_reader(
+        &archive_name,
         client.clone(),
+        &manifest,
         crypt_config.clone(),
-        file_info.chunk_crypt_mode(),
-        most_used,
-    );
-    let reader = BufferedDynamicReader::new(index, chunk_reader);
-    let archive_size = reader.archive_size();
-    let reader: pbs_pxar_fuse::Reader = Arc::new(BufferedDynamicReadAt::new(reader));
+    )
+    .await?;
+
     let decoder = pbs_pxar_fuse::Accessor::new(reader, archive_size).await?;
 
     client.download(CATALOG_NAME, &mut tmpfile).await?;
-- 
2.39.2





^ permalink raw reply	[flat|nested] 122+ messages in thread

* [pbs-devel] [PATCH v3 proxmox-backup 30/58] catalog: shell: redirect payload reader for split streams
  2024-03-28 12:36 [pbs-devel] [PATCH v3 pxar proxmox-backup 00/58] fix #3174: improve file-level backup Christian Ebner
                   ` (28 preceding siblings ...)
  2024-03-28 12:36 ` [pbs-devel] [PATCH v3 proxmox-backup 29/58] catalog: shell: factor out pxar fuse reader instantiation Christian Ebner
@ 2024-03-28 12:36 ` Christian Ebner
  2024-04-04  9:49   ` Fabian Grünbichler
  2024-03-28 12:36 ` [pbs-devel] [PATCH v3 proxmox-backup 31/58] www: cover meta extension for pxar archives Christian Ebner
                   ` (29 subsequent siblings)
  59 siblings, 1 reply; 122+ messages in thread
From: Christian Ebner @ 2024-03-28 12:36 UTC (permalink / raw)
  To: pbs-devel

Allows to attach to pxar archives with split metadata and payload
streams, by redirecting the payload input to a dedicated reader
accessing the payload index.

Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
---
changes since version 2:
- use mpxar and ppxar file extensions
- use pxar fuse reader helper

 proxmox-backup-client/src/catalog.rs | 18 ++++++++++++++++--
 1 file changed, 16 insertions(+), 2 deletions(-)

diff --git a/proxmox-backup-client/src/catalog.rs b/proxmox-backup-client/src/catalog.rs
index 2073e058d..3e52880b9 100644
--- a/proxmox-backup-client/src/catalog.rs
+++ b/proxmox-backup-client/src/catalog.rs
@@ -181,7 +181,10 @@ async fn catalog_shell(param: Value) -> Result<(), Error> {
         }
     };
 
-    let server_archive_name = if archive_name.ends_with(".pxar") {
+    let server_archive_name = if archive_name.ends_with(".pxar")
+        || archive_name.ends_with(".mpxar")
+        || archive_name.ends_with(".ppxar")
+    {
         format!("{}.didx", archive_name)
     } else {
         bail!("Can only mount pxar archives.");
@@ -216,7 +219,18 @@ async fn catalog_shell(param: Value) -> Result<(), Error> {
     )
     .await?;
 
-    let decoder = pbs_pxar_fuse::Accessor::new(reader, archive_size).await?;
+    let decoder = if let Some(payload_archive_name) = payload_archive_name {
+        let (payload_reader, _) = helper::get_pxar_fuse_reader(
+            &payload_archive_name,
+            client.clone(),
+            &manifest,
+            crypt_config.clone(),
+        )
+        .await?;
+        pbs_pxar_fuse::Accessor::new(reader, archive_size, Some(payload_reader)).await?
+    } else {
+        pbs_pxar_fuse::Accessor::new(reader, archive_size, None).await?
+    };
 
     client.download(CATALOG_NAME, &mut tmpfile).await?;
     let index = DynamicIndexReader::new(tmpfile)
-- 
2.39.2





^ permalink raw reply	[flat|nested] 122+ messages in thread

* [pbs-devel] [PATCH v3 proxmox-backup 31/58] www: cover meta extension for pxar archives
  2024-03-28 12:36 [pbs-devel] [PATCH v3 pxar proxmox-backup 00/58] fix #3174: improve file-level backup Christian Ebner
                   ` (29 preceding siblings ...)
  2024-03-28 12:36 ` [pbs-devel] [PATCH v3 proxmox-backup 30/58] catalog: shell: redirect payload reader for split streams Christian Ebner
@ 2024-03-28 12:36 ` Christian Ebner
  2024-04-04 10:01   ` Fabian Grünbichler
  2024-03-28 12:36 ` [pbs-devel] [PATCH v3 proxmox-backup 32/58] pxar: add optional payload input for achive restore Christian Ebner
                   ` (28 subsequent siblings)
  59 siblings, 1 reply; 122+ messages in thread
From: Christian Ebner @ 2024-03-28 12:36 UTC (permalink / raw)
  To: pbs-devel

Allows to access the pxar meta archives for navigation and download
via the Proxmox Backup Server web ui.

Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
---
changes since version 2:
- use mpxar and ppxar file extensions

 www/datastore/Content.js | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/www/datastore/Content.js b/www/datastore/Content.js
index c2403ff9c..eb25f6ca4 100644
--- a/www/datastore/Content.js
+++ b/www/datastore/Content.js
@@ -1050,7 +1050,7 @@ Ext.define('PBS.DataStoreContent', {
 		    tooltip: gettext('Browse'),
 		    getClass: (v, m, { data }) => {
 			if (
-			    (data.ty === 'file' && data.filename.endsWith('pxar.didx')) ||
+			    (data.ty === 'file' && (data.filename.endsWith('pxar.didx') || data.filename.endsWith('mpxar.didx'))) ||
 			    (data.ty === 'ns' && !data.root)
 			) {
 			    return 'fa fa-folder-open-o';
@@ -1058,7 +1058,9 @@ Ext.define('PBS.DataStoreContent', {
 			return 'pmx-hidden';
 		    },
 		    isActionDisabled: (v, r, c, i, { data }) =>
-			!(data.ty === 'file' && data.filename.endsWith('pxar.didx') && data['crypt-mode'] < 3) && data.ty !== 'ns',
+			!(data.ty === 'file' &&
+			(data.filename.endsWith('pxar.didx') || data.filename.endsWith('mpxar.didx')) &&
+			data['crypt-mode'] < 3) && data.ty !== 'ns',
 		},
 	    ],
 	},
-- 
2.39.2





^ permalink raw reply	[flat|nested] 122+ messages in thread

* [pbs-devel] [PATCH v3 proxmox-backup 32/58] pxar: add optional payload input for achive restore
  2024-03-28 12:36 [pbs-devel] [PATCH v3 pxar proxmox-backup 00/58] fix #3174: improve file-level backup Christian Ebner
                   ` (30 preceding siblings ...)
  2024-03-28 12:36 ` [pbs-devel] [PATCH v3 proxmox-backup 31/58] www: cover meta extension for pxar archives Christian Ebner
@ 2024-03-28 12:36 ` Christian Ebner
  2024-03-28 12:36 ` [pbs-devel] [PATCH v3 proxmox-backup 33/58] pxar: add more context to extraction error Christian Ebner
                   ` (27 subsequent siblings)
  59 siblings, 0 replies; 122+ messages in thread
From: Christian Ebner @ 2024-03-28 12:36 UTC (permalink / raw)
  To: pbs-devel

Allows to pass the optional payload input to restore for cases where the
regular file payloads are stored in the split archive.

Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
---
changes since version 2:
- not present in previous version

 pxar-bin/src/main.rs | 24 +++++++++++++++++++++---
 1 file changed, 21 insertions(+), 3 deletions(-)

diff --git a/pxar-bin/src/main.rs b/pxar-bin/src/main.rs
index 34944cf16..ac0acad0e 100644
--- a/pxar-bin/src/main.rs
+++ b/pxar-bin/src/main.rs
@@ -25,9 +25,10 @@ fn extract_archive_from_reader<R: std::io::Read>(
     target: &str,
     feature_flags: Flags,
     options: PxarExtractOptions,
+    payload_reader: Option<&mut R>,
 ) -> Result<(), Error> {
     pbs_client::pxar::extract_archive(
-        pxar::decoder::Decoder::from_std(reader, None)?,
+        pxar::decoder::Decoder::from_std(reader, payload_reader)?,
         Path::new(target),
         feature_flags,
         |path| {
@@ -120,6 +121,10 @@ fn extract_archive_from_reader<R: std::io::Read>(
                 optional: true,
                 default: false,
             },
+            "payload-input": {
+                description: "'ppxar' payload input data file to restore split archive.",
+                optional: true,
+            },
         },
     },
 )]
@@ -142,6 +147,7 @@ fn extract_archive(
     no_fifos: bool,
     no_sockets: bool,
     strict: bool,
+    payload_input: Option<String>,
 ) -> Result<(), Error> {
     let mut feature_flags = Flags::DEFAULT;
     if no_xattrs {
@@ -220,12 +226,24 @@ fn extract_archive(
     if archive == "-" {
         let stdin = std::io::stdin();
         let mut reader = stdin.lock();
-        extract_archive_from_reader(&mut reader, target, feature_flags, options)?;
+        extract_archive_from_reader(&mut reader, target, feature_flags, options, None)?;
     } else {
         log::debug!("PXAR extract: {}", archive);
         let file = std::fs::File::open(archive)?;
         let mut reader = std::io::BufReader::new(file);
-        extract_archive_from_reader(&mut reader, target, feature_flags, options)?;
+        let mut payload_reader = if let Some(payload_input) = payload_input {
+            let file = std::fs::File::open(payload_input)?;
+            Some(std::io::BufReader::new(file))
+        } else {
+            None
+        };
+        extract_archive_from_reader(
+            &mut reader,
+            target,
+            feature_flags,
+            options,
+            payload_reader.as_mut(),
+        )?;
     }
 
     if !was_ok.load(Ordering::Acquire) {
-- 
2.39.2





^ permalink raw reply	[flat|nested] 122+ messages in thread

* [pbs-devel] [PATCH v3 proxmox-backup 33/58] pxar: add more context to extraction error
  2024-03-28 12:36 [pbs-devel] [PATCH v3 pxar proxmox-backup 00/58] fix #3174: improve file-level backup Christian Ebner
                   ` (31 preceding siblings ...)
  2024-03-28 12:36 ` [pbs-devel] [PATCH v3 proxmox-backup 32/58] pxar: add optional payload input for achive restore Christian Ebner
@ 2024-03-28 12:36 ` Christian Ebner
  2024-03-28 12:36 ` [pbs-devel] [PATCH v3 proxmox-backup 34/58] client: pxar: include payload offset in output Christian Ebner
                   ` (26 subsequent siblings)
  59 siblings, 0 replies; 122+ messages in thread
From: Christian Ebner @ 2024-03-28 12:36 UTC (permalink / raw)
  To: pbs-devel

Show more of the extraction error context provided by the pxar decoder.

Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
---
changes since version 2:
- not present in previous version

 pxar-bin/src/main.rs | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/pxar-bin/src/main.rs b/pxar-bin/src/main.rs
index ac0acad0e..44a6fa8a1 100644
--- a/pxar-bin/src/main.rs
+++ b/pxar-bin/src/main.rs
@@ -226,7 +226,8 @@ fn extract_archive(
     if archive == "-" {
         let stdin = std::io::stdin();
         let mut reader = stdin.lock();
-        extract_archive_from_reader(&mut reader, target, feature_flags, options, None)?;
+        extract_archive_from_reader(&mut reader, target, feature_flags, options, None)
+            .map_err(|err| format_err!("error extracting archive - {err:#}"))?;
     } else {
         log::debug!("PXAR extract: {}", archive);
         let file = std::fs::File::open(archive)?;
@@ -243,7 +244,8 @@ fn extract_archive(
             feature_flags,
             options,
             payload_reader.as_mut(),
-        )?;
+        )
+        .map_err(|err| format_err!("error extracting archive - {err:#}"))?
     }
 
     if !was_ok.load(Ordering::Acquire) {
-- 
2.39.2





^ permalink raw reply	[flat|nested] 122+ messages in thread

* [pbs-devel] [PATCH v3 proxmox-backup 34/58] client: pxar: include payload offset in output
  2024-03-28 12:36 [pbs-devel] [PATCH v3 pxar proxmox-backup 00/58] fix #3174: improve file-level backup Christian Ebner
                   ` (32 preceding siblings ...)
  2024-03-28 12:36 ` [pbs-devel] [PATCH v3 proxmox-backup 33/58] pxar: add more context to extraction error Christian Ebner
@ 2024-03-28 12:36 ` Christian Ebner
  2024-03-28 12:36 ` [pbs-devel] [PATCH v3 proxmox-backup 35/58] pxar: show padding in debug output on archive list Christian Ebner
                   ` (25 subsequent siblings)
  59 siblings, 0 replies; 122+ messages in thread
From: Christian Ebner @ 2024-03-28 12:36 UTC (permalink / raw)
  To: pbs-devel

Also display the payload offset as listing output when the regular file
entry had a payload reference rather than the payload encoded in the
archive.

Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
---
changes since version 2:
- not present in previous version

 pbs-client/src/pxar/tools.rs | 116 ++++++++++++++++++++++++-----------
 1 file changed, 80 insertions(+), 36 deletions(-)

diff --git a/pbs-client/src/pxar/tools.rs b/pbs-client/src/pxar/tools.rs
index 0cfbaf5b9..459951d50 100644
--- a/pbs-client/src/pxar/tools.rs
+++ b/pbs-client/src/pxar/tools.rs
@@ -128,25 +128,42 @@ pub fn format_single_line_entry(entry: &Entry) -> String {
 
     let meta = entry.metadata();
 
-    let (size, link) = match entry.kind() {
-        EntryKind::File { size, .. } => (format!("{}", *size), String::new()),
-        EntryKind::Symlink(link) => ("0".to_string(), format!(" -> {:?}", link.as_os_str())),
-        EntryKind::Hardlink(link) => ("0".to_string(), format!(" -> {:?}", link.as_os_str())),
-        EntryKind::Device(dev) => (format!("{},{}", dev.major, dev.minor), String::new()),
-        _ => ("0".to_string(), String::new()),
+    let (size, link, payload_offset) = match entry.kind() {
+        EntryKind::File {
+            size,
+            payload_offset,
+            ..
+        } => (format!("{}", *size), String::new(), *payload_offset),
+        EntryKind::Symlink(link) => ("0".to_string(), format!(" -> {:?}", link.as_os_str()), None),
+        EntryKind::Hardlink(link) => ("0".to_string(), format!(" -> {:?}", link.as_os_str()), None),
+        EntryKind::Device(dev) => (format!("{},{}", dev.major, dev.minor), String::new(), None),
+        _ => ("0".to_string(), String::new(), None),
     };
 
     let owner_string = format!("{}/{}", meta.stat.uid, meta.stat.gid);
 
-    format!(
-        "{} {:<13} {} {:>8} {:?}{}",
-        mode_string,
-        owner_string,
-        format_mtime(&meta.stat.mtime),
-        size,
-        entry.path(),
-        link,
-    )
+    if let Some(offset) = payload_offset {
+        format!(
+            "{} {:<13} {} {:>8} {:?}{} {}",
+            mode_string,
+            owner_string,
+            format_mtime(&meta.stat.mtime),
+            size,
+            entry.path(),
+            link,
+            offset,
+        )
+    } else {
+        format!(
+            "{} {:<13} {} {:>8} {:?}{}",
+            mode_string,
+            owner_string,
+            format_mtime(&meta.stat.mtime),
+            size,
+            entry.path(),
+            link,
+        )
+    }
 }
 
 pub fn format_multi_line_entry(entry: &Entry) -> String {
@@ -154,17 +171,23 @@ pub fn format_multi_line_entry(entry: &Entry) -> String {
 
     let meta = entry.metadata();
 
-    let (size, link, type_name) = match entry.kind() {
-        EntryKind::File { size, .. } => (format!("{}", *size), String::new(), "file"),
+    let (size, link, type_name, payload_offset) = match entry.kind() {
+        EntryKind::File {
+            size,
+            payload_offset,
+            ..
+        } => (format!("{}", *size), String::new(), "file", *payload_offset),
         EntryKind::Symlink(link) => (
             "0".to_string(),
             format!(" -> {:?}", link.as_os_str()),
             "symlink",
+            None,
         ),
         EntryKind::Hardlink(link) => (
             "0".to_string(),
             format!(" -> {:?}", link.as_os_str()),
             "symlink",
+            None,
         ),
         EntryKind::Device(dev) => (
             format!("{},{}", dev.major, dev.minor),
@@ -176,11 +199,12 @@ pub fn format_multi_line_entry(entry: &Entry) -> String {
             } else {
                 "device"
             },
+            None,
         ),
-        EntryKind::Socket => ("0".to_string(), String::new(), "socket"),
-        EntryKind::Fifo => ("0".to_string(), String::new(), "fifo"),
-        EntryKind::Directory => ("0".to_string(), String::new(), "directory"),
-        EntryKind::GoodbyeTable => ("0".to_string(), String::new(), "bad entry"),
+        EntryKind::Socket => ("0".to_string(), String::new(), "socket", None),
+        EntryKind::Fifo => ("0".to_string(), String::new(), "fifo", None),
+        EntryKind::Directory => ("0".to_string(), String::new(), "directory", None),
+        EntryKind::GoodbyeTable => ("0".to_string(), String::new(), "bad entry", None),
     };
 
     let file_name = match std::str::from_utf8(entry.path().as_os_str().as_bytes()) {
@@ -188,19 +212,39 @@ pub fn format_multi_line_entry(entry: &Entry) -> String {
         Err(_) => std::borrow::Cow::Owned(format!("{:?}", entry.path())),
     };
 
-    format!(
-        "  File: {}{}\n  \
-           Size: {:<13} Type: {}\n\
-         Access: ({:o}/{})  Uid: {:<5} Gid: {:<5}\n\
-         Modify: {}\n",
-        file_name,
-        link,
-        size,
-        type_name,
-        meta.file_mode(),
-        mode_string,
-        meta.stat.uid,
-        meta.stat.gid,
-        format_mtime(&meta.stat.mtime),
-    )
+    if let Some(offset) = payload_offset {
+        format!(
+            "  File: {}{}\n  \
+               Size: {:<13} Type: {}\n\
+             Access: ({:o}/{})  Uid: {:<5} Gid: {:<5}\n\
+             Modify: {}\n
+             PayloadOffset: {}\n",
+            file_name,
+            link,
+            size,
+            type_name,
+            meta.file_mode(),
+            mode_string,
+            meta.stat.uid,
+            meta.stat.gid,
+            format_mtime(&meta.stat.mtime),
+            offset,
+        )
+    } else {
+        format!(
+            "  File: {}{}\n  \
+               Size: {:<13} Type: {}\n\
+             Access: ({:o}/{})  Uid: {:<5} Gid: {:<5}\n\
+             Modify: {}\n",
+            file_name,
+            link,
+            size,
+            type_name,
+            meta.file_mode(),
+            mode_string,
+            meta.stat.uid,
+            meta.stat.gid,
+            format_mtime(&meta.stat.mtime),
+        )
+    }
 }
-- 
2.39.2





^ permalink raw reply	[flat|nested] 122+ messages in thread

* [pbs-devel] [PATCH v3 proxmox-backup 35/58] pxar: show padding in debug output on archive list
  2024-03-28 12:36 [pbs-devel] [PATCH v3 pxar proxmox-backup 00/58] fix #3174: improve file-level backup Christian Ebner
                   ` (33 preceding siblings ...)
  2024-03-28 12:36 ` [pbs-devel] [PATCH v3 proxmox-backup 34/58] client: pxar: include payload offset in output Christian Ebner
@ 2024-03-28 12:36 ` Christian Ebner
  2024-03-28 12:36 ` [pbs-devel] [PATCH v3 proxmox-backup 36/58] datastore: dynamic index: add method to get digest Christian Ebner
                   ` (24 subsequent siblings)
  59 siblings, 0 replies; 122+ messages in thread
From: Christian Ebner @ 2024-03-28 12:36 UTC (permalink / raw)
  To: pbs-devel

In addition to the entries, also show the padding encountered in-between
referenced payloads.

Example invocation: `PXAR_LOG=debug pxar list archive.mpxar`

Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
---
changes since version 2:
- not present in previous version

 pxar-bin/src/main.rs | 19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

diff --git a/pxar-bin/src/main.rs b/pxar-bin/src/main.rs
index 44a6fa8a1..58c9d2cfd 100644
--- a/pxar-bin/src/main.rs
+++ b/pxar-bin/src/main.rs
@@ -9,6 +9,7 @@ use std::sync::Arc;
 use anyhow::{bail, format_err, Error};
 use futures::future::FutureExt;
 use futures::select;
+use pxar::EntryKind;
 use tokio::signal::unix::{signal, SignalKind};
 
 use pathpatterns::{MatchEntry, MatchType, PatternFlag};
@@ -456,10 +457,28 @@ async fn mount_archive(archive: String, mountpoint: String, verbose: bool) -> Re
 )]
 /// List the contents of an archive.
 fn dump_archive(archive: String) -> Result<(), Error> {
+    let mut last = None;
     for entry in pxar::decoder::Decoder::open(archive)? {
         let entry = entry?;
 
         if log::log_enabled!(log::Level::Debug) {
+            match entry.kind() {
+                EntryKind::File {
+                    payload_offset: Some(offset),
+                    size,
+                    ..
+                } => {
+                    if let Some(last) = last {
+                        let skipped = offset - last;
+                        if skipped > 0 {
+                            log::debug!("Encountered padding of {skipped} bytes");
+                        }
+                    }
+                    last = Some(offset + size + std::mem::size_of::<pxar::format::Header>() as u64);
+                }
+                _ => (),
+            }
+
             log::debug!("{}", format_single_line_entry(&entry));
         } else {
             log::info!("{:?}", entry.path());
-- 
2.39.2





^ permalink raw reply	[flat|nested] 122+ messages in thread

* [pbs-devel] [PATCH v3 proxmox-backup 36/58] datastore: dynamic index: add method to get digest
  2024-03-28 12:36 [pbs-devel] [PATCH v3 pxar proxmox-backup 00/58] fix #3174: improve file-level backup Christian Ebner
                   ` (34 preceding siblings ...)
  2024-03-28 12:36 ` [pbs-devel] [PATCH v3 proxmox-backup 35/58] pxar: show padding in debug output on archive list Christian Ebner
@ 2024-03-28 12:36 ` Christian Ebner
  2024-03-28 12:36 ` [pbs-devel] [PATCH v3 proxmox-backup 37/58] client: pxar: helper for lookup of reusable dynamic entries Christian Ebner
                   ` (23 subsequent siblings)
  59 siblings, 0 replies; 122+ messages in thread
From: Christian Ebner @ 2024-03-28 12:36 UTC (permalink / raw)
  To: pbs-devel

In preparation for injecting reused payload chunks in payload streams
for regular files with unchanged metaddata.

Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
---
changes since version 2:
- not present in previous version

 pbs-datastore/src/dynamic_index.rs | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/pbs-datastore/src/dynamic_index.rs b/pbs-datastore/src/dynamic_index.rs
index 71a5082e1..b8047b5b1 100644
--- a/pbs-datastore/src/dynamic_index.rs
+++ b/pbs-datastore/src/dynamic_index.rs
@@ -72,6 +72,11 @@ impl DynamicEntry {
     pub fn end(&self) -> u64 {
         u64::from_le(self.end_le)
     }
+
+    #[inline]
+    pub fn digest(&self) -> [u8; 32] {
+        self.digest
+    }
 }
 
 pub struct DynamicIndexReader {
-- 
2.39.2





^ permalink raw reply	[flat|nested] 122+ messages in thread

* [pbs-devel] [PATCH v3 proxmox-backup 37/58] client: pxar: helper for lookup of reusable dynamic entries
  2024-03-28 12:36 [pbs-devel] [PATCH v3 pxar proxmox-backup 00/58] fix #3174: improve file-level backup Christian Ebner
                   ` (35 preceding siblings ...)
  2024-03-28 12:36 ` [pbs-devel] [PATCH v3 proxmox-backup 36/58] datastore: dynamic index: add method to get digest Christian Ebner
@ 2024-03-28 12:36 ` Christian Ebner
  2024-04-04 12:54   ` Fabian Grünbichler
  2024-04-05 11:28   ` Fabian Grünbichler
  2024-03-28 12:36 ` [pbs-devel] [PATCH v3 proxmox-backup 38/58] upload stream: impl reused chunk injector Christian Ebner
                   ` (22 subsequent siblings)
  59 siblings, 2 replies; 122+ messages in thread
From: Christian Ebner @ 2024-03-28 12:36 UTC (permalink / raw)
  To: pbs-devel

The helper method allows to lookup the entries of a dynamic index
which fully cover a given offset range. Further, the helper returns
the start padding from the start offset of the dynamic index entry
to the start offset of the given range and the end padding.

This will be used to lookup size and digest for chunks covering the
payload range of a regular file in order to re-use found chunks by
indexing them in the archives index file instead of re-encoding the
payload.

Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
---
changes since version 2:
- moved this from the dynamic index to the pxar create as suggested
- refactored and optimized search, going for linear search to find the
  end entry
- reworded commit message

 pbs-client/src/pxar/create.rs | 63 +++++++++++++++++++++++++++++++++++
 1 file changed, 63 insertions(+)

diff --git a/pbs-client/src/pxar/create.rs b/pbs-client/src/pxar/create.rs
index 2bb5a6253..e2d3954ca 100644
--- a/pbs-client/src/pxar/create.rs
+++ b/pbs-client/src/pxar/create.rs
@@ -2,6 +2,7 @@ use std::collections::{HashMap, HashSet};
 use std::ffi::{CStr, CString, OsStr};
 use std::fmt;
 use std::io::{self, Read};
+use std::ops::Range;
 use std::os::unix::ffi::OsStrExt;
 use std::os::unix::io::{AsRawFd, FromRawFd, IntoRawFd, OwnedFd, RawFd};
 use std::path::{Path, PathBuf};
@@ -16,6 +17,7 @@ use nix::fcntl::OFlag;
 use nix::sys::stat::{FileStat, Mode};
 
 use pathpatterns::{MatchEntry, MatchFlag, MatchList, MatchType, PatternFlag};
+use pbs_datastore::index::IndexFile;
 use proxmox_sys::error::SysError;
 use pxar::encoder::{LinkOffset, SeqWrite};
 use pxar::Metadata;
@@ -25,6 +27,7 @@ use proxmox_lang::c_str;
 use proxmox_sys::fs::{self, acl, xattr};
 
 use pbs_datastore::catalog::BackupCatalogWriter;
+use pbs_datastore::dynamic_index::DynamicIndexReader;
 
 use crate::pxar::metadata::errno_is_unsupported;
 use crate::pxar::tools::assert_single_path_component;
@@ -791,6 +794,66 @@ impl Archiver {
     }
 }
 
+/// Dynamic Entry reusable by payload references
+#[derive(Clone, Debug)]
+#[repr(C)]
+pub struct ReusableDynamicEntry {
+    size_le: u64,
+    digest: [u8; 32],
+}
+
+impl ReusableDynamicEntry {
+    #[inline]
+    pub fn size(&self) -> u64 {
+        u64::from_le(self.size_le)
+    }
+
+    #[inline]
+    pub fn digest(&self) -> [u8; 32] {
+        self.digest
+    }
+}
+
+/// List of dynamic entries containing the data given by an offset range
+fn lookup_dynamic_entries(
+    index: &DynamicIndexReader,
+    range: Range<u64>,
+) -> Result<(Vec<ReusableDynamicEntry>, u64, u64), Error> {
+    let end_idx = index.index_count() - 1;
+    let chunk_end = index.chunk_end(end_idx);
+    let start = index.binary_search(0, 0, end_idx, chunk_end, range.start)?;
+    let mut end = start;
+    while end < end_idx {
+        if range.end < index.chunk_end(end) {
+            break;
+        }
+        end += 1;
+    }
+
+    let offset_first = if start == 0 {
+        0
+    } else {
+        index.chunk_end(start - 1)
+    };
+
+    let padding_start = range.start - offset_first;
+    let padding_end = index.chunk_end(end) - range.end;
+
+    let mut indices = Vec::new();
+    let mut prev_end = offset_first;
+    for dynamic_entry in &index.index()[start..end + 1] {
+        let size = dynamic_entry.end() - prev_end;
+        let reusable_dynamic_entry = ReusableDynamicEntry {
+            size_le: size.to_le(),
+            digest: dynamic_entry.digest(),
+        };
+        prev_end += size;
+        indices.push(reusable_dynamic_entry);
+    }
+
+    Ok((indices, padding_start, padding_end))
+}
+
 fn get_metadata(
     fd: RawFd,
     stat: &FileStat,
-- 
2.39.2





^ permalink raw reply	[flat|nested] 122+ messages in thread

* [pbs-devel] [PATCH v3 proxmox-backup 38/58] upload stream: impl reused chunk injector
  2024-03-28 12:36 [pbs-devel] [PATCH v3 pxar proxmox-backup 00/58] fix #3174: improve file-level backup Christian Ebner
                   ` (36 preceding siblings ...)
  2024-03-28 12:36 ` [pbs-devel] [PATCH v3 proxmox-backup 37/58] client: pxar: helper for lookup of reusable dynamic entries Christian Ebner
@ 2024-03-28 12:36 ` Christian Ebner
  2024-04-04 14:24   ` Fabian Grünbichler
  2024-03-28 12:36 ` [pbs-devel] [PATCH v3 proxmox-backup 39/58] client: chunk stream: add struct to hold injection state Christian Ebner
                   ` (21 subsequent siblings)
  59 siblings, 1 reply; 122+ messages in thread
From: Christian Ebner @ 2024-03-28 12:36 UTC (permalink / raw)
  To: pbs-devel

In order to be included in the backups index file, reused payload
chunks have to be injected into the payload upload stream.

The chunker forces a chunk boundary and queues the list of chunks to
be uploaded thereafter.

This implements the logic to inject the chunks into the chunk upload
stream after such a boundary is requested, by looping over the queued
chunks and inserting them into the stream.

Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
---
changes since version 2:
- no changes

 pbs-client/src/inject_reused_chunks.rs | 152 +++++++++++++++++++++++++
 pbs-client/src/lib.rs                  |   1 +
 2 files changed, 153 insertions(+)
 create mode 100644 pbs-client/src/inject_reused_chunks.rs

diff --git a/pbs-client/src/inject_reused_chunks.rs b/pbs-client/src/inject_reused_chunks.rs
new file mode 100644
index 000000000..5cc19ce5d
--- /dev/null
+++ b/pbs-client/src/inject_reused_chunks.rs
@@ -0,0 +1,152 @@
+use std::collections::VecDeque;
+use std::pin::Pin;
+use std::sync::atomic::{AtomicUsize, Ordering};
+use std::sync::{Arc, Mutex};
+use std::task::{Context, Poll};
+
+use anyhow::{anyhow, Error};
+use futures::{ready, Stream};
+use pin_project_lite::pin_project;
+
+use crate::pxar::create::ReusableDynamicEntry;
+
+pin_project! {
+    pub struct InjectReusedChunksQueue<S> {
+        #[pin]
+        input: S,
+        current: Option<InjectChunks>,
+        buffer: Option<bytes::BytesMut>,
+        injection_queue: Arc<Mutex<VecDeque<InjectChunks>>>,
+        stream_len: Arc<AtomicUsize>,
+        reused_len: Arc<AtomicUsize>,
+        index_csum: Arc<Mutex<Option<openssl::sha::Sha256>>>,
+    }
+}
+
+#[derive(Debug)]
+pub struct InjectChunks {
+    pub boundary: u64,
+    pub chunks: Vec<ReusableDynamicEntry>,
+    pub size: usize,
+}
+
+pub enum InjectedChunksInfo {
+    Known(Vec<(u64, [u8; 32])>),
+    Raw((u64, bytes::BytesMut)),
+}
+
+pub trait InjectReusedChunks: Sized {
+    fn inject_reused_chunks(
+        self,
+        injection_queue: Arc<Mutex<VecDeque<InjectChunks>>>,
+        stream_len: Arc<AtomicUsize>,
+        reused_len: Arc<AtomicUsize>,
+        index_csum: Arc<Mutex<Option<openssl::sha::Sha256>>>,
+    ) -> InjectReusedChunksQueue<Self>;
+}
+
+impl<S> InjectReusedChunks for S
+where
+    S: Stream<Item = Result<bytes::BytesMut, Error>>,
+{
+    fn inject_reused_chunks(
+        self,
+        injection_queue: Arc<Mutex<VecDeque<InjectChunks>>>,
+        stream_len: Arc<AtomicUsize>,
+        reused_len: Arc<AtomicUsize>,
+        index_csum: Arc<Mutex<Option<openssl::sha::Sha256>>>,
+    ) -> InjectReusedChunksQueue<Self> {
+        InjectReusedChunksQueue {
+            input: self,
+            current: None,
+            injection_queue,
+            buffer: None,
+            stream_len,
+            reused_len,
+            index_csum,
+        }
+    }
+}
+
+impl<S> Stream for InjectReusedChunksQueue<S>
+where
+    S: Stream<Item = Result<bytes::BytesMut, Error>>,
+{
+    type Item = Result<InjectedChunksInfo, Error>;
+
+    fn poll_next(self: Pin<&mut Self>, cx: &mut Context) -> Poll<Option<Self::Item>> {
+        let mut this = self.project();
+        loop {
+            let current = this.current.take();
+            if let Some(current) = current {
+                let mut chunks = Vec::new();
+                let mut guard = this.index_csum.lock().unwrap();
+                let csum = guard.as_mut().unwrap();
+
+                for chunk in current.chunks {
+                    let offset = this
+                        .stream_len
+                        .fetch_add(chunk.size() as usize, Ordering::SeqCst)
+                        as u64;
+                    this.reused_len
+                        .fetch_add(chunk.size() as usize, Ordering::SeqCst);
+                    let digest = chunk.digest();
+                    chunks.push((offset, digest));
+                    let end_offset = offset + chunk.size();
+                    csum.update(&end_offset.to_le_bytes());
+                    csum.update(&digest);
+                }
+                let chunk_info = InjectedChunksInfo::Known(chunks);
+                return Poll::Ready(Some(Ok(chunk_info)));
+            }
+
+            let buffer = this.buffer.take();
+            if let Some(buffer) = buffer {
+                let offset = this.stream_len.fetch_add(buffer.len(), Ordering::SeqCst) as u64;
+                let data = InjectedChunksInfo::Raw((offset, buffer));
+                return Poll::Ready(Some(Ok(data)));
+            }
+
+            match ready!(this.input.as_mut().poll_next(cx)) {
+                None => return Poll::Ready(None),
+                Some(Err(err)) => return Poll::Ready(Some(Err(err))),
+                Some(Ok(raw)) => {
+                    let chunk_size = raw.len();
+                    let offset = this.stream_len.load(Ordering::SeqCst) as u64;
+                    let mut injections = this.injection_queue.lock().unwrap();
+
+                    if let Some(inject) = injections.pop_front() {
+                        if inject.boundary == offset {
+                            if this.current.replace(inject).is_some() {
+                                return Poll::Ready(Some(Err(anyhow!(
+                                    "replaced injection queue not empty"
+                                ))));
+                            }
+                            if chunk_size > 0 && this.buffer.replace(raw).is_some() {
+                                return Poll::Ready(Some(Err(anyhow!(
+                                    "replaced buffer not empty"
+                                ))));
+                            }
+                            continue;
+                        } else if inject.boundary == offset + chunk_size as u64 {
+                            let _ = this.current.insert(inject);
+                        } else if inject.boundary < offset + chunk_size as u64 {
+                            return Poll::Ready(Some(Err(anyhow!("invalid injection boundary"))));
+                        } else {
+                            injections.push_front(inject);
+                        }
+                    }
+
+                    if chunk_size == 0 {
+                        return Poll::Ready(Some(Err(anyhow!("unexpected empty raw data"))));
+                    }
+
+                    let offset = this.stream_len.fetch_add(chunk_size, Ordering::SeqCst) as u64;
+                    let data = InjectedChunksInfo::Raw((offset, raw));
+
+                    return Poll::Ready(Some(Ok(data)));
+                }
+            }
+        }
+    }
+}
diff --git a/pbs-client/src/lib.rs b/pbs-client/src/lib.rs
index 21cf8556b..3e7bd2a8b 100644
--- a/pbs-client/src/lib.rs
+++ b/pbs-client/src/lib.rs
@@ -7,6 +7,7 @@ pub mod catalog_shell;
 pub mod pxar;
 pub mod tools;
 
+mod inject_reused_chunks;
 mod merge_known_chunks;
 pub mod pipe_to_stream;
 
-- 
2.39.2





^ permalink raw reply	[flat|nested] 122+ messages in thread

* [pbs-devel] [PATCH v3 proxmox-backup 39/58] client: chunk stream: add struct to hold injection state
  2024-03-28 12:36 [pbs-devel] [PATCH v3 pxar proxmox-backup 00/58] fix #3174: improve file-level backup Christian Ebner
                   ` (37 preceding siblings ...)
  2024-03-28 12:36 ` [pbs-devel] [PATCH v3 proxmox-backup 38/58] upload stream: impl reused chunk injector Christian Ebner
@ 2024-03-28 12:36 ` Christian Ebner
  2024-03-28 12:36 ` [pbs-devel] [PATCH v3 proxmox-backup 40/58] client: chunk stream: add dynamic entries injection queues Christian Ebner
                   ` (20 subsequent siblings)
  59 siblings, 0 replies; 122+ messages in thread
From: Christian Ebner @ 2024-03-28 12:36 UTC (permalink / raw)
  To: pbs-devel

Adds a dedicated structure to hold the optional queues and state for
injection of reused dynamic entries in the payload stream for
split stream pxar archives when reusing the payloads chunks from a
previous backup run for regular files with unchanged metadata.

The queues must only be attached to the payload archive, leaving the
current behaviour for the metadata archive or in case of regular
encoding without reusing payload chunks of previous snapshots.

Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
---
changes since version 2:
- not present in previous version

 pbs-client/src/chunk_stream.rs | 24 ++++++++++++++++++++++++
 pbs-client/src/lib.rs          |  2 +-
 2 files changed, 25 insertions(+), 1 deletion(-)

diff --git a/pbs-client/src/chunk_stream.rs b/pbs-client/src/chunk_stream.rs
index 895f6eae2..a45420ca0 100644
--- a/pbs-client/src/chunk_stream.rs
+++ b/pbs-client/src/chunk_stream.rs
@@ -1,4 +1,6 @@
+use std::collections::VecDeque;
 use std::pin::Pin;
+use std::sync::{Arc, Mutex};
 use std::task::{Context, Poll};
 
 use anyhow::Error;
@@ -8,6 +10,28 @@ use futures::stream::{Stream, TryStream};
 
 use pbs_datastore::Chunker;
 
+use crate::inject_reused_chunks::InjectChunks;
+
+/// Holds the queues for optional injection of reused dynamic index entries
+pub struct InjectionData {
+    boundaries: Arc<Mutex<VecDeque<InjectChunks>>>,
+    injections: Arc<Mutex<VecDeque<InjectChunks>>>,
+    consumed: u64,
+}
+
+impl InjectionData {
+    pub fn new(
+        boundaries: Arc<Mutex<VecDeque<InjectChunks>>>,
+        injections: Arc<Mutex<VecDeque<InjectChunks>>>,
+    ) -> Self {
+        Self {
+            boundaries,
+            injections,
+            consumed: 0,
+        }
+    }
+}
+
 /// Split input stream into dynamic sized chunks
 pub struct ChunkStream<S: Unpin> {
     input: S,
diff --git a/pbs-client/src/lib.rs b/pbs-client/src/lib.rs
index 3e7bd2a8b..3d2da27b9 100644
--- a/pbs-client/src/lib.rs
+++ b/pbs-client/src/lib.rs
@@ -39,6 +39,6 @@ mod backup_specification;
 pub use backup_specification::*;
 
 mod chunk_stream;
-pub use chunk_stream::{ChunkStream, FixedChunkStream};
+pub use chunk_stream::{ChunkStream, FixedChunkStream, InjectionData};
 
 pub const PROXMOX_BACKUP_TCP_KEEPALIVE_TIME: u32 = 120;
-- 
2.39.2





^ permalink raw reply	[flat|nested] 122+ messages in thread

* [pbs-devel] [PATCH v3 proxmox-backup 40/58] client: chunk stream: add dynamic entries injection queues
  2024-03-28 12:36 [pbs-devel] [PATCH v3 pxar proxmox-backup 00/58] fix #3174: improve file-level backup Christian Ebner
                   ` (38 preceding siblings ...)
  2024-03-28 12:36 ` [pbs-devel] [PATCH v3 proxmox-backup 39/58] client: chunk stream: add struct to hold injection state Christian Ebner
@ 2024-03-28 12:36 ` Christian Ebner
  2024-04-04 14:52   ` Fabian Grünbichler
  2024-03-28 12:36 ` [pbs-devel] [PATCH v3 proxmox-backup 41/58] specs: add backup detection mode specification Christian Ebner
                   ` (19 subsequent siblings)
  59 siblings, 1 reply; 122+ messages in thread
From: Christian Ebner @ 2024-03-28 12:36 UTC (permalink / raw)
  To: pbs-devel

Adds a queue to the chunk stream to request forced boundaries at a
given offset within the stream and inject reused dynamic entries
after this boundary.

The chunks are then passed along to the uploader stream using the
injection queue, which inserts them during upload.

Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
---
changes since version 2:
- combined queues into new optional struct
- refactoring

 examples/test_chunk_speed2.rs                 |  2 +-
 pbs-client/src/backup_writer.rs               | 89 +++++++++++--------
 pbs-client/src/chunk_stream.rs                | 36 +++++++-
 pbs-client/src/pxar/create.rs                 |  6 +-
 pbs-client/src/pxar_backup_stream.rs          |  7 +-
 proxmox-backup-client/src/main.rs             | 31 ++++---
 .../src/proxmox_restore_daemon/api.rs         |  1 +
 pxar-bin/src/main.rs                          |  1 +
 tests/catar.rs                                |  1 +
 9 files changed, 121 insertions(+), 53 deletions(-)

diff --git a/examples/test_chunk_speed2.rs b/examples/test_chunk_speed2.rs
index 3f69b436d..22dd14ce2 100644
--- a/examples/test_chunk_speed2.rs
+++ b/examples/test_chunk_speed2.rs
@@ -26,7 +26,7 @@ async fn run() -> Result<(), Error> {
         .map_err(Error::from);
 
     //let chunk_stream = FixedChunkStream::new(stream, 4*1024*1024);
-    let mut chunk_stream = ChunkStream::new(stream, None);
+    let mut chunk_stream = ChunkStream::new(stream, None, None);
 
     let start_time = std::time::Instant::now();
 
diff --git a/pbs-client/src/backup_writer.rs b/pbs-client/src/backup_writer.rs
index 8bd0e4f36..032d93da7 100644
--- a/pbs-client/src/backup_writer.rs
+++ b/pbs-client/src/backup_writer.rs
@@ -1,4 +1,4 @@
-use std::collections::HashSet;
+use std::collections::{HashSet, VecDeque};
 use std::future::Future;
 use std::os::unix::fs::OpenOptionsExt;
 use std::sync::atomic::{AtomicU64, AtomicUsize, Ordering};
@@ -23,6 +23,7 @@ use pbs_tools::crypt_config::CryptConfig;
 
 use proxmox_human_byte::HumanByte;
 
+use super::inject_reused_chunks::{InjectChunks, InjectReusedChunks, InjectedChunksInfo};
 use super::merge_known_chunks::{MergeKnownChunks, MergedChunkInfo};
 
 use super::{H2Client, HttpClient};
@@ -265,6 +266,7 @@ impl BackupWriter {
         archive_name: &str,
         stream: impl Stream<Item = Result<bytes::BytesMut, Error>>,
         options: UploadOptions,
+        injection_queue: Option<Arc<Mutex<VecDeque<InjectChunks>>>>,
     ) -> Result<BackupStats, Error> {
         let known_chunks = Arc::new(Mutex::new(HashSet::new()));
 
@@ -341,6 +343,7 @@ impl BackupWriter {
                 None
             },
             options.compress,
+            injection_queue,
         )
         .await?;
 
@@ -637,6 +640,7 @@ impl BackupWriter {
         known_chunks: Arc<Mutex<HashSet<[u8; 32]>>>,
         crypt_config: Option<Arc<CryptConfig>>,
         compress: bool,
+        injection_queue: Option<Arc<Mutex<VecDeque<InjectChunks>>>>,
     ) -> impl Future<Output = Result<UploadStats, Error>> {
         let total_chunks = Arc::new(AtomicUsize::new(0));
         let total_chunks2 = total_chunks.clone();
@@ -663,48 +667,63 @@ impl BackupWriter {
         let index_csum_2 = index_csum.clone();
 
         stream
-            .and_then(move |data| {
-                let chunk_len = data.len();
+            .inject_reused_chunks(
+                injection_queue.unwrap_or_default(),
+                stream_len,
+                reused_len.clone(),
+                index_csum.clone(),
+            )
+            .and_then(move |chunk_info| match chunk_info {
+                InjectedChunksInfo::Known(chunks) => {
+                    total_chunks.fetch_add(chunks.len(), Ordering::SeqCst);
+                    future::ok(MergedChunkInfo::Known(chunks))
+                }
+                InjectedChunksInfo::Raw((offset, data)) => {
+                    let chunk_len = data.len();
 
-                total_chunks.fetch_add(1, Ordering::SeqCst);
-                let offset = stream_len.fetch_add(chunk_len, Ordering::SeqCst) as u64;
+                    total_chunks.fetch_add(1, Ordering::SeqCst);
 
-                let mut chunk_builder = DataChunkBuilder::new(data.as_ref()).compress(compress);
+                    let mut chunk_builder = DataChunkBuilder::new(data.as_ref()).compress(compress);
 
-                if let Some(ref crypt_config) = crypt_config {
-                    chunk_builder = chunk_builder.crypt_config(crypt_config);
-                }
+                    if let Some(ref crypt_config) = crypt_config {
+                        chunk_builder = chunk_builder.crypt_config(crypt_config);
+                    }
 
-                let mut known_chunks = known_chunks.lock().unwrap();
-                let digest = chunk_builder.digest();
+                    let mut known_chunks = known_chunks.lock().unwrap();
 
-                let mut guard = index_csum.lock().unwrap();
-                let csum = guard.as_mut().unwrap();
+                    let digest = chunk_builder.digest();
 
-                let chunk_end = offset + chunk_len as u64;
+                    let mut guard = index_csum.lock().unwrap();
+                    let csum = guard.as_mut().unwrap();
 
-                if !is_fixed_chunk_size {
-                    csum.update(&chunk_end.to_le_bytes());
-                }
-                csum.update(digest);
+                    let chunk_end = offset + chunk_len as u64;
 
-                let chunk_is_known = known_chunks.contains(digest);
-                if chunk_is_known {
-                    known_chunk_count.fetch_add(1, Ordering::SeqCst);
-                    reused_len.fetch_add(chunk_len, Ordering::SeqCst);
-                    future::ok(MergedChunkInfo::Known(vec![(offset, *digest)]))
-                } else {
-                    let compressed_stream_len2 = compressed_stream_len.clone();
-                    known_chunks.insert(*digest);
-                    future::ready(chunk_builder.build().map(move |(chunk, digest)| {
-                        compressed_stream_len2.fetch_add(chunk.raw_size(), Ordering::SeqCst);
-                        MergedChunkInfo::New(ChunkInfo {
-                            chunk,
-                            digest,
-                            chunk_len: chunk_len as u64,
-                            offset,
-                        })
-                    }))
+                    if !is_fixed_chunk_size {
+                        csum.update(&chunk_end.to_le_bytes());
+                    }
+                    csum.update(digest);
+
+                    let chunk_is_known = known_chunks.contains(digest);
+                    if chunk_is_known {
+                        known_chunk_count.fetch_add(1, Ordering::SeqCst);
+                        reused_len.fetch_add(chunk_len, Ordering::SeqCst);
+
+                        future::ok(MergedChunkInfo::Known(vec![(offset, *digest)]))
+                    } else {
+                        let compressed_stream_len2 = compressed_stream_len.clone();
+                        known_chunks.insert(*digest);
+
+                        future::ready(chunk_builder.build().map(move |(chunk, digest)| {
+                            compressed_stream_len2.fetch_add(chunk.raw_size(), Ordering::SeqCst);
+
+                            MergedChunkInfo::New(ChunkInfo {
+                                chunk,
+                                digest,
+                                chunk_len: chunk_len as u64,
+                                offset,
+                            })
+                        }))
+                    }
                 }
             })
             .merge_known_chunks()
diff --git a/pbs-client/src/chunk_stream.rs b/pbs-client/src/chunk_stream.rs
index a45420ca0..6ac0c638b 100644
--- a/pbs-client/src/chunk_stream.rs
+++ b/pbs-client/src/chunk_stream.rs
@@ -38,15 +38,17 @@ pub struct ChunkStream<S: Unpin> {
     chunker: Chunker,
     buffer: BytesMut,
     scan_pos: usize,
+    injection_data: Option<InjectionData>,
 }
 
 impl<S: Unpin> ChunkStream<S> {
-    pub fn new(input: S, chunk_size: Option<usize>) -> Self {
+    pub fn new(input: S, chunk_size: Option<usize>, injection_data: Option<InjectionData>) -> Self {
         Self {
             input,
             chunker: Chunker::new(chunk_size.unwrap_or(4 * 1024 * 1024)),
             buffer: BytesMut::new(),
             scan_pos: 0,
+            injection_data,
         }
     }
 }
@@ -64,6 +66,34 @@ where
     fn poll_next(self: Pin<&mut Self>, cx: &mut Context) -> Poll<Option<Self::Item>> {
         let this = self.get_mut();
         loop {
+            if let Some(InjectionData {
+                boundaries,
+                injections,
+                consumed,
+            }) = this.injection_data.as_mut()
+            {
+                // Make sure to release this lock as soon as possible
+                let mut boundaries = boundaries.lock().unwrap();
+                if let Some(inject) = boundaries.pop_front() {
+                    let max = *consumed + this.buffer.len() as u64;
+                    if inject.boundary <= max {
+                        let chunk_size = (inject.boundary - *consumed) as usize;
+                        let result = this.buffer.split_to(chunk_size);
+                        *consumed += chunk_size as u64;
+                        this.scan_pos = 0;
+
+                        // Add the size of the injected chunks to consumed, so chunk stream offsets
+                        // are in sync with the rest of the archive.
+                        *consumed += inject.size as u64;
+
+                        injections.lock().unwrap().push_back(inject);
+
+                        return Poll::Ready(Some(Ok(result)));
+                    }
+                    boundaries.push_front(inject);
+                }
+            }
+
             if this.scan_pos < this.buffer.len() {
                 let boundary = this.chunker.scan(&this.buffer[this.scan_pos..]);
 
@@ -74,7 +104,11 @@ where
                     // continue poll
                 } else if chunk_size <= this.buffer.len() {
                     let result = this.buffer.split_to(chunk_size);
+                    if let Some(InjectionData { consumed, .. }) = this.injection_data.as_mut() {
+                        *consumed += chunk_size as u64;
+                    }
                     this.scan_pos = 0;
+
                     return Poll::Ready(Some(Ok(result)));
                 } else {
                     panic!("got unexpected chunk boundary from chunker");
diff --git a/pbs-client/src/pxar/create.rs b/pbs-client/src/pxar/create.rs
index e2d3954ca..2c7867f22 100644
--- a/pbs-client/src/pxar/create.rs
+++ b/pbs-client/src/pxar/create.rs
@@ -1,4 +1,4 @@
-use std::collections::{HashMap, HashSet};
+use std::collections::{HashMap, HashSet, VecDeque};
 use std::ffi::{CStr, CString, OsStr};
 use std::fmt;
 use std::io::{self, Read};
@@ -29,6 +29,7 @@ use proxmox_sys::fs::{self, acl, xattr};
 use pbs_datastore::catalog::BackupCatalogWriter;
 use pbs_datastore::dynamic_index::DynamicIndexReader;
 
+use crate::inject_reused_chunks::InjectChunks;
 use crate::pxar::metadata::errno_is_unsupported;
 use crate::pxar::tools::assert_single_path_component;
 use crate::pxar::Flags;
@@ -134,6 +135,7 @@ struct Archiver {
     hardlinks: HashMap<HardLinkInfo, (PathBuf, LinkOffset)>,
     file_copy_buffer: Vec<u8>,
     skip_e2big_xattr: bool,
+    forced_boundaries: Option<Arc<Mutex<VecDeque<InjectChunks>>>>,
 }
 
 type Encoder<'a, T> = pxar::encoder::aio::Encoder<'a, T>;
@@ -164,6 +166,7 @@ pub async fn create_archive<T, F>(
     feature_flags: Flags,
     callback: F,
     options: PxarCreateOptions,
+    forced_boundaries: Option<Arc<Mutex<VecDeque<InjectChunks>>>>,
 ) -> Result<(), Error>
 where
     T: SeqWrite + Send,
@@ -224,6 +227,7 @@ where
         hardlinks: HashMap::new(),
         file_copy_buffer: vec::undefined(4 * 1024 * 1024),
         skip_e2big_xattr: options.skip_e2big_xattr,
+        forced_boundaries,
     };
 
     archiver
diff --git a/pbs-client/src/pxar_backup_stream.rs b/pbs-client/src/pxar_backup_stream.rs
index 95145cb0d..4ea084f28 100644
--- a/pbs-client/src/pxar_backup_stream.rs
+++ b/pbs-client/src/pxar_backup_stream.rs
@@ -1,3 +1,4 @@
+use std::collections::VecDeque;
 use std::io::Write;
 //use std::os::unix::io::FromRawFd;
 use std::path::Path;
@@ -17,6 +18,7 @@ use proxmox_io::StdChannelWriter;
 
 use pbs_datastore::catalog::CatalogWriter;
 
+use crate::inject_reused_chunks::InjectChunks;
 use crate::pxar::create::PxarWriters;
 
 /// Stream implementation to encode and upload .pxar archives.
@@ -42,6 +44,7 @@ impl PxarBackupStream {
         dir: Dir,
         catalog: Arc<Mutex<CatalogWriter<W>>>,
         options: crate::pxar::PxarCreateOptions,
+        boundaries: Option<Arc<Mutex<VecDeque<InjectChunks>>>>,
         separate_payload_stream: bool,
     ) -> Result<(Self, Option<Self>), Error> {
         let buffer_size = 256 * 1024;
@@ -79,6 +82,7 @@ impl PxarBackupStream {
                     Ok(())
                 },
                 options,
+                boundaries,
             )
             .await
             {
@@ -110,11 +114,12 @@ impl PxarBackupStream {
         dirname: &Path,
         catalog: Arc<Mutex<CatalogWriter<W>>>,
         options: crate::pxar::PxarCreateOptions,
+        boundaries: Option<Arc<Mutex<VecDeque<InjectChunks>>>>,
         separate_payload_stream: bool,
     ) -> Result<(Self, Option<Self>), Error> {
         let dir = nix::dir::Dir::open(dirname, OFlag::O_DIRECTORY, Mode::empty())?;
 
-        Self::new(dir, catalog, options, separate_payload_stream)
+        Self::new(dir, catalog, options, boundaries, separate_payload_stream)
     }
 }
 
diff --git a/proxmox-backup-client/src/main.rs b/proxmox-backup-client/src/main.rs
index 294b52ddb..215095ee7 100644
--- a/proxmox-backup-client/src/main.rs
+++ b/proxmox-backup-client/src/main.rs
@@ -1,4 +1,4 @@
-use std::collections::HashSet;
+use std::collections::{HashSet, VecDeque};
 use std::io::{self, Read, Seek, SeekFrom, Write};
 use std::path::{Path, PathBuf};
 use std::pin::Pin;
@@ -43,10 +43,10 @@ use pbs_client::tools::{
     CHUNK_SIZE_SCHEMA, REPO_URL_SCHEMA,
 };
 use pbs_client::{
-    delete_ticket_info, parse_backup_specification, view_task_result, BackupReader,
-    BackupRepository, BackupSpecificationType, BackupStats, BackupWriter, ChunkStream,
-    FixedChunkStream, HttpClient, PxarBackupStream, RemoteChunkReader, UploadOptions,
-    BACKUP_SOURCE_SCHEMA,
+    delete_ticket_info, parse_backup_detection_mode_specification, parse_backup_specification,
+    view_task_result, BackupReader, BackupRepository, BackupSpecificationType, BackupStats,
+    BackupWriter, ChunkStream, FixedChunkStream, HttpClient, InjectionData, PxarBackupStream,
+    RemoteChunkReader, UploadOptions, BACKUP_DETECTION_MODE_SPEC, BACKUP_SOURCE_SCHEMA,
 };
 use pbs_datastore::catalog::{BackupCatalogWriter, CatalogReader, CatalogWriter};
 use pbs_datastore::chunk_store::verify_chunk_size;
@@ -199,14 +199,16 @@ async fn backup_directory<P: AsRef<Path>>(
         bail!("cannot backup directory with fixed chunk size!");
     }
 
+    let payload_boundaries = Arc::new(Mutex::new(VecDeque::new()));
     let (pxar_stream, payload_stream) = PxarBackupStream::open(
         dir_path.as_ref(),
         catalog,
         pxar_create_options,
+        Some(payload_boundaries.clone()),
         payload_target.is_some(),
     )?;
 
-    let mut chunk_stream = ChunkStream::new(pxar_stream, chunk_size);
+    let mut chunk_stream = ChunkStream::new(pxar_stream, chunk_size, None);
     let (tx, rx) = mpsc::channel(10); // allow to buffer 10 chunks
 
     let stream = ReceiverStream::new(rx).map_err(Error::from);
@@ -218,16 +220,16 @@ async fn backup_directory<P: AsRef<Path>>(
         }
     });
 
-    let stats = client.upload_stream(archive_name, stream, upload_options.clone());
+    let stats = client.upload_stream(archive_name, stream, upload_options.clone(), None);
 
     if let Some(payload_stream) = payload_stream {
         let payload_target = payload_target
             .ok_or_else(|| format_err!("got payload stream, but no target archive name"))?;
 
-        let mut payload_chunk_stream = ChunkStream::new(
-            payload_stream,
-            chunk_size,
-        );
+        let payload_injections = Arc::new(Mutex::new(VecDeque::new()));
+        let injection_data = InjectionData::new(payload_boundaries, payload_injections.clone());
+        let mut payload_chunk_stream =
+            ChunkStream::new(payload_stream, chunk_size, Some(injection_data));
         let (payload_tx, payload_rx) = mpsc::channel(10); // allow to buffer 10 chunks
         let stream = ReceiverStream::new(payload_rx).map_err(Error::from);
 
@@ -242,6 +244,7 @@ async fn backup_directory<P: AsRef<Path>>(
             &payload_target,
             stream,
             upload_options,
+            Some(payload_injections),
         );
 
         match futures::join!(stats, payload_stats) {
@@ -278,7 +281,7 @@ async fn backup_image<P: AsRef<Path>>(
     }
 
     let stats = client
-        .upload_stream(archive_name, stream, upload_options)
+        .upload_stream(archive_name, stream, upload_options, None)
         .await?;
 
     Ok(stats)
@@ -569,7 +572,7 @@ fn spawn_catalog_upload(
     let (catalog_tx, catalog_rx) = std::sync::mpsc::sync_channel(10); // allow to buffer 10 writes
     let catalog_stream = proxmox_async::blocking::StdChannelStream(catalog_rx);
     let catalog_chunk_size = 512 * 1024;
-    let catalog_chunk_stream = ChunkStream::new(catalog_stream, Some(catalog_chunk_size));
+    let catalog_chunk_stream = ChunkStream::new(catalog_stream, Some(catalog_chunk_size), None);
 
     let catalog_writer = Arc::new(Mutex::new(CatalogWriter::new(TokioWriterAdapter::new(
         StdChannelWriter::new(catalog_tx),
@@ -585,7 +588,7 @@ fn spawn_catalog_upload(
 
     tokio::spawn(async move {
         let catalog_upload_result = client
-            .upload_stream(CATALOG_NAME, catalog_chunk_stream, upload_options)
+            .upload_stream(CATALOG_NAME, catalog_chunk_stream, upload_options, None)
             .await;
 
         if let Err(ref err) = catalog_upload_result {
diff --git a/proxmox-restore-daemon/src/proxmox_restore_daemon/api.rs b/proxmox-restore-daemon/src/proxmox_restore_daemon/api.rs
index ea97976e6..0883d6cda 100644
--- a/proxmox-restore-daemon/src/proxmox_restore_daemon/api.rs
+++ b/proxmox-restore-daemon/src/proxmox_restore_daemon/api.rs
@@ -364,6 +364,7 @@ fn extract(
                         Flags::DEFAULT,
                         |_| Ok(()),
                         options,
+                        None,
                     )
                     .await
                 }
diff --git a/pxar-bin/src/main.rs b/pxar-bin/src/main.rs
index 58c9d2cfd..d46c98d2b 100644
--- a/pxar-bin/src/main.rs
+++ b/pxar-bin/src/main.rs
@@ -405,6 +405,7 @@ async fn create_archive(
             Ok(())
         },
         options,
+        None,
     )
     .await?;
 
diff --git a/tests/catar.rs b/tests/catar.rs
index 9e96a8610..d5ef85ffe 100644
--- a/tests/catar.rs
+++ b/tests/catar.rs
@@ -39,6 +39,7 @@ fn run_test(dir_name: &str) -> Result<(), Error> {
         Flags::DEFAULT,
         |_| Ok(()),
         options,
+        None,
     ))?;
 
     Command::new("cmp")
-- 
2.39.2





^ permalink raw reply	[flat|nested] 122+ messages in thread

* [pbs-devel] [PATCH v3 proxmox-backup 41/58] specs: add backup detection mode specification
  2024-03-28 12:36 [pbs-devel] [PATCH v3 pxar proxmox-backup 00/58] fix #3174: improve file-level backup Christian Ebner
                   ` (39 preceding siblings ...)
  2024-03-28 12:36 ` [pbs-devel] [PATCH v3 proxmox-backup 40/58] client: chunk stream: add dynamic entries injection queues Christian Ebner
@ 2024-03-28 12:36 ` Christian Ebner
  2024-04-04 14:54   ` Fabian Grünbichler
  2024-03-28 12:36 ` [pbs-devel] [PATCH v3 proxmox-backup 42/58] client: implement prepare reference method Christian Ebner
                   ` (18 subsequent siblings)
  59 siblings, 1 reply; 122+ messages in thread
From: Christian Ebner @ 2024-03-28 12:36 UTC (permalink / raw)
  To: pbs-devel

Adds the specification for switching the detection mode used to
identify regular files which changed since a reference backup run.

Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
---
changes since version 2:
- removed unneeded vector storing archive names for which to enable
  metadata mode, set either for all or none

 pbs-client/src/backup_specification.rs | 40 ++++++++++++++++++++++++++
 1 file changed, 40 insertions(+)

diff --git a/pbs-client/src/backup_specification.rs b/pbs-client/src/backup_specification.rs
index 619a3a9da..4b1dbd188 100644
--- a/pbs-client/src/backup_specification.rs
+++ b/pbs-client/src/backup_specification.rs
@@ -4,6 +4,7 @@ use proxmox_schema::*;
 
 const_regex! {
     BACKUPSPEC_REGEX = r"^([a-zA-Z0-9_-]+\.(pxar|img|conf|log)):(.+)$";
+    DETECTION_MODE_REGEX = r"^(data|metadata(:[a-zA-Z0-9_-]+\.pxar)*)$";
 }
 
 pub const BACKUP_SOURCE_SCHEMA: Schema =
@@ -11,6 +12,11 @@ pub const BACKUP_SOURCE_SCHEMA: Schema =
         .format(&ApiStringFormat::Pattern(&BACKUPSPEC_REGEX))
         .schema();
 
+pub const BACKUP_DETECTION_MODE_SPEC: Schema =
+    StringSchema::new("Backup source specification ([data|metadata(:<label>,...)]).")
+        .format(&ApiStringFormat::Pattern(&DETECTION_MODE_REGEX))
+        .schema();
+
 pub enum BackupSpecificationType {
     PXAR,
     IMAGE,
@@ -45,3 +51,37 @@ pub fn parse_backup_specification(value: &str) -> Result<BackupSpecification, Er
 
     bail!("unable to parse backup source specification '{}'", value);
 }
+
+/// Mode to detect file changes since last backup run
+pub enum BackupDetectionMode {
+    /// Regular mode, re-encode payload data
+    Data,
+    /// Compare metadata, reuse payload chunks if metadata unchanged
+    Metadata,
+}
+
+impl BackupDetectionMode {
+    /// Check if the selected mode is metadata based file change detection
+    pub fn is_metadata(&self) -> bool {
+        match self {
+            Self::Data => false,
+            Self::Metadata => true,
+        }
+    }
+}
+
+pub fn parse_backup_detection_mode_specification(
+    value: &str,
+) -> Result<BackupDetectionMode, Error> {
+    match (DETECTION_MODE_REGEX.regex_obj)().captures(value) {
+        Some(caps) => {
+            let mode = match caps.get(1).unwrap().as_str() {
+                "data" => BackupDetectionMode::Data,
+                "metadata" => BackupDetectionMode::Metadata,
+                _ => bail!("invalid backup detection mode"),
+            };
+            Ok(mode)
+        }
+        None => bail!("unable to parse backup detection mode specification '{value}'"),
+    }
+}
-- 
2.39.2





^ permalink raw reply	[flat|nested] 122+ messages in thread

* [pbs-devel] [PATCH v3 proxmox-backup 42/58] client: implement prepare reference method
  2024-03-28 12:36 [pbs-devel] [PATCH v3 pxar proxmox-backup 00/58] fix #3174: improve file-level backup Christian Ebner
                   ` (40 preceding siblings ...)
  2024-03-28 12:36 ` [pbs-devel] [PATCH v3 proxmox-backup 41/58] specs: add backup detection mode specification Christian Ebner
@ 2024-03-28 12:36 ` Christian Ebner
  2024-04-05  8:01   ` Fabian Grünbichler
  2024-03-28 12:36 ` [pbs-devel] [PATCH v3 proxmox-backup 43/58] client: pxar: implement store to insert chunks on caching Christian Ebner
                   ` (17 subsequent siblings)
  59 siblings, 1 reply; 122+ messages in thread
From: Christian Ebner @ 2024-03-28 12:36 UTC (permalink / raw)
  To: pbs-devel

Implement a method that prepares the decoder instance to access a
previous snapshots metadata index and payload index in order to
pass it to the pxar archiver. The archiver than can utilize these
to compare the metadata for files to the previous state and gather
reusable chunks.

Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
---
changes since version 2:
- moved checks for reader and mainifest to call side as suggested
- distinguish between previous manifest not having index and error state

 pbs-client/src/pxar/create.rs     | 14 +++++++-
 pbs-client/src/pxar/mod.rs        |  2 +-
 proxmox-backup-client/src/main.rs | 57 +++++++++++++++++++++++++++++--
 3 files changed, 69 insertions(+), 4 deletions(-)

diff --git a/pbs-client/src/pxar/create.rs b/pbs-client/src/pxar/create.rs
index 2c7867f22..335e3556f 100644
--- a/pbs-client/src/pxar/create.rs
+++ b/pbs-client/src/pxar/create.rs
@@ -19,6 +19,7 @@ use nix::sys::stat::{FileStat, Mode};
 use pathpatterns::{MatchEntry, MatchFlag, MatchList, MatchType, PatternFlag};
 use pbs_datastore::index::IndexFile;
 use proxmox_sys::error::SysError;
+use pxar::accessor::aio::Accessor;
 use pxar::encoder::{LinkOffset, SeqWrite};
 use pxar::Metadata;
 
@@ -26,8 +27,9 @@ use proxmox_io::vec;
 use proxmox_lang::c_str;
 use proxmox_sys::fs::{self, acl, xattr};
 
+use crate::RemoteChunkReader;
 use pbs_datastore::catalog::BackupCatalogWriter;
-use pbs_datastore::dynamic_index::DynamicIndexReader;
+use pbs_datastore::dynamic_index::{DynamicIndexReader, LocalDynamicReadAt};
 
 use crate::inject_reused_chunks::InjectChunks;
 use crate::pxar::metadata::errno_is_unsupported;
@@ -49,6 +51,16 @@ pub struct PxarCreateOptions {
     pub skip_e2big_xattr: bool,
 }
 
+/// Statefull information of previous backups snapshots for partial backups
+pub struct PxarPrevRef {
+    /// Reference accessor for metadata comparison
+    pub accessor: Accessor<LocalDynamicReadAt<RemoteChunkReader>>,
+    /// Reference index for reusing payload chunks
+    pub payload_index: DynamicIndexReader,
+    /// Reference archive name for partial backups
+    pub archive_name: String,
+}
+
 fn detect_fs_type(fd: RawFd) -> Result<i64, Error> {
     let mut fs_stat = std::mem::MaybeUninit::uninit();
     let res = unsafe { libc::fstatfs(fd, fs_stat.as_mut_ptr()) };
diff --git a/pbs-client/src/pxar/mod.rs b/pbs-client/src/pxar/mod.rs
index b7dcf8362..76652094e 100644
--- a/pbs-client/src/pxar/mod.rs
+++ b/pbs-client/src/pxar/mod.rs
@@ -56,7 +56,7 @@ pub(crate) mod tools;
 mod flags;
 pub use flags::Flags;
 
-pub use create::{create_archive, PxarCreateOptions, PxarWriters};
+pub use create::{create_archive, PxarCreateOptions, PxarPrevRef, PxarWriters};
 pub use extract::{
     create_tar, create_zip, extract_archive, extract_sub_dir, extract_sub_dir_seq, ErrorHandler,
     OverwriteFlags, PxarExtractContext, PxarExtractOptions,
diff --git a/proxmox-backup-client/src/main.rs b/proxmox-backup-client/src/main.rs
index 215095ee7..0b747453c 100644
--- a/proxmox-backup-client/src/main.rs
+++ b/proxmox-backup-client/src/main.rs
@@ -21,6 +21,7 @@ use proxmox_router::{cli::*, ApiMethod, RpcEnvironment};
 use proxmox_schema::api;
 use proxmox_sys::fs::{file_get_json, image_size, replace_file, CreateOptions};
 use proxmox_time::{epoch_i64, strftime_local};
+use pxar::accessor::aio::Accessor;
 use pxar::accessor::{MaybeReady, ReadAt, ReadAtOperation};
 
 use pbs_api_types::{
@@ -30,7 +31,7 @@ use pbs_api_types::{
     BACKUP_TYPE_SCHEMA, TRAFFIC_CONTROL_BURST_SCHEMA, TRAFFIC_CONTROL_RATE_SCHEMA,
 };
 use pbs_client::catalog_shell::Shell;
-use pbs_client::pxar::ErrorHandler as PxarErrorHandler;
+use pbs_client::pxar::{ErrorHandler as PxarErrorHandler, PxarPrevRef};
 use pbs_client::tools::{
     complete_archive_name, complete_auth_id, complete_backup_group, complete_backup_snapshot,
     complete_backup_source, complete_chunk_size, complete_group_or_snapshot,
@@ -50,7 +51,7 @@ use pbs_client::{
 };
 use pbs_datastore::catalog::{BackupCatalogWriter, CatalogReader, CatalogWriter};
 use pbs_datastore::chunk_store::verify_chunk_size;
-use pbs_datastore::dynamic_index::{BufferedDynamicReader, DynamicIndexReader};
+use pbs_datastore::dynamic_index::{BufferedDynamicReader, DynamicIndexReader, LocalDynamicReadAt};
 use pbs_datastore::fixed_index::FixedIndexReader;
 use pbs_datastore::index::IndexFile;
 use pbs_datastore::manifest::{
@@ -1177,6 +1178,58 @@ async fn create_backup(
     Ok(Value::Null)
 }
 
+async fn prepare_reference(
+    target: &str,
+    manifest: Arc<BackupManifest>,
+    backup_writer: &BackupWriter,
+    backup_reader: Arc<BackupReader>,
+    crypt_config: Option<Arc<CryptConfig>>,
+) -> Result<Option<PxarPrevRef>, Error> {
+    let (target, payload_target) = helper::get_pxar_archive_names(target);
+    let payload_target = payload_target.unwrap_or_default();
+
+    let metadata_ref_index = if let Ok(index) = backup_reader
+        .download_dynamic_index(&manifest, &target)
+        .await
+    {
+        index
+    } else {
+        log::info!("No previous metadata index, continue without reference");
+        return Ok(None);
+    };
+
+    if let Err(_err) = manifest.lookup_file_info(&payload_target) {
+        log::info!("No previous payload index found in manifest, continue without reference");
+        return Ok(None);
+    }
+
+    let known_payload_chunks = Arc::new(Mutex::new(HashSet::new()));
+    let payload_ref_index = backup_writer
+        .download_previous_dynamic_index(&payload_target, &manifest, known_payload_chunks)
+        .await?;
+
+    log::info!("Using previous index as metadata reference for '{target}'");
+
+    let most_used = metadata_ref_index.find_most_used_chunks(8);
+    let file_info = manifest.lookup_file_info(&target)?;
+    let chunk_reader = RemoteChunkReader::new(
+        backup_reader.clone(),
+        crypt_config.clone(),
+        file_info.chunk_crypt_mode(),
+        most_used,
+    );
+    let reader = BufferedDynamicReader::new(metadata_ref_index, chunk_reader);
+    let archive_size = reader.archive_size();
+    let reader = LocalDynamicReadAt::new(reader);
+    let accessor = Accessor::new(reader, archive_size, None).await?;
+
+    Ok(Some(pbs_client::pxar::PxarPrevRef {
+        accessor,
+        payload_index: payload_ref_index,
+        archive_name: target,
+    }))
+}
+
 async fn dump_image<W: Write>(
     client: Arc<BackupReader>,
     crypt_config: Option<Arc<CryptConfig>>,
-- 
2.39.2





^ permalink raw reply	[flat|nested] 122+ messages in thread

* [pbs-devel] [PATCH v3 proxmox-backup 43/58] client: pxar: implement store to insert chunks on caching
  2024-03-28 12:36 [pbs-devel] [PATCH v3 pxar proxmox-backup 00/58] fix #3174: improve file-level backup Christian Ebner
                   ` (41 preceding siblings ...)
  2024-03-28 12:36 ` [pbs-devel] [PATCH v3 proxmox-backup 42/58] client: implement prepare reference method Christian Ebner
@ 2024-03-28 12:36 ` Christian Ebner
  2024-04-05  7:52   ` Fabian Grünbichler
  2024-03-28 12:36 ` [pbs-devel] [PATCH v3 proxmox-backup 44/58] client: pxar: add previous reference to archiver Christian Ebner
                   ` (16 subsequent siblings)
  59 siblings, 1 reply; 122+ messages in thread
From: Christian Ebner @ 2024-03-28 12:36 UTC (permalink / raw)
  To: pbs-devel

In preparation for the look-ahead caching used to temprarily store
entries before encoding them in the pxar archive, being able to
decide wether to re-use or re-encode regular file entries.

Allows to insert and store reused chunks in the archiver,
deduplicating chunks upon insert when possible.

Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
---
changes since version 2:
- Strongly adapted and refactored: keep track also of paddings
  introduced by reusing the chunks, making a suggestion whether to
  re-use, re-encode or check next entry based on threshold
- completely removed code which allowed to calculate offsets based on
  chunks found in the middle, they must either be a continuation of the
  end or be added after, otherwise offsets are not monotonically
  increasing, which is required for sequential restore

 pbs-client/src/pxar/create.rs | 126 +++++++++++++++++++++++++++++++++-
 1 file changed, 125 insertions(+), 1 deletion(-)

diff --git a/pbs-client/src/pxar/create.rs b/pbs-client/src/pxar/create.rs
index 335e3556f..95a91a59b 100644
--- a/pbs-client/src/pxar/create.rs
+++ b/pbs-client/src/pxar/create.rs
@@ -20,7 +20,7 @@ use pathpatterns::{MatchEntry, MatchFlag, MatchList, MatchType, PatternFlag};
 use pbs_datastore::index::IndexFile;
 use proxmox_sys::error::SysError;
 use pxar::accessor::aio::Accessor;
-use pxar::encoder::{LinkOffset, SeqWrite};
+use pxar::encoder::{LinkOffset, PayloadOffset, SeqWrite};
 use pxar::Metadata;
 
 use proxmox_io::vec;
@@ -36,6 +36,128 @@ use crate::pxar::metadata::errno_is_unsupported;
 use crate::pxar::tools::assert_single_path_component;
 use crate::pxar::Flags;
 
+const CHUNK_PADDING_THRESHOLD: f64 = 0.1;
+
+#[derive(Default)]
+struct ReusedChunks {
+    start_boundary: PayloadOffset,
+    total: PayloadOffset,
+    padding: u64,
+    chunks: Vec<(u64, ReusableDynamicEntry)>,
+    must_flush_first: bool,
+    suggestion: Suggested,
+}
+
+#[derive(Copy, Clone, Default)]
+enum Suggested {
+    #[default]
+    CheckNext,
+    Reuse,
+    Reencode,
+}
+
+impl ReusedChunks {
+    fn new() -> Self {
+        Self::default()
+    }
+
+    fn start_boundary(&self) -> PayloadOffset {
+        self.start_boundary
+    }
+
+    fn is_empty(&self) -> bool {
+        self.chunks.is_empty()
+    }
+
+    fn suggested(&self) -> Suggested {
+        self.suggestion
+    }
+
+    fn insert(
+        &mut self,
+        indices: Vec<ReusableDynamicEntry>,
+        boundary: PayloadOffset,
+        start_padding: u64,
+        end_padding: u64,
+    ) -> PayloadOffset {
+        if self.is_empty() {
+            self.start_boundary = boundary;
+        }
+
+        if let Some(offset) = self.last_digest_matched(&indices) {
+            if let Some((padding, last)) = self.chunks.last_mut() {
+                // Existing chunk, update padding based on pre-existing one
+                // Start padding is expected to be larger than previous padding
+                *padding += start_padding - last.size();
+                self.padding += start_padding - last.size();
+            }
+
+            for chunk in indices.into_iter().skip(1) {
+                self.total = self.total.add(chunk.size());
+                self.chunks.push((0, chunk));
+            }
+
+            if let Some((padding, _last)) = self.chunks.last_mut() {
+                *padding += end_padding;
+                self.padding += end_padding;
+            }
+
+            let padding_ratio = self.padding as f64 / self.total.raw() as f64;
+            if self.chunks.len() > 1 && padding_ratio < CHUNK_PADDING_THRESHOLD {
+                self.suggestion = Suggested::Reuse;
+            }
+
+            self.start_boundary.add(offset + start_padding)
+        } else {
+            let offset = self.total.raw();
+
+            if let Some(first) = indices.first() {
+                self.total = self.total.add(first.size());
+                self.chunks.push((start_padding, first.clone()));
+                // New chunk, all start padding counts
+                self.padding += start_padding;
+            }
+
+            for chunk in indices.into_iter().skip(1) {
+                self.total = self.total.add(chunk.size());
+                self.chunks.push((chunk.size(), chunk));
+            }
+
+            if let Some((padding, _last)) = self.chunks.last_mut() {
+                *padding += end_padding;
+                self.padding += end_padding;
+            }
+
+            if self.chunks.len() > 2 {
+                let padding_ratio = self.padding as f64 / self.total.raw() as f64;
+                if padding_ratio < CHUNK_PADDING_THRESHOLD {
+                    self.suggestion = Suggested::Reuse;
+                } else {
+                    self.suggestion = Suggested::Reencode;
+                }
+            }
+
+            self.start_boundary.add(offset + start_padding)
+        }
+    }
+
+    fn last_digest_matched(&self, indices: &[ReusableDynamicEntry]) -> Option<u64> {
+        let digest = if let Some(first) = indices.first() {
+            first.digest()
+        } else {
+            return None;
+        };
+
+        if let Some(last) = self.chunks.last() {
+            if last.1.digest() == digest {
+                return Some(self.total.raw() - last.1.size());
+            }
+        }
+
+        None
+    }
+}
+
 /// Pxar options for creating a pxar archive/stream
 #[derive(Default, Clone)]
 pub struct PxarCreateOptions {
@@ -147,6 +269,7 @@ struct Archiver {
     hardlinks: HashMap<HardLinkInfo, (PathBuf, LinkOffset)>,
     file_copy_buffer: Vec<u8>,
     skip_e2big_xattr: bool,
+    reused_chunks: ReusedChunks,
     forced_boundaries: Option<Arc<Mutex<VecDeque<InjectChunks>>>>,
 }
 
@@ -239,6 +362,7 @@ where
         hardlinks: HashMap::new(),
         file_copy_buffer: vec::undefined(4 * 1024 * 1024),
         skip_e2big_xattr: options.skip_e2big_xattr,
+        reused_chunks: ReusedChunks::new(),
         forced_boundaries,
     };
 
-- 
2.39.2





^ permalink raw reply	[flat|nested] 122+ messages in thread

* [pbs-devel] [PATCH v3 proxmox-backup 44/58] client: pxar: add previous reference to archiver
  2024-03-28 12:36 [pbs-devel] [PATCH v3 pxar proxmox-backup 00/58] fix #3174: improve file-level backup Christian Ebner
                   ` (42 preceding siblings ...)
  2024-03-28 12:36 ` [pbs-devel] [PATCH v3 proxmox-backup 43/58] client: pxar: implement store to insert chunks on caching Christian Ebner
@ 2024-03-28 12:36 ` Christian Ebner
  2024-04-04 15:04   ` Fabian Grünbichler
  2024-03-28 12:36 ` [pbs-devel] [PATCH v3 proxmox-backup 45/58] client: pxar: add method for metadata comparison Christian Ebner
                   ` (15 subsequent siblings)
  59 siblings, 1 reply; 122+ messages in thread
From: Christian Ebner @ 2024-03-28 12:36 UTC (permalink / raw)
  To: pbs-devel

Read the previous snaphosts manifest and check if a split archive
with the same name is given. If so, create the accessor instance to
read the previous archive entries to be able to lookup and compare
the metata for the entries, allowing to make a decision if the
entry is reusable or not.

Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
---
changes since version 2:
- renamed accessor to previous metadata_accessor
- get backup reader for previous snapshot after creating the writer
  instance for the new snapshot
- adapted to only use metadata mode for all or non of the given archives

 pbs-client/src/pxar/create.rs                 | 55 ++++++++++++++++---
 proxmox-backup-client/src/main.rs             | 51 ++++++++++++++++-
 .../src/proxmox_restore_daemon/api.rs         |  1 +
 pxar-bin/src/main.rs                          |  1 +
 4 files changed, 97 insertions(+), 11 deletions(-)

diff --git a/pbs-client/src/pxar/create.rs b/pbs-client/src/pxar/create.rs
index 95a91a59b..79925bba2 100644
--- a/pbs-client/src/pxar/create.rs
+++ b/pbs-client/src/pxar/create.rs
@@ -19,7 +19,7 @@ use nix::sys::stat::{FileStat, Mode};
 use pathpatterns::{MatchEntry, MatchFlag, MatchList, MatchType, PatternFlag};
 use pbs_datastore::index::IndexFile;
 use proxmox_sys::error::SysError;
-use pxar::accessor::aio::Accessor;
+use pxar::accessor::aio::{Accessor, Directory};
 use pxar::encoder::{LinkOffset, PayloadOffset, SeqWrite};
 use pxar::Metadata;
 
@@ -159,7 +159,7 @@ impl ReusedChunks {
 }
 
 /// Pxar options for creating a pxar archive/stream
-#[derive(Default, Clone)]
+#[derive(Default)]
 pub struct PxarCreateOptions {
     /// Device/mountpoint st_dev numbers that should be included. None for no limitation.
     pub device_set: Option<HashSet<u64>>,
@@ -171,6 +171,8 @@ pub struct PxarCreateOptions {
     pub skip_lost_and_found: bool,
     /// Skip xattrs of files that return E2BIG error
     pub skip_e2big_xattr: bool,
+    /// Reference state for partial backups
+    pub previous_ref: Option<PxarPrevRef>,
 }
 
 /// Statefull information of previous backups snapshots for partial backups
@@ -270,6 +272,7 @@ struct Archiver {
     file_copy_buffer: Vec<u8>,
     skip_e2big_xattr: bool,
     reused_chunks: ReusedChunks,
+    previous_payload_index: Option<DynamicIndexReader>,
     forced_boundaries: Option<Arc<Mutex<VecDeque<InjectChunks>>>>,
 }
 
@@ -346,6 +349,15 @@ where
             MatchType::Exclude,
         )?);
     }
+    let (previous_payload_index, previous_metadata_accessor) =
+        if let Some(refs) = options.previous_ref {
+            (
+                Some(refs.payload_index),
+                refs.accessor.open_root().await.ok(),
+            )
+        } else {
+            (None, None)
+        };
 
     let mut archiver = Archiver {
         feature_flags,
@@ -363,11 +375,12 @@ where
         file_copy_buffer: vec::undefined(4 * 1024 * 1024),
         skip_e2big_xattr: options.skip_e2big_xattr,
         reused_chunks: ReusedChunks::new(),
+        previous_payload_index,
         forced_boundaries,
     };
 
     archiver
-        .archive_dir_contents(&mut encoder, source_dir, true)
+        .archive_dir_contents(&mut encoder, previous_metadata_accessor, source_dir, true)
         .await?;
     encoder.finish().await?;
     encoder.close().await?;
@@ -399,6 +412,7 @@ impl Archiver {
     fn archive_dir_contents<'a, T: SeqWrite + Send>(
         &'a mut self,
         encoder: &'a mut Encoder<'_, T>,
+        mut previous_metadata_accessor: Option<Directory<LocalDynamicReadAt<RemoteChunkReader>>>,
         mut dir: Dir,
         is_root: bool,
     ) -> BoxFuture<'a, Result<(), Error>> {
@@ -433,9 +447,15 @@ impl Archiver {
 
                 (self.callback)(&file_entry.path)?;
                 self.path = file_entry.path;
-                self.add_entry(encoder, dir_fd, &file_entry.name, &file_entry.stat)
-                    .await
-                    .map_err(|err| self.wrap_err(err))?;
+                self.add_entry(
+                    encoder,
+                    &mut previous_metadata_accessor,
+                    dir_fd,
+                    &file_entry.name,
+                    &file_entry.stat,
+                )
+                .await
+                .map_err(|err| self.wrap_err(err))?;
             }
             self.path = old_path;
             self.entry_counter = entry_counter;
@@ -683,6 +703,7 @@ impl Archiver {
     async fn add_entry<T: SeqWrite + Send>(
         &mut self,
         encoder: &mut Encoder<'_, T>,
+        previous_metadata: &mut Option<Directory<LocalDynamicReadAt<RemoteChunkReader>>>,
         parent: RawFd,
         c_file_name: &CStr,
         stat: &FileStat,
@@ -772,7 +793,14 @@ impl Archiver {
                     catalog.lock().unwrap().start_directory(c_file_name)?;
                 }
                 let result = self
-                    .add_directory(encoder, dir, c_file_name, &metadata, stat)
+                    .add_directory(
+                        encoder,
+                        previous_metadata,
+                        dir,
+                        c_file_name,
+                        &metadata,
+                        stat,
+                    )
                     .await;
                 if let Some(ref catalog) = self.catalog {
                     catalog.lock().unwrap().end_directory()?;
@@ -825,6 +853,7 @@ impl Archiver {
     async fn add_directory<T: SeqWrite + Send>(
         &mut self,
         encoder: &mut Encoder<'_, T>,
+        previous_metadata_accessor: &mut Option<Directory<LocalDynamicReadAt<RemoteChunkReader>>>,
         dir: Dir,
         dir_name: &CStr,
         metadata: &Metadata,
@@ -855,7 +884,17 @@ impl Archiver {
             log::info!("skipping mount point: {:?}", self.path);
             Ok(())
         } else {
-            self.archive_dir_contents(encoder, dir, false).await
+            let mut dir_accessor = None;
+            if let Some(accessor) = previous_metadata_accessor.as_mut() {
+                if let Some(file_entry) = accessor.lookup(dir_name).await? {
+                    if file_entry.entry().is_dir() {
+                        let dir = file_entry.enter_directory().await?;
+                        dir_accessor = Some(dir);
+                    }
+                }
+            }
+            self.archive_dir_contents(encoder, dir_accessor, dir, false)
+                .await
         };
 
         self.fs_magic = old_fs_magic;
diff --git a/proxmox-backup-client/src/main.rs b/proxmox-backup-client/src/main.rs
index 0b747453c..66dcaa63e 100644
--- a/proxmox-backup-client/src/main.rs
+++ b/proxmox-backup-client/src/main.rs
@@ -688,6 +688,10 @@ fn spawn_catalog_upload(
                schema: TRAFFIC_CONTROL_BURST_SCHEMA,
                optional: true,
            },
+           "change-detection-mode": {
+               schema: BACKUP_DETECTION_MODE_SPEC,
+               optional: true,
+           },
            "exclude": {
                type: Array,
                description: "List of paths or patterns for matching files to exclude.",
@@ -882,6 +886,9 @@ async fn create_backup(
 
     let backup_time = backup_time_opt.unwrap_or_else(epoch_i64);
 
+    let detection_mode = param["change-detection-mode"].as_str().unwrap_or("data");
+    let detection_mode = parse_backup_detection_mode_specification(detection_mode)?;
+
     let http_client = connect_rate_limited(&repo, rate_limit)?;
     record_repository(&repo);
 
@@ -982,7 +989,7 @@ async fn create_backup(
         None
     };
 
-    let mut manifest = BackupManifest::new(snapshot);
+    let mut manifest = BackupManifest::new(snapshot.clone());
 
     let mut catalog = None;
     let mut catalog_result_rx = None;
@@ -1029,14 +1036,13 @@ async fn create_backup(
                 manifest.add_file(target, stats.size, stats.csum, crypto.mode)?;
             }
             (BackupSpecificationType::PXAR, false) => {
-                let metadata_mode = false; // Until enabled via param
-
                 let target_base = if let Some(base) = target_base.strip_suffix(".pxar") {
                     base.to_string()
                 } else {
                     bail!("unexpected suffix in target: {target_base}");
                 };
 
+                let metadata_mode = detection_mode.is_metadata();
                 let (target, payload_target) = if metadata_mode {
                     (
                         format!("{target_base}.mpxar.{extension}"),
@@ -1061,12 +1067,51 @@ async fn create_backup(
                     .unwrap()
                     .start_directory(std::ffi::CString::new(target.as_str())?.as_c_str())?;
 
+                let mut previous_ref = None;
+                if metadata_mode {
+                    if let Some(ref manifest) = previous_manifest {
+                        let list = api_datastore_list_snapshots(
+                            &http_client,
+                            repo.store(),
+                            &backup_ns,
+                            Some(&snapshot.group),
+                        )
+                        .await?;
+                        let mut list: Vec<SnapshotListItem> = serde_json::from_value(list)?;
+
+                        // BackupWriter::start created a new snapshot, get the one before
+                        if list.len() > 1 {
+                            list.sort_unstable_by(|a, b| b.backup.time.cmp(&a.backup.time));
+                            let backup_dir: BackupDir =
+                                (snapshot.group.clone(), list[1].backup.time).into();
+                            let backup_reader = BackupReader::start(
+                                &http_client,
+                                crypt_config.clone(),
+                                repo.store(),
+                                &backup_ns,
+                                &backup_dir,
+                                true,
+                            )
+                            .await?;
+                            previous_ref = prepare_reference(
+                                &target,
+                                manifest.clone(),
+                                &client,
+                                backup_reader.clone(),
+                                crypt_config.clone(),
+                            )
+                            .await?
+                        }
+                    }
+                }
+
                 let pxar_options = pbs_client::pxar::PxarCreateOptions {
                     device_set: devices.clone(),
                     patterns: pattern_list.clone(),
                     entries_max: entries_max as usize,
                     skip_lost_and_found,
                     skip_e2big_xattr,
+                    previous_ref,
                 };
 
                 let upload_options = UploadOptions {
diff --git a/proxmox-restore-daemon/src/proxmox_restore_daemon/api.rs b/proxmox-restore-daemon/src/proxmox_restore_daemon/api.rs
index 0883d6cda..e50cb8184 100644
--- a/proxmox-restore-daemon/src/proxmox_restore_daemon/api.rs
+++ b/proxmox-restore-daemon/src/proxmox_restore_daemon/api.rs
@@ -355,6 +355,7 @@ fn extract(
                         patterns,
                         skip_lost_and_found: false,
                         skip_e2big_xattr: false,
+                        previous_ref: None,
                     };
 
                     let pxar_writer = TokioWriter::new(writer);
diff --git a/pxar-bin/src/main.rs b/pxar-bin/src/main.rs
index d46c98d2b..c6d3794bb 100644
--- a/pxar-bin/src/main.rs
+++ b/pxar-bin/src/main.rs
@@ -358,6 +358,7 @@ async fn create_archive(
         patterns,
         skip_lost_and_found: false,
         skip_e2big_xattr: false,
+        previous_ref: None,
     };
 
     let source = PathBuf::from(source);
-- 
2.39.2





^ permalink raw reply	[flat|nested] 122+ messages in thread

* [pbs-devel] [PATCH v3 proxmox-backup 45/58] client: pxar: add method for metadata comparison
  2024-03-28 12:36 [pbs-devel] [PATCH v3 pxar proxmox-backup 00/58] fix #3174: improve file-level backup Christian Ebner
                   ` (43 preceding siblings ...)
  2024-03-28 12:36 ` [pbs-devel] [PATCH v3 proxmox-backup 44/58] client: pxar: add previous reference to archiver Christian Ebner
@ 2024-03-28 12:36 ` Christian Ebner
  2024-04-05  8:08   ` Fabian Grünbichler
  2024-03-28 12:36 ` [pbs-devel] [PATCH v3 proxmox-backup 46/58] pxar: caching: add look-ahead cache types Christian Ebner
                   ` (14 subsequent siblings)
  59 siblings, 1 reply; 122+ messages in thread
From: Christian Ebner @ 2024-03-28 12:36 UTC (permalink / raw)
  To: pbs-devel

Adds a method to compare the metadata of the current file entry
against the metadata of the entry looked up in the previous backup
snapshot.

If the metadata matched, the start offset for the payload stream is
returned.

This is in preparation for reusing payload chunks for unchanged files.

Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
---
changes since version 2:
- refactored to new padding based threshold

 pbs-client/src/pxar/create.rs | 31 ++++++++++++++++++++++++++++++-
 1 file changed, 30 insertions(+), 1 deletion(-)

diff --git a/pbs-client/src/pxar/create.rs b/pbs-client/src/pxar/create.rs
index 79925bba2..c64084a74 100644
--- a/pbs-client/src/pxar/create.rs
+++ b/pbs-client/src/pxar/create.rs
@@ -21,7 +21,7 @@ use pbs_datastore::index::IndexFile;
 use proxmox_sys::error::SysError;
 use pxar::accessor::aio::{Accessor, Directory};
 use pxar::encoder::{LinkOffset, PayloadOffset, SeqWrite};
-use pxar::Metadata;
+use pxar::{EntryKind, Metadata};
 
 use proxmox_io::vec;
 use proxmox_lang::c_str;
@@ -466,6 +466,35 @@ impl Archiver {
         .boxed()
     }
 
+    async fn is_reusable_entry(
+        &mut self,
+        previous_metadata_accessor: &mut Directory<LocalDynamicReadAt<RemoteChunkReader>>,
+        file_name: &Path,
+        stat: &FileStat,
+        metadata: &Metadata,
+    ) -> Result<Option<u64>, Error> {
+        if stat.st_nlink > 1 {
+            log::debug!("re-encode: {file_name:?} has hardlinks.");
+            return Ok(None);
+        }
+
+        if let Some(file_entry) = previous_metadata_accessor.lookup(file_name).await? {
+            if metadata == file_entry.metadata() {
+                if let EntryKind::File { payload_offset, .. } = file_entry.entry().kind() {
+                    log::debug!("possible re-use: {file_name:?} at offset {payload_offset:?} has unchanged metadata.");
+                    return Ok(*payload_offset);
+                }
+                log::debug!("re-encode: {file_name:?} not a regular file.");
+                return Ok(None);
+            }
+            log::debug!("re-encode: {file_name:?} metadata did not match.");
+            return Ok(None);
+        }
+
+        log::debug!("re-encode: {file_name:?} not found in previous archive.");
+        Ok(None)
+    }
+
     /// openat() wrapper which allows but logs `EACCES` and turns `ENOENT` into `None`.
     ///
     /// The `existed` flag is set when iterating through a directory to note that we know the file
-- 
2.39.2





^ permalink raw reply	[flat|nested] 122+ messages in thread

* [pbs-devel] [PATCH v3 proxmox-backup 46/58] pxar: caching: add look-ahead cache types
  2024-03-28 12:36 [pbs-devel] [PATCH v3 pxar proxmox-backup 00/58] fix #3174: improve file-level backup Christian Ebner
                   ` (44 preceding siblings ...)
  2024-03-28 12:36 ` [pbs-devel] [PATCH v3 proxmox-backup 45/58] client: pxar: add method for metadata comparison Christian Ebner
@ 2024-03-28 12:36 ` Christian Ebner
  2024-03-28 12:36 ` [pbs-devel] [PATCH v3 proxmox-backup 47/58] client: pxar: add look-ahead caching Christian Ebner
                   ` (13 subsequent siblings)
  59 siblings, 0 replies; 122+ messages in thread
From: Christian Ebner @ 2024-03-28 12:36 UTC (permalink / raw)
  To: pbs-devel

The look-ahead cache is used to cache entries during pxar archive
creation before encoding, in order to decide if regular file payloads
might be re-used rather than re-encoded.

These types allow to store the needed data and keep track of
directory boundaries while traversing the filesystem tree.

Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
---
changes since version 2:
- removed caching for cli exclude patterns, these will be stored in own
  entry type within the metadata archive

 pbs-client/src/pxar/look_ahead_cache.rs | 38 +++++++++++++++++++++++++
 pbs-client/src/pxar/mod.rs              |  1 +
 2 files changed, 39 insertions(+)
 create mode 100644 pbs-client/src/pxar/look_ahead_cache.rs

diff --git a/pbs-client/src/pxar/look_ahead_cache.rs b/pbs-client/src/pxar/look_ahead_cache.rs
new file mode 100644
index 000000000..68f3fd1f2
--- /dev/null
+++ b/pbs-client/src/pxar/look_ahead_cache.rs
@@ -0,0 +1,38 @@
+use nix::sys::stat::FileStat;
+use pxar::encoder::PayloadOffset;
+use std::ffi::CString;
+use std::os::unix::io::OwnedFd;
+
+use pxar::Metadata;
+
+pub(crate) struct CacheEntryData {
+    pub(crate) fd: OwnedFd,
+    pub(crate) c_file_name: CString,
+    pub(crate) stat: FileStat,
+    pub(crate) metadata: Metadata,
+    pub(crate) payload_offset: PayloadOffset,
+}
+
+impl CacheEntryData {
+    pub(crate) fn new(
+        fd: OwnedFd,
+        c_file_name: CString,
+        stat: FileStat,
+        metadata: Metadata,
+        payload_offset: PayloadOffset,
+    ) -> Self {
+        Self {
+            fd,
+            c_file_name,
+            stat,
+            metadata,
+            payload_offset,
+        }
+    }
+}
+
+pub(crate) enum CacheEntry {
+    RegEntry(CacheEntryData),
+    DirEntry(CacheEntryData),
+    DirEnd,
+}
diff --git a/pbs-client/src/pxar/mod.rs b/pbs-client/src/pxar/mod.rs
index 76652094e..6567b93ad 100644
--- a/pbs-client/src/pxar/mod.rs
+++ b/pbs-client/src/pxar/mod.rs
@@ -50,6 +50,7 @@
 pub(crate) mod create;
 pub(crate) mod dir_stack;
 pub(crate) mod extract;
+pub(crate) mod look_ahead_cache;
 pub(crate) mod metadata;
 pub(crate) mod tools;
 
-- 
2.39.2





^ permalink raw reply	[flat|nested] 122+ messages in thread

* [pbs-devel] [PATCH v3 proxmox-backup 47/58] client: pxar: add look-ahead caching
  2024-03-28 12:36 [pbs-devel] [PATCH v3 pxar proxmox-backup 00/58] fix #3174: improve file-level backup Christian Ebner
                   ` (45 preceding siblings ...)
  2024-03-28 12:36 ` [pbs-devel] [PATCH v3 proxmox-backup 46/58] pxar: caching: add look-ahead cache types Christian Ebner
@ 2024-03-28 12:36 ` Christian Ebner
  2024-04-05  8:33   ` Fabian Grünbichler
  2024-03-28 12:36 ` [pbs-devel] [PATCH v3 proxmox-backup 48/58] fix #3174: client: pxar: enable caching and meta comparison Christian Ebner
                   ` (12 subsequent siblings)
  59 siblings, 1 reply; 122+ messages in thread
From: Christian Ebner @ 2024-03-28 12:36 UTC (permalink / raw)
  To: pbs-devel

Implements the methods to cache entries in a look-ahead cache and
flush the entries to archive, either by re-using and injecting the
payload chunks from the previous backup snapshot and storing the
reference to it, or by re-encoding the chunks.

When walking the file system tree, check for each entry if it is
re-usable, meaning that the metadata did not change and the payload
chunks can be re-indexed instead of re-encoding the whole data.
Since the ammount of payload data might be small as compared to the
actual chunk size, a decision whether to re-use or re-encode is
postponed if the reused payload does not fall below a threshold value,
but the chunks where continuous.
In this case, put the entry's file handle an metadata on the cache and
enable caching mode, and continue with the next entry.
Reusable chunk digests and size as well as reference offsets to the
start of regular files payloads within the payload stream are stored in
memory, to be injected for re-usable file entries.

If the threshold value for re-use is reached, the chunks are injected
in the payload stream and the references with the corresponding offsets
encoded in the metadata stream.
If however a non-reusable (because changed) entry is encountered before
the threshold is reached, the entries on the cache are flushed to the
archive by re-encoding them, the memorized chunks and payload reference
offsets are discarted.

Since multiple files might be contained within a single chunk, it is
assured that the deduplication of chunks is performed also when the
reuse threshold is reached, by keeping back the last chunk in the
memorized list, so following files might as well rei-use that chunk.
It is assured that this chunk is however injected in the stream also in
case that the following lookups lead to a cache clear and re-encoding.

Directory boundaries are cached as well, and written as part of the
encoding when flushing.

Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
---
changes since version 2:
- completely reworked
- strongly reduced duplicate code

 pbs-client/src/pxar/create.rs | 259 ++++++++++++++++++++++++++++++++++
 1 file changed, 259 insertions(+)

diff --git a/pbs-client/src/pxar/create.rs b/pbs-client/src/pxar/create.rs
index c64084a74..07fa17ec4 100644
--- a/pbs-client/src/pxar/create.rs
+++ b/pbs-client/src/pxar/create.rs
@@ -2,6 +2,7 @@ use std::collections::{HashMap, HashSet, VecDeque};
 use std::ffi::{CStr, CString, OsStr};
 use std::fmt;
 use std::io::{self, Read};
+use std::mem::size_of;
 use std::ops::Range;
 use std::os::unix::ffi::OsStrExt;
 use std::os::unix::io::{AsRawFd, FromRawFd, IntoRawFd, OwnedFd, RawFd};
@@ -23,6 +24,7 @@ use pxar::accessor::aio::{Accessor, Directory};
 use pxar::encoder::{LinkOffset, PayloadOffset, SeqWrite};
 use pxar::{EntryKind, Metadata};
 
+use proxmox_human_byte::HumanByte;
 use proxmox_io::vec;
 use proxmox_lang::c_str;
 use proxmox_sys::fs::{self, acl, xattr};
@@ -32,6 +34,7 @@ use pbs_datastore::catalog::BackupCatalogWriter;
 use pbs_datastore::dynamic_index::{DynamicIndexReader, LocalDynamicReadAt};
 
 use crate::inject_reused_chunks::InjectChunks;
+use crate::pxar::look_ahead_cache::{CacheEntry, CacheEntryData};
 use crate::pxar::metadata::errno_is_unsupported;
 use crate::pxar::tools::assert_single_path_component;
 use crate::pxar::Flags;
@@ -274,6 +277,12 @@ struct Archiver {
     reused_chunks: ReusedChunks,
     previous_payload_index: Option<DynamicIndexReader>,
     forced_boundaries: Option<Arc<Mutex<VecDeque<InjectChunks>>>>,
+    cached_entries: Vec<CacheEntry>,
+    caching_enabled: bool,
+    total_injected_size: u64,
+    total_injected_count: u64,
+    partial_chunks_count: u64,
+    total_reused_payload_size: u64,
 }
 
 type Encoder<'a, T> = pxar::encoder::aio::Encoder<'a, T>;
@@ -377,6 +386,12 @@ where
         reused_chunks: ReusedChunks::new(),
         previous_payload_index,
         forced_boundaries,
+        cached_entries: Vec::new(),
+        caching_enabled: false,
+        total_injected_size: 0,
+        total_injected_count: 0,
+        partial_chunks_count: 0,
+        total_reused_payload_size: 0,
     };
 
     archiver
@@ -879,6 +894,250 @@ impl Archiver {
         }
     }
 
+    async fn cache_or_flush_entries<T: SeqWrite + Send>(
+        &mut self,
+        encoder: &mut Encoder<'_, T>,
+        previous_metadata_accessor: &mut Option<Directory<LocalDynamicReadAt<RemoteChunkReader>>>,
+        c_file_name: &CStr,
+        stat: &FileStat,
+        fd: OwnedFd,
+        metadata: &Metadata,
+    ) -> Result<(), Error> {
+        let file_name: &Path = OsStr::from_bytes(c_file_name.to_bytes()).as_ref();
+        let reusable = if let Some(accessor) = previous_metadata_accessor {
+            self.is_reusable_entry(accessor, file_name, stat, metadata)
+                .await?
+        } else {
+            None
+        };
+
+        let file_size = stat.st_size as u64;
+        if let Some(start_offset) = reusable {
+            if let Some(ref ref_payload_index) = self.previous_payload_index {
+                let payload_size = file_size + size_of::<pxar::format::Header>() as u64;
+                let end_offset = start_offset + payload_size;
+                let (indices, start_padding, end_padding) =
+                    lookup_dynamic_entries(ref_payload_index, start_offset..end_offset)?;
+
+                let boundary = encoder.payload_position()?;
+                let offset =
+                    self.reused_chunks
+                        .insert(indices, boundary, start_padding, end_padding);
+
+                self.caching_enabled = true;
+                let cache_entry = CacheEntry::RegEntry(CacheEntryData::new(
+                    fd,
+                    c_file_name.into(),
+                    *stat,
+                    metadata.clone(),
+                    offset,
+                ));
+                self.cached_entries.push(cache_entry);
+
+                match self.reused_chunks.suggested() {
+                    Suggested::Reuse => self.flush_cached_to_archive(encoder, true, true).await?,
+                    Suggested::Reencode => {
+                        self.flush_cached_to_archive(encoder, false, true).await?
+                    }
+                    Suggested::CheckNext => {}
+                }
+
+                return Ok(());
+            }
+        }
+
+        self.flush_cached_to_archive(encoder, false, true).await?;
+        self.add_entry(encoder, previous_metadata_accessor, fd.as_raw_fd(), c_file_name, stat)
+            .await
+    }
+
+    async fn flush_cached_to_archive<T: SeqWrite + Send>(
+        &mut self,
+        encoder: &mut Encoder<'_, T>,
+        reuse_chunks: bool,
+        keep_back_last_chunk: bool,
+    ) -> Result<(), Error> {
+        let entries = std::mem::take(&mut self.cached_entries);
+
+        if !reuse_chunks {
+            self.clear_cached_chunks(encoder)?;
+        }
+
+        for entry in entries {
+            match entry {
+                CacheEntry::RegEntry(CacheEntryData {
+                    fd,
+                    c_file_name,
+                    stat,
+                    metadata,
+                    payload_offset,
+                }) => {
+                    self.add_entry_to_archive(
+                        encoder,
+                        &mut None,
+                        &c_file_name,
+                        &stat,
+                        fd,
+                        &metadata,
+                        reuse_chunks,
+                        Some(payload_offset),
+                    )
+                    .await?
+                }
+                CacheEntry::DirEntry(CacheEntryData {
+                    c_file_name,
+                    metadata,
+                    ..
+                }) => {
+                    if let Some(ref catalog) = self.catalog {
+                        catalog.lock().unwrap().start_directory(&c_file_name)?;
+                    }
+                    let dir_name = OsStr::from_bytes(c_file_name.to_bytes());
+                    encoder.create_directory(dir_name, &metadata).await?;
+                }
+                CacheEntry::DirEnd => {
+                    encoder.finish().await?;
+                    if let Some(ref catalog) = self.catalog {
+                        catalog.lock().unwrap().end_directory()?;
+                    }
+                }
+            }
+        }
+
+        self.caching_enabled = false;
+
+        if reuse_chunks {
+            self.flush_reused_chunks(encoder, keep_back_last_chunk)?;
+        }
+
+        Ok(())
+    }
+
+    fn flush_reused_chunks<T: SeqWrite + Send>(
+        &mut self,
+        encoder: &mut Encoder<'_, T>,
+        keep_back_last_chunk: bool,
+    ) -> Result<(), Error> {
+        let mut reused_chunks = std::mem::take(&mut self.reused_chunks);
+
+        // Do not inject the last reused chunk directly, but keep it as base for further entries
+        // to reduce chunk duplication. Needs to be flushed even on cache clear!
+        let last_chunk = if keep_back_last_chunk {
+            reused_chunks.chunks.pop()
+        } else {
+            None
+        };
+
+        let mut injection_boundary = reused_chunks.start_boundary();
+        let payload_writer_position = encoder.payload_position()?.raw();
+
+        if !reused_chunks.chunks.is_empty() && injection_boundary.raw() != payload_writer_position {
+            bail!(
+                "encoder payload writer position out of sync: got {payload_writer_position}, expected {}",
+                injection_boundary.raw(),
+            );
+        }
+
+        for chunks in reused_chunks.chunks.chunks(128) {
+            let mut chunk_list = Vec::with_capacity(128);
+            let mut size = PayloadOffset::default();
+            for (padding, chunk) in chunks.iter() {
+                log::debug!(
+                    "Injecting chunk with {} padding (chunk size {})",
+                    HumanByte::from(*padding),
+                    HumanByte::from(chunk.size()),
+                );
+                self.total_injected_size += chunk.size();
+                self.total_injected_count += 1;
+                if *padding > 0 {
+                    self.partial_chunks_count += 1;
+                }
+                size = size.add(chunk.size());
+                chunk_list.push(chunk.clone());
+            }
+
+            let inject_chunks = InjectChunks {
+                boundary: injection_boundary.raw(),
+                chunks: chunk_list,
+                size: size.raw() as usize,
+            };
+
+            if let Some(boundary) = self.forced_boundaries.as_mut() {
+                let mut boundary = boundary.lock().unwrap();
+                boundary.push_back(inject_chunks);
+            } else {
+                bail!("missing injection queue");
+            };
+
+            injection_boundary = injection_boundary.add(size.raw());
+            encoder.advance(size)?;
+        }
+
+        if let Some((padding, chunk)) = last_chunk {
+            // Make sure that we flush this chunk even on clear calls
+            self.reused_chunks.must_flush_first = true;
+            let _offset = self
+                .reused_chunks
+                .insert(vec![chunk], injection_boundary, padding, 0);
+        }
+
+        Ok(())
+    }
+
+    fn clear_cached_chunks<T: SeqWrite + Send>(
+        &mut self,
+        encoder: &mut Encoder<'_, T>,
+    ) -> Result<(), Error> {
+        let reused_chunks = std::mem::take(&mut self.reused_chunks);
+
+        if !reused_chunks.must_flush_first {
+            return Ok(());
+        }
+
+        // First chunk was kept back to avoid duplication but needs to be injected
+        let injection_boundary = reused_chunks.start_boundary();
+        let payload_writer_position = encoder.payload_position()?.raw();
+
+        if !reused_chunks.chunks.is_empty() && injection_boundary.raw() != payload_writer_position {
+            bail!(
+                "encoder payload writer position out of sync: got {payload_writer_position}, expected {}",
+                injection_boundary.raw()
+            );
+        }
+
+        if let Some((padding, chunk)) = reused_chunks.chunks.first() {
+            let size = PayloadOffset::default().add(chunk.size());
+            log::debug!(
+                "Injecting chunk with {} padding (chunk size {})",
+                HumanByte::from(*padding),
+                HumanByte::from(chunk.size()),
+            );
+            let inject_chunks = InjectChunks {
+                boundary: injection_boundary.raw(),
+                chunks: vec![chunk.clone()],
+                size: size.raw() as usize,
+            };
+
+            self.total_injected_size += size.raw();
+            self.total_injected_count += 1;
+            if *padding > 0 {
+                self.partial_chunks_count += 1;
+            }
+
+            if let Some(boundary) = self.forced_boundaries.as_mut() {
+                let mut boundary = boundary.lock().unwrap();
+                boundary.push_back(inject_chunks);
+            } else {
+                bail!("missing injection queue");
+            };
+            encoder.advance(size)?;
+        } else {
+            bail!("missing first chunk");
+        }
+
+        Ok(())
+    }
+
     async fn add_directory<T: SeqWrite + Send>(
         &mut self,
         encoder: &mut Encoder<'_, T>,
-- 
2.39.2





^ permalink raw reply	[flat|nested] 122+ messages in thread

* [pbs-devel] [PATCH v3 proxmox-backup 48/58] fix #3174: client: pxar: enable caching and meta comparison
  2024-03-28 12:36 [pbs-devel] [PATCH v3 pxar proxmox-backup 00/58] fix #3174: improve file-level backup Christian Ebner
                   ` (46 preceding siblings ...)
  2024-03-28 12:36 ` [pbs-devel] [PATCH v3 proxmox-backup 47/58] client: pxar: add look-ahead caching Christian Ebner
@ 2024-03-28 12:36 ` Christian Ebner
  2024-03-28 12:36 ` [pbs-devel] [PATCH v3 proxmox-backup 49/58] client: backup: increase average chunk size for metadata Christian Ebner
                   ` (11 subsequent siblings)
  59 siblings, 0 replies; 122+ messages in thread
From: Christian Ebner @ 2024-03-28 12:36 UTC (permalink / raw)
  To: pbs-devel

Add the final glue logic to enable the look-ahead caching and
metadata comparison introduced in the preparatory patches.

Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
---
changes since version 2:
- use reused chunk padding for final decision if entries with unchanged
  metadata should be re-used or re-encoded

 pbs-client/src/pxar/create.rs | 194 +++++++++++++++++++++++++++-------
 1 file changed, 156 insertions(+), 38 deletions(-)

diff --git a/pbs-client/src/pxar/create.rs b/pbs-client/src/pxar/create.rs
index 07fa17ec4..f103127c4 100644
--- a/pbs-client/src/pxar/create.rs
+++ b/pbs-client/src/pxar/create.rs
@@ -40,6 +40,7 @@ use crate::pxar::tools::assert_single_path_component;
 use crate::pxar::Flags;
 
 const CHUNK_PADDING_THRESHOLD: f64 = 0.1;
+const MAX_CACHE_SIZE: usize = 512;
 
 #[derive(Default)]
 struct ReusedChunks {
@@ -397,6 +398,12 @@ where
     archiver
         .archive_dir_contents(&mut encoder, previous_metadata_accessor, source_dir, true)
         .await?;
+
+    // Re-encode all the remaining cached entries
+    archiver
+        .flush_cached_to_archive(&mut encoder, false, false)
+        .await?;
+
     encoder.finish().await?;
     encoder.close().await?;
 
@@ -454,7 +461,10 @@ impl Archiver {
             for file_entry in file_list {
                 let file_name = file_entry.name.to_bytes();
 
-                if is_root && file_name == b".pxarexclude-cli" {
+                if is_root
+                    && file_name == b".pxarexclude-cli"
+                    && previous_metadata_accessor.is_none()
+                {
                     self.encode_pxarexclude_cli(encoder, &file_entry.name, old_patterns_count)
                         .await?;
                     continue;
@@ -472,6 +482,11 @@ impl Archiver {
                 .await
                 .map_err(|err| self.wrap_err(err))?;
             }
+
+            if self.caching_enabled && !is_root {
+                self.cached_entries.push(CacheEntry::DirEnd);
+            }
+
             self.path = old_path;
             self.entry_counter = entry_counter;
             self.patterns.truncate(old_patterns_count);
@@ -752,8 +767,6 @@ impl Archiver {
         c_file_name: &CStr,
         stat: &FileStat,
     ) -> Result<(), Error> {
-        use pxar::format::mode;
-
         let file_mode = stat.st_mode & libc::S_IFMT;
         let open_mode = if file_mode == libc::S_IFREG || file_mode == libc::S_IFDIR {
             OFlag::empty()
@@ -791,6 +804,97 @@ impl Archiver {
             self.skip_e2big_xattr,
         )?;
 
+        if self.previous_payload_index.is_none() {
+            return self
+                .add_entry_to_archive(
+                    encoder,
+                    previous_metadata,
+                    c_file_name,
+                    stat,
+                    fd,
+                    &metadata,
+                    false,
+                    None,
+                )
+                .await;
+        }
+
+        // Avoid having to many open file handles in cached entries
+        if self.cached_entries.len() > MAX_CACHE_SIZE {
+            self.flush_cached_to_archive(encoder, true, true).await?;
+        }
+
+        if metadata.is_regular_file() {
+            self.cache_or_flush_entries(
+                encoder,
+                previous_metadata,
+                c_file_name,
+                stat,
+                fd,
+                &metadata,
+            )
+            .await
+        } else if self.caching_enabled {
+            if stat.st_mode & libc::S_IFMT == libc::S_IFDIR {
+                let fd_clone = fd.try_clone()?;
+                let cache_entry = CacheEntry::DirEntry(CacheEntryData::new(
+                    fd,
+                    c_file_name.into(),
+                    *stat,
+                    metadata.clone(),
+                    PayloadOffset::default(),
+                ));
+                self.cached_entries.push(cache_entry);
+
+                let dir = Dir::from_fd(fd_clone.into_raw_fd())?;
+                self.add_directory(
+                    encoder,
+                    previous_metadata,
+                    dir,
+                    c_file_name,
+                    &metadata,
+                    stat,
+                )
+                .await?;
+            } else {
+                let cache_entry = CacheEntry::RegEntry(CacheEntryData::new(
+                    fd,
+                    c_file_name.into(),
+                    *stat,
+                    metadata,
+                    PayloadOffset::default(),
+                ));
+                self.cached_entries.push(cache_entry);
+            }
+            Ok(())
+        } else {
+            self.add_entry_to_archive(
+                encoder,
+                previous_metadata,
+                c_file_name,
+                stat,
+                fd,
+                &metadata,
+                false,
+                None,
+            )
+            .await
+        }
+    }
+
+    async fn add_entry_to_archive<T: SeqWrite + Send>(
+        &mut self,
+        encoder: &mut Encoder<'_, T>,
+        previous_metadata: &mut Option<Directory<LocalDynamicReadAt<RemoteChunkReader>>>,
+        c_file_name: &CStr,
+        stat: &FileStat,
+        fd: OwnedFd,
+        metadata: &Metadata,
+        flush_reused: bool,
+        payload_offset: Option<PayloadOffset>,
+    ) -> Result<(), Error> {
+        use pxar::format::mode;
+
         let file_name: &Path = OsStr::from_bytes(c_file_name.to_bytes()).as_ref();
         match metadata.file_type() {
             mode::IFREG => {
@@ -819,72 +923,64 @@ impl Archiver {
                         .add_file(c_file_name, file_size, stat.st_mtime)?;
                 }
 
-                let offset: LinkOffset = self
-                    .add_regular_file(encoder, fd, file_name, &metadata, file_size)
-                    .await?;
+                if flush_reused {
+                    self.total_reused_payload_size +=
+                        file_size + size_of::<pxar::format::Header>() as u64;
+                    encoder
+                        .add_payload_ref(metadata, file_name, file_size, payload_offset.unwrap())
+                        .await?;
+                } else {
+                    let offset: LinkOffset = self
+                        .add_regular_file(encoder, fd, file_name, metadata, file_size)
+                        .await?;
 
-                if stat.st_nlink > 1 {
-                    self.hardlinks
-                        .insert(link_info, (self.path.clone(), offset));
+                    if stat.st_nlink > 1 {
+                        self.hardlinks
+                            .insert(link_info, (self.path.clone(), offset));
+                    }
                 }
 
                 Ok(())
             }
             mode::IFDIR => {
                 let dir = Dir::from_fd(fd.into_raw_fd())?;
-
-                if let Some(ref catalog) = self.catalog {
-                    catalog.lock().unwrap().start_directory(c_file_name)?;
-                }
-                let result = self
-                    .add_directory(
-                        encoder,
-                        previous_metadata,
-                        dir,
-                        c_file_name,
-                        &metadata,
-                        stat,
-                    )
-                    .await;
-                if let Some(ref catalog) = self.catalog {
-                    catalog.lock().unwrap().end_directory()?;
-                }
-                result
+                self.add_directory(encoder, previous_metadata, dir, c_file_name, metadata, stat)
+                    .await
             }
             mode::IFSOCK => {
                 if let Some(ref catalog) = self.catalog {
                     catalog.lock().unwrap().add_socket(c_file_name)?;
                 }
 
-                Ok(encoder.add_socket(&metadata, file_name).await?)
+                Ok(encoder.add_socket(metadata, file_name).await?)
             }
             mode::IFIFO => {
                 if let Some(ref catalog) = self.catalog {
                     catalog.lock().unwrap().add_fifo(c_file_name)?;
                 }
 
-                Ok(encoder.add_fifo(&metadata, file_name).await?)
+                Ok(encoder.add_fifo(metadata, file_name).await?)
             }
             mode::IFLNK => {
                 if let Some(ref catalog) = self.catalog {
                     catalog.lock().unwrap().add_symlink(c_file_name)?;
                 }
 
-                self.add_symlink(encoder, fd, file_name, &metadata).await
+                self.add_symlink(encoder, fd, file_name, metadata).await
             }
             mode::IFBLK => {
                 if let Some(ref catalog) = self.catalog {
                     catalog.lock().unwrap().add_block_device(c_file_name)?;
                 }
 
-                self.add_device(encoder, file_name, &metadata, stat).await
+                self.add_device(encoder, file_name, metadata, stat).await
             }
             mode::IFCHR => {
                 if let Some(ref catalog) = self.catalog {
                     catalog.lock().unwrap().add_char_device(c_file_name)?;
                 }
 
-                self.add_device(encoder, file_name, &metadata, stat).await
+                self.add_device(encoder, file_name, metadata, stat).await
             }
             other => bail!(
                 "encountered unknown file type: 0x{:x} (0o{:o})",
@@ -947,8 +1043,17 @@ impl Archiver {
         }
 
         self.flush_cached_to_archive(encoder, false, true).await?;
-        self.add_entry(encoder, previous_metadata_accessor, fd.as_raw_fd(), c_file_name, stat)
-            .await
+        self.add_entry_to_archive(
+            encoder,
+            previous_metadata_accessor,
+            c_file_name,
+            stat,
+            fd,
+            metadata,
+            false,
+            None,
+        )
+        .await
     }
 
     async fn flush_cached_to_archive<T: SeqWrite + Send>(
@@ -1143,13 +1248,18 @@ impl Archiver {
         encoder: &mut Encoder<'_, T>,
         previous_metadata_accessor: &mut Option<Directory<LocalDynamicReadAt<RemoteChunkReader>>>,
         dir: Dir,
-        dir_name: &CStr,
+        c_dir_name: &CStr,
         metadata: &Metadata,
         stat: &FileStat,
     ) -> Result<(), Error> {
-        let dir_name = OsStr::from_bytes(dir_name.to_bytes());
+        let dir_name = OsStr::from_bytes(c_dir_name.to_bytes());
 
-        encoder.create_directory(dir_name, metadata).await?;
+        if !self.caching_enabled {
+            if let Some(ref catalog) = self.catalog {
+                catalog.lock().unwrap().start_directory(c_dir_name)?;
+            }
+            encoder.create_directory(dir_name, metadata).await?;
+        }
 
         let old_fs_magic = self.fs_magic;
         let old_fs_feature_flags = self.fs_feature_flags;
@@ -1189,7 +1299,15 @@ impl Archiver {
         self.fs_feature_flags = old_fs_feature_flags;
         self.current_st_dev = old_st_dev;
 
-        encoder.finish().await?;
+        if !self.caching_enabled {
+            encoder.finish().await?;
+            if let Some(ref catalog) = self.catalog {
+                if !self.caching_enabled {
+                    catalog.lock().unwrap().end_directory()?;
+                }
+            }
+        }
+
         result
     }
 
-- 
2.39.2





^ permalink raw reply	[flat|nested] 122+ messages in thread

* [pbs-devel] [PATCH v3 proxmox-backup 49/58] client: backup: increase average chunk size for metadata
  2024-03-28 12:36 [pbs-devel] [PATCH v3 pxar proxmox-backup 00/58] fix #3174: improve file-level backup Christian Ebner
                   ` (47 preceding siblings ...)
  2024-03-28 12:36 ` [pbs-devel] [PATCH v3 proxmox-backup 48/58] fix #3174: client: pxar: enable caching and meta comparison Christian Ebner
@ 2024-03-28 12:36 ` Christian Ebner
  2024-04-05  9:42   ` Fabian Grünbichler
  2024-03-28 12:36 ` [pbs-devel] [PATCH v3 proxmox-backup 50/58] client: backup writer: add injected chunk count to stats Christian Ebner
                   ` (10 subsequent siblings)
  59 siblings, 1 reply; 122+ messages in thread
From: Christian Ebner @ 2024-03-28 12:36 UTC (permalink / raw)
  To: pbs-devel

Use double the average chunk size for the metadata archive as compared
to the payload stream. This does not only reduce the number of unique
chunks produced by the metadata archive, not well chunkable because
mainly many localized small changes, but further has the positive side
effect of producing well compressable larger chunks. The reduced number
of chunks further increases the performance for access because of
reduced number of download requests and increased cachability.

Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
---
changes since version 2:
- not present in previous version

 proxmox-backup-client/src/main.rs | 12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/proxmox-backup-client/src/main.rs b/proxmox-backup-client/src/main.rs
index 66dcaa63e..4aad0ff8c 100644
--- a/proxmox-backup-client/src/main.rs
+++ b/proxmox-backup-client/src/main.rs
@@ -78,6 +78,8 @@ pub(crate) use helper::*;
 pub mod key;
 pub mod namespace;
 
+const AVG_METADATA_CHUNK_SIZE: usize = 8 * 1024 * 1024;
+
 fn record_repository(repo: &BackupRepository) {
     let base = match BaseDirectories::with_prefix("proxmox-backup") {
         Ok(v) => v,
@@ -209,7 +211,15 @@ async fn backup_directory<P: AsRef<Path>>(
         payload_target.is_some(),
     )?;
 
-    let mut chunk_stream = ChunkStream::new(pxar_stream, chunk_size, None);
+    let avg_chunk_size = if payload_stream.is_none() {
+        chunk_size
+    } else {
+        chunk_size
+            .map(|size| 2 * size)
+            .or_else(|| Some(AVG_METADATA_CHUNK_SIZE))
+    };
+
+    let mut chunk_stream = ChunkStream::new(pxar_stream, avg_chunk_size, None);
     let (tx, rx) = mpsc::channel(10); // allow to buffer 10 chunks
 
     let stream = ReceiverStream::new(rx).map_err(Error::from);
-- 
2.39.2





^ permalink raw reply	[flat|nested] 122+ messages in thread

* [pbs-devel] [PATCH v3 proxmox-backup 50/58] client: backup writer: add injected chunk count to stats
  2024-03-28 12:36 [pbs-devel] [PATCH v3 pxar proxmox-backup 00/58] fix #3174: improve file-level backup Christian Ebner
                   ` (48 preceding siblings ...)
  2024-03-28 12:36 ` [pbs-devel] [PATCH v3 proxmox-backup 49/58] client: backup: increase average chunk size for metadata Christian Ebner
@ 2024-03-28 12:36 ` Christian Ebner
  2024-03-28 12:37 ` [pbs-devel] [PATCH v3 proxmox-backup 51/58] pxar: create: show chunk injection stats debug output Christian Ebner
                   ` (9 subsequent siblings)
  59 siblings, 0 replies; 122+ messages in thread
From: Christian Ebner @ 2024-03-28 12:36 UTC (permalink / raw)
  To: pbs-devel

Track the number of injected chunks and show them in the debug output

Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
---
changes since version 2:
- not present in previous version

 pbs-client/src/backup_writer.rs | 14 +++++++++++++-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/pbs-client/src/backup_writer.rs b/pbs-client/src/backup_writer.rs
index 032d93da7..828bd7a3e 100644
--- a/pbs-client/src/backup_writer.rs
+++ b/pbs-client/src/backup_writer.rs
@@ -57,6 +57,7 @@ pub struct UploadOptions {
 struct UploadStats {
     chunk_count: usize,
     chunk_reused: usize,
+    chunk_injected: usize,
     size: usize,
     size_reused: usize,
     size_compressed: usize,
@@ -400,6 +401,11 @@ impl BackupWriter {
                 archive,
                 (upload_stats.duration.as_micros()) / (upload_stats.chunk_count as u128)
             );
+            log::debug!(
+                "{}: Injected {} chunks from previous snapshot.",
+                archive,
+                upload_stats.chunk_injected,
+            );
         }
 
         let param = json!({
@@ -646,6 +652,8 @@ impl BackupWriter {
         let total_chunks2 = total_chunks.clone();
         let known_chunk_count = Arc::new(AtomicUsize::new(0));
         let known_chunk_count2 = known_chunk_count.clone();
+        let injected_chunk_count = Arc::new(AtomicUsize::new(0));
+        let injected_chunk_count2 = known_chunk_count.clone();
 
         let stream_len = Arc::new(AtomicUsize::new(0));
         let stream_len2 = stream_len.clone();
@@ -675,7 +683,9 @@ impl BackupWriter {
             )
             .and_then(move |chunk_info| match chunk_info {
                 InjectedChunksInfo::Known(chunks) => {
-                    total_chunks.fetch_add(chunks.len(), Ordering::SeqCst);
+                    let count = chunks.len();
+                    total_chunks.fetch_add(count, Ordering::SeqCst);
+                    injected_chunk_count.fetch_add(count, Ordering::SeqCst);
                     future::ok(MergedChunkInfo::Known(chunks))
                 }
                 InjectedChunksInfo::Raw((offset, data)) => {
@@ -787,6 +797,7 @@ impl BackupWriter {
                 let duration = start_time.elapsed();
                 let chunk_count = total_chunks2.load(Ordering::SeqCst);
                 let chunk_reused = known_chunk_count2.load(Ordering::SeqCst);
+                let chunk_injected = injected_chunk_count2.load(Ordering::SeqCst);
                 let size = stream_len2.load(Ordering::SeqCst);
                 let size_reused = reused_len2.load(Ordering::SeqCst);
                 let size_compressed = compressed_stream_len2.load(Ordering::SeqCst) as usize;
@@ -797,6 +808,7 @@ impl BackupWriter {
                 futures::future::ok(UploadStats {
                     chunk_count,
                     chunk_reused,
+                    chunk_injected,
                     size,
                     size_reused,
                     size_compressed,
-- 
2.39.2





^ permalink raw reply	[flat|nested] 122+ messages in thread

* [pbs-devel] [PATCH v3 proxmox-backup 51/58] pxar: create: show chunk injection stats debug output
  2024-03-28 12:36 [pbs-devel] [PATCH v3 pxar proxmox-backup 00/58] fix #3174: improve file-level backup Christian Ebner
                   ` (49 preceding siblings ...)
  2024-03-28 12:36 ` [pbs-devel] [PATCH v3 proxmox-backup 50/58] client: backup writer: add injected chunk count to stats Christian Ebner
@ 2024-03-28 12:37 ` Christian Ebner
  2024-04-05  9:47   ` Fabian Grünbichler
  2024-03-28 12:37 ` [pbs-devel] [PATCH v3 proxmox-backup 52/58] client: pxar: add entry kind format version Christian Ebner
                   ` (8 subsequent siblings)
  59 siblings, 1 reply; 122+ messages in thread
From: Christian Ebner @ 2024-03-28 12:37 UTC (permalink / raw)
  To: pbs-devel

Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
---
changes since version 2:
- not present in previous version

 pbs-client/src/pxar/create.rs | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/pbs-client/src/pxar/create.rs b/pbs-client/src/pxar/create.rs
index f103127c4..461509c39 100644
--- a/pbs-client/src/pxar/create.rs
+++ b/pbs-client/src/pxar/create.rs
@@ -407,6 +407,14 @@ where
     encoder.finish().await?;
     encoder.close().await?;
 
+    log::info!(
+        "Total injected: {} ({} chunks), total reused payload: {}, padding: {} ({} partial chunks)",
+        HumanByte::from(archiver.total_injected_size),
+        archiver.total_injected_count,
+        HumanByte::from(archiver.total_reused_payload_size),
+        HumanByte::from(archiver.total_injected_size - archiver.total_reused_payload_size),
+        archiver.partial_chunks_count,
+    );
     Ok(())
 }
 
-- 
2.39.2





^ permalink raw reply	[flat|nested] 122+ messages in thread

* [pbs-devel] [PATCH v3 proxmox-backup 52/58] client: pxar: add entry kind format version
  2024-03-28 12:36 [pbs-devel] [PATCH v3 pxar proxmox-backup 00/58] fix #3174: improve file-level backup Christian Ebner
                   ` (50 preceding siblings ...)
  2024-03-28 12:37 ` [pbs-devel] [PATCH v3 proxmox-backup 51/58] pxar: create: show chunk injection stats debug output Christian Ebner
@ 2024-03-28 12:37 ` Christian Ebner
  2024-03-28 12:37 ` [pbs-devel] [PATCH v3 proxmox-backup 53/58] client: pxar: opt encode cli exclude patterns as CliParams Christian Ebner
                   ` (7 subsequent siblings)
  59 siblings, 0 replies; 122+ messages in thread
From: Christian Ebner @ 2024-03-28 12:37 UTC (permalink / raw)
  To: pbs-devel

Cover the match case for the format version entry introduced to
distinguish between different file formats used for encoding.

For now simply ignore it.

Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
---
changes since version 2:
- not present in previous version

 pbs-client/src/pxar/extract.rs | 1 +
 pbs-client/src/pxar/tools.rs   | 1 +
 2 files changed, 2 insertions(+)

diff --git a/pbs-client/src/pxar/extract.rs b/pbs-client/src/pxar/extract.rs
index 5f5ac6188..56f8d7adc 100644
--- a/pbs-client/src/pxar/extract.rs
+++ b/pbs-client/src/pxar/extract.rs
@@ -267,6 +267,7 @@ where
         };
 
         let extract_res = match (did_match, entry.kind()) {
+            (_, EntryKind::Version(_)) => Ok(()),
             (_, EntryKind::Directory) => {
                 self.callback(entry.path());
 
diff --git a/pbs-client/src/pxar/tools.rs b/pbs-client/src/pxar/tools.rs
index 459951d50..4e9bd5b60 100644
--- a/pbs-client/src/pxar/tools.rs
+++ b/pbs-client/src/pxar/tools.rs
@@ -172,6 +172,7 @@ pub fn format_multi_line_entry(entry: &Entry) -> String {
     let meta = entry.metadata();
 
     let (size, link, type_name, payload_offset) = match entry.kind() {
+        EntryKind::Version(version) => (format!("{version:?}"), String::new(), "version", None),
         EntryKind::File {
             size,
             payload_offset,
-- 
2.39.2





^ permalink raw reply	[flat|nested] 122+ messages in thread

* [pbs-devel] [PATCH v3 proxmox-backup 53/58] client: pxar: opt encode cli exclude patterns as CliParams
  2024-03-28 12:36 [pbs-devel] [PATCH v3 pxar proxmox-backup 00/58] fix #3174: improve file-level backup Christian Ebner
                   ` (51 preceding siblings ...)
  2024-03-28 12:37 ` [pbs-devel] [PATCH v3 proxmox-backup 52/58] client: pxar: add entry kind format version Christian Ebner
@ 2024-03-28 12:37 ` Christian Ebner
  2024-04-05  9:49   ` Fabian Grünbichler
  2024-03-28 12:37 ` [pbs-devel] [PATCH v3 proxmox-backup 54/58] client: pxar: add flow chart for metadata change detection Christian Ebner
                   ` (6 subsequent siblings)
  59 siblings, 1 reply; 122+ messages in thread
From: Christian Ebner @ 2024-03-28 12:37 UTC (permalink / raw)
  To: pbs-devel

Instead of encoding the pxar cli exclude patterns as regular file
within the root directory of an archive, store this information
directly after the pxar format version entry in a new pxar cli params
entry.

This behaviour is however currently exclusive to the archives written
with format version 2 in a split metadata and payload case.

This is a breaking change for the encoding of new cli exclude
parameters. Any new exclude parameter will not be added to an already
present .pxar-cliexclude file, and it will not be created if not
present.

Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
---
changes since version 2:
- not present in previous version

 pbs-client/src/pxar/create.rs             | 25 +++++++++++++++--------
 pbs-client/src/pxar/extract.rs            |  3 ++-
 pbs-client/src/pxar/tools.rs              |  6 ++++++
 src/tape/file_formats/snapshot_archive.rs |  8 ++++++--
 4 files changed, 31 insertions(+), 11 deletions(-)

diff --git a/pbs-client/src/pxar/create.rs b/pbs-client/src/pxar/create.rs
index 461509c39..5f2270fe8 100644
--- a/pbs-client/src/pxar/create.rs
+++ b/pbs-client/src/pxar/create.rs
@@ -343,13 +343,6 @@ where
         set.insert(stat.st_dev);
     }
 
-    let mut encoder = Encoder::new(
-        &mut writers.writer,
-        &metadata,
-        writers.payload_writer.as_mut(),
-    )
-    .await?;
-
     let mut patterns = options.patterns;
 
     if options.skip_lost_and_found {
@@ -359,6 +352,14 @@ where
             MatchType::Exclude,
         )?);
     }
+
+    let cli_params_content = generate_pxar_excludes_cli(&patterns[..]);
+    let cli_params = if options.previous_ref.is_some() {
+        Some(cli_params_content.as_slice())
+    } else {
+        None
+    };
+
     let (previous_payload_index, previous_metadata_accessor) =
         if let Some(refs) = options.previous_ref {
             (
@@ -369,6 +370,14 @@ where
             (None, None)
         };
 
+    let mut encoder = Encoder::new(
+        &mut writers.writer,
+        &metadata,
+        writers.payload_writer.as_mut(),
+        cli_params,
+    )
+    .await?;
+
     let mut archiver = Archiver {
         feature_flags,
         fs_feature_flags,
@@ -454,7 +463,7 @@ impl Archiver {
 
             let mut file_list = self.generate_directory_file_list(&mut dir, is_root)?;
 
-            if is_root && old_patterns_count > 0 {
+            if is_root && old_patterns_count > 0 && previous_metadata_accessor.is_none() {
                 file_list.push(FileListEntry {
                     name: CString::new(".pxarexclude-cli").unwrap(),
                     path: PathBuf::new(),
diff --git a/pbs-client/src/pxar/extract.rs b/pbs-client/src/pxar/extract.rs
index 56f8d7adc..46ff8fc80 100644
--- a/pbs-client/src/pxar/extract.rs
+++ b/pbs-client/src/pxar/extract.rs
@@ -267,7 +267,8 @@ where
         };
 
         let extract_res = match (did_match, entry.kind()) {
-            (_, EntryKind::Version(_)) => Ok(()),
+            (_, EntryKind::Version(_version)) => Ok(()),
+            (_, EntryKind::CliParams(_data)) => Ok(()),
             (_, EntryKind::Directory) => {
                 self.callback(entry.path());
 
diff --git a/pbs-client/src/pxar/tools.rs b/pbs-client/src/pxar/tools.rs
index 4e9bd5b60..478acdc0f 100644
--- a/pbs-client/src/pxar/tools.rs
+++ b/pbs-client/src/pxar/tools.rs
@@ -173,6 +173,12 @@ pub fn format_multi_line_entry(entry: &Entry) -> String {
 
     let (size, link, type_name, payload_offset) = match entry.kind() {
         EntryKind::Version(version) => (format!("{version:?}"), String::new(), "version", None),
+        EntryKind::CliParams(params) => (
+            "0".to_string(),
+            format!(" -> {:?}", params.as_os_str()),
+            "symlink",
+            None,
+        ),
         EntryKind::File {
             size,
             payload_offset,
diff --git a/src/tape/file_formats/snapshot_archive.rs b/src/tape/file_formats/snapshot_archive.rs
index 43d1cf9c3..7e052919b 100644
--- a/src/tape/file_formats/snapshot_archive.rs
+++ b/src/tape/file_formats/snapshot_archive.rs
@@ -58,8 +58,12 @@ pub fn tape_write_snapshot_archive<'a>(
             ));
         }
 
-        let mut encoder =
-            pxar::encoder::sync::Encoder::new(PxarTapeWriter::new(writer), &root_metadata, None)?;
+        let mut encoder = pxar::encoder::sync::Encoder::new(
+            PxarTapeWriter::new(writer),
+            &root_metadata,
+            None,
+            None,
+        )?;
 
         for filename in file_list.iter() {
             let mut file = snapshot_reader.open_file(filename).map_err(|err| {
-- 
2.39.2





^ permalink raw reply	[flat|nested] 122+ messages in thread

* [pbs-devel] [PATCH v3 proxmox-backup 54/58] client: pxar: add flow chart for metadata change detection
  2024-03-28 12:36 [pbs-devel] [PATCH v3 pxar proxmox-backup 00/58] fix #3174: improve file-level backup Christian Ebner
                   ` (52 preceding siblings ...)
  2024-03-28 12:37 ` [pbs-devel] [PATCH v3 proxmox-backup 53/58] client: pxar: opt encode cli exclude patterns as CliParams Christian Ebner
@ 2024-03-28 12:37 ` Christian Ebner
  2024-04-05 10:16   ` Fabian Grünbichler
  2024-03-28 12:37 ` [pbs-devel] [PATCH v3 proxmox-backup 55/58] docs: describe file format for split payload files Christian Ebner
                   ` (5 subsequent siblings)
  59 siblings, 1 reply; 122+ messages in thread
From: Christian Ebner @ 2024-03-28 12:37 UTC (permalink / raw)
  To: pbs-devel

A high level flow chart describing the logic used for the metadata
based file change detection mode.

Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
---
changes since version 2:
- not present in previous version

 ...ow-chart-metadata-based-file-change-detection.svg |  1 +
 ...ow-chart-metadata-based-file-change-detection.txt | 12 ++++++++++++
 2 files changed, 13 insertions(+)
 create mode 100644 pbs-client/src/pxar/flow-chart-metadata-based-file-change-detection.svg
 create mode 100644 pbs-client/src/pxar/flow-chart-metadata-based-file-change-detection.txt

diff --git a/pbs-client/src/pxar/flow-chart-metadata-based-file-change-detection.svg b/pbs-client/src/pxar/flow-chart-metadata-based-file-change-detection.svg
new file mode 100644
index 000000000..5e6df4815
--- /dev/null
+++ b/pbs-client/src/pxar/flow-chart-metadata-based-file-change-detection.svg
@@ -0,0 +1 @@
+<svg aria-roledescription="flowchart-v2" role="graphics-document document" viewBox="-8 -8 1339.7958984375 620" style="max-width: 100%;" xmlns="http://www.w3.org/2000/svg" width="100%" id="graph-div" height="100%" xmlns:xlink="http://www.w3.org/1999/xlink"><style>@import url("https://cdnjs.cloudflare.com/ajax/libs/font-awesome/6.5.1/css/all.min.css");'</style><style>#graph-div{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}#graph-div .error-icon{fill:#552222;}#graph-div .error-text{fill:#552222;stroke:#552222;}#graph-div .edge-thickness-normal{stroke-width:2px;}#graph-div .edge-thickness-thick{stroke-width:3.5px;}#graph-div .edge-pattern-solid{stroke-dasharray:0;}#graph-div .edge-pattern-dashed{stroke-dasharray:3;}#graph-div .edge-pattern-dotted{stroke-dasharray:2;}#graph-div .marker{fill:#333333;stroke:#333333;}#graph-div .marker.cross{stroke:#333333;}#graph-div svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#graph-div .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#graph-div .cluster-label text{fill:#333;}#graph-div .cluster-label span,#graph-div p{color:#333;}#graph-div .label text,#graph-div span,#graph-div p{fill:#333;color:#333;}#graph-div .node rect,#graph-div .node circle,#graph-div .node ellipse,#graph-div .node polygon,#graph-div .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#graph-div .flowchart-label text{text-anchor:middle;}#graph-div .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#graph-div .node .label{text-align:center;}#graph-div .node.clickable{cursor:pointer;}#graph-div .arrowheadPath{fill:#333333;}#graph-div .edgePath .path{stroke:#333333;stroke-width:2.0px;}#graph-div .flowchart-link{stroke:#333333;fill:none;}#graph-div .edgeLabel{background-color:#e8e8e8;text-align:center;}#graph-div .edgeLabel rect{opacity:0.5;background-color:#e8e8e8;fill:#e8e8e8;}#graph-div .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#graph-div .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#graph-div .cluster text{fill:#333;}#graph-div .cluster span,#graph-div p{color:#333;}#graph-div div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#graph-div .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#graph-div :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;}</style><g><marker orient="auto" markerHeight="12" markerWidth="12" markerUnits="userSpaceOnUse" refY="5" refX="6" viewBox="0 0 10 10" class="marker flowchart" id="graph-div_flowchart-pointEnd"><path style="stroke-width: 1px; stroke-dasharray: 1px, 0px;" class="arrowMarkerPath" d="M 0 0 L 10 5 L 0 10 z"></path></marker><marker orient="auto" markerHeight="12" markerWidth="12" markerUnits="userSpaceOnUse" refY="5" refX="4.5" viewBox="0 0 10 10" class="marker flowchart" id="graph-div_flowchart-pointStart"><path style="stroke-width: 1px; stroke-dasharray: 1px, 0px;" class="arrowMarkerPath" d="M 0 5 L 10 10 L 10 0 z"></path></marker><marker orient="auto" markerHeight="11" markerWidth="11" markerUnits="userSpaceOnUse" refY="5" refX="11" viewBox="0 0 10 10" class="marker flowchart" id="graph-div_flowchart-circleEnd"><circle style="stroke-width: 1px; stroke-dasharray: 1px, 0px;" class="arrowMarkerPath" r="5" cy="5" cx="5"></circle></marker><marker orient="auto" markerHeight="11" markerWidth="11" markerUnits="userSpaceOnUse" refY="5" refX="-1" viewBox="0 0 10 10" class="marker flowchart" id="graph-div_flowchart-circleStart"><circle style="stroke-width: 1px; stroke-dasharray: 1px, 0px;" class="arrowMarkerPath" r="5" cy="5" cx="5"></circle></marker><marker orient="auto" markerHeight="11" markerWidth="11" markerUnits="userSpaceOnUse" refY="5.2" refX="12" viewBox="0 0 11 11" class="marker cross flowchart" id="graph-div_flowchart-crossEnd"><path style="stroke-width: 2px; stroke-dasharray: 1px, 0px;" class="arrowMarkerPath" d="M 1,1 l 9,9 M 10,1 l -9,9"></path></marker><marker orient="auto" markerHeight="11" markerWidth="11" markerUnits="userSpaceOnUse" refY="5.2" refX="-1" viewBox="0 0 11 11" class="marker cross flowchart" id="graph-div_flowchart-crossStart"><path style="stroke-width: 2px; stroke-dasharray: 1px, 0px;" class="arrowMarkerPath" d="M 1,1 l 9,9 M 10,1 l -9,9"></path></marker><g class="root"><g class="clusters"></g><g class="edgePaths"><path marker-end="url(#graph-div_flowchart-pointEnd)" style="fill:none;" class="edge-thickness-normal edge-pattern-solid flowchart-link LS-A LE-B" id="L-A-B-0" d="M612.438,23.106L522.374,31.922C432.311,40.737,252.185,58.369,162.122,72.468C72.058,86.567,72.058,97.133,72.058,102.417L72.058,107.7"></path><path marker-end="url(#graph-div_flowchart-pointEnd)" style="fill:none;" class="edge-thickness-normal edge-pattern-solid flowchart-link LS-B LE-C" id="L-B-C-0" d="M112.458,141.679L147.172,149.566C181.885,157.453,251.311,173.226,286.024,186.396C320.738,199.567,320.738,210.133,320.738,215.417L320.738,220.7"></path><path marker-end="url(#graph-div_flowchart-pointEnd)" style="fill:none;" class="edge-thickness-normal edge-pattern-solid flowchart-link LS-C LE-D" id="L-C-D-0" d="M320.738,265L320.738,271.167C320.738,277.333,320.738,289.667,320.738,301.117C320.738,312.567,320.738,323.133,320.738,328.417L320.738,333.7"></path><path marker-end="url(#graph-div_flowchart-pointEnd)" style="fill:none;" class="edge-thickness-normal edge-pattern-solid flowchart-link LS-D LE-E" id="L-D-E-0" d="M320.738,378L320.738,384.167C320.738,390.333,320.738,402.667,324.098,414.249C327.458,425.832,334.179,436.664,337.539,442.08L340.899,447.496"></path><path marker-end="url(#graph-div_flowchart-pointEnd)" style="fill:none;" class="edge-thickness-normal edge-pattern-solid flowchart-link LS-B LE-F" id="L-B-F-0" d="M72.058,152L72.058,158.167C72.058,164.333,72.058,176.667,72.058,192.25C72.058,207.833,72.058,226.667,72.058,245.5C72.058,264.333,72.058,283.167,72.058,302C72.058,320.833,72.058,339.667,72.058,358.5C72.058,377.333,72.058,396.167,72.058,415C72.058,433.833,72.058,452.667,72.058,471.5C72.058,490.333,72.058,509.167,116.254,524.63C160.45,540.094,248.843,552.188,293.039,558.235L337.235,564.282"></path><path marker-end="url(#graph-div_flowchart-pointEnd)" style="fill:none;" class="edge-thickness-normal edge-pattern-solid flowchart-link LS-F LE-A" id="L-F-A-0" d="M627.905,565L673.094,558.833C718.284,552.667,808.663,540.333,853.852,524.75C899.042,509.167,899.042,490.333,899.042,471.5C899.042,452.667,899.042,433.833,899.042,415C899.042,396.167,899.042,377.333,899.042,358.5C899.042,339.667,899.042,320.833,899.042,302C899.042,283.167,899.042,264.333,899.042,245.5C899.042,226.667,899.042,207.833,899.042,189C899.042,170.167,899.042,151.333,899.042,132.5C899.042,113.667,899.042,94.833,864.416,77.584C829.791,60.335,760.541,44.669,725.915,36.836L691.29,29.004"></path><path marker-end="url(#graph-div_flowchart-pointEnd)" style="fill:none;" class="edge-thickness-normal edge-pattern-solid flowchart-link LS-E LE-F" id="L-E-F-0" d="M355.792,491L355.792,497.167C355.792,503.333,355.792,515.667,369.086,527.646C382.38,539.626,408.967,551.251,422.261,557.064L435.555,562.877"></path><path marker-end="url(#graph-div_flowchart-pointEnd)" style="fill:none;" class="edge-thickness-normal edge-pattern-solid flowchart-link LS-E LE-A" id="L-E-A-0" d="M420.217,459.097L458.394,451.748C496.571,444.398,572.925,429.699,611.102,412.933C649.279,396.167,649.279,377.333,649.279,358.5C649.279,339.667,649.279,320.833,649.279,302C649.279,283.167,649.279,264.333,649.279,245.5C649.279,226.667,649.279,207.833,649.279,189C649.279,170.167,649.279,151.333,649.279,132.5C649.279,113.667,649.279,94.833,649.279,80.133C649.279,65.433,649.279,54.867,649.279,49.583L649.279,44.3"></path><path marker-end="url(#graph-div_flowchart-pointEnd)" style="fill:none;" class="edge-thickness-normal edge-pattern-solid flowchart-link LS-E LE-G" id="L-E-G-0" d="M420.217,477.1L517.811,485.583C615.406,494.067,810.594,511.033,913.519,525.047C1016.444,539.061,1027.105,550.123,1032.435,555.653L1037.766,561.184"></path><path marker-end="url(#graph-div_flowchart-pointEnd)" style="fill:none;" class="edge-thickness-normal edge-pattern-solid flowchart-link LS-G LE-A" id="L-G-A-0" d="M1078.658,565L1084.483,558.833C1090.308,552.667,1101.958,540.333,1107.783,524.75C1113.608,509.167,1113.608,490.333,1113.608,471.5C1113.608,452.667,1113.608,433.833,1113.608,415C1113.608,396.167,1113.608,377.333,1113.608,358.5C1113.608,339.667,1113.608,320.833,1113.608,302C1113.608,283.167,1113.608,264.333,1113.608,245.5C1113.608,226.667,1113.608,207.833,1113.608,189C1113.608,170.167,1113.608,151.333,1113.608,132.5C1113.608,113.667,1113.608,94.833,1043.237,76.854C972.866,58.874,832.124,41.749,761.753,33.186L691.382,24.623"></path></g><g class="edgeLabels"><g transform="translate(72.05833435058594, 76)" class="edgeLabel"><g transform="translate(-59.166664123535156, -12)" class="label"><foreignObject height="24" width="118.33332824707031"><div xmlns="http://www.w3.org/1999/xhtml" style="display: inline-block; white-space: nowrap;"><span class="edgeLabel">lookup metadata</span></div></foreignObject></g></g><g transform="translate(320.7375030517578, 189)" class="edgeLabel"><g transform="translate(-58.708335876464844, -12)" class="label"><foreignObject height="24" width="117.41667175292969"><div xmlns="http://www.w3.org/1999/xhtml" style="display: inline-block; white-space: nowrap;"><span class="edgeLabel">is reusable entry</span></div></foreignObject></g></g><g transform="translate(320.7375030517578, 302)" class="edgeLabel"><g transform="translate(-84.06666564941406, -12)" class="label"><foreignObject height="24" width="168.13333129882812"><div xmlns="http://www.w3.org/1999/xhtml" style="display: inline-block; white-space: nowrap;"><span class="edgeLabel">lookup reusable chunks</span></div></foreignObject></g></g><g transform="translate(320.7375030517578, 415)" class="edgeLabel"><g transform="translate(-136.5500030517578, -12)" class="label"><foreignObject height="24" width="273.1000061035156"><div xmlns="http://www.w3.org/1999/xhtml" style="display: inline-block; white-space: nowrap;"><span class="edgeLabel">insert and deduplicate dynamic entries</span></div></foreignObject></g></g><g transform="translate(72.05833435058594, 358.5)" class="edgeLabel"><g transform="translate(-72.05833435058594, -12)" class="label"><foreignObject height="24" width="144.11666870117188"><div xmlns="http://www.w3.org/1999/xhtml" style="display: inline-block; white-space: nowrap;"><span class="edgeLabel">is not reusable entry</span></div></foreignObject></g></g><g transform="translate(899.0416793823242, 302)" class="edgeLabel"><g transform="translate(-59.599998474121094, -12)" class="label"><foreignObject height="24" width="119.19999694824219"><div xmlns="http://www.w3.org/1999/xhtml" style="display: inline-block; white-space: nowrap;"><span class="edgeLabel">caching disabled</span></div></foreignObject></g></g><g transform="translate(355.7916717529297, 528)" class="edgeLabel"><g transform="translate(-238.43333435058594, -12)" class="label"><foreignObject height="24" width="476.8666687011719"><div xmlns="http://www.w3.org/1999/xhtml" style="display: inline-block; white-space: nowrap;"><span class="edgeLabel">padding above threshold, non-continuous chunks, caching disabled</span></div></foreignObject></g></g><g transform="translate(649.2791748046875, 245.5)" class="edgeLabel"><g transform="translate(-221.0916748046875, -12)" class="label"><foreignObject height="24" width="442.183349609375"><div xmlns="http://www.w3.org/1999/xhtml" style="display: inline-block; white-space: nowrap;"><span class="edgeLabel">padding above threshold, chunks continuous, caching enabled</span></div></foreignObject></g></g><g transform="translate(1005.7833480834961, 528)" class="edgeLabel"><g transform="translate(-86.74166870117188, -12)" class="label"><foreignObject height="24" width="173.48333740234375"><div xmlns="http://www.w3.org/1999/xhtml" style="display: inline-block; white-space: nowrap;"><span class="edgeLabel">padding below threshold</span></div></foreignObject></g></g><g transform="translate(1113.6083526611328, 302)" class="edgeLabel"><g transform="translate(-58.275001525878906, -12)" class="label"><foreignObject height="24" width="116.55000305175781"><div xmlns="http://www.w3.org/1999/xhtml" style="display: inline-block; white-space: nowrap;"><span class="edgeLabel">caching enabled</span></div></foreignObject></g></g></g><g class="nodes"><g transform="translate(649.2791748046875, 19.5)" data-id="A" data-node="true" id="flowchart-A-12299" class="node default default flowchart-label"><rect height="39" width="73.68333435058594" y="-19.5" x="-36.84166717529297" ry="0" rx="0" style="" class="basic label-container"></rect><g transform="translate(-29.34166717529297, -12)" style="" class="label"><rect></rect><foreignObject height="24" width="58.68333435058594"><div xmlns="http://www.w3.org/1999/xhtml" style="display: inline-block; white-space: nowrap;"><span class="nodeLabel">Archiver</span></div></foreignObject></g></g><g transform="translate(72.05833435058594, 132.5)" data-id="B" data-node="true" id="flowchart-B-12300" class="node default default flowchart-label"><rect height="39" width="80.80000305175781" y="-19.5" x="-40.400001525878906" ry="0" rx="0" style="" class="basic label-container"></rect><g transform="translate(-32.900001525878906, -12)" style="" class="label"><rect></rect><foreignObject height="24" width="65.80000305175781"><div xmlns="http://www.w3.org/1999/xhtml" style="display: inline-block; white-space: nowrap;"><span class="nodeLabel">Accessor</span></div></foreignObject></g></g><g transform="translate(320.7375030517578, 245.5)" data-id="C" data-node="true" id="flowchart-C-12302" class="node default default flowchart-label"><rect height="39" width="144.89999389648438" y="-19.5" x="-72.44999694824219" ry="0" rx="0" style="" class="basic label-container"></rect><g transform="translate(-64.94999694824219, -12)" style="" class="label"><rect></rect><foreignObject height="24" width="129.89999389648438"><div xmlns="http://www.w3.org/1999/xhtml" style="display: inline-block; white-space: nowrap;"><span class="nodeLabel">Lookahead Cache</span></div></foreignObject></g></g><g transform="translate(320.7375030517578, 358.5)" data-id="D" data-node="true" id="flowchart-D-12304" class="node default default flowchart-label"><rect height="39" width="120.83332824707031" y="-19.5" x="-60.416664123535156" ry="0" rx="0" style="" class="basic label-container"></rect><g transform="translate(-52.916664123535156, -12)" style="" class="label"><rect></rect><foreignObject height="24" width="105.83332824707031"><div xmlns="http://www.w3.org/1999/xhtml" style="display: inline-block; white-space: nowrap;"><span class="nodeLabel">Dynamic Index</span></div></foreignObject></g></g><g transform="translate(355.7916717529297, 471.5)" data-id="E" data-node="true" id="flowchart-E-12306" class="node default default flowchart-label"><rect height="39" width="128.85000610351562" y="-19.5" x="-64.42500305175781" ry="0" rx="0" style="" class="basic label-container"></rect><g transform="translate(-56.92500305175781, -12)" style="" class="label"><rect></rect><foreignObject height="24" width="113.85000610351562"><div xmlns="http://www.w3.org/1999/xhtml" style="display: inline-block; white-space: nowrap;"><span class="nodeLabel">Reused Chunks</span></div></foreignObject></g></g><g transform="translate(485.00833892822266, 584.5)" data-id="F" data-node="true" id="flowchart-F-12308" class="node default default flowchart-label"><rect height="39" width="321.04998779296875" y="-19.5" x="-160.52499389648438" ry="5" rx="5" style="" class="basic label-container"></rect><g transform="translate(-153.02499389648438, -12)" style="" class="label"><rect></rect><foreignObject height="24" width="306.04998779296875"><div xmlns="http://www.w3.org/1999/xhtml" style="display: inline-block; white-space: nowrap;"><span class="nodeLabel">re-encode cached entries and current entry</span></div></foreignObject></g></g><g transform="translate(1060.2375183105469, 584.5)" data-id="G" data-node="true" id="flowchart-G-12316" class="node default default flowchart-label"><rect height="39" width="527.11669921875" y="-19.5" x="-263.558349609375" ry="5" rx="5" style="" class="basic label-container"></rect><g transform="translate(-256.058349609375, -12)" style="" class="label"><rect></rect><foreignObject height="24" width="512.11669921875"><div xmlns="http://www.w3.org/1999/xhtml" style="display: inline-block; white-space: nowrap;"><span class="nodeLabel">force boundary, inject chunks, keepback last chunk for potential followup</span></div></foreignObject></g></g></g></g></g></svg>
\ No newline at end of file
diff --git a/pbs-client/src/pxar/flow-chart-metadata-based-file-change-detection.txt b/pbs-client/src/pxar/flow-chart-metadata-based-file-change-detection.txt
new file mode 100644
index 000000000..5eace70be
--- /dev/null
+++ b/pbs-client/src/pxar/flow-chart-metadata-based-file-change-detection.txt
@@ -0,0 +1,12 @@
+flowchart TD
+    A[Archiver] -->|lookup metadata| B[Accessor]
+    B -->|is reusable entry| C[Lookahead Cache]
+    C -->|lookup reusable chunks| D[Dynamic Index]
+    D -->|insert and deduplicate dynamic entries| E[Reused Chunks]
+    B -->|is not reusable entry| F(re-encode cached entries and current entry)
+    F -->|caching disabled| A
+    E -->|padding above threshold, non-continuous chunks, caching disabled| F
+    E -->|padding above threshold, chunks continuous, caching enabled| A
+    E -->|padding below threshold| G(force boundary, inject chunks, keepback last chunk for potential followup)
+    G -->|caching enabled| A
+
-- 
2.39.2





^ permalink raw reply	[flat|nested] 122+ messages in thread

* [pbs-devel] [PATCH v3 proxmox-backup 55/58] docs: describe file format for split payload files
  2024-03-28 12:36 [pbs-devel] [PATCH v3 pxar proxmox-backup 00/58] fix #3174: improve file-level backup Christian Ebner
                   ` (53 preceding siblings ...)
  2024-03-28 12:37 ` [pbs-devel] [PATCH v3 proxmox-backup 54/58] client: pxar: add flow chart for metadata change detection Christian Ebner
@ 2024-03-28 12:37 ` Christian Ebner
  2024-04-05 10:26   ` Fabian Grünbichler
  2024-03-28 12:37 ` [pbs-devel] [PATCH v3 proxmox-backup 56/58] docs: add section describing change detection mode Christian Ebner
                   ` (4 subsequent siblings)
  59 siblings, 1 reply; 122+ messages in thread
From: Christian Ebner @ 2024-03-28 12:37 UTC (permalink / raw)
  To: pbs-devel

Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
---
changes since version 2:
- not present in previous version

 docs/file-formats.rst         | 32 ++++++++++++++++++++++
 docs/meta-format-overview.dot | 50 +++++++++++++++++++++++++++++++++++
 2 files changed, 82 insertions(+)
 create mode 100644 docs/meta-format-overview.dot

diff --git a/docs/file-formats.rst b/docs/file-formats.rst
index 43ecfefce..292660579 100644
--- a/docs/file-formats.rst
+++ b/docs/file-formats.rst
@@ -8,6 +8,38 @@ Proxmox File Archive Format (``.pxar``)
 
 .. graphviz:: pxar-format-overview.dot
 
+.. _pxar-meta-format:
+
+Proxmox File Archive Format - Meta (``.mpxar``)
+-----------------------------------------------
+
+.. graphviz:: meta-format-overview.dot
+
+.. _ppxar-format:
+
+Proxmox File Archive Format - Payload (``.ppxar``)
+--------------------------------------------------
+
+The pxar payload contains a concatenation of regular file payloads,
+each prefixed by a `PAYLOAD` header. Further, the entries can have
+some padding following the actual payload, if a referenced chunk was
+not fully reused:
+
+.. list-table::
+   :widths: auto
+
+   * - ``PAYLOAD_START_MARKER``
+     - ``[u8; 16]``
+   * - ``PAYLOAD``
+     - ``header with [u8; 16]``
+   * - ``Payload``
+     - ``raw regular file payload``
+   * - ``Padding``
+     - ``none if chunk fully reused``
+   * - ``...``
+     - ``Further list of header, payload and padding``
+   * - ``PAYLOAD_TAIL_MARKER``
+     - ``[u8; 16]``
 
 .. _data-blob-format:
 
diff --git a/docs/meta-format-overview.dot b/docs/meta-format-overview.dot
new file mode 100644
index 000000000..c3e4869b3
--- /dev/null
+++ b/docs/meta-format-overview.dot
@@ -0,0 +1,50 @@
+digraph g {
+graph [
+rankdir = "LR"
+fontname="Helvetica"
+];
+node [
+fontsize = "16"
+shape = "record"
+];
+edge [
+];
+
+"archive" [
+label = "archive.mpxar"
+shape = "record"
+];
+
+"rootdir" [
+label = "<f0> FORMAT_VERSION | CLI_PARAMS | ENTRY| \{XATTR\}\* extended attribute list\l | \{ACL_USER\}\* USER ACL entries\l | \{ACL_GROUP\}\* GROUP ACL entries\l| \[ACL_GROUP_OBJ\] the ACL_GROUP_OBJ \l| \[ACL_DEFAULT\] the various default ACL fields\l|\{ACL_DEFAULT_USER\}\* USER ACL entries\l|\{ACL_DEFAULT_GROUP\}\* GROUP ACL entries\l|\[FCAPS\] file capability in Linux disk format\l|\[QUOTA_PROJECT_ID\] the ext4/xfs quota project ID\l| { <pl> PAYLOAD_REF  | SYMLINK | DEVICE | { <de> \{DirectoryEntries\}\* | GOODBYE}}"
+shape = "record"
+];
+
+
+"entry" [
+label = "<f0> size: u64 = 64\l|type: u64 = ENTRY\l|feature_flags: u64\l|mode: u64\l|flags: u64\l|uid: u64\l|gid: u64\l|mtime: u64\l"
+labeljust = "l"
+shape = "record"
+];
+
+
+
+"direntry" [
+label = "<f0> FILENAME |{ENTRY | HARDLINK}"
+shape = "record"
+];
+
+"payloadrefentry" [
+label = "<f0> offset: u64 \l| size: u64\l"
+shape = "record"
+];
+
+"archive" -> "rootdir":f0
+
+"rootdir":f0 -> "entry":f0
+
+"rootdir":de -> "direntry":f0
+
+"rootdir":pl -> "payloadrefentry":f0
+
+}
-- 
2.39.2





^ permalink raw reply	[flat|nested] 122+ messages in thread

* [pbs-devel] [PATCH v3 proxmox-backup 56/58] docs: add section describing change detection mode
  2024-03-28 12:36 [pbs-devel] [PATCH v3 pxar proxmox-backup 00/58] fix #3174: improve file-level backup Christian Ebner
                   ` (54 preceding siblings ...)
  2024-03-28 12:37 ` [pbs-devel] [PATCH v3 proxmox-backup 55/58] docs: describe file format for split payload files Christian Ebner
@ 2024-03-28 12:37 ` Christian Ebner
  2024-04-05 11:22   ` Fabian Grünbichler
  2024-03-28 12:37 ` [pbs-devel] [PATCH v3 proxmox-backup 57/58] test-suite: add detection mode change benchmark Christian Ebner
                   ` (3 subsequent siblings)
  59 siblings, 1 reply; 122+ messages in thread
From: Christian Ebner @ 2024-03-28 12:37 UTC (permalink / raw)
  To: pbs-devel

Describe the motivation and basic principle of the clients change
detection mode and show an example invocation.

Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
---
changes since version 2:
- not present in previous version

 docs/backup-client.rst | 33 +++++++++++++++++++++++++++++++++
 1 file changed, 33 insertions(+)

diff --git a/docs/backup-client.rst b/docs/backup-client.rst
index 00a1abbb3..e54a52bf4 100644
--- a/docs/backup-client.rst
+++ b/docs/backup-client.rst
@@ -280,6 +280,39 @@ Multiple paths can be excluded like this:
 
     # proxmox-backup-client backup.pxar:./linux --exclude=/usr --exclude=/rust
 
+.. _client_change_detection_mode:
+
+Change detection mode
+~~~~~~~~~~~~~~~~~~~~~
+
+Backing up filesystems with large contents can take a long time, as the default
+behaviour for the Proxmox Backup Client is to read all data and re-encode it
+into chunks. For some usecases, where files do not change frequently this is not
+feasible and undesired.
+
+In order to instruct the client to not re-encode files with unchanged metadata,
+the `change-detection-mode` can be set from the default `data` to `metadata`.
+By this, regular file payloads for files with unchanged metadata are looked up
+and re-used from the previous backup runs snapshot when possible. For this to
+be feasible, the pxar archives for backup runs using this mode are split into
+two separate files, the `mpxar` containing the archives metadata and the `ppxar`
+containing a concatenation of the file payloads.
+
+During backup, the current file metadata is compared to the one looked up in the
+previous `mpxar` archive, and if unchanged, the payload of the file is included
+in the current backup by referencing the indices of the previous snaphshot. The
+increase in backup speed comes at the cost of a possible increase of used space,
+as chunks might only be partially reused, containing unneeded padding. This is
+however minimized by selectively re-encoding files where the padding overhead
+does not justify a re-use.
+
+The following shows an example for the client invocation with the `metadata`
+mode:
+
+.. code-block:: console
+
+    # proxmox-backup-client backup.pxar:./linux --change-detection-mode=metadata
+
 .. _client_encryption:
 
 Encryption
-- 
2.39.2





^ permalink raw reply	[flat|nested] 122+ messages in thread

* [pbs-devel] [PATCH v3 proxmox-backup 57/58] test-suite: add detection mode change benchmark
  2024-03-28 12:36 [pbs-devel] [PATCH v3 pxar proxmox-backup 00/58] fix #3174: improve file-level backup Christian Ebner
                   ` (55 preceding siblings ...)
  2024-03-28 12:37 ` [pbs-devel] [PATCH v3 proxmox-backup 56/58] docs: add section describing change detection mode Christian Ebner
@ 2024-03-28 12:37 ` Christian Ebner
  2024-03-28 12:37 ` [pbs-devel] [PATCH v3 proxmox-backup 58/58] test-suite: add bin to deb, add shell completions Christian Ebner
                   ` (2 subsequent siblings)
  59 siblings, 0 replies; 122+ messages in thread
From: Christian Ebner @ 2024-03-28 12:37 UTC (permalink / raw)
  To: pbs-devel

Introduces the proxmox-backup-test-suite create intended for
benchmarking and high level user facing testing.

The initial code includes a benchmark intended for regression testing of
the proxmox-backup-client when using different file detection modes
during backup.

Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
---
changes since version 2:
- no changes
- might be not applied/dropped in future version and picked up later

 Cargo.toml                                    |   1 +
 proxmox-backup-test-suite/Cargo.toml          |  18 ++
 .../src/detection_mode_bench.rs               | 294 ++++++++++++++++++
 proxmox-backup-test-suite/src/main.rs         |  17 +
 4 files changed, 330 insertions(+)
 create mode 100644 proxmox-backup-test-suite/Cargo.toml
 create mode 100644 proxmox-backup-test-suite/src/detection_mode_bench.rs
 create mode 100644 proxmox-backup-test-suite/src/main.rs

diff --git a/Cargo.toml b/Cargo.toml
index 4616e4768..42086118b 100644
--- a/Cargo.toml
+++ b/Cargo.toml
@@ -45,6 +45,7 @@ members = [
     "proxmox-restore-daemon",
 
     "pxar-bin",
+    "proxmox-backup-test-suite",
 ]
 
 [lib]
diff --git a/proxmox-backup-test-suite/Cargo.toml b/proxmox-backup-test-suite/Cargo.toml
new file mode 100644
index 000000000..3f899e9bc
--- /dev/null
+++ b/proxmox-backup-test-suite/Cargo.toml
@@ -0,0 +1,18 @@
+[package]
+name = "proxmox-backup-test-suite"
+version = "0.1.0"
+authors.workspace = true
+edition.workspace = true
+
+[dependencies]
+anyhow.workspace = true
+futures.workspace = true
+serde.workspace = true
+serde_json.workspace = true
+
+pbs-client.workspace = true
+pbs-key-config.workspace = true
+pbs-tools.workspace = true
+proxmox-async.workspace = true
+proxmox-router = { workspace = true, features = ["cli"] }
+proxmox-schema = { workspace = true, features = [ "api-macro" ] }
diff --git a/proxmox-backup-test-suite/src/detection_mode_bench.rs b/proxmox-backup-test-suite/src/detection_mode_bench.rs
new file mode 100644
index 000000000..9a3c76802
--- /dev/null
+++ b/proxmox-backup-test-suite/src/detection_mode_bench.rs
@@ -0,0 +1,294 @@
+use std::path::Path;
+use std::process::Command;
+use std::{thread, time};
+
+use anyhow::{bail, format_err, Error};
+use serde_json::Value;
+
+use pbs_client::{
+    tools::{complete_repository, key_source::KEYFILE_SCHEMA, REPO_URL_SCHEMA},
+    BACKUP_SOURCE_SCHEMA,
+};
+use pbs_tools::json;
+use proxmox_router::cli::*;
+use proxmox_schema::api;
+
+const DEFAULT_NUMBER_OF_RUNS: u64 = 5;
+// Homepage https://cocodataset.org/
+const COCO_DATASET_SRC_URL: &'static str = "http://images.cocodataset.org/zips/unlabeled2017.zip";
+// Homepage https://kernel.org/
+const LINUX_GIT_REPOSITORY: &'static str =
+    "git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git";
+const LINUX_GIT_TAG: &'static str = "v6.5.5";
+
+pub(crate) fn detection_mode_bench_mgtm_cli() -> CliCommandMap {
+    let run_cmd_def = CliCommand::new(&API_METHOD_DETECTION_MODE_BENCH_RUN)
+        .arg_param(&["backupspec"])
+        .completion_cb("repository", complete_repository)
+        .completion_cb("keyfile", complete_file_name);
+
+    let prepare_cmd_def = CliCommand::new(&API_METHOD_DETECTION_MODE_BENCH_PREPARE);
+    CliCommandMap::new()
+        .insert("prepare", prepare_cmd_def)
+        .insert("run", run_cmd_def)
+}
+
+#[api(
+   input: {
+       properties: {
+           backupspec: {
+               type: Array,
+               description: "List of backup source specifications ([<label.ext>:<path>] ...)",
+               items: {
+                   schema: BACKUP_SOURCE_SCHEMA,
+               }
+           },
+           repository: {
+               schema: REPO_URL_SCHEMA,
+               optional: true,
+           },
+           keyfile: {
+               schema: KEYFILE_SCHEMA,
+               optional: true,
+           },
+           "number-of-runs": {
+               description: "Number of times to repeat the run",
+               type: Integer,
+               optional: true,
+           },
+       }
+   }
+)]
+/// Run benchmark to compare performance for backups using different change detection modes.
+fn detection_mode_bench_run(param: Value) -> Result<(), Error> {
+    let mut pbc = Command::new("proxmox-backup-client");
+    pbc.arg("backup");
+
+    let backupspec_list = json::required_array_param(&param, "backupspec")?;
+    for backupspec in backupspec_list {
+        let arg = backupspec
+            .as_str()
+            .ok_or_else(|| format_err!("failed to parse backupspec"))?;
+        pbc.arg(arg);
+    }
+
+    if let Some(repo) = param["repository"].as_str() {
+        pbc.arg("--repository");
+        pbc.arg::<&str>(repo);
+    }
+
+    if let Some(keyfile) = param["keyfile"].as_str() {
+        pbc.arg("--keyfile");
+        pbc.arg::<&str>(keyfile);
+    }
+
+    let number_of_runs = match param["number_of_runs"].as_u64() {
+        Some(n) => n,
+        None => DEFAULT_NUMBER_OF_RUNS,
+    };
+    if number_of_runs < 1 {
+        bail!("Number of runs must be greater than 1, aborting.");
+    }
+
+    // First run is an initial run to make sure all chunks are present already, reduce side effects
+    // by filesystem caches ecc.
+    let _stats_initial = do_run(&mut pbc, 1)?;
+
+    println!("\nStarting benchmarking backups with regular detection mode...\n");
+    let stats_reg = do_run(&mut pbc, number_of_runs)?;
+
+    // Make sure to have a valid reference with catalog fromat version 2
+    pbc.arg("--change-detection-mode=metadata");
+    let _stats_initial = do_run(&mut pbc, 1)?;
+
+    println!("\nStarting benchmarking backups with metadata detection mode...\n");
+    let stats_meta = do_run(&mut pbc, number_of_runs)?;
+
+    println!("\nCompleted benchmark with {number_of_runs} runs for each tested mode.");
+    println!("\nCompleted regular backup with:");
+    println!("Total runtime: {:.2} s", stats_reg.total);
+    println!("Average: {:.2} ± {:.2} s", stats_reg.avg, stats_reg.stddev);
+    println!("Min: {:.2} s", stats_reg.min);
+    println!("Max: {:.2} s", stats_reg.max);
+
+    println!("\nCompleted metadata detection mode backup with:");
+    println!("Total runtime: {:.2} s", stats_meta.total);
+    println!(
+        "Average: {:.2} ± {:.2} s",
+        stats_meta.avg, stats_meta.stddev
+    );
+    println!("Min: {:.2} s", stats_meta.min);
+    println!("Max: {:.2} s", stats_meta.max);
+
+    let diff_stddev =
+        ((stats_meta.stddev * stats_meta.stddev) + (stats_reg.stddev * stats_reg.stddev)).sqrt();
+    println!("\nDifferences (metadata based - regular):");
+    println!(
+        "Delta total runtime: {:.2} s ({:.2} %)",
+        stats_meta.total - stats_reg.total,
+        100.0 * (stats_meta.total / stats_reg.total - 1.0),
+    );
+    println!(
+        "Delta average: {:.2} ± {:.2} s ({:.2} %)",
+        stats_meta.avg - stats_reg.avg,
+        diff_stddev,
+        100.0 * (stats_meta.avg / stats_reg.avg - 1.0),
+    );
+    println!(
+        "Delta min: {:.2} s ({:.2} %)",
+        stats_meta.min - stats_reg.min,
+        100.0 * (stats_meta.min / stats_reg.min - 1.0),
+    );
+    println!(
+        "Delta max: {:.2} s ({:.2} %)",
+        stats_meta.max - stats_reg.max,
+        100.0 * (stats_meta.max / stats_reg.max - 1.0),
+    );
+
+    Ok(())
+}
+
+fn do_run(cmd: &mut Command, n_runs: u64) -> Result<Statistics, Error> {
+    // Avoid consecutive snapshot timestamps collision
+    thread::sleep(time::Duration::from_millis(1000));
+    let mut timings = Vec::with_capacity(n_runs as usize);
+    for iteration in 1..n_runs + 1 {
+        let start = std::time::SystemTime::now();
+        let mut child = cmd.spawn()?;
+        let exit_code = child.wait()?;
+        let elapsed = start.elapsed()?;
+        timings.push(elapsed);
+        if !exit_code.success() {
+            bail!("Run number {iteration} of {n_runs} failed, aborting.");
+        }
+    }
+
+    Ok(statistics(timings))
+}
+
+struct Statistics {
+    total: f64,
+    avg: f64,
+    stddev: f64,
+    min: f64,
+    max: f64,
+}
+
+fn statistics(timings: Vec<std::time::Duration>) -> Statistics {
+    let total = timings
+        .iter()
+        .fold(0f64, |sum, time| sum + time.as_secs_f64());
+    let avg = total / timings.len() as f64;
+    let var = 1f64 / (timings.len() - 1) as f64
+        * timings.iter().fold(0f64, |sq_sum, time| {
+            let diff = time.as_secs_f64() - avg;
+            sq_sum + diff * diff
+        });
+    let stddev = var.sqrt();
+    let min = timings.iter().min().unwrap().as_secs_f64();
+    let max = timings.iter().max().unwrap().as_secs_f64();
+
+    Statistics {
+        total,
+        avg,
+        stddev,
+        min,
+        max,
+    }
+}
+
+#[api(
+    input: {
+        properties: {
+            target: {
+                description: "target path to prepare test data.",
+            },
+        },
+    },
+)]
+/// Prepare files required for detection mode backup benchmarks.
+fn detection_mode_bench_prepare(target: String) -> Result<(), Error> {
+    let linux_repo_target = format!("{target}/linux");
+    let coco_dataset_target = format!("{target}/coco");
+    git_clone(LINUX_GIT_REPOSITORY, linux_repo_target.as_str())?;
+    git_checkout(LINUX_GIT_TAG, linux_repo_target.as_str())?;
+    wget_download(COCO_DATASET_SRC_URL, coco_dataset_target.as_str())?;
+
+    Ok(())
+}
+
+fn git_clone(repo: &str, target: &str) -> Result<(), Error> {
+    println!("Calling git clone for '{repo}'.");
+    let target_git = format!("{target}/.git");
+    let path = Path::new(&target_git);
+    if let Ok(true) = path.try_exists() {
+        println!("Target '{target}' already contains a git repository, skip.");
+        return Ok(());
+    }
+
+    let mut git = Command::new("git");
+    git.args(["clone", repo, target]);
+
+    let mut child = git.spawn()?;
+    let exit_code = child.wait()?;
+    if exit_code.success() {
+        println!("git clone finished with success.");
+    } else {
+        bail!("git clone failed for '{target}'.");
+    }
+
+    Ok(())
+}
+
+fn git_checkout(checkout_target: &str, target: &str) -> Result<(), Error> {
+    println!("Calling git checkout '{checkout_target}'.");
+    let mut git = Command::new("git");
+    git.args(["-C", target, "checkout", checkout_target]);
+
+    let mut child = git.spawn()?;
+    let exit_code = child.wait()?;
+    if exit_code.success() {
+        println!("git checkout finished with success.");
+    } else {
+        bail!("git checkout '{checkout_target}' failed for '{target}'.");
+    }
+    Ok(())
+}
+
+fn wget_download(source_url: &str, target: &str) -> Result<(), Error> {
+    let path = Path::new(&target);
+    if let Ok(true) = path.try_exists() {
+        println!("Target '{target}' already exists, skip.");
+        return Ok(());
+    }
+    let zip = format!("{}/unlabeled2017.zip", target);
+    let path = Path::new(&zip);
+    if !path.try_exists()? {
+        println!("Download archive using wget from '{source_url}' to '{target}'.");
+        let mut wget = Command::new("wget");
+        wget.args(["-P", target, source_url]);
+
+        let mut child = wget.spawn()?;
+        let exit_code = child.wait()?;
+        if exit_code.success() {
+            println!("Download finished with success.");
+        } else {
+            bail!("Failed to download '{source_url}' to '{target}'.");
+        }
+        return Ok(());
+    } else {
+        println!("Target '{target}' already contains download, skip download.");
+    }
+
+    let mut unzip = Command::new("unzip");
+    unzip.args([&zip, "-d", target]);
+
+    let mut child = unzip.spawn()?;
+    let exit_code = child.wait()?;
+    if exit_code.success() {
+        println!("Extracting zip archive finished with success.");
+    } else {
+        bail!("Failed to extract zip archive '{zip}' to '{target}'.");
+    }
+    Ok(())
+}
diff --git a/proxmox-backup-test-suite/src/main.rs b/proxmox-backup-test-suite/src/main.rs
new file mode 100644
index 000000000..0a5b436a8
--- /dev/null
+++ b/proxmox-backup-test-suite/src/main.rs
@@ -0,0 +1,17 @@
+use proxmox_router::cli::*;
+
+mod detection_mode_bench;
+
+fn main() {
+    let cmd_def = CliCommandMap::new().insert(
+        "detection-mode-bench",
+        detection_mode_bench::detection_mode_bench_mgtm_cli(),
+    );
+
+    let rpcenv = CliEnvironment::new();
+    run_cli_command(
+        cmd_def,
+        rpcenv,
+        Some(|future| proxmox_async::runtime::main(future)),
+    );
+}
-- 
2.39.2





^ permalink raw reply	[flat|nested] 122+ messages in thread

* [pbs-devel] [PATCH v3 proxmox-backup 58/58] test-suite: add bin to deb, add shell completions
  2024-03-28 12:36 [pbs-devel] [PATCH v3 pxar proxmox-backup 00/58] fix #3174: improve file-level backup Christian Ebner
                   ` (56 preceding siblings ...)
  2024-03-28 12:37 ` [pbs-devel] [PATCH v3 proxmox-backup 57/58] test-suite: add detection mode change benchmark Christian Ebner
@ 2024-03-28 12:37 ` Christian Ebner
  2024-04-05 11:39 ` [pbs-devel] [PATCH v3 pxar proxmox-backup 00/58] fix #3174: improve file-level backup Fabian Grünbichler
  2024-04-29 12:13 ` Christian Ebner
  59 siblings, 0 replies; 122+ messages in thread
From: Christian Ebner @ 2024-03-28 12:37 UTC (permalink / raw)
  To: pbs-devel

Adds the required files for bash and zsh completion and packages the
binary to be included in the proxmox-backup-client debian package.

Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
---
changes since version 2:
- no changes
- might be not applied/dropped in future version and picked up later

 Makefile                                     | 13 ++++++++-----
 debian/proxmox-backup-client.bash-completion |  1 +
 debian/proxmox-backup-client.install         |  2 ++
 debian/proxmox-backup-test-suite.bc          |  8 ++++++++
 zsh-completions/_proxmox-backup-test-suite   | 13 +++++++++++++
 5 files changed, 32 insertions(+), 5 deletions(-)
 create mode 100644 debian/proxmox-backup-test-suite.bc
 create mode 100644 zsh-completions/_proxmox-backup-test-suite

diff --git a/Makefile b/Makefile
index 0317dd5e8..acaac3f78 100644
--- a/Makefile
+++ b/Makefile
@@ -8,11 +8,12 @@ SUBDIRS := etc www docs
 
 # Binaries usable by users
 USR_BIN := \
-	proxmox-backup-client 	\
-	proxmox-file-restore	\
-	pxar			\
-	proxmox-tape		\
-	pmtx			\
+	proxmox-backup-client 		\
+	proxmox-backup-test-suite 	\
+	proxmox-file-restore		\
+	pxar				\
+	proxmox-tape			\
+	pmtx				\
 	pmt
 
 # Binaries usable by admins
@@ -165,6 +166,8 @@ $(COMPILED_BINS) $(COMPILEDIR)/dump-catalog-shell-cli $(COMPILEDIR)/docgen: .do-
 	    --bin proxmox-backup-client \
 	    --bin dump-catalog-shell-cli \
 	    --bin proxmox-backup-debug \
+	    --package proxmox-backup-test-suite \
+	    --bin proxmox-backup-test-suite \
 	    --package proxmox-file-restore \
 	    --bin proxmox-file-restore \
 	    --package pxar-bin \
diff --git a/debian/proxmox-backup-client.bash-completion b/debian/proxmox-backup-client.bash-completion
index 437360175..c4ff02ae6 100644
--- a/debian/proxmox-backup-client.bash-completion
+++ b/debian/proxmox-backup-client.bash-completion
@@ -1,2 +1,3 @@
 debian/proxmox-backup-client.bc proxmox-backup-client
+debian/proxmox-backup-test-suite.bc proxmox-backup-test-suite
 debian/pxar.bc pxar
diff --git a/debian/proxmox-backup-client.install b/debian/proxmox-backup-client.install
index 74b568f17..0eb859757 100644
--- a/debian/proxmox-backup-client.install
+++ b/debian/proxmox-backup-client.install
@@ -1,6 +1,8 @@
 usr/bin/proxmox-backup-client
+usr/bin/proxmox-backup-test-suite
 usr/bin/pxar
 usr/share/man/man1/proxmox-backup-client.1
 usr/share/man/man1/pxar.1
 usr/share/zsh/vendor-completions/_proxmox-backup-client
+usr/share/zsh/vendor-completions/_proxmox-backup-test-suite
 usr/share/zsh/vendor-completions/_pxar
diff --git a/debian/proxmox-backup-test-suite.bc b/debian/proxmox-backup-test-suite.bc
new file mode 100644
index 000000000..2686d7eaa
--- /dev/null
+++ b/debian/proxmox-backup-test-suite.bc
@@ -0,0 +1,8 @@
+# proxmox-backup-test-suite bash completion
+
+# see http://tiswww.case.edu/php/chet/bash/FAQ
+# and __ltrim_colon_completions() in /usr/share/bash-completion/bash_completion
+# this modifies global var, but I found no better way
+COMP_WORDBREAKS=${COMP_WORDBREAKS//:}
+
+complete -C 'proxmox-backup-test-suite bashcomplete' proxmox-backup-test-suite
diff --git a/zsh-completions/_proxmox-backup-test-suite b/zsh-completions/_proxmox-backup-test-suite
new file mode 100644
index 000000000..72ebcea5f
--- /dev/null
+++ b/zsh-completions/_proxmox-backup-test-suite
@@ -0,0 +1,13 @@
+#compdef _proxmox-backup-test-suite() proxmox-backup-test-suite
+
+function _proxmox-backup-test-suite() {
+    local cwords line point cmd curr prev
+    cwords=${#words[@]}
+    line=$words
+    point=${#line}
+    cmd=${words[1]}
+    curr=${words[cwords]}
+    prev=${words[cwords-1]}
+    compadd -- $(COMP_CWORD="$cwords" COMP_LINE="$line" COMP_POINT="$point" \
+        proxmox-backup-test-suite bashcomplete "$cmd" "$curr" "$prev")
+}
-- 
2.39.2





^ permalink raw reply	[flat|nested] 122+ messages in thread

* [pbs-devel] applied: [PATCH v3 pxar 01/58] encoder: fix two typos in comments
  2024-03-28 12:36 ` [pbs-devel] [PATCH v3 pxar 01/58] encoder: fix two typos in comments Christian Ebner
@ 2024-04-03  9:12   ` Fabian Grünbichler
  0 siblings, 0 replies; 122+ messages in thread
From: Fabian Grünbichler @ 2024-04-03  9:12 UTC (permalink / raw)
  To: Proxmox Backup Server development discussion

this one already, with a follow-up that fixes the doubled "without"
before it, and the wrong "lengths" as well as some phrasing ;)

On March 28, 2024 1:36 pm, Christian Ebner wrote:
> Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
> ---
> changes since version 2:
> - not present in previous version
> 
>  src/encoder/mod.rs | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/src/encoder/mod.rs b/src/encoder/mod.rs
> index 0d342ec..c93f13b 100644
> --- a/src/encoder/mod.rs
> +++ b/src/encoder/mod.rs
> @@ -166,7 +166,7 @@ where
>      seq_write_all(output, &[0u8], position).await
>  }
>  
> -/// Write a pxar entry consiting of an endian-swappable struct.
> +/// Write a pxar entry consisting of an endian-swappable struct.
>  async fn seq_write_pxar_struct_entry<E, T>(
>      output: &mut T,
>      htype: u64,
> @@ -188,7 +188,7 @@ where
>  pub enum EncodeError {
>      /// The user dropped a `File` without without finishing writing all of its contents.
>      ///
> -    /// This is required because the payload lengths is written out at the begining and decoding
> +    /// This is required because the payload lengths is written out at the beginning and decoding
>      /// requires there to follow the right amount of data.
>      IncompleteFile,
>  
> -- 
> 2.39.2
> 
> 
> 
> _______________________________________________
> pbs-devel mailing list
> pbs-devel@lists.proxmox.com
> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel
> 
> 
> 




^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [pbs-devel] [PATCH v3 pxar 04/58] decoder: factor out skip part from skip_entry
  2024-03-28 12:36 ` [pbs-devel] [PATCH v3 pxar 04/58] decoder: factor out skip part from skip_entry Christian Ebner
@ 2024-04-03  9:18   ` Fabian Grünbichler
  2024-04-03 11:02     ` Christian Ebner
  0 siblings, 1 reply; 122+ messages in thread
From: Fabian Grünbichler @ 2024-04-03  9:18 UTC (permalink / raw)
  To: Proxmox Backup Server development discussion

On March 28, 2024 1:36 pm, Christian Ebner wrote:
> Make the skip part reusable for a different input.
> 
> In preparation for skipping payload paddings in a separated input.
> 
> Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
> ---
> changes since version 2:
> - not present in previous version
> 
>  src/decoder/mod.rs | 16 ++++++++++------
>  1 file changed, 10 insertions(+), 6 deletions(-)
> 
> diff --git a/src/decoder/mod.rs b/src/decoder/mod.rs
> index cc50e4f..f439327 100644
> --- a/src/decoder/mod.rs
> +++ b/src/decoder/mod.rs
> @@ -563,15 +563,19 @@ impl<I: SeqRead> DecoderImpl<I> {
>      //
>  
>      async fn skip_entry(&mut self, offset: u64) -> io::Result<()> {
> -        let mut len = self.current_header.content_size() - offset;
> +        let len = (self.current_header.content_size() - offset) as usize;
> +        Self::skip(&mut self.input, len).await
> +    }
> +
> +    async fn skip(input: &mut I, len: usize) -> io::Result<()> {
> +        let mut len = len;

nit: this re-binding could just be part of the fn signature ;)

>          let scratch = scratch_buffer();
> -        while len >= (scratch.len() as u64) {
> -            seq_read_exact(&mut self.input, scratch).await?;
> -            len -= scratch.len() as u64;
> +        while len >= (scratch.len()) {
> +            seq_read_exact(input, scratch).await?;
> +            len -= scratch.len();
>          }
> -        let len = len as usize;
>          if len > 0 {
> -            seq_read_exact(&mut self.input, &mut scratch[..len]).await?;
> +            seq_read_exact(input, &mut scratch[..len]).await?;
>          }
>          Ok(())
>      }
> -- 
> 2.39.2
> 
> 
> 
> _______________________________________________
> pbs-devel mailing list
> pbs-devel@lists.proxmox.com
> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel
> 
> 
> 




^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [pbs-devel] [PATCH v3 pxar 06/58] encoder: move to stack based state tracking
  2024-03-28 12:36 ` [pbs-devel] [PATCH v3 pxar 06/58] encoder: move to stack based state tracking Christian Ebner
@ 2024-04-03  9:54   ` Fabian Grünbichler
  2024-04-03 11:01     ` Christian Ebner
  0 siblings, 1 reply; 122+ messages in thread
From: Fabian Grünbichler @ 2024-04-03  9:54 UTC (permalink / raw)
  To: Proxmox Backup Server development discussion

On March 28, 2024 1:36 pm, Christian Ebner wrote:
> In preparation for the proxmox-backup-client look-ahead caching,
> where a passing around of different encoder instances with internal
> references is not feasible.
> 
> Instead of creating a new encoder instance for each directory level
> and keeping references to the parent state, use an internal stack.
> 
> This is a breaking change in the pxar library API.
> 
> Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
> ---
> changes since version 2:
> - consume encoder with new `close` method to finalize
> - new output_state helper usable when for both a mut borrow is required
> - double checked use of state/state_mut usage
> - major refactoring
> 
>  examples/pxarcmd.rs  |   7 +-
>  src/encoder/aio.rs   |  26 ++--
>  src/encoder/mod.rs   | 285 +++++++++++++++++++++++--------------------
>  src/encoder/sync.rs  |  16 ++-
>  tests/simple/main.rs |   3 +
>  5 files changed, 187 insertions(+), 150 deletions(-)
> 
> diff --git a/examples/pxarcmd.rs b/examples/pxarcmd.rs
> index e0c779d..0294eba 100644
> --- a/examples/pxarcmd.rs
> +++ b/examples/pxarcmd.rs
> @@ -106,6 +106,7 @@ fn cmd_create(mut args: std::env::ArgsOs) -> Result<(), Error> {
>      let mut encoder = Encoder::create(file, &meta)?;
>      add_directory(&mut encoder, dir, &dir_path, &mut HashMap::new())?;
>      encoder.finish()?;
> +    encoder.close()?;
>  
>      Ok(())
>  }
> @@ -138,14 +139,14 @@ fn add_directory<'a, T: SeqWrite + 'a>(
>  
>          let meta = Metadata::from(&file_meta);
>          if file_type.is_dir() {
> -            let mut dir = encoder.create_directory(file_name, &meta)?;
> +            encoder.create_directory(file_name, &meta)?;
>              add_directory(
> -                &mut dir,
> +                encoder,
>                  std::fs::read_dir(file_path)?,
>                  root_path,
>                  &mut *hardlinks,
>              )?;
> -            dir.finish()?;
> +            encoder.finish()?;
>          } else if file_type.is_symlink() {
>              todo!("symlink handling");
>          } else if file_type.is_file() {
> diff --git a/src/encoder/aio.rs b/src/encoder/aio.rs
> index 31a1a2f..635e550 100644
> --- a/src/encoder/aio.rs
> +++ b/src/encoder/aio.rs
> @@ -109,20 +109,23 @@ impl<'a, T: SeqWrite + 'a> Encoder<'a, T> {
>          &mut self,
>          file_name: P,
>          metadata: &Metadata,
> -    ) -> io::Result<Encoder<'_, T>> {
> -        Ok(Encoder {
> -            inner: self
> -                .inner
> -                .create_directory(file_name.as_ref(), metadata)
> -                .await?,
> -        })
> +    ) -> io::Result<()> {
> +        self.inner
> +            .create_directory(file_name.as_ref(), metadata)
> +            .await
>      }
>  
> -    /// Finish this directory. This is mandatory, otherwise the `Drop` handler will `panic!`.
> -    pub async fn finish(self) -> io::Result<()> {
> +    /// Finish this directory. This is mandatory, encodes the end for the current directory.
> +    pub async fn finish(&mut self) -> io::Result<()> {
>          self.inner.finish().await
>      }
>  
> +    /// Close the encoder instance. This is mandatory, encodes the end for the optional payload
> +    /// output stream, if some is given
> +    pub async fn close(self) -> io::Result<()> {
> +        self.inner.close().await
> +    }
> +
>      /// Add a symbolic link to the archive.
>      pub async fn add_symlink<PF: AsRef<Path>, PT: AsRef<Path>>(
>          &mut self,
> @@ -307,11 +310,12 @@ mod test {
>                      .await
>                      .unwrap();
>              {
> -                let mut dir = encoder
> +                encoder
>                      .create_directory("baba", &Metadata::dir_builder(0o700).build())
>                      .await
>                      .unwrap();
> -                dir.create_file(&Metadata::file_builder(0o755).build(), "abab", 1024)
> +                encoder
> +                    .create_file(&Metadata::file_builder(0o755).build(), "abab", 1024)
>                      .await
>                      .unwrap();
>              }
> diff --git a/src/encoder/mod.rs b/src/encoder/mod.rs
> index bff6acf..31bb0fa 100644
> --- a/src/encoder/mod.rs
> +++ b/src/encoder/mod.rs
> @@ -227,6 +227,16 @@ struct EncoderState {
>  }
>  
>  impl EncoderState {
> +    #[inline]
> +    fn position(&self) -> u64 {
> +        self.write_position
> +    }
> +
> +    #[inline]
> +    fn payload_position(&self) -> u64 {
> +        self.payload_write_position
> +    }
> +
>      fn merge_error(&mut self, error: Option<EncodeError>) {
>          // one error is enough:
>          if self.encode_error.is_none() {
> @@ -244,16 +254,6 @@ pub(crate) enum EncoderOutput<'a, T> {
>      Borrowed(&'a mut T),
>  }
>  
> -impl<'a, T> EncoderOutput<'a, T> {
> -    #[inline]
> -    fn to_borrowed_mut<'s>(&'s mut self) -> EncoderOutput<'s, T>
> -    where
> -        'a: 's,
> -    {
> -        EncoderOutput::Borrowed(self.as_mut())
> -    }
> -}
> -
>  impl<'a, T> std::convert::AsMut<T> for EncoderOutput<'a, T> {
>      fn as_mut(&mut self) -> &mut T {
>          match self {
> @@ -282,8 +282,8 @@ impl<'a, T> std::convert::From<&'a mut T> for EncoderOutput<'a, T> {
>  pub(crate) struct EncoderImpl<'a, T: SeqWrite + 'a> {
>      output: EncoderOutput<'a, T>,
>      payload_output: EncoderOutput<'a, Option<T>>,
> -    state: EncoderState,
> -    parent: Option<&'a mut EncoderState>,
> +    /// EncoderState stack storing the state for each directory level
> +    state: Vec<EncoderState>,
>      finished: bool,
>  
>      /// Since only the "current" entry can be actively writing files, we share the file copy
> @@ -291,21 +291,6 @@ pub(crate) struct EncoderImpl<'a, T: SeqWrite + 'a> {
>      file_copy_buffer: Arc<Mutex<Vec<u8>>>,
>  }
>  
> -impl<'a, T: SeqWrite + 'a> Drop for EncoderImpl<'a, T> {
> -    fn drop(&mut self) {
> -        if let Some(ref mut parent) = self.parent {
> -            // propagate errors:
> -            parent.merge_error(self.state.encode_error);
> -            if !self.finished {
> -                parent.add_error(EncodeError::IncompleteDirectory);
> -            }
> -        } else if !self.finished {
> -            // FIXME: how do we deal with this?
> -            // eprintln!("Encoder dropped without finishing!");
> -        }
> -    }
> -}

should we still have some sort of checks here? e.g., when dropping an
encoder, how should self.finished and self.state look like? IIUC, then a
dropped encoder should have an empty state and be finished (i.e.,
`close()` has been called on it).

or is this simply not relevant anymore because we only create one and
then drop it at the end (but should we then have a similar mechanism for
EncoderState?)

> -
>  impl<'a, T: SeqWrite + 'a> EncoderImpl<'a, T> {
>      pub async fn new(
>          output: EncoderOutput<'a, T>,
> @@ -318,8 +303,7 @@ impl<'a, T: SeqWrite + 'a> EncoderImpl<'a, T> {
>          let mut this = Self {
>              output,
>              payload_output: EncoderOutput::Owned(None),
> -            state: EncoderState::default(),
> -            parent: None,
> +            state: vec![EncoderState::default()],
>              finished: false,
>              file_copy_buffer: Arc::new(Mutex::new(unsafe {
>                  crate::util::vec_new_uninitialized(1024 * 1024)
> @@ -327,7 +311,8 @@ impl<'a, T: SeqWrite + 'a> EncoderImpl<'a, T> {
>          };
>  
>          this.encode_metadata(metadata).await?;
> -        this.state.files_offset = this.position();
> +        let state = this.state_mut()?;
> +        state.files_offset = state.position();
>  
>          if let Some(payload_output) = payload_output {
>              this.payload_output = EncoderOutput::Owned(Some(payload_output));
> @@ -337,13 +322,38 @@ impl<'a, T: SeqWrite + 'a> EncoderImpl<'a, T> {
>      }
>  
>      fn check(&self) -> io::Result<()> {
> -        match self.state.encode_error {
> +        if self.finished {
> +            io_bail!("unexpected encoder finished state");
> +        }
> +        let state = self.state()?;
> +        match state.encode_error {
>              Some(EncodeError::IncompleteFile) => io_bail!("incomplete file"),
>              Some(EncodeError::IncompleteDirectory) => io_bail!("directory not finalized"),
>              None => Ok(()),
>          }
>      }
>  
> +    fn state(&self) -> io::Result<&EncoderState> {
> +        self.state
> +            .last()
> +            .ok_or_else(|| io_format_err!("encoder state stack underflow"))
> +    }
> +
> +    fn state_mut(&mut self) -> io::Result<&mut EncoderState> {
> +        self.state
> +            .last_mut()
> +            .ok_or_else(|| io_format_err!("encoder state stack underflow"))
> +    }
> +
> +    fn output_state(&mut self) -> io::Result<(&mut T, &mut EncoderState)> {
> +        Ok((
> +            self.output.as_mut(),
> +            self.state
> +                .last_mut()
> +                .ok_or_else(|| io_format_err!("encoder state stack underflow"))?,
> +        ))
> +    }
> +

we could have another helper here that also returns the Option<&mut T>
for payload_output (while not used as often, it might still be a good
idea for readability):

diff --git a/src/encoder/mod.rs b/src/encoder/mod.rs
index b0ec877..e8c5faa 100644
--- a/src/encoder/mod.rs
+++ b/src/encoder/mod.rs
@@ -387,6 +387,16 @@ impl<'a, T: SeqWrite + 'a> EncoderImpl<'a, T> {
         ))
     }
 
+    fn payload_output_state(&mut self) -> io::Result<(&mut T, Option<&mut T>, &mut EncoderState)> {
+        Ok((
+            self.output.as_mut(),
+            self.payload_output.as_mut().as_mut(),
+            self.state
+                .last_mut()
+                .ok_or_else(|| io_format_err!("encoder state stack underflow"))?,
+        ))
+    }
+
     pub async fn create_file<'b>(
         &'b mut self,
         metadata: &Metadata,
@@ -414,12 +424,9 @@ impl<'a, T: SeqWrite + 'a> EncoderImpl<'a, T> {
         let file_offset = self.state()?.position();
         self.start_file_do(Some(metadata), file_name).await?;
 
-        if let Some(payload_output) = self.payload_output.as_mut() {
-            let state = self
-                .state
-                .last_mut()
-                .ok_or_else(|| io_format_err!("encoder state stack underflow"))?;
+        let (output, payload_output, state) = self.payload_output_state()?;
 
+        if let Some(payload_output) = payload_output {
             // Position prior to the payload header
             let payload_position = state.payload_position();
 
@@ -435,7 +442,7 @@ impl<'a, T: SeqWrite + 'a> EncoderImpl<'a, T> {
 
             // Write ref to metadata archive
             seq_write_pxar_entry(
-                self.output.as_mut(),
+                output,
                 format::PXAR_PAYLOAD_REF,
                 &payload_ref.data(),
                 &mut state.write_position,
@@ -444,21 +451,18 @@ impl<'a, T: SeqWrite + 'a> EncoderImpl<'a, T> {
         } else {
             let header = format::Header::with_content_size(format::PXAR_PAYLOAD, file_size);
             header.check_header_size()?;
-            let (output, state) = self.output_state()?;
             seq_write_struct(output, header, &mut state.write_position).await?;
         }
 
-        let state = self
-            .state
-            .last_mut()
-            .ok_or_else(|| io_format_err!("encoder state stack underflow"))?;
+        let (output, payload_output, state) = self.payload_output_state()?;
+
         let payload_data_offset = state.position();
 
         let meta_size = payload_data_offset - file_offset;
 
         Ok(FileImpl {
-            output: self.output.as_mut(),
-            payload_output: self.payload_output.as_mut().as_mut(),
+            output,
+            payload_output,
             goodbye_item: GoodbyeItem {
                 hash: format::hash_filename(file_name),
                 offset: file_offset,

> [..]




^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [pbs-devel] [PATCH v3 pxar 07/58] decoder/accessor: add optional payload input stream
  2024-03-28 12:36 ` [pbs-devel] [PATCH v3 pxar 07/58] decoder/accessor: add optional payload input stream Christian Ebner
@ 2024-04-03 10:38   ` Fabian Grünbichler
  2024-04-03 11:47     ` Christian Ebner
  2024-04-03 12:18     ` Christian Ebner
  0 siblings, 2 replies; 122+ messages in thread
From: Fabian Grünbichler @ 2024-04-03 10:38 UTC (permalink / raw)
  To: Proxmox Backup Server development discussion

On March 28, 2024 1:36 pm, Christian Ebner wrote:
> Implement an optional redirection to read the payload for regular files
> from a different input stream.
> 
> This allows to decode split stream archives.
> 
> Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
> ---
> changes since version 2:
> - pass the payload input on decoder/accessor instantiation in order to
>   avoid possible adding/removing during decoding/accessing.
> - major refactoring

style nit: for those fns that take input and payload_input, it might
make sense to order them next to eachother? IMHO it makes the call sites
more readable, especially in those cases where it's mostly passed along
;) for the constructors, I am a bit torn on which variant is nicer.

> 
>  examples/apxar.rs    |   2 +-
>  src/accessor/aio.rs  |  10 ++--
>  src/accessor/mod.rs  |  61 ++++++++++++++++++++++---
>  src/accessor/sync.rs |   8 ++--
>  src/decoder/aio.rs   |  14 ++++--
>  src/decoder/mod.rs   | 106 +++++++++++++++++++++++++++++++++++++++----
>  src/decoder/sync.rs  |  15 ++++--
>  src/lib.rs           |   3 ++
>  8 files changed, 184 insertions(+), 35 deletions(-)
> 
> diff --git a/examples/apxar.rs b/examples/apxar.rs
> index 0c62242..d5eb04e 100644
> --- a/examples/apxar.rs
> +++ b/examples/apxar.rs
> @@ -9,7 +9,7 @@ async fn main() {
>          .await
>          .expect("failed to open file");
>  
> -    let mut reader = Decoder::from_tokio(file)
> +    let mut reader = Decoder::from_tokio(file, None)
>          .await
>          .expect("failed to open pxar archive contents");
>  
> diff --git a/src/accessor/aio.rs b/src/accessor/aio.rs
> index 98d7755..0ebb921 100644
> --- a/src/accessor/aio.rs
> +++ b/src/accessor/aio.rs
> @@ -39,7 +39,7 @@ impl<T: FileExt> Accessor<FileReader<T>> {
>      /// by a blocking file.
>      #[inline]
>      pub async fn from_file_and_size(input: T, size: u64) -> io::Result<Self> {
> -        Accessor::new(FileReader::new(input), size).await
> +        Accessor::new(FileReader::new(input), size, None).await
>      }
>  }
>  
> @@ -75,7 +75,7 @@ where
>          input: T,
>          size: u64,
>      ) -> io::Result<Accessor<FileRefReader<T>>> {
> -        Accessor::new(FileRefReader::new(input), size).await
> +        Accessor::new(FileRefReader::new(input), size, None).await
>      }
>  }
>  
> @@ -85,9 +85,11 @@ impl<T: ReadAt> Accessor<T> {
>      ///
>      /// Note that the `input`'s `SeqRead` implementation must always return `Poll::Ready` and is
>      /// not allowed to use the `Waker`, as this will cause a `panic!`.
> -    pub async fn new(input: T, size: u64) -> io::Result<Self> {
> +    /// Optionally take the file payloads from the provided input stream rather than the regular
> +    /// pxar stream.
> +    pub async fn new(input: T, size: u64, payload_input: Option<T>) -> io::Result<Self> {
>          Ok(Self {
> -            inner: accessor::AccessorImpl::new(input, size).await?,
> +            inner: accessor::AccessorImpl::new(input, size, payload_input).await?,
>          })
>      }
>  
> diff --git a/src/accessor/mod.rs b/src/accessor/mod.rs
> index 6a2de73..4789595 100644
> --- a/src/accessor/mod.rs
> +++ b/src/accessor/mod.rs
> @@ -182,10 +182,11 @@ pub(crate) struct AccessorImpl<T> {
>      input: T,
>      size: u64,
>      caches: Arc<Caches>,
> +    payload_input: Option<T>,
>  }
>  
>  impl<T: ReadAt> AccessorImpl<T> {
> -    pub async fn new(input: T, size: u64) -> io::Result<Self> {
> +    pub async fn new(input: T, size: u64, payload_input: Option<T>) -> io::Result<Self> {
>          if size < (size_of::<GoodbyeItem>() as u64) {
>              io_bail!("too small to contain a pxar archive");
>          }
> @@ -194,6 +195,7 @@ impl<T: ReadAt> AccessorImpl<T> {
>              input,
>              size,
>              caches: Arc::new(Caches::default()),
> +            payload_input,
>          })
>      }
>  
> @@ -207,6 +209,9 @@ impl<T: ReadAt> AccessorImpl<T> {
>              self.size,
>              "/".into(),
>              Arc::clone(&self.caches),
> +            self.payload_input
> +                .as_ref()
> +                .map(|input| input as &dyn ReadAt),
>          )
>          .await
>      }
> @@ -228,7 +233,13 @@ async fn get_decoder<T: ReadAt>(
>      entry_range: Range<u64>,
>      path: PathBuf,
>  ) -> io::Result<DecoderImpl<SeqReadAtAdapter<T>>> {
> -    DecoderImpl::new_full(SeqReadAtAdapter::new(input, entry_range), path, true).await
> +    DecoderImpl::new_full(
> +        SeqReadAtAdapter::new(input, entry_range.clone()),
> +        path,
> +        true,
> +        None,
> +    )
> +    .await
>  }
>  
>  // NOTE: This performs the Decoder::read_next_item() behavior! Keep in mind when changing!
> @@ -263,6 +274,7 @@ impl<T: Clone + ReadAt> AccessorImpl<T> {
>              self.size,
>              "/".into(),
>              Arc::clone(&self.caches),
> +            self.payload_input.clone(),
>          )
>          .await
>      }
> @@ -274,6 +286,7 @@ impl<T: Clone + ReadAt> AccessorImpl<T> {
>              offset,
>              "/".into(),
>              Arc::clone(&self.caches),
> +            self.payload_input.clone(),
>          )
>          .await
>      }
> @@ -293,17 +306,23 @@ impl<T: Clone + ReadAt> AccessorImpl<T> {
>              .next()
>              .await
>              .ok_or_else(|| io_format_err!("unexpected EOF while decoding file entry"))??;
> +
>          Ok(FileEntryImpl {
>              input: self.input.clone(),
>              entry,
>              entry_range_info: entry_range_info.clone(),
>              caches: Arc::clone(&self.caches),
> +            payload_input: self.payload_input.clone(),
>          })
>      }
>  
>      /// Allow opening arbitrary contents from a specific range.
>      pub unsafe fn open_contents_at_range(&self, range: Range<u64>) -> FileContentsImpl<T> {
> -        FileContentsImpl::new(self.input.clone(), range)
> +        if let Some(payload_input) = &self.payload_input {
> +            FileContentsImpl::new(payload_input.clone(), range)
> +        } else {
> +            FileContentsImpl::new(self.input.clone(), range)
> +        }
>      }
>  
>      /// Following a hardlink breaks a couple of conventions we otherwise have, particularly we will
> @@ -326,9 +345,12 @@ impl<T: Clone + ReadAt> AccessorImpl<T> {
>  
>          let link_offset = entry_file_offset - link_offset;
>  
> -        let (mut decoder, entry_offset) =
> -            get_decoder_at_filename(self.input.clone(), link_offset..self.size, PathBuf::new())
> -                .await?;
> +        let (mut decoder, entry_offset) = get_decoder_at_filename(
> +            self.input.clone(),
> +            link_offset..self.size,
> +            PathBuf::new(),
> +        )
> +        .await?;
>  
>          let entry = decoder
>              .next()
> @@ -342,6 +364,7 @@ impl<T: Clone + ReadAt> AccessorImpl<T> {
>              EntryKind::File {
>                  offset: Some(offset),
>                  size,
> +                ..
>              } => {
>                  let meta_size = offset - link_offset;
>                  let entry_end = link_offset + meta_size + size;
> @@ -353,6 +376,7 @@ impl<T: Clone + ReadAt> AccessorImpl<T> {
>                          entry_range: entry_offset..entry_end,
>                      },
>                      caches: Arc::clone(&self.caches),
> +                    payload_input: self.payload_input.clone(),
>                  })
>              }
>              _ => io_bail!("hardlink does not point to a regular file"),
> @@ -369,6 +393,7 @@ pub(crate) struct DirectoryImpl<T> {
>      table: Arc<[GoodbyeItem]>,
>      path: PathBuf,
>      caches: Arc<Caches>,
> +    payload_input: Option<T>,
>  }
>  
>  impl<T: Clone + ReadAt> DirectoryImpl<T> {
> @@ -378,6 +403,7 @@ impl<T: Clone + ReadAt> DirectoryImpl<T> {
>          end_offset: u64,
>          path: PathBuf,
>          caches: Arc<Caches>,
> +        payload_input: Option<T>,
>      ) -> io::Result<DirectoryImpl<T>> {
>          let tail = Self::read_tail_entry(&input, end_offset).await?;
>  
> @@ -407,6 +433,7 @@ impl<T: Clone + ReadAt> DirectoryImpl<T> {
>              table: table.as_ref().map_or_else(|| Arc::new([]), Arc::clone),
>              path,
>              caches,
> +            payload_input,
>          };
>  
>          // sanity check:
> @@ -533,6 +560,7 @@ impl<T: Clone + ReadAt> DirectoryImpl<T> {
>                  entry_range: self.entry_range(),
>              },
>              caches: Arc::clone(&self.caches),
> +            payload_input: self.payload_input.clone(),
>          })
>      }
>  
> @@ -685,6 +713,7 @@ pub(crate) struct FileEntryImpl<T: Clone + ReadAt> {
>      entry: Entry,
>      entry_range_info: EntryRangeInfo,
>      caches: Arc<Caches>,
> +    payload_input: Option<T>,
>  }
>  
>  impl<T: Clone + ReadAt> FileEntryImpl<T> {
> @@ -698,6 +727,7 @@ impl<T: Clone + ReadAt> FileEntryImpl<T> {
>              self.entry_range_info.entry_range.end,
>              self.entry.path.clone(),
>              Arc::clone(&self.caches),
> +            self.payload_input.clone(),
>          )
>          .await
>      }
> @@ -711,14 +741,30 @@ impl<T: Clone + ReadAt> FileEntryImpl<T> {
>              EntryKind::File {
>                  size,
>                  offset: Some(offset),
> +                payload_offset: None,
>              } => Ok(Some(offset..(offset + size))),
> +            // Payload offset beats regular offset if some
> +            EntryKind::File {
> +                size,
> +                offset: Some(_offset),
> +                payload_offset: Some(payload_offset),
> +            } => {
> +                let start_offset = payload_offset + size_of::<format::Header>() as u64;
> +                Ok(Some(start_offset..start_offset + size))
> +            }
>              _ => Ok(None),
>          }
>      }
>  
>      pub async fn contents(&self) -> io::Result<FileContentsImpl<T>> {
>          match self.content_range()? {
> -            Some(range) => Ok(FileContentsImpl::new(self.input.clone(), range)),
> +            Some(range) => {
> +                if let Some(ref payload_input) = self.payload_input {
> +                    Ok(FileContentsImpl::new(payload_input.clone(), range))
> +                } else {
> +                    Ok(FileContentsImpl::new(self.input.clone(), range))
> +                }
> +            }
>              None => io_bail!("not a file"),

nit: would be easier to parse if it were

let range = ..
if let Some(ref payload_input) = .. {
..
} else {
..
}

and it would also mesh better with `open_contents_at_range` above.

>          }
>      }
> @@ -808,6 +854,7 @@ impl<'a, T: Clone + ReadAt> DirEntryImpl<'a, T> {
>              entry,
>              entry_range_info: self.entry_range_info.clone(),
>              caches: Arc::clone(&self.caches),
> +            payload_input: self.dir.payload_input.clone(),
>          })
>      }
>  
> diff --git a/src/accessor/sync.rs b/src/accessor/sync.rs
> index a777152..6150a18 100644
> --- a/src/accessor/sync.rs
> +++ b/src/accessor/sync.rs
> @@ -31,7 +31,7 @@ impl<T: FileExt> Accessor<FileReader<T>> {
>      /// Decode a `pxar` archive from a standard file implementing `FileExt`.
>      #[inline]
>      pub fn from_file_and_size(input: T, size: u64) -> io::Result<Self> {
> -        Accessor::new(FileReader::new(input), size)
> +        Accessor::new(FileReader::new(input), size, None)
>      }
>  }
>  
> @@ -64,7 +64,7 @@ where
>  {
>      /// Open an `Arc` or `Rc` of `File`.
>      pub fn from_file_ref_and_size(input: T, size: u64) -> io::Result<Accessor<FileRefReader<T>>> {
> -        Accessor::new(FileRefReader::new(input), size)
> +        Accessor::new(FileRefReader::new(input), size, None)
>      }
>  }
>  
> @@ -74,9 +74,9 @@ impl<T: ReadAt> Accessor<T> {
>      ///
>      /// Note that the `input`'s `SeqRead` implementation must always return `Poll::Ready` and is
>      /// not allowed to use the `Waker`, as this will cause a `panic!`.
> -    pub fn new(input: T, size: u64) -> io::Result<Self> {
> +    pub fn new(input: T, size: u64, payload_input: Option<T>) -> io::Result<Self> {
>          Ok(Self {
> -            inner: poll_result_once(accessor::AccessorImpl::new(input, size))?,
> +            inner: poll_result_once(accessor::AccessorImpl::new(input, size, payload_input))?,
>          })
>      }
>  
> diff --git a/src/decoder/aio.rs b/src/decoder/aio.rs
> index 4de8c6f..bb032cf 100644
> --- a/src/decoder/aio.rs
> +++ b/src/decoder/aio.rs
> @@ -20,8 +20,12 @@ pub struct Decoder<T> {
>  impl<T: tokio::io::AsyncRead> Decoder<TokioReader<T>> {
>      /// Decode a `pxar` archive from a `tokio::io::AsyncRead` input.
>      #[inline]
> -    pub async fn from_tokio(input: T) -> io::Result<Self> {
> -        Decoder::new(TokioReader::new(input)).await
> +    pub async fn from_tokio(input: T, payload_input: Option<T>) -> io::Result<Self> {
> +        Decoder::new(
> +            TokioReader::new(input),
> +            payload_input.map(|payload_input| TokioReader::new(payload_input)),
> +        )
> +        .await
>      }
>  }
>  
> @@ -30,15 +34,15 @@ impl Decoder<TokioReader<tokio::fs::File>> {
>      /// Decode a `pxar` archive from a `tokio::io::AsyncRead` input.
>      #[inline]
>      pub async fn open<P: AsRef<Path>>(path: P) -> io::Result<Self> {
> -        Decoder::from_tokio(tokio::fs::File::open(path.as_ref()).await?).await
> +        Decoder::from_tokio(tokio::fs::File::open(path.as_ref()).await?, None).await
>      }
>  }
>  
>  impl<T: SeqRead> Decoder<T> {
>      /// Create an async decoder from an input implementing our internal read interface.
> -    pub async fn new(input: T) -> io::Result<Self> {
> +    pub async fn new(input: T, payload_input: Option<T>) -> io::Result<Self> {
>          Ok(Self {
> -            inner: decoder::DecoderImpl::new(input).await?,
> +            inner: decoder::DecoderImpl::new(input, payload_input).await?,
>          })
>      }
>  
> diff --git a/src/decoder/mod.rs b/src/decoder/mod.rs
> index f439327..8cc4877 100644
> --- a/src/decoder/mod.rs
> +++ b/src/decoder/mod.rs
> @@ -157,6 +157,10 @@ pub(crate) struct DecoderImpl<T> {
>      state: State,
>      with_goodbye_tables: bool,
>  
> +    // Payload of regular files might be provided by a different reader
> +    payload_input: Option<T>,
> +    payload_consumed: u64,
> +
>      /// The random access code uses decoders for sub-ranges which may not end in a `PAYLOAD` for
>      /// entries like FIFOs or sockets, so there we explicitly allow an item to terminate with EOF.
>      eof_after_entry: bool,
> @@ -167,6 +171,8 @@ enum State {
>      Default,
>      InPayload {
>          offset: u64,
> +        size: u64,
> +        payload_ref: bool,
>      },
>  
>      /// file entries with no data (fifo, socket)
> @@ -195,8 +201,8 @@ pub(crate) enum ItemResult {
>  }
>  
>  impl<I: SeqRead> DecoderImpl<I> {
> -    pub async fn new(input: I) -> io::Result<Self> {
> -        Self::new_full(input, "/".into(), false).await
> +    pub async fn new(input: I, payload_input: Option<I>) -> io::Result<Self> {
> +        Self::new_full(input, "/".into(), false, payload_input).await
>      }
>  
>      pub(crate) fn input(&self) -> &I {
> @@ -207,6 +213,7 @@ impl<I: SeqRead> DecoderImpl<I> {
>          input: I,
>          path: PathBuf,
>          eof_after_entry: bool,
> +        payload_input: Option<I>,
>      ) -> io::Result<Self> {
>          let this = DecoderImpl {
>              input,
> @@ -219,6 +226,8 @@ impl<I: SeqRead> DecoderImpl<I> {
>              path_lengths: Vec::new(),
>              state: State::Begin,
>              with_goodbye_tables: false,
> +            payload_input,
> +            payload_consumed: 0,
>              eof_after_entry,
>          };
>  
> @@ -242,9 +251,18 @@ impl<I: SeqRead> DecoderImpl<I> {
>                      // hierarchy and parse the next PXAR_FILENAME or the PXAR_GOODBYE:
>                      self.read_next_item().await?;
>                  }
> -                State::InPayload { offset } => {
> -                    // We need to skip the current payload first.
> -                    self.skip_entry(offset).await?;
> +                State::InPayload {
> +                    offset,
> +                    payload_ref,
> +                    ..
> +                } => {
> +                    if payload_ref {
> +                        // Update consumed payload as given by the offset referenced by the content reader
> +                        self.payload_consumed += offset;
> +                    } else if self.payload_input.is_none() {
> +                        // Skip remaining payload of current entry in regular stream
> +                        self.skip_entry(offset).await?;
> +                    }

I am a bit confused by this here - shouldn't all payloads be encoded via
refs now if we have a split archive? and vice versa? why the second
condition? and what if I pass a bogus payload input for an archive that
doesn't contain any references?

>                      self.read_next_item().await?;
>                  }
>                  State::InGoodbyeTable => {
> @@ -308,11 +326,19 @@ impl<I: SeqRead> DecoderImpl<I> {
>      }
>  
>      pub fn content_reader(&mut self) -> Option<Contents<I>> {
> -        if let State::InPayload { offset } = &mut self.state {
> +        if let State::InPayload {
> +            offset,
> +            size,
> +            payload_ref,
> +        } = &mut self.state
> +        {
> +            if *payload_ref && self.payload_input.is_none() {
> +                return None;
> +            }
>              Some(Contents::new(
> -                &mut self.input,
> +                self.payload_input.as_mut().unwrap_or(&mut self.input),
>                  offset,
> -                self.current_header.content_size(),
> +                *size,

similar here..

e.g., something like this:

            let input = if *payload_ref {
                if let Some(payload_input) = self.payload_input.as_mut() {
                    payload_input
                } else {
                    return None;
                }
            } else {
                &mut self.input
            };
            Some(Contents::new(input, offset, *size))

although technically we do have an invariant there that we could check -
we shouldn't encounter a non-ref payload when we have a payload_input..

>              ))
>          } else {
>              None
> @@ -531,8 +557,60 @@ impl<I: SeqRead> DecoderImpl<I> {
>                  self.entry.kind = EntryKind::File {
>                      size: self.current_header.content_size(),
>                      offset,
> +                    payload_offset: None,
> +                };
> +                self.state = State::InPayload {
> +                    offset: 0,
> +                    size: self.current_header.content_size(),
> +                    payload_ref: false,
> +                };
> +                return Ok(ItemResult::Entry);
> +            }
> +            format::PXAR_PAYLOAD_REF => {
> +                let offset = seq_read_position(&mut self.input).await.transpose()?;
> +                let payload_ref = self.read_payload_ref().await?;
> +
> +                if let Some(payload_input) = self.payload_input.as_mut() {

this condition (cted below)

> +                    if seq_read_position(payload_input)
> +                        .await
> +                        .transpose()?
> +                        .is_none()
> +                    {
> +                        // Skip payload padding for injected chunks in sequential decoder
> +                        let to_skip = payload_ref.offset - self.payload_consumed;

should we add a check here for the invariant that offsets should only
ever be increasing? (and avoid an underflow for corrupt/invalid archives
;))

> +                        self.skip_payload(to_skip).await?;
> +                    }
> +                }
> +
> +                if let Some(payload_input) = self.payload_input.as_mut() {

and this condition here are the same?

> +                    let header: u64 = seq_read_entry(payload_input).await?;

why not read a Header here?

> +                    if header != format::PXAR_PAYLOAD {
> +                        io_bail!(
> +                            "unexpected header in payload input: expected {} , got {header}",
> +                            format::PXAR_PAYLOAD,
> +                        );
> +                    }
> +                    let size: u64 = seq_read_entry(payload_input).await?;
> +                    self.payload_consumed += size_of::<Header>() as u64;
> +
> +                    if size != payload_ref.size + size_of::<Header>() as u64 {
> +                        io_bail!(
> +                            "encountered payload size mismatch: got {}, expected {size}",
> +                            payload_ref.size
> +                        );
> +                    }

then these could use the size helpers of Header ;)

> +                }
> +
> +                self.entry.kind = EntryKind::File {
> +                    size: payload_ref.size,
> +                    offset,
> +                    payload_offset: Some(payload_ref.offset),
> +                };
> +                self.state = State::InPayload {
> +                    offset: 0,
> +                    size: payload_ref.size,
> +                    payload_ref: true,
>                  };
> -                self.state = State::InPayload { offset: 0 };
>                  return Ok(ItemResult::Entry);
>              }
>              format::PXAR_FILENAME | format::PXAR_GOODBYE => {

> [..]




^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [pbs-devel] [PATCH v3 pxar 06/58] encoder: move to stack based state tracking
  2024-04-03  9:54   ` Fabian Grünbichler
@ 2024-04-03 11:01     ` Christian Ebner
  2024-04-04  8:48       ` Fabian Grünbichler
  0 siblings, 1 reply; 122+ messages in thread
From: Christian Ebner @ 2024-04-03 11:01 UTC (permalink / raw)
  To: Proxmox Backup Server development discussion, Fabian Grünbichler

On 4/3/24 11:54, Fabian Grünbichler wrote:
> 
> should we still have some sort of checks here? e.g., when dropping an
> encoder, how should self.finished and self.state look like? IIUC, then a
> dropped encoder should have an empty state and be finished (i.e.,
> `close()` has been called on it).
> 
> or is this simply not relevant anymore because we only create one and
> then drop it at the end (but should we then have a similar mechanism for
> EncoderState?)

The encoder should now be consumed with the `close` call, which takes 
ownership of the encoder and drops it afterwards, so all the state 
checks should happen there.

Previously, the encoder finish consumed the per-directory level encoder 
object, passing possible errors up to the parent implementation, which 
is not possible now since there is only one encoder instance. I did not 
want to panic here as the checks should be done in the close now, so the 
Drop implementation was removed.

Not sure what to check in a Drop implementation the EncoderState. What 
did you have in mind for that? Note that errors get propagated to the 
parent state in the encoder finish calls now.

>> +    fn output_state(&mut self) -> io::Result<(&mut T, &mut EncoderState)> {
>> +        Ok((
>> +            self.output.as_mut(),
>> +            self.state
>> +                .last_mut()
>> +                .ok_or_else(|| io_format_err!("encoder state stack underflow"))?,
>> +        ))
>> +    }
>> +
> 
> we could have another helper here that also returns the Option<&mut T>
> for payload_output (while not used as often, it might still be a good
> idea for readability):

Okay, yes I can add that.





^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [pbs-devel] [PATCH v3 pxar 04/58] decoder: factor out skip part from skip_entry
  2024-04-03  9:18   ` Fabian Grünbichler
@ 2024-04-03 11:02     ` Christian Ebner
  0 siblings, 0 replies; 122+ messages in thread
From: Christian Ebner @ 2024-04-03 11:02 UTC (permalink / raw)
  To: Proxmox Backup Server development discussion, Fabian Grünbichler

On 4/3/24 11:18, Fabian Grünbichler wrote:
> On March 28, 2024 1:36 pm, Christian Ebner wrote:
>> Make the skip part reusable for a different input.
>>
>> In preparation for skipping payload paddings in a separated input.
>>
>> Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
>> ---
>> changes since version 2:
>> - not present in previous version
>>
>>   src/decoder/mod.rs | 16 ++++++++++------
>>   1 file changed, 10 insertions(+), 6 deletions(-)
>>
>> diff --git a/src/decoder/mod.rs b/src/decoder/mod.rs
>> index cc50e4f..f439327 100644
>> --- a/src/decoder/mod.rs
>> +++ b/src/decoder/mod.rs
>> @@ -563,15 +563,19 @@ impl<I: SeqRead> DecoderImpl<I> {
>>       //
>>   
>>       async fn skip_entry(&mut self, offset: u64) -> io::Result<()> {
>> -        let mut len = self.current_header.content_size() - offset;
>> +        let len = (self.current_header.content_size() - offset) as usize;
>> +        Self::skip(&mut self.input, len).await
>> +    }
>> +
>> +    async fn skip(input: &mut I, len: usize) -> io::Result<()> {
>> +        let mut len = len;
> 
> nit: this re-binding could just be part of the fn signature ;)

Okay, moved the mut binding to the function signature.





^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [pbs-devel] [PATCH v3 pxar 13/58] format: add pxar format version entry
  2024-03-28 12:36 ` [pbs-devel] [PATCH v3 pxar 13/58] format: add pxar format version entry Christian Ebner
@ 2024-04-03 11:41   ` Fabian Grünbichler
  2024-04-03 13:31     ` Christian Ebner
  0 siblings, 1 reply; 122+ messages in thread
From: Fabian Grünbichler @ 2024-04-03 11:41 UTC (permalink / raw)
  To: Proxmox Backup Server development discussion

On March 28, 2024 1:36 pm, Christian Ebner wrote:
> Adds an additional entry type at the start of each pxar archive
> signaling the encoding format version. If not present, the default
> version 1 is assumed.
> 
> This allows to early on detect the pxar encoding version, allowing tools
> to switch mode or bail on non compatible encoder/decoder functionality.
> 
> Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
> ---
> changes since version 2:
> - not present in previous version
> 
>  examples/mk-format-hashes.rs |  5 +++++
>  src/decoder/mod.rs           | 29 ++++++++++++++++++++++++++--
>  src/encoder/mod.rs           | 37 +++++++++++++++++++++++++++++++++---
>  src/format/mod.rs            | 11 +++++++++++
>  src/lib.rs                   |  3 +++
>  5 files changed, 80 insertions(+), 5 deletions(-)
> 
> diff --git a/examples/mk-format-hashes.rs b/examples/mk-format-hashes.rs
> index 35cff99..e5d69b1 100644
> --- a/examples/mk-format-hashes.rs
> +++ b/examples/mk-format-hashes.rs
> @@ -1,6 +1,11 @@
>  use pxar::format::hash_filename;
>  
>  const CONSTANTS: &[(&str, &str, &str)] = &[
> +    (
> +        "Pxar format version entry, fallback to version 1 if not present",
> +        "PXAR_FORMAT_VERSION",
> +        "__PROXMOX_FORMAT_VERSION__",
> +    ),
>      (
>          "Beginning of an entry (current version).",
>          "PXAR_ENTRY",
> diff --git a/src/decoder/mod.rs b/src/decoder/mod.rs
> index 00d9abf..5b2fafb 100644
> --- a/src/decoder/mod.rs
> +++ b/src/decoder/mod.rs
> @@ -17,7 +17,7 @@ use std::task::{Context, Poll};
>  
>  use endian_trait::Endian;
>  
> -use crate::format::{self, Header};
> +use crate::format::{self, FormatVersion, Header};
>  use crate::util::{self, io_err_other};
>  use crate::{Entry, EntryKind, Metadata};
>  
> @@ -164,6 +164,8 @@ pub(crate) struct DecoderImpl<T> {
>      /// The random access code uses decoders for sub-ranges which may not end in a `PAYLOAD` for
>      /// entries like FIFOs or sockets, so there we explicitly allow an item to terminate with EOF.
>      eof_after_entry: bool,
> +    /// The format version as determined by the format version header
> +    version: format::FormatVersion,
>  }
>  
>  enum State {
> @@ -242,6 +244,7 @@ impl<I: SeqRead> DecoderImpl<I> {
>              payload_input,
>              payload_consumed,
>              eof_after_entry,
> +            version: FormatVersion::default(),
>          };
>  
>          // this.read_next_entry().await?;
> @@ -258,7 +261,16 @@ impl<I: SeqRead> DecoderImpl<I> {
>          loop {
>              match self.state {
>                  State::Eof => return Ok(None),
> -                State::Begin => return self.read_next_entry().await.map(Some),
> +                State::Begin => {
> +                    let entry = self.read_next_entry().await.map(Some);
> +                    if let Ok(Some(ref entry)) = entry {
> +                        if let EntryKind::Version(version) = entry.kind() {
> +                            self.version = version.clone();
> +                            return self.read_next_entry().await.map(Some);
> +                        }
> +                    }
> +                    return entry;

a bit unsure here, if we want to enforce the order, wouldn't it be more
clean to transition to a new state here rather than adding more nested
ifs over time? ;)

> +                }
>                  State::Default => {
>                      // we completely finished an entry, so now we're going "up" in the directory
>                      // hierarchy and parse the next PXAR_FILENAME or the PXAR_GOODBYE:
> @@ -412,6 +424,11 @@ impl<I: SeqRead> DecoderImpl<I> {
>              self.entry.metadata = Metadata::default();
>              self.entry.kind = EntryKind::Hardlink(self.read_hardlink().await?);
>  
> +            Ok(Some(self.entry.take()))
> +        } else if header.htype == format::PXAR_FORMAT_VERSION {
> +            self.current_header = header;
> +            self.entry.kind = EntryKind::Version(self.read_format_version().await?);
> +
>              Ok(Some(self.entry.take()))
>          } else if header.htype == format::PXAR_ENTRY || header.htype == format::PXAR_ENTRY_V1 {
>              if header.htype == format::PXAR_ENTRY {
> @@ -777,6 +794,14 @@ impl<I: SeqRead> DecoderImpl<I> {
>  
>          seq_read_entry(&mut self.input).await
>      }
> +
> +    async fn read_format_version(&mut self) -> io::Result<format::FormatVersion> {
> +        match seq_read_entry(&mut self.input).await? {
> +            1u64 => Ok(format::FormatVersion::Version1),

this should never happen though, right?

> +            2u64 => Ok(format::FormatVersion::Version2),

also this (cted below)

> +            _ => io_bail!("unexpected pxar format version"),

this should maybe include the value? ;)

> +        }
> +    }
>  }
>  
>  /// Reader for file contents inside a pxar archive.
> diff --git a/src/encoder/mod.rs b/src/encoder/mod.rs
> index 88c0ed5..9270153 100644
> --- a/src/encoder/mod.rs
> +++ b/src/encoder/mod.rs
> @@ -17,7 +17,7 @@ use endian_trait::Endian;
>  
>  use crate::binary_tree_array;
>  use crate::decoder::{self, SeqRead};
> -use crate::format::{self, GoodbyeItem, PayloadRef};
> +use crate::format::{self, FormatVersion, GoodbyeItem, PayloadRef};
>  use crate::Metadata;
>  
>  pub mod aio;
> @@ -307,6 +307,8 @@ pub(crate) struct EncoderImpl<'a, T: SeqWrite + 'a> {
>      /// Since only the "current" entry can be actively writing files, we share the file copy
>      /// buffer.
>      file_copy_buffer: Arc<Mutex<Vec<u8>>>,
> +    /// Pxar format version to encode
> +    version: format::FormatVersion,
>  }
>  
>  impl<'a, T: SeqWrite + 'a> EncoderImpl<'a, T> {
> @@ -320,11 +322,14 @@ impl<'a, T: SeqWrite + 'a> EncoderImpl<'a, T> {
>          }
>  
>          let mut state = EncoderState::default();
> -        if let Some(payload_output) = payload_output.as_mut() {
> +        let version = if let Some(payload_output) = payload_output.as_mut() {
>              let header = format::Header::with_content_size(format::PXAR_PAYLOAD_START_MARKER, 0);
>              header.check_header_size()?;
>              seq_write_struct(payload_output, header, &mut state.payload_write_position).await?;
> -        }
> +            format::FormatVersion::Version2
> +        } else {
> +            format::FormatVersion::default()

shouldn't this be Version1 instead of default()? they are the same
*now*, but that might not be the case forever?

> +        };
>  
>          let mut this = Self {
>              output,
> @@ -334,8 +339,10 @@ impl<'a, T: SeqWrite + 'a> EncoderImpl<'a, T> {
>              file_copy_buffer: Arc::new(Mutex::new(unsafe {
>                  crate::util::vec_new_uninitialized(1024 * 1024)
>              })),
> +            version,
>          };
>  
> +        this.encode_format_version().await?;
>          this.encode_metadata(metadata).await?;
>          let state = this.state_mut()?;
>          state.files_offset = state.position();
> @@ -522,6 +529,10 @@ impl<'a, T: SeqWrite + 'a> EncoderImpl<'a, T> {
>          file_size: u64,
>          payload_offset: PayloadOffset,
>      ) -> io::Result<()> {
> +        if self.version == FormatVersion::Version1 {
> +            io_bail!("payload references not supported pxar format version 1");
> +        }
> +
>          if self.payload_output.as_mut().is_none() {
>              io_bail!("unable to add payload reference");
>          }
> @@ -729,6 +740,26 @@ impl<'a, T: SeqWrite + 'a> EncoderImpl<'a, T> {
>          Ok(())
>      }
>  
> +    async fn encode_format_version(&mut self) -> io::Result<()> {
> +		let version_bytes = match self.version {
> +			format::FormatVersion::Version1 => return Ok(()),
> +			format::FormatVersion::Version2 => 2u64.to_le_bytes(),

(cted from above) and this here should maybe go together?

> +		};
> +
> +        let (output, state) = self.output_state()?;
> +		if state.write_position != 0 {
> +			io_bail!("pxar format version must be encoded at the beginning of an archive");

should this also be enforced while decoding?

should we also encode a/the version of the payload archive?

> +		}
> +
> +        seq_write_pxar_entry(
> +            output,
> +            format::PXAR_FORMAT_VERSION,
> +            &version_bytes,
> +            &mut state.write_position,
> +        )
> +        .await
> +    }
> +
>      async fn encode_metadata(&mut self, metadata: &Metadata) -> io::Result<()> {
>          let (output, state) = self.output_state()?;
>          seq_write_pxar_struct_entry(
> diff --git a/src/format/mod.rs b/src/format/mod.rs
> index a672d19..2bf33c9 100644
> --- a/src/format/mod.rs
> +++ b/src/format/mod.rs
> @@ -6,6 +6,7 @@
>  //! item data.
>  //!
>  //! An archive contains items in the following order:
> +//!  * `FORMAT_VERSION`     -- (optional for v1), version of encoding format
>  //!  * `ENTRY`              -- containing general stat() data and related bits
>  //!   * `XATTR`             -- one extended attribute
>  //!   * ...                 -- more of these when there are multiple defined
> @@ -80,6 +81,8 @@ pub mod mode {
>  }
>  
>  // Generated by `cargo run --example mk-format-hashes`
> +/// Pxar format version entry, fallback to version 1 if not present
> +pub const PXAR_FORMAT_VERSION: u64 = 0x730f6c75df16a40d;
>  /// Beginning of an entry (current version).
>  pub const PXAR_ENTRY: u64 = 0xd5956474e588acef;
>  /// Previous version of the entry struct
> @@ -186,6 +189,7 @@ impl Header {
>  impl Display for Header {
>      fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
>          let readable = match self.htype {
> +            PXAR_FORMAT_VERSION => "FORMAT_VERSION",
>              PXAR_FILENAME => "FILENAME",
>              PXAR_SYMLINK => "SYMLINK",
>              PXAR_HARDLINK => "HARDLINK",
> @@ -551,6 +555,13 @@ impl From<&std::fs::Metadata> for Stat {
>      }
>  }
>  
> +#[derive(Clone, Debug, Default, PartialEq)]
> +pub enum FormatVersion {
> +    #[default]
> +    Version1,
> +    Version2,
> +}
> +
>  #[derive(Clone, Debug)]
>  pub struct Filename {
>      pub name: Vec<u8>,
> diff --git a/src/lib.rs b/src/lib.rs
> index ef81a85..a87b5ac 100644
> --- a/src/lib.rs
> +++ b/src/lib.rs
> @@ -342,6 +342,9 @@ impl Acl {
>  /// Identifies whether the entry is a file, symlink, directory, etc.
>  #[derive(Clone, Debug)]
>  pub enum EntryKind {
> +    /// Pxar file format version
> +    Version(format::FormatVersion),
> +
>      /// Symbolic links.
>      Symlink(format::Symlink),
>  
> -- 
> 2.39.2
> 
> 
> 
> _______________________________________________
> pbs-devel mailing list
> pbs-devel@lists.proxmox.com
> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel
> 
> 
> 




^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [pbs-devel] [PATCH v3 pxar 07/58] decoder/accessor: add optional payload input stream
  2024-04-03 10:38   ` Fabian Grünbichler
@ 2024-04-03 11:47     ` Christian Ebner
  2024-04-03 12:18     ` Christian Ebner
  1 sibling, 0 replies; 122+ messages in thread
From: Christian Ebner @ 2024-04-03 11:47 UTC (permalink / raw)
  To: Proxmox Backup Server development discussion, Fabian Grünbichler

On 4/3/24 12:38, Fabian Grünbichler wrote:

> style nit: for those fns that take input and payload_input, it might
> make sense to order them next to eachother? IMHO it makes the call sites
> more readable, especially in those cases where it's mostly passed along
> ;) for the constructors, I am a bit torn on which variant is nicer.\

Maybe I can do that as a followup, as this already touches so much stuff 
it feels wrong to also move implementations around?

> 
> nit: would be easier to parse if it were
> 
> let range = ..
> if let Some(ref payload_input) = .. {
> ..
> } else {
> ..
> }
> 
> and it would also mesh better with `open_contents_at_range` above.

Ah yes, will change that!

> 
> I am a bit confused by this here - shouldn't all payloads be encoded via
> refs now if we have a split archive? and vice versa? why the second
> condition? and what if I pass a bogus payload input for an archive that
> doesn't contain any references?

Got me on this one, sorry for confusing you here: this was added with 
the intention to not encode files with zero payload size in the payload 
stream, but rather in the metadata stream for space optimization. 
However this caused issues so was removed again on the proxmox-backup 
side, but I forgot to remove it here as well. Will remove this for the 
next version.

> 
> similar here..
> 
> e.g., something like this:
> 
>              let input = if *payload_ref {
>                  if let Some(payload_input) = self.payload_input.as_mut() {
>                      payload_input
>                  } else {
>                      return None;
>                  }
>              } else {
>                  &mut self.input
>              };
>              Some(Contents::new(input, offset, *size))
> 
> although technically we do have an invariant there that we could check -
> we shouldn't encounter a non-ref payload when we have a payload_input..

yes same as above, this was added for the zero payload files.





^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [pbs-devel] [PATCH v3 pxar 14/58] format/encoder/decoder: add entry type cli params
  2024-03-28 12:36 ` [pbs-devel] [PATCH v3 pxar 14/58] format/encoder/decoder: add entry type cli params Christian Ebner
@ 2024-04-03 12:01   ` Fabian Grünbichler
  2024-04-03 14:41     ` Christian Ebner
  0 siblings, 1 reply; 122+ messages in thread
From: Fabian Grünbichler @ 2024-04-03 12:01 UTC (permalink / raw)
  To: Proxmox Backup Server development discussion

On March 28, 2024 1:36 pm, Christian Ebner wrote:
> Add an additional entrt type PXAR_CLI_PARAMS which is used to store
> additional metadata passed by the cli arguments such as the pxar cli
> exclude patterns.
> 
> The content is encoded as an arbitrary byte slice. The entry must be
> encoded right after the pxar format version entry, it is not possible to
> encode this with the previous format version 1.

since (from pxar's perspective) this is just an opaque blob of data,
isn't PXAR_CLI_PARAMS a bit of a misnomer? do we want a single blob, or
multiple delineated ones (might be more handy for the client using it)?

> 
> Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
> ---
> changes since version 2:
> - not present in previous version
> 
>  examples/mk-format-hashes.rs |  1 +
>  src/accessor/mod.rs          |  9 +++-----
>  src/decoder/mod.rs           | 18 +++++++++++++++-
>  src/encoder/aio.rs           | 19 ++++++++++++-----
>  src/encoder/mod.rs           | 40 +++++++++++++++++++++++++++++-------
>  src/encoder/sync.rs          | 11 ++++++++--
>  src/format/mod.rs            | 26 +++++++++++++++++++++++
>  src/lib.rs                   |  3 +++
>  8 files changed, 106 insertions(+), 21 deletions(-)
> 
> diff --git a/examples/mk-format-hashes.rs b/examples/mk-format-hashes.rs
> index e5d69b1..12394f3 100644
> --- a/examples/mk-format-hashes.rs
> +++ b/examples/mk-format-hashes.rs
> @@ -16,6 +16,7 @@ const CONSTANTS: &[(&str, &str, &str)] = &[
>          "PXAR_ENTRY_V1",
>          "__PROXMOX_FORMAT_ENTRY__",
>      ),
> +    ("", "PXAR_CLI_PARAMS", "__PROXMOX_FORMAT_CLI_PARAMS__"),
>      ("", "PXAR_FILENAME", "__PROXMOX_FORMAT_FILENAME__"),
>      ("", "PXAR_SYMLINK", "__PROXMOX_FORMAT_SYMLINK__"),
>      ("", "PXAR_DEVICE", "__PROXMOX_FORMAT_DEVICE__"),
> diff --git a/src/accessor/mod.rs b/src/accessor/mod.rs
> index 4789595..3b6ae44 100644
> --- a/src/accessor/mod.rs
> +++ b/src/accessor/mod.rs
> @@ -345,12 +345,9 @@ impl<T: Clone + ReadAt> AccessorImpl<T> {
>  
>          let link_offset = entry_file_offset - link_offset;
>  
> -        let (mut decoder, entry_offset) = get_decoder_at_filename(
> -            self.input.clone(),
> -            link_offset..self.size,
> -            PathBuf::new(),
> -        )
> -        .await?;
> +        let (mut decoder, entry_offset) =
> +            get_decoder_at_filename(self.input.clone(), link_offset..self.size, PathBuf::new())
> +                .await?;
>  
>          let entry = decoder
>              .next()

this whole hunk just reverts a change done earlier in the same series
(forgotten `cargo fmt` for the first patch maybe? ;))

> diff --git a/src/decoder/mod.rs b/src/decoder/mod.rs
> index 5b2fafb..4170b2f 100644
> --- a/src/decoder/mod.rs
> +++ b/src/decoder/mod.rs
> @@ -266,7 +266,13 @@ impl<I: SeqRead> DecoderImpl<I> {
>                      if let Ok(Some(ref entry)) = entry {
>                          if let EntryKind::Version(version) = entry.kind() {
>                              self.version = version.clone();
> -                            return self.read_next_entry().await.map(Some);
> +                            let entry = self.read_next_entry().await.map(Some);
> +                            if let Ok(Some(ref entry)) = entry {
> +                                if let EntryKind::CliParams(_) = entry.kind() {
> +                                    return self.read_next_entry().await.map(Some);
> +                                }
> +                            }
> +                            return entry;

so maybe we want a new State::Prelude or something that we transition to
from Begin if we encounter a FormatVersion, then we can match this and
future "special" entries before proceeding with the regular archive?

>                          }
>                      }
>                      return entry;
> @@ -429,6 +435,11 @@ impl<I: SeqRead> DecoderImpl<I> {
>              self.current_header = header;
>              self.entry.kind = EntryKind::Version(self.read_format_version().await?);
>  
> +            Ok(Some(self.entry.take()))
> +        } else if header.htype == format::PXAR_CLI_PARAMS {
> +            self.current_header = header;
> +            self.entry.kind = EntryKind::CliParams(self.read_cli_params().await?);
> +

and here (well, not here, at the start of read_next_entry_or_eof ;)) we
should maybe save the previous state before setting it to Default, so
that we can then check it for some header types like FormatVersion or
CliParams to ensure a misconstructed input cannot confuse our state
machine/decoder?

>              Ok(Some(self.entry.take()))
>          } else if header.htype == format::PXAR_ENTRY || header.htype == format::PXAR_ENTRY_V1 {
>              if header.htype == format::PXAR_ENTRY {
> @@ -802,6 +813,11 @@ impl<I: SeqRead> DecoderImpl<I> {
>              _ => io_bail!("unexpected pxar format version"),
>          }
>      }
> +
> +    async fn read_cli_params(&mut self) -> io::Result<format::CliParams> {
> +        let data = self.read_entry_as_bytes().await?;
> +        Ok(format::CliParams { data })
> +    }
>  }
>  
>  /// Reader for file contents inside a pxar archive.
> diff --git a/src/encoder/aio.rs b/src/encoder/aio.rs
> index 6da32bd..956b2a3 100644
> --- a/src/encoder/aio.rs
> +++ b/src/encoder/aio.rs
> @@ -25,11 +25,13 @@ impl<'a, T: tokio::io::AsyncWrite + 'a> Encoder<'a, TokioWriter<T>> {
>          output: T,
>          metadata: &Metadata,
>          payload_output: Option<T>,
> +        cli_params: Option<&[u8]>,
>      ) -> io::Result<Encoder<'a, TokioWriter<T>>> {
>          Encoder::new(
>              TokioWriter::new(output),
>              metadata,
>              payload_output.map(|payload_output| TokioWriter::new(payload_output)),
> +            cli_params,
>          )
>          .await
>      }
> @@ -46,6 +48,7 @@ impl<'a> Encoder<'a, TokioWriter<tokio::fs::File>> {
>              TokioWriter::new(tokio::fs::File::create(path.as_ref()).await?),
>              metadata,
>              None,
> +            None,
>          )
>          .await
>      }
> @@ -57,9 +60,11 @@ impl<'a, T: SeqWrite + 'a> Encoder<'a, T> {
>          output: T,
>          metadata: &Metadata,
>          payload_output: Option<T>,
> +        cli_params: Option<&[u8]>,
>      ) -> io::Result<Encoder<'a, T>> {
>          Ok(Self {
> -            inner: encoder::EncoderImpl::new(output.into(), metadata, payload_output).await?,
> +            inner: encoder::EncoderImpl::new(output.into(), metadata, payload_output, cli_params)
> +                .await?,
>          })
>      }
>  
> @@ -331,10 +336,14 @@ mod test {
>      /// Assert that `Encoder` is `Send`
>      fn send_test() {
>          let test = async {
> -            let mut encoder =
> -                Encoder::new(DummyOutput, &Metadata::dir_builder(0o700).build(), None)
> -                    .await
> -                    .unwrap();
> +            let mut encoder = Encoder::new(
> +                DummyOutput,
> +                &Metadata::dir_builder(0o700).build(),
> +                None,
> +                None,
> +            )
> +            .await
> +            .unwrap();
>              {
>                  encoder
>                      .create_directory("baba", &Metadata::dir_builder(0o700).build())
> diff --git a/src/encoder/mod.rs b/src/encoder/mod.rs
> index 9270153..b0ec877 100644
> --- a/src/encoder/mod.rs
> +++ b/src/encoder/mod.rs
> @@ -316,6 +316,7 @@ impl<'a, T: SeqWrite + 'a> EncoderImpl<'a, T> {
>          output: EncoderOutput<'a, T>,
>          metadata: &Metadata,
>          mut payload_output: Option<T>,
> +        cli_params: Option<&[u8]>,
>      ) -> io::Result<EncoderImpl<'a, T>> {
>          if !metadata.is_dir() {
>              io_bail!("directory metadata must contain the directory mode flag");
> @@ -343,6 +344,9 @@ impl<'a, T: SeqWrite + 'a> EncoderImpl<'a, T> {
>          };
>  
>          this.encode_format_version().await?;
> +        if let Some(params) = cli_params {
> +            this.encode_cli_params(params).await?;
> +        }
>          this.encode_metadata(metadata).await?;
>          let state = this.state_mut()?;
>          state.files_offset = state.position();
> @@ -740,16 +744,38 @@ impl<'a, T: SeqWrite + 'a> EncoderImpl<'a, T> {
>          Ok(())
>      }
>  
> +    async fn encode_cli_params(&mut self, params: &[u8]) -> io::Result<()> {
> +        if self.version == FormatVersion::Version1 {
> +            io_bail!("encoding cli params not supported pxar format version 1");

nit: missing "in" or "for" or "with"

> +        }
> +
> +        let (output, state) = self.output_state()?;
> +        if state.write_position != (size_of::<u64>() + size_of::<format::Header>()) as u64 {

this seems brittle, shouldn't it explicitly use the size of a
FormatVersion entry?

this and the similar check for the version introduced in the previous
patch smell a bit like "we actually have a state machine here but
pretend not to" :) for the payload archive, we also have a very simple
one: start_marker (1) -> payload entry (0..N) -> tail_marker (1) that is
not enforced atm (as in, nothing stops a bug from writing other entry
types, or additonal start/tail markers, or .. to the payload output).

> +            io_bail!(
> +                "cli params must be encoded following the version header, current position {}",
> +                state.write_position,
> +            );
> +        }
> +
> +        seq_write_pxar_entry(
> +            output,
> +            format::PXAR_CLI_PARAMS,
> +            params,
> +            &mut state.write_position,
> +        )
> +        .await
> +    }
> +
>      async fn encode_format_version(&mut self) -> io::Result<()> {
> -		let version_bytes = match self.version {
> -			format::FormatVersion::Version1 => return Ok(()),
> -			format::FormatVersion::Version2 => 2u64.to_le_bytes(),
> -		};
> +        let version_bytes = match self.version {
> +            format::FormatVersion::Version1 => return Ok(()),
> +            format::FormatVersion::Version2 => 2u64.to_le_bytes(),
> +        };

cargo fmt?

>  
>          let (output, state) = self.output_state()?;
> -		if state.write_position != 0 {
> -			io_bail!("pxar format version must be encoded at the beginning of an archive");
> -		}
> +        if state.write_position != 0 {
> +            io_bail!("pxar format version must be encoded at the beginning of an archive");
> +        }

cargo fmt?

>  
>          seq_write_pxar_entry(
>              output,
> diff --git a/src/encoder/sync.rs b/src/encoder/sync.rs
> index a6e16f4..3f706c1 100644
> --- a/src/encoder/sync.rs
> +++ b/src/encoder/sync.rs
> @@ -28,7 +28,7 @@ impl<'a, T: io::Write + 'a> Encoder<'a, StandardWriter<T>> {
>      /// Encode a `pxar` archive into a regular `std::io::Write` output.
>      #[inline]
>      pub fn from_std(output: T, metadata: &Metadata) -> io::Result<Encoder<'a, StandardWriter<T>>> {
> -        Encoder::new(StandardWriter::new(output), metadata, None)
> +        Encoder::new(StandardWriter::new(output), metadata, None, None)
>      }
>  }
>  
> @@ -42,6 +42,7 @@ impl<'a> Encoder<'a, StandardWriter<std::fs::File>> {
>              StandardWriter::new(std::fs::File::create(path.as_ref())?),
>              metadata,
>              None,
> +            None,
>          )
>      }
>  }
> @@ -53,12 +54,18 @@ impl<'a, T: SeqWrite + 'a> Encoder<'a, T> {
>      /// not allowed to use the `Waker`, as this will cause a `panic!`.
>      // Optionally attach a dedicated writer to redirect the payloads of regular files to a separate
>      // output.
> -    pub fn new(output: T, metadata: &Metadata, payload_output: Option<T>) -> io::Result<Self> {
> +    pub fn new(
> +        output: T,
> +        metadata: &Metadata,
> +        payload_output: Option<T>,
> +        cli_params: Option<&[u8]>,
> +    ) -> io::Result<Self> {
>          Ok(Self {
>              inner: poll_result_once(encoder::EncoderImpl::new(
>                  output.into(),
>                  metadata,
>                  payload_output,
> +                cli_params,
>              ))?,
>          })
>      }
> diff --git a/src/format/mod.rs b/src/format/mod.rs
> index 2bf33c9..82ef196 100644
> --- a/src/format/mod.rs
> +++ b/src/format/mod.rs
> @@ -87,6 +87,7 @@ pub const PXAR_FORMAT_VERSION: u64 = 0x730f6c75df16a40d;
>  pub const PXAR_ENTRY: u64 = 0xd5956474e588acef;
>  /// Previous version of the entry struct
>  pub const PXAR_ENTRY_V1: u64 = 0x11da850a1c1cceff;
> +pub const PXAR_CLI_PARAMS: u64 = 0xcf58b7dd627f604a;
>  pub const PXAR_FILENAME: u64 = 0x16701121063917b3;
>  pub const PXAR_SYMLINK: u64 = 0x27f971e7dbf5dc5f;
>  pub const PXAR_DEVICE: u64 = 0x9fc9e906586d5ce9;
> @@ -147,6 +148,7 @@ impl Header {
>      #[inline]
>      pub fn max_content_size(&self) -> u64 {
>          match self.htype {
> +            PXAR_CLI_PARAMS => u64::MAX - (size_of::<Self>() as u64),
>              // + null-termination
>              PXAR_FILENAME => crate::util::MAX_FILENAME_LEN + 1,
>              // + null-termination
> @@ -190,6 +192,7 @@ impl Display for Header {
>      fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
>          let readable = match self.htype {
>              PXAR_FORMAT_VERSION => "FORMAT_VERSION",
> +            PXAR_CLI_PARAMS => "CLI_PARAMS",
>              PXAR_FILENAME => "FILENAME",
>              PXAR_SYMLINK => "SYMLINK",
>              PXAR_HARDLINK => "HARDLINK",
> @@ -694,6 +697,29 @@ impl Device {
>      }
>  }
>  
> +#[derive(Clone, Debug)]
> +pub struct CliParams {
> +    pub data: Vec<u8>,
> +}
> +
> +impl CliParams {
> +    pub fn as_os_str(&self) -> &OsStr {
> +        self.as_ref()
> +    }
> +}
> +
> +impl AsRef<[u8]> for CliParams {
> +    fn as_ref(&self) -> &[u8] {
> +        &self.data
> +    }
> +}
> +
> +impl AsRef<OsStr> for CliParams {
> +    fn as_ref(&self) -> &OsStr {
> +        OsStr::from_bytes(&self.data[..self.data.len().max(1) - 1])
> +    }
> +}
> +
>  #[cfg(all(test, target_os = "linux"))]
>  #[test]
>  fn test_linux_devices() {
> diff --git a/src/lib.rs b/src/lib.rs
> index a87b5ac..cc85759 100644
> --- a/src/lib.rs
> +++ b/src/lib.rs
> @@ -345,6 +345,9 @@ pub enum EntryKind {
>      /// Pxar file format version
>      Version(format::FormatVersion),
>  
> +    /// Cli parameter.
> +    CliParams(format::CliParams),
> +
>      /// Symbolic links.
>      Symlink(format::Symlink),
>  
> -- 
> 2.39.2
> 
> 
> 
> _______________________________________________
> pbs-devel mailing list
> pbs-devel@lists.proxmox.com
> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel
> 
> 
> 




^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [pbs-devel] [PATCH v3 pxar 07/58] decoder/accessor: add optional payload input stream
  2024-04-03 10:38   ` Fabian Grünbichler
  2024-04-03 11:47     ` Christian Ebner
@ 2024-04-03 12:18     ` Christian Ebner
  2024-04-04  8:46       ` Fabian Grünbichler
  1 sibling, 1 reply; 122+ messages in thread
From: Christian Ebner @ 2024-04-03 12:18 UTC (permalink / raw)
  To: Proxmox Backup Server development discussion, Fabian Grünbichler

On 4/3/24 12:38, Fabian Grünbichler wrote:
>> +
>> +                if let Some(payload_input) = self.payload_input.as_mut() {
> 
> this condition (cted below)
> 
>> +                    if seq_read_position(payload_input)
>> +                        .await
>> +                        .transpose()?
>> +                        .is_none()
>> +                    {
>> +                        // Skip payload padding for injected chunks in sequential decoder
>> +                        let to_skip = payload_ref.offset - self.payload_consumed;
> 
> should we add a check here for the invariant that offsets should only
> ever be increasing? (and avoid an underflow for corrupt/invalid archives
> ;))

This is called by both, seq and random access decoder instances, so that 
will not be possible I guess.

> 
>> +                        self.skip_payload(to_skip).await?;
>> +                    }
>> +                }
>> +
>> +                if let Some(payload_input) = self.payload_input.as_mut() {
> 
> and this condition here are the same?

While this seems just duplicate, it makes the borrow checker happy as 
otherwise it complains that the &mut self borrow of the skip_payload 
call and the following seq_read_entry call taking the payload_input are 
in conflict.
I am happy for any hint on how to make the borrow checker happy without 
having to perform the if check two time
> 
>> +                    let header: u64 = seq_read_entry(payload_input).await?;
> 
> why not read a Header here?

Yeah, definitely better to read the full header, and then check against 
the htype and content_size(). Will change this for the next version.

> 
> 
> then these could use the size helpers of Header ;)




^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [pbs-devel] [PATCH v3 pxar 13/58] format: add pxar format version entry
  2024-04-03 11:41   ` Fabian Grünbichler
@ 2024-04-03 13:31     ` Christian Ebner
  0 siblings, 0 replies; 122+ messages in thread
From: Christian Ebner @ 2024-04-03 13:31 UTC (permalink / raw)
  To: Proxmox Backup Server development discussion, Fabian Grünbichler

On 4/3/24 13:41, Fabian Grünbichler wrote:
>> @@ -258,7 +261,16 @@ impl<I: SeqRead> DecoderImpl<I> {
>>           loop {
>>               match self.state {
>>                   State::Eof => return Ok(None),
>> -                State::Begin => return self.read_next_entry().await.map(Some),
>> +                State::Begin => {
>> +                    let entry = self.read_next_entry().await.map(Some);
>> +                    if let Ok(Some(ref entry)) = entry {
>> +                        if let EntryKind::Version(version) = entry.kind() {
>> +                            self.version = version.clone();
>> +                            return self.read_next_entry().await.map(Some);
>> +                        }
>> +                    }
>> +                    return entry;
> 
> a bit unsure here, if we want to enforce the order, wouldn't it be more
> clean to transition to a new state here rather than adding more nested
> ifs over time? ;)

Okay, agreed, will try to de-clutter this a bit by adding a Prelude 
state as suggested in your other reply.

>> +            1u64 => Ok(format::FormatVersion::Version1),
> 
> this should never happen though, right?

No, right... Will remove it.

> 
>> +            2u64 => Ok(format::FormatVersion::Version2),
> 
> also this (cted below)
> 
>> +            _ => io_bail!("unexpected pxar format version"),
> 
> this should maybe include the value? ;)

Okay, will add that.

> 
>> +        }
>> +    }
>>   }
>>   
>>   /// Reader for file contents inside a pxar archive.
>> diff --git a/src/encoder/mod.rs b/src/encoder/mod.rs
>> index 88c0ed5..9270153 100644
>> --- a/src/encoder/mod.rs
>> +++ b/src/encoder/mod.rs
>> @@ -17,7 +17,7 @@ use endian_trait::Endian;
>>   
>>   use crate::binary_tree_array;
>>   use crate::decoder::{self, SeqRead};
>> -use crate::format::{self, GoodbyeItem, PayloadRef};
>> +use crate::format::{self, FormatVersion, GoodbyeItem, PayloadRef};
>>   use crate::Metadata;
>>   
>>   pub mod aio;
>> @@ -307,6 +307,8 @@ pub(crate) struct EncoderImpl<'a, T: SeqWrite + 'a> {
>>       /// Since only the "current" entry can be actively writing files, we share the file copy
>>       /// buffer.
>>       file_copy_buffer: Arc<Mutex<Vec<u8>>>,
>> +    /// Pxar format version to encode
>> +    version: format::FormatVersion,
>>   }
>>   
>>   impl<'a, T: SeqWrite + 'a> EncoderImpl<'a, T> {
>> @@ -320,11 +322,14 @@ impl<'a, T: SeqWrite + 'a> EncoderImpl<'a, T> {
>>           }
>>   
>>           let mut state = EncoderState::default();
>> -        if let Some(payload_output) = payload_output.as_mut() {
>> +        let version = if let Some(payload_output) = payload_output.as_mut() {
>>               let header = format::Header::with_content_size(format::PXAR_PAYLOAD_START_MARKER, 0);
>>               header.check_header_size()?;
>>               seq_write_struct(payload_output, header, &mut state.payload_write_position).await?;
>> -        }
>> +            format::FormatVersion::Version2
>> +        } else {
>> +            format::FormatVersion::default()
> 
> shouldn't this be Version1 instead of default()? they are the same
> *now*, but that might not be the case forever?

Okay, will set this to the new version.

> 
>> +        };
>>   
>>           let mut this = Self {
>>               output,
>> @@ -334,8 +339,10 @@ impl<'a, T: SeqWrite + 'a> EncoderImpl<'a, T> {
>>               file_copy_buffer: Arc::new(Mutex::new(unsafe {
>>                   crate::util::vec_new_uninitialized(1024 * 1024)
>>               })),
>> +            version,
>>           };
>>   
>> +        this.encode_format_version().await?;
>>           this.encode_metadata(metadata).await?;
>>           let state = this.state_mut()?;
>>           state.files_offset = state.position();
>> @@ -522,6 +529,10 @@ impl<'a, T: SeqWrite + 'a> EncoderImpl<'a, T> {
>>           file_size: u64,
>>           payload_offset: PayloadOffset,
>>       ) -> io::Result<()> {
>> +        if self.version == FormatVersion::Version1 {
>> +            io_bail!("payload references not supported pxar format version 1");
>> +        }
>> +
>>           if self.payload_output.as_mut().is_none() {
>>               io_bail!("unable to add payload reference");
>>           }
>> @@ -729,6 +740,26 @@ impl<'a, T: SeqWrite + 'a> EncoderImpl<'a, T> {
>>           Ok(())
>>       }
>>   
>> +    async fn encode_format_version(&mut self) -> io::Result<()> {
>> +		let version_bytes = match self.version {
>> +			format::FormatVersion::Version1 => return Ok(()),
>> +			format::FormatVersion::Version2 => 2u64.to_le_bytes(),
> 
> (cted from above) and this here should maybe go together?

Not sure I understand, above is in the decoder, this in the encoder... 
What was your intention? Move the encoding/decoding to the FormatVersion 
type?

> 
>> +		};
>> +
>> +        let (output, state) = self.output_state()?;
>> +		if state.write_position != 0 {
>> +			io_bail!("pxar format version must be encoded at the beginning of an archive");
> 
> should this also be enforced while decoding?

Okay, will see to add a check for that as well.

> 
> should we also encode a/the version of the payload archive?

The payload archive has the PAYLOAD_START_MARKER, which can be changed 
just like the pxar entries v2?

> 
>> +		}
>> +
>> +        seq_write_pxar_entry(
>> +            output,
>> +            format::PXAR_FORMAT_VERSION,
>> +            &version_bytes,
>> +            &mut state.write_position,
>> +        )
>> +        .await
>> +    }
>> +
>>       async fn encode_metadata(&mut self, metadata: &Metadata) -> io::Result<()> {
>>           let (output, state) = self.output_state()?;
>>           seq_write_pxar_struct_entry(
>> diff --git a/src/format/mod.rs b/src/format/mod.rs
>> index a672d19..2bf33c9 100644
>> --- a/src/format/mod.rs
>> +++ b/src/format/mod.rs
>> @@ -6,6 +6,7 @@
>>   //! item data.
>>   //!
>>   //! An archive contains items in the following order:
>> +//!  * `FORMAT_VERSION`     -- (optional for v1), version of encoding format
>>   //!  * `ENTRY`              -- containing general stat() data and related bits
>>   //!   * `XATTR`             -- one extended attribute
>>   //!   * ...                 -- more of these when there are multiple defined
>> @@ -80,6 +81,8 @@ pub mod mode {
>>   }
>>   
>>   // Generated by `cargo run --example mk-format-hashes`
>> +/// Pxar format version entry, fallback to version 1 if not present
>> +pub const PXAR_FORMAT_VERSION: u64 = 0x730f6c75df16a40d;
>>   /// Beginning of an entry (current version).
>>   pub const PXAR_ENTRY: u64 = 0xd5956474e588acef;
>>   /// Previous version of the entry struct
>> @@ -186,6 +189,7 @@ impl Header {
>>   impl Display for Header {
>>       fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
>>           let readable = match self.htype {
>> +            PXAR_FORMAT_VERSION => "FORMAT_VERSION",
>>               PXAR_FILENAME => "FILENAME",
>>               PXAR_SYMLINK => "SYMLINK",
>>               PXAR_HARDLINK => "HARDLINK",
>> @@ -551,6 +555,13 @@ impl From<&std::fs::Metadata> for Stat {
>>       }
>>   }
>>   
>> +#[derive(Clone, Debug, Default, PartialEq)]
>> +pub enum FormatVersion {
>> +    #[default]
>> +    Version1,
>> +    Version2,
>> +}
>> +
>>   #[derive(Clone, Debug)]
>>   pub struct Filename {
>>       pub name: Vec<u8>,
>> diff --git a/src/lib.rs b/src/lib.rs
>> index ef81a85..a87b5ac 100644
>> --- a/src/lib.rs
>> +++ b/src/lib.rs
>> @@ -342,6 +342,9 @@ impl Acl {
>>   /// Identifies whether the entry is a file, symlink, directory, etc.
>>   #[derive(Clone, Debug)]
>>   pub enum EntryKind {
>> +    /// Pxar file format version
>> +    Version(format::FormatVersion),
>> +
>>       /// Symbolic links.
>>       Symlink(format::Symlink),
>>   
>> -- 
>> 2.39.2
>>
>>
>>
>> _______________________________________________
>> pbs-devel mailing list
>> pbs-devel@lists.proxmox.com
>> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel
>>
>>
>>
> 
> 
> _______________________________________________
> pbs-devel mailing list
> pbs-devel@lists.proxmox.com
> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel
> 
> 





^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [pbs-devel] [PATCH v3 pxar 14/58] format/encoder/decoder: add entry type cli params
  2024-04-03 12:01   ` Fabian Grünbichler
@ 2024-04-03 14:41     ` Christian Ebner
  0 siblings, 0 replies; 122+ messages in thread
From: Christian Ebner @ 2024-04-03 14:41 UTC (permalink / raw)
  To: Proxmox Backup Server development discussion, Fabian Grünbichler

On 4/3/24 14:01, Fabian Grünbichler wrote:
> On March 28, 2024 1:36 pm, Christian Ebner wrote:
>> Add an additional entrt type PXAR_CLI_PARAMS which is used to store
>> additional metadata passed by the cli arguments such as the pxar cli
>> exclude patterns.
>>
>> The content is encoded as an arbitrary byte slice. The entry must be
>> encoded right after the pxar format version entry, it is not possible to
>> encode this with the previous format version 1.
> 
> since (from pxar's perspective) this is just an opaque blob of data,
> isn't PXAR_CLI_PARAMS a bit of a misnomer? do we want a single blob, or
> multiple delineated ones (might be more handy for the client using it)?

This is in general still rather open for discussion and not decided 
upon. I added it as opaque blob for now, not wanting to force any format 
on this from the pxar perspective.
As is, this is only used to store the pxar exclude cli parameters during 
backup creation via proxmox-backups client, as encoding that as regular 
file truly broke the reusability logic there.

The idea was to possibly add more stuff in here e.g. further cli 
parameters and other metadata, without having pxar to know about the 
content format.

So should I just rename this to e.g. PXAR_PRELUDE?

> 
> this whole hunk just reverts a change done earlier in the same series
> (forgotten `cargo fmt` for the first patch maybe? ;))

Yeah, seems I squashed this into the wrong patch, will be fixed in the 
next version.

> 
>> diff --git a/src/decoder/mod.rs b/src/decoder/mod.rs
>> index 5b2fafb..4170b2f 100644
>> --- a/src/decoder/mod.rs
>> +++ b/src/decoder/mod.rs
>> @@ -266,7 +266,13 @@ impl<I: SeqRead> DecoderImpl<I> {
>>                       if let Ok(Some(ref entry)) = entry {
>>                           if let EntryKind::Version(version) = entry.kind() {
>>                               self.version = version.clone();
>> -                            return self.read_next_entry().await.map(Some);
>> +                            let entry = self.read_next_entry().await.map(Some);
>> +                            if let Ok(Some(ref entry)) = entry {
>> +                                if let EntryKind::CliParams(_) = entry.kind() {
>> +                                    return self.read_next_entry().await.map(Some);
>> +                                }
>> +                            }
>> +                            return entry;
> 
> so maybe we want a new State::Prelude or something that we transition to
> from Begin if we encounter a FormatVersion, then we can match this and
> future "special" entries before proceeding with the regular archive?

Ack, I added this based on the comment in the reply to the other patch 
already

> 
>>                           }
>>                       }
>>                       return entry;
>> @@ -429,6 +435,11 @@ impl<I: SeqRead> DecoderImpl<I> {
>>               self.current_header = header;
>>               self.entry.kind = EntryKind::Version(self.read_format_version().await?);
>>   
>> +            Ok(Some(self.entry.take()))
>> +        } else if header.htype == format::PXAR_CLI_PARAMS {
>> +            self.current_header = header;
>> +            self.entry.kind = EntryKind::CliParams(self.read_cli_params().await?);
>> +
> 
> and here (well, not here, at the start of read_next_entry_or_eof ;)) we
> should maybe save the previous state before setting it to Default, so
> that we can then check it for some header types like FormatVersion or
> CliParams to ensure a misconstructed input cannot confuse our state
> machine/decoder?

Hmm, yes, that will do. Will add such a check in the next version of the 
patches.

> 
>>               Ok(Some(self.entry.take()))
>>           } else if header.htype == format::PXAR_ENTRY || header.htype == format::PXAR_ENTRY_V1 {
>>               if header.htype == format::PXAR_ENTRY {
>> @@ -802,6 +813,11 @@ impl<I: SeqRead> DecoderImpl<I> {
>>               _ => io_bail!("unexpected pxar format version"),
>>           }
>>       }
>> +
>> +    async fn read_cli_params(&mut self) -> io::Result<format::CliParams> {
>> +        let data = self.read_entry_as_bytes().await?;
>> +        Ok(format::CliParams { data })
>> +    }
>>   }
>>   
>>   /// Reader for file contents inside a pxar archive.
>> diff --git a/src/encoder/aio.rs b/src/encoder/aio.rs
>> index 6da32bd..956b2a3 100644
>> --- a/src/encoder/aio.rs
>> +++ b/src/encoder/aio.rs
>> @@ -25,11 +25,13 @@ impl<'a, T: tokio::io::AsyncWrite + 'a> Encoder<'a, TokioWriter<T>> {
>>           output: T,
>>           metadata: &Metadata,
>>           payload_output: Option<T>,
>> +        cli_params: Option<&[u8]>,
>>       ) -> io::Result<Encoder<'a, TokioWriter<T>>> {
>>           Encoder::new(
>>               TokioWriter::new(output),
>>               metadata,
>>               payload_output.map(|payload_output| TokioWriter::new(payload_output)),
>> +            cli_params,
>>           )
>>           .await
>>       }
>> @@ -46,6 +48,7 @@ impl<'a> Encoder<'a, TokioWriter<tokio::fs::File>> {
>>               TokioWriter::new(tokio::fs::File::create(path.as_ref()).await?),
>>               metadata,
>>               None,
>> +            None,
>>           )
>>           .await
>>       }
>> @@ -57,9 +60,11 @@ impl<'a, T: SeqWrite + 'a> Encoder<'a, T> {
>>           output: T,
>>           metadata: &Metadata,
>>           payload_output: Option<T>,
>> +        cli_params: Option<&[u8]>,
>>       ) -> io::Result<Encoder<'a, T>> {
>>           Ok(Self {
>> -            inner: encoder::EncoderImpl::new(output.into(), metadata, payload_output).await?,
>> +            inner: encoder::EncoderImpl::new(output.into(), metadata, payload_output, cli_params)
>> +                .await?,
>>           })
>>       }
>>   
>> @@ -331,10 +336,14 @@ mod test {
>>       /// Assert that `Encoder` is `Send`
>>       fn send_test() {
>>           let test = async {
>> -            let mut encoder =
>> -                Encoder::new(DummyOutput, &Metadata::dir_builder(0o700).build(), None)
>> -                    .await
>> -                    .unwrap();
>> +            let mut encoder = Encoder::new(
>> +                DummyOutput,
>> +                &Metadata::dir_builder(0o700).build(),
>> +                None,
>> +                None,
>> +            )
>> +            .await
>> +            .unwrap();
>>               {
>>                   encoder
>>                       .create_directory("baba", &Metadata::dir_builder(0o700).build())
>> diff --git a/src/encoder/mod.rs b/src/encoder/mod.rs
>> index 9270153..b0ec877 100644
>> --- a/src/encoder/mod.rs
>> +++ b/src/encoder/mod.rs
>> @@ -316,6 +316,7 @@ impl<'a, T: SeqWrite + 'a> EncoderImpl<'a, T> {
>>           output: EncoderOutput<'a, T>,
>>           metadata: &Metadata,
>>           mut payload_output: Option<T>,
>> +        cli_params: Option<&[u8]>,
>>       ) -> io::Result<EncoderImpl<'a, T>> {
>>           if !metadata.is_dir() {
>>               io_bail!("directory metadata must contain the directory mode flag");
>> @@ -343,6 +344,9 @@ impl<'a, T: SeqWrite + 'a> EncoderImpl<'a, T> {
>>           };
>>   
>>           this.encode_format_version().await?;
>> +        if let Some(params) = cli_params {
>> +            this.encode_cli_params(params).await?;
>> +        }
>>           this.encode_metadata(metadata).await?;
>>           let state = this.state_mut()?;
>>           state.files_offset = state.position();
>> @@ -740,16 +744,38 @@ impl<'a, T: SeqWrite + 'a> EncoderImpl<'a, T> {
>>           Ok(())
>>       }
>>   
>> +    async fn encode_cli_params(&mut self, params: &[u8]) -> io::Result<()> {
>> +        if self.version == FormatVersion::Version1 {
>> +            io_bail!("encoding cli params not supported pxar format version 1");
> 
> nit: missing "in" or "for" or "with"

ack

> 
>> +        }
>> +
>> +        let (output, state) = self.output_state()?;
>> +        if state.write_position != (size_of::<u64>() + size_of::<format::Header>()) as u64 {
> 
> this seems brittle, shouldn't it explicitly use the size of a
> FormatVersion entry?

Will double check, but afaik the size_of for format::FormatVersion was 
not correct since that is an enum, the version number however encoded as 
u64.

> 
> this and the similar check for the version introduced in the previous
> patch smell a bit like "we actually have a state machine here but
> pretend not to" :) for the payload archive, we also have a very simple
> one: start_marker (1) -> payload entry (0..N) -> tail_marker (1) that is
> not enforced atm (as in, nothing stops a bug from writing other entry
> types, or additonal start/tail markers, or .. to the payload output).

I agree, for the encoder we however do not have a state machine, I am 
open for suggestions on how to improve this!

>> +            io_bail!(
>> +                "cli params must be encoded following the version header, current position {}",
>> +                state.write_position,
>> +            );
>> +        }
>> +
>> +        seq_write_pxar_entry(
>> +            output,
>> +            format::PXAR_CLI_PARAMS,
>> +            params,
>> +            &mut state.write_position,
>> +        )
>> +        .await
>> +    }
>> +
>>       async fn encode_format_version(&mut self) -> io::Result<()> {
>> -		let version_bytes = match self.version {
>> -			format::FormatVersion::Version1 => return Ok(()),
>> -			format::FormatVersion::Version2 => 2u64.to_le_bytes(),
>> -		};
>> +        let version_bytes = match self.version {
>> +            format::FormatVersion::Version1 => return Ok(()),
>> +            format::FormatVersion::Version2 => 2u64.to_le_bytes(),
>> +        };
> 
> cargo fmt?

ack, incorrectly squashed, fixed in new version

> 
>>   
>>           let (output, state) = self.output_state()?;
>> -		if state.write_position != 0 {
>> -			io_bail!("pxar format version must be encoded at the beginning of an archive");
>> -		}
>> +        if state.write_position != 0 {
>> +            io_bail!("pxar format version must be encoded at the beginning of an archive");
>> +        }
> 
> cargo fmt?

ack, incorrectly squashed, fixed in new version

> 
>>   
>>           seq_write_pxar_entry(
>>               output,
>> diff --git a/src/encoder/sync.rs b/src/encoder/sync.rs
>> index a6e16f4..3f706c1 100644
>> --- a/src/encoder/sync.rs
>> +++ b/src/encoder/sync.rs
>> @@ -28,7 +28,7 @@ impl<'a, T: io::Write + 'a> Encoder<'a, StandardWriter<T>> {
>>       /// Encode a `pxar` archive into a regular `std::io::Write` output.
>>       #[inline]
>>       pub fn from_std(output: T, metadata: &Metadata) -> io::Result<Encoder<'a, StandardWriter<T>>> {
>> -        Encoder::new(StandardWriter::new(output), metadata, None)
>> +        Encoder::new(StandardWriter::new(output), metadata, None, None)
>>       }
>>   }
>>   
>> @@ -42,6 +42,7 @@ impl<'a> Encoder<'a, StandardWriter<std::fs::File>> {
>>               StandardWriter::new(std::fs::File::create(path.as_ref())?),
>>               metadata,
>>               None,
>> +            None,
>>           )
>>       }
>>   }
>> @@ -53,12 +54,18 @@ impl<'a, T: SeqWrite + 'a> Encoder<'a, T> {
>>       /// not allowed to use the `Waker`, as this will cause a `panic!`.
>>       // Optionally attach a dedicated writer to redirect the payloads of regular files to a separate
>>       // output.
>> -    pub fn new(output: T, metadata: &Metadata, payload_output: Option<T>) -> io::Result<Self> {
>> +    pub fn new(
>> +        output: T,
>> +        metadata: &Metadata,
>> +        payload_output: Option<T>,
>> +        cli_params: Option<&[u8]>,
>> +    ) -> io::Result<Self> {
>>           Ok(Self {
>>               inner: poll_result_once(encoder::EncoderImpl::new(
>>                   output.into(),
>>                   metadata,
>>                   payload_output,
>> +                cli_params,
>>               ))?,
>>           })
>>       }
>> diff --git a/src/format/mod.rs b/src/format/mod.rs
>> index 2bf33c9..82ef196 100644
>> --- a/src/format/mod.rs
>> +++ b/src/format/mod.rs
>> @@ -87,6 +87,7 @@ pub const PXAR_FORMAT_VERSION: u64 = 0x730f6c75df16a40d;
>>   pub const PXAR_ENTRY: u64 = 0xd5956474e588acef;
>>   /// Previous version of the entry struct
>>   pub const PXAR_ENTRY_V1: u64 = 0x11da850a1c1cceff;
>> +pub const PXAR_CLI_PARAMS: u64 = 0xcf58b7dd627f604a;
>>   pub const PXAR_FILENAME: u64 = 0x16701121063917b3;
>>   pub const PXAR_SYMLINK: u64 = 0x27f971e7dbf5dc5f;
>>   pub const PXAR_DEVICE: u64 = 0x9fc9e906586d5ce9;
>> @@ -147,6 +148,7 @@ impl Header {
>>       #[inline]
>>       pub fn max_content_size(&self) -> u64 {
>>           match self.htype {
>> +            PXAR_CLI_PARAMS => u64::MAX - (size_of::<Self>() as u64),
>>               // + null-termination
>>               PXAR_FILENAME => crate::util::MAX_FILENAME_LEN + 1,
>>               // + null-termination
>> @@ -190,6 +192,7 @@ impl Display for Header {
>>       fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
>>           let readable = match self.htype {
>>               PXAR_FORMAT_VERSION => "FORMAT_VERSION",
>> +            PXAR_CLI_PARAMS => "CLI_PARAMS",
>>               PXAR_FILENAME => "FILENAME",
>>               PXAR_SYMLINK => "SYMLINK",
>>               PXAR_HARDLINK => "HARDLINK",
>> @@ -694,6 +697,29 @@ impl Device {
>>       }
>>   }
>>   
>> +#[derive(Clone, Debug)]
>> +pub struct CliParams {
>> +    pub data: Vec<u8>,
>> +}
>> +
>> +impl CliParams {
>> +    pub fn as_os_str(&self) -> &OsStr {
>> +        self.as_ref()
>> +    }
>> +}
>> +
>> +impl AsRef<[u8]> for CliParams {
>> +    fn as_ref(&self) -> &[u8] {
>> +        &self.data
>> +    }
>> +}
>> +
>> +impl AsRef<OsStr> for CliParams {
>> +    fn as_ref(&self) -> &OsStr {
>> +        OsStr::from_bytes(&self.data[..self.data.len().max(1) - 1])
>> +    }
>> +}
>> +
>>   #[cfg(all(test, target_os = "linux"))]
>>   #[test]
>>   fn test_linux_devices() {
>> diff --git a/src/lib.rs b/src/lib.rs
>> index a87b5ac..cc85759 100644
>> --- a/src/lib.rs
>> +++ b/src/lib.rs
>> @@ -345,6 +345,9 @@ pub enum EntryKind {
>>       /// Pxar file format version
>>       Version(format::FormatVersion),
>>   
>> +    /// Cli parameter.
>> +    CliParams(format::CliParams),
>> +
>>       /// Symbolic links.
>>       Symlink(format::Symlink),
>>   
>> -- 
>> 2.39.2
>>
>>
>>
>> _______________________________________________
>> pbs-devel mailing list
>> pbs-devel@lists.proxmox.com
>> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel
>>
>>
>>
> 
> 
> _______________________________________________
> pbs-devel mailing list
> pbs-devel@lists.proxmox.com
> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel
> 
> 





^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [pbs-devel] [PATCH v3 pxar 07/58] decoder/accessor: add optional payload input stream
  2024-04-03 12:18     ` Christian Ebner
@ 2024-04-04  8:46       ` Fabian Grünbichler
  2024-04-04  9:49         ` Christian Ebner
  0 siblings, 1 reply; 122+ messages in thread
From: Fabian Grünbichler @ 2024-04-04  8:46 UTC (permalink / raw)
  To: Christian Ebner, Proxmox Backup Server development discussion

On April 3, 2024 2:18 pm, Christian Ebner wrote:
> On 4/3/24 12:38, Fabian Grünbichler wrote:
>>> +
>>> +                if let Some(payload_input) = self.payload_input.as_mut() {
>> 
>> this condition (cted below)
>> 
>>> +                    if seq_read_position(payload_input)
>>> +                        .await
>>> +                        .transpose()?
>>> +                        .is_none()
>>> +                    {
>>> +                        // Skip payload padding for injected chunks in sequential decoder
>>> +                        let to_skip = payload_ref.offset - self.payload_consumed;
>> 
>> should we add a check here for the invariant that offsets should only
>> ever be increasing? (and avoid an underflow for corrupt/invalid archives
>> ;))
> 
> This is called by both, seq and random access decoder instances, so that 
> will not be possible I guess.

but.. payload_consumed only ever goes up? if the offset then jumps back
to a position before the payload_consumed counter, this will underflow
(to_skip is unsigned)?

>>> +                        self.skip_payload(to_skip).await?;
>>> +                    }
>>> +                }
>>> +
>>> +                if let Some(payload_input) = self.payload_input.as_mut() {
>> 
>> and this condition here are the same?
> 
> While this seems just duplicate, it makes the borrow checker happy as 
> otherwise it complains that the &mut self borrow of the skip_payload 
> call and the following seq_read_entry call taking the payload_input are 
> in conflict.
> I am happy for any hint on how to make the borrow checker happy without 
> having to perform the if check two time

ah, yeah, that makes sense.. I think the only way to avoid it is to
inline skip_payload here (it's the only call site anyway).




^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [pbs-devel] [PATCH v3 pxar 06/58] encoder: move to stack based state tracking
  2024-04-03 11:01     ` Christian Ebner
@ 2024-04-04  8:48       ` Fabian Grünbichler
  2024-04-04  9:04         ` Christian Ebner
  0 siblings, 1 reply; 122+ messages in thread
From: Fabian Grünbichler @ 2024-04-04  8:48 UTC (permalink / raw)
  To: Christian Ebner, Proxmox Backup Server development discussion

On April 3, 2024 1:01 pm, Christian Ebner wrote:
> On 4/3/24 11:54, Fabian Grünbichler wrote:
>> 
>> should we still have some sort of checks here? e.g., when dropping an
>> encoder, how should self.finished and self.state look like? IIUC, then a
>> dropped encoder should have an empty state and be finished (i.e.,
>> `close()` has been called on it).
>> 
>> or is this simply not relevant anymore because we only create one and
>> then drop it at the end (but should we then have a similar mechanism for
>> EncoderState?)
> 
> The encoder should now be consumed with the `close` call, which takes 
> ownership of the encoder and drops it afterwards, so all the state 
> checks should happen there.
> 
> Previously, the encoder finish consumed the per-directory level encoder 
> object, passing possible errors up to the parent implementation, which 
> is not possible now since there is only one encoder instance. I did not 
> want to panic here as the checks should be done in the close now, so the 
> Drop implementation was removed.

but now the equivalent is the EncoderState (which is per directory).

> Not sure what to check in a Drop implementation the EncoderState. What 
> did you have in mind for that? Note that errors get propagated to the 
> parent state in the encoder finish calls now.

well, basically that it is finished itself? i.e., in `finish` set a
flag, and in the Drop handler check that it is set. right now this is
the only place we `pop` the state from the state stack anyway, so it
should be okay, but who knows what future refactors bring.




^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [pbs-devel] [PATCH v3 proxmox-backup 24/58] tools: cover meta extension for pxar archives
  2024-03-28 12:36 ` [pbs-devel] [PATCH v3 proxmox-backup 24/58] tools: cover meta extension for pxar archives Christian Ebner
@ 2024-04-04  9:01   ` Fabian Grünbichler
  2024-04-04  9:06     ` Christian Ebner
  0 siblings, 1 reply; 122+ messages in thread
From: Fabian Grünbichler @ 2024-04-04  9:01 UTC (permalink / raw)
  To: Proxmox Backup Server development discussion

On March 28, 2024 1:36 pm, Christian Ebner wrote:
> Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
> ---
> changes since version 2:
> - use mpxar and ppxar file extensions
> 
>  pbs-client/src/tools/mod.rs | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/pbs-client/src/tools/mod.rs b/pbs-client/src/tools/mod.rs
> index 1b0123a39..08986fc5e 100644
> --- a/pbs-client/src/tools/mod.rs
> +++ b/pbs-client/src/tools/mod.rs
> @@ -337,7 +337,7 @@ pub fn complete_pxar_archive_name(arg: &str, param: &HashMap<String, String>) ->
>      complete_server_file_name(arg, param)
>          .iter()
>          .filter_map(|name| {
> -            if name.ends_with(".pxar.didx") {
> +            if name.ends_with(".pxar.didx") || name.ends_with(".pxar.meta.didx") {

this still has the old extension/naming scheme instead of the new one

>                  Some(pbs_tools::format::strip_server_file_extension(name).to_owned())
>              } else {
>                  None
> -- 
> 2.39.2
> 
> 
> 
> _______________________________________________
> pbs-devel mailing list
> pbs-devel@lists.proxmox.com
> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel
> 
> 
> 




^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [pbs-devel] [PATCH v3 proxmox-backup 25/58] restore: cover meta extension for pxar archives
  2024-03-28 12:36 ` [pbs-devel] [PATCH v3 proxmox-backup 25/58] restore: " Christian Ebner
@ 2024-04-04  9:02   ` Fabian Grünbichler
  0 siblings, 0 replies; 122+ messages in thread
From: Fabian Grünbichler @ 2024-04-04  9:02 UTC (permalink / raw)
  To: Proxmox Backup Server development discussion

On March 28, 2024 1:36 pm, Christian Ebner wrote:
> Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
> ---
> changes since version 2:
> - use mpxar and ppxar file extensions
> - merged with file restore patch into one
> 
>  pbs-client/src/tools/mod.rs      |  5 ++++-
>  proxmox-file-restore/src/main.rs | 16 +++++++++++++---
>  2 files changed, 17 insertions(+), 4 deletions(-)
> 
> diff --git a/pbs-client/src/tools/mod.rs b/pbs-client/src/tools/mod.rs
> index 08986fc5e..f8d3102d1 100644
> --- a/pbs-client/src/tools/mod.rs
> +++ b/pbs-client/src/tools/mod.rs
> @@ -337,7 +337,10 @@ pub fn complete_pxar_archive_name(arg: &str, param: &HashMap<String, String>) ->
>      complete_server_file_name(arg, param)
>          .iter()
>          .filter_map(|name| {
> -            if name.ends_with(".pxar.didx") || name.ends_with(".pxar.meta.didx") {
> +            if name.ends_with(".pxar.didx")
> +                || name.ends_with(".mpxar.didx")
> +                || name.ends_with(".ppxar.didx")
> +            {

hit reply to soon on the previous patch - this hunk here should be
folded into that one ;)

>                  Some(pbs_tools::format::strip_server_file_extension(name).to_owned())
>              } else {
>                  None
> diff --git a/proxmox-file-restore/src/main.rs b/proxmox-file-restore/src/main.rs
> index dbab69942..685ce34d9 100644
> --- a/proxmox-file-restore/src/main.rs
> +++ b/proxmox-file-restore/src/main.rs
> @@ -75,7 +75,10 @@ fn parse_path(path: String, base64: bool) -> Result<ExtractPath, Error> {
>          (file, path)
>      };
>  
> -    if file.ends_with(".pxar.didx") {
> +    if file.ends_with(".pxar.didx")
> +        || file.ends_with(".mpxar.didx")
> +        || file.ends_with(".ppxar.didx")
> +    {
>          Ok(ExtractPath::Pxar(file, path))
>      } else if file.ends_with(".img.fidx") {
>          Ok(ExtractPath::VM(file, path))
> @@ -123,11 +126,18 @@ async fn list_files(
>          ExtractPath::ListArchives => {
>              let mut entries = vec![];
>              for file in manifest.files() {
> -                if !file.filename.ends_with(".pxar.didx") && !file.filename.ends_with(".img.fidx") {
> +                if !file.filename.ends_with(".pxar.didx")
> +                    && !file.filename.ends_with(".img.fidx")
> +                    && !file.filename.ends_with(".mpxar.didx")
> +                    && !file.filename.ends_with(".ppxar.didx")
> +                {
>                      continue;
>                  }
>                  let path = format!("/{}", file.filename);
> -                let attr = if file.filename.ends_with(".pxar.didx") {
> +                let attr = if file.filename.ends_with(".pxar.didx")
> +                    || file.filename.ends_with(".mpxar.didx")
> +                    || file.filename.ends_with(".ppxar.didx")
> +                {
>                      // a pxar file is a file archive, so it's root is also a directory root
>                      Some(&DirEntryAttribute::Directory { start: 0 })
>                  } else {
> -- 
> 2.39.2
> 
> 
> 
> _______________________________________________
> pbs-devel mailing list
> pbs-devel@lists.proxmox.com
> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel
> 
> 
> 




^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [pbs-devel] [PATCH v3 pxar 06/58] encoder: move to stack based state tracking
  2024-04-04  8:48       ` Fabian Grünbichler
@ 2024-04-04  9:04         ` Christian Ebner
  0 siblings, 0 replies; 122+ messages in thread
From: Christian Ebner @ 2024-04-04  9:04 UTC (permalink / raw)
  To: Fabian Grünbichler, Proxmox Backup Server development discussion

On 4/4/24 10:48, Fabian Grünbichler wrote:
> On April 3, 2024 1:01 pm, Christian Ebner wrote:
>> On 4/3/24 11:54, Fabian Grünbichler wrote:
>>>
>>> should we still have some sort of checks here? e.g., when dropping an
>>> encoder, how should self.finished and self.state look like? IIUC, then a
>>> dropped encoder should have an empty state and be finished (i.e.,
>>> `close()` has been called on it).
>>>
>>> or is this simply not relevant anymore because we only create one and
>>> then drop it at the end (but should we then have a similar mechanism for
>>> EncoderState?)
>>
>> The encoder should now be consumed with the `close` call, which takes
>> ownership of the encoder and drops it afterwards, so all the state
>> checks should happen there.
>>
>> Previously, the encoder finish consumed the per-directory level encoder
>> object, passing possible errors up to the parent implementation, which
>> is not possible now since there is only one encoder instance. I did not
>> want to panic here as the checks should be done in the close now, so the
>> Drop implementation was removed.
> 
> but now the equivalent is the EncoderState (which is per directory).
> 
>> Not sure what to check in a Drop implementation the EncoderState. What
>> did you have in mind for that? Note that errors get propagated to the
>> parent state in the encoder finish calls now.
> 
> well, basically that it is finished itself? i.e., in `finish` set a
> flag, and in the Drop handler check that it is set. right now this is
> the only place we `pop` the state from the state stack anyway, so it
> should be okay, but who knows what future refactors bring.


Okay, so will take any encoding errors stored in the state when 
finishing the directory in the `finish` calls and set a finished flag, 
and add a check in the `EncoderState`s `Drop` implementation if all the 
errors have been correctly passed along to the parent state (none 
present anymore) and the finished flag is set. Requires however to panic 
if that is not the case.





^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [pbs-devel] [PATCH v3 proxmox-backup 24/58] tools: cover meta extension for pxar archives
  2024-04-04  9:01   ` Fabian Grünbichler
@ 2024-04-04  9:06     ` Christian Ebner
  2024-04-04  9:10       ` Christian Ebner
  0 siblings, 1 reply; 122+ messages in thread
From: Christian Ebner @ 2024-04-04  9:06 UTC (permalink / raw)
  To: Proxmox Backup Server development discussion, Fabian Grünbichler

On 4/4/24 11:01, Fabian Grünbichler wrote:
> On March 28, 2024 1:36 pm, Christian Ebner wrote:
>> Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
>> ---
>> changes since version 2:
>> - use mpxar and ppxar file extensions
>>
>>   pbs-client/src/tools/mod.rs | 2 +-
>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/pbs-client/src/tools/mod.rs b/pbs-client/src/tools/mod.rs
>> index 1b0123a39..08986fc5e 100644
>> --- a/pbs-client/src/tools/mod.rs
>> +++ b/pbs-client/src/tools/mod.rs
>> @@ -337,7 +337,7 @@ pub fn complete_pxar_archive_name(arg: &str, param: &HashMap<String, String>) ->
>>       complete_server_file_name(arg, param)
>>           .iter()
>>           .filter_map(|name| {
>> -            if name.ends_with(".pxar.didx") {
>> +            if name.ends_with(".pxar.didx") || name.ends_with(".pxar.meta.didx") {
> 
> this still has the old extension/naming scheme instead of the new one

Oh, thanks for noticing! Fixed it right away in the new version branch!





^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [pbs-devel] [PATCH v3 proxmox-backup 24/58] tools: cover meta extension for pxar archives
  2024-04-04  9:06     ` Christian Ebner
@ 2024-04-04  9:10       ` Christian Ebner
  0 siblings, 0 replies; 122+ messages in thread
From: Christian Ebner @ 2024-04-04  9:10 UTC (permalink / raw)
  To: Proxmox Backup Server development discussion, Fabian Grünbichler

On 4/4/24 11:06, Christian Ebner wrote:
> On 4/4/24 11:01, Fabian Grünbichler wrote:
>> On March 28, 2024 1:36 pm, Christian Ebner wrote:
>>> Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
>>> ---
>>> changes since version 2:
>>> - use mpxar and ppxar file extensions
>>>
>>>   pbs-client/src/tools/mod.rs | 2 +-
>>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/pbs-client/src/tools/mod.rs b/pbs-client/src/tools/mod.rs
>>> index 1b0123a39..08986fc5e 100644
>>> --- a/pbs-client/src/tools/mod.rs
>>> +++ b/pbs-client/src/tools/mod.rs
>>> @@ -337,7 +337,7 @@ pub fn complete_pxar_archive_name(arg: &str, 
>>> param: &HashMap<String, String>) ->
>>>       complete_server_file_name(arg, param)
>>>           .iter()
>>>           .filter_map(|name| {
>>> -            if name.ends_with(".pxar.didx") {
>>> +            if name.ends_with(".pxar.didx") || 
>>> name.ends_with(".pxar.meta.didx") {
>>
>> this still has the old extension/naming scheme instead of the new one
> 
> Oh, thanks for noticing! Fixed it right away in the new version branch!

Actually, it already is fixed in a followup patch, fixed the patches




^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [pbs-devel] [PATCH v3 proxmox-backup 26/58] client: mount: make split pxar archives mountable
  2024-03-28 12:36 ` [pbs-devel] [PATCH v3 proxmox-backup 26/58] client: mount: make split pxar archives mountable Christian Ebner
@ 2024-04-04  9:43   ` Fabian Grünbichler
  2024-04-04 13:29     ` Christian Ebner
  0 siblings, 1 reply; 122+ messages in thread
From: Fabian Grünbichler @ 2024-04-04  9:43 UTC (permalink / raw)
  To: Proxmox Backup Server development discussion

On March 28, 2024 1:36 pm, Christian Ebner wrote:
> Cover the cases where the pxar archive was uploaded as split payload
> data and metadata streams. Instantiate the required reader and
> decoder instances to access the metadata and payload data archives.
> 
> Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
> ---
> changes since version 2:
> - use mpxar and ppxar file extensions
> - use helpers for archive names and fuse reader
> 
>  proxmox-backup-client/src/mount.rs | 54 ++++++++++++++++++++----------
>  1 file changed, 37 insertions(+), 17 deletions(-)
> 
> diff --git a/proxmox-backup-client/src/mount.rs b/proxmox-backup-client/src/mount.rs
> index 4a2f83357..7bcd581be 100644
> --- a/proxmox-backup-client/src/mount.rs
> +++ b/proxmox-backup-client/src/mount.rs
> @@ -21,17 +21,16 @@ use pbs_api_types::BackupNamespace;
>  use pbs_client::tools::key_source::get_encryption_key_password;
>  use pbs_client::{BackupReader, RemoteChunkReader};
>  use pbs_datastore::cached_chunk_reader::CachedChunkReader;
> -use pbs_datastore::dynamic_index::BufferedDynamicReader;
>  use pbs_datastore::index::IndexFile;
>  use pbs_key_config::load_and_decrypt_key;
>  use pbs_tools::crypt_config::CryptConfig;
>  use pbs_tools::json::required_string_param;
>  
> +use crate::helper;
>  use crate::{
>      complete_group_or_snapshot, complete_img_archive_name, complete_namespace,
>      complete_pxar_archive_name, complete_repository, connect, dir_or_last_from_group,
> -    extract_repository_from_value, optional_ns_param, record_repository, BufferedDynamicReadAt,
> -    REPO_URL_SCHEMA,
> +    extract_repository_from_value, optional_ns_param, record_repository, REPO_URL_SCHEMA,
>  };
>  
>  #[sortable]
> @@ -219,7 +218,10 @@ async fn mount_do(param: Value, pipe: Option<OwnedFd>) -> Result<Value, Error> {
>          }
>      };
>  
> -    let server_archive_name = if archive_name.ends_with(".pxar") {
> +    let server_archive_name = if archive_name.ends_with(".pxar")
> +        || archive_name.ends_with(".mpxar")
> +        || archive_name.ends_with(".ppxar")
> +    {
>          if target.is_none() {
>              bail!("use the 'mount' command to mount pxar archives");
>          }
> @@ -246,6 +248,16 @@ async fn mount_do(param: Value, pipe: Option<OwnedFd>) -> Result<Value, Error> {
>      let (manifest, _) = client.download_manifest().await?;
>      manifest.check_fingerprint(crypt_config.as_ref().map(Arc::as_ref))?;
>  
> +    let base_name = server_archive_name
> +        .strip_suffix(".mpxar.didx")
> +        .or_else(|| server_archive_name.strip_suffix(".ppxar.didx"));
> +
> +    let server_archive_name = if let Some(base) = base_name {
> +        format!("{base}.mpxar.didx")
> +    } else {
> +        server_archive_name.to_owned()
> +    };
> +

should this be switched to the get_pxar_archive_names helper?

>      let file_info = manifest.lookup_file_info(&server_archive_name)?;
>  
>      let daemonize = || -> Result<(), Error> {
> @@ -283,20 +295,28 @@ async fn mount_do(param: Value, pipe: Option<OwnedFd>) -> Result<Value, Error> {
>          futures::future::select(interrupt_int.recv().boxed(), interrupt_term.recv().boxed());
>  
>      if server_archive_name.ends_with(".didx") {
> -        let index = client
> -            .download_dynamic_index(&manifest, &server_archive_name)
> -            .await?;
> -        let most_used = index.find_most_used_chunks(8);
> -        let chunk_reader = RemoteChunkReader::new(
> +        let (archive_name, payload_archive_name) =
> +            helper::get_pxar_archive_names(&server_archive_name);

and then we can skip this one here ;)

> +        let (reader, archive_size) = helper::get_pxar_fuse_reader(
> +            &archive_name,
>              client.clone(),
> -            crypt_config,
> -            file_info.chunk_crypt_mode(),
> -            most_used,
> -        );
> -        let reader = BufferedDynamicReader::new(index, chunk_reader);
> -        let archive_size = reader.archive_size();
> -        let reader: pbs_pxar_fuse::Reader = Arc::new(BufferedDynamicReadAt::new(reader));
> -        let decoder = pbs_pxar_fuse::Accessor::new(reader, archive_size).await?;
> +            &manifest,
> +            crypt_config.clone(),
> +        )
> +        .await?;
> +
> +        let decoder = if let Some(payload_archive_name) = payload_archive_name {
> +            let (payload_reader, _) = helper::get_pxar_fuse_reader(
> +                &payload_archive_name,
> +                client.clone(),
> +                &manifest,
> +                crypt_config.clone(),
> +            )
> +            .await?;
> +            pbs_pxar_fuse::Accessor::new(reader, archive_size, Some(payload_reader)).await?
> +        } else {
> +            pbs_pxar_fuse::Accessor::new(reader, archive_size, None).await?
> +        };
>  
>          let session =
>              pbs_pxar_fuse::Session::mount(decoder, options, false, Path::new(target.unwrap()))
> -- 
> 2.39.2
> 
> 
> 
> _______________________________________________
> pbs-devel mailing list
> pbs-devel@lists.proxmox.com
> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel
> 
> 
> 




^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [pbs-devel] [PATCH v3 pxar 07/58] decoder/accessor: add optional payload input stream
  2024-04-04  8:46       ` Fabian Grünbichler
@ 2024-04-04  9:49         ` Christian Ebner
  0 siblings, 0 replies; 122+ messages in thread
From: Christian Ebner @ 2024-04-04  9:49 UTC (permalink / raw)
  To: Fabian Grünbichler, Proxmox Backup Server development discussion

On 4/4/24 10:46, Fabian Grünbichler wrote:
> On April 3, 2024 2:18 pm, Christian Ebner wrote:
>> On 4/3/24 12:38, Fabian Grünbichler wrote:
>>>> +
>>>> +                if let Some(payload_input) = self.payload_input.as_mut() {
>>>
>>> this condition (cted below)
>>>
>>>> +                    if seq_read_position(payload_input)
>>>> +                        .await
>>>> +                        .transpose()?
>>>> +                        .is_none()
>>>> +                    {
>>>> +                        // Skip payload padding for injected chunks in sequential decoder
>>>> +                        let to_skip = payload_ref.offset - self.payload_consumed;
>>>
>>> should we add a check here for the invariant that offsets should only
>>> ever be increasing? (and avoid an underflow for corrupt/invalid archives
>>> ;))
>>
>> This is called by both, seq and random access decoder instances, so that
>> will not be possible I guess.
> 
> but.. payload_consumed only ever goes up? if the offset then jumps back
> to a position before the payload_consumed counter, this will underflow
> (to_skip is unsigned)?

Ah yes, will add the check for just the sequential decoder case...





^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [pbs-devel] [PATCH v3 proxmox-backup 30/58] catalog: shell: redirect payload reader for split streams
  2024-03-28 12:36 ` [pbs-devel] [PATCH v3 proxmox-backup 30/58] catalog: shell: redirect payload reader for split streams Christian Ebner
@ 2024-04-04  9:49   ` Fabian Grünbichler
  2024-04-04 15:52     ` Christian Ebner
  0 siblings, 1 reply; 122+ messages in thread
From: Fabian Grünbichler @ 2024-04-04  9:49 UTC (permalink / raw)
  To: Proxmox Backup Server development discussion

On March 28, 2024 1:36 pm, Christian Ebner wrote:
> Allows to attach to pxar archives with split metadata and payload
> streams, by redirecting the payload input to a dedicated reader
> accessing the payload index.
> 
> Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
> ---
> changes since version 2:
> - use mpxar and ppxar file extensions
> - use pxar fuse reader helper
> 
>  proxmox-backup-client/src/catalog.rs | 18 ++++++++++++++++--
>  1 file changed, 16 insertions(+), 2 deletions(-)
> 
> diff --git a/proxmox-backup-client/src/catalog.rs b/proxmox-backup-client/src/catalog.rs
> index 2073e058d..3e52880b9 100644
> --- a/proxmox-backup-client/src/catalog.rs
> +++ b/proxmox-backup-client/src/catalog.rs
> @@ -181,7 +181,10 @@ async fn catalog_shell(param: Value) -> Result<(), Error> {
>          }
>      };
>  
> -    let server_archive_name = if archive_name.ends_with(".pxar") {
> +    let server_archive_name = if archive_name.ends_with(".pxar")
> +        || archive_name.ends_with(".mpxar")
> +        || archive_name.ends_with(".ppxar")
> +    {

as with mount - there is a call to get_pxar_archive_names betwee this
hunk and the next one (introduced by the previous patch), shouldn't that
just be moved up here?

>          format!("{}.didx", archive_name)
>      } else {
>          bail!("Can only mount pxar archives.");
> @@ -216,7 +219,18 @@ async fn catalog_shell(param: Value) -> Result<(), Error> {
>      )
>      .await?;
>  
> -    let decoder = pbs_pxar_fuse::Accessor::new(reader, archive_size).await?;
> +    let decoder = if let Some(payload_archive_name) = payload_archive_name {
> +        let (payload_reader, _) = helper::get_pxar_fuse_reader(
> +            &payload_archive_name,
> +            client.clone(),
> +            &manifest,
> +            crypt_config.clone(),
> +        )
> +        .await?;
> +        pbs_pxar_fuse::Accessor::new(reader, archive_size, Some(payload_reader)).await?
> +    } else {
> +        pbs_pxar_fuse::Accessor::new(reader, archive_size, None).await?
> +    };

we have this exact pattern twice at least as well, once for mount, once
for the catalog.. in fact, all four calls to helpers::get_pxar_fuse_reader
are in those two call sites, so probably the helper should just be that
whole sequence instead (or the existing helper made internal to a new
helper for this sequence)? :)

>  
>      client.download(CATALOG_NAME, &mut tmpfile).await?;
>      let index = DynamicIndexReader::new(tmpfile)
> -- 
> 2.39.2
> 
> 
> 
> _______________________________________________
> pbs-devel mailing list
> pbs-devel@lists.proxmox.com
> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel
> 
> 
> 




^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [pbs-devel] [PATCH v3 proxmox-backup 31/58] www: cover meta extension for pxar archives
  2024-03-28 12:36 ` [pbs-devel] [PATCH v3 proxmox-backup 31/58] www: cover meta extension for pxar archives Christian Ebner
@ 2024-04-04 10:01   ` Fabian Grünbichler
  2024-04-04 14:51     ` Christian Ebner
  0 siblings, 1 reply; 122+ messages in thread
From: Fabian Grünbichler @ 2024-04-04 10:01 UTC (permalink / raw)
  To: Proxmox Backup Server development discussion

On March 28, 2024 1:36 pm, Christian Ebner wrote:
> Allows to access the pxar meta archives for navigation and download
> via the Proxmox Backup Server web ui.
> 
> Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
> ---
> changes since version 2:
> - use mpxar and ppxar file extensions
> 
>  www/datastore/Content.js | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/www/datastore/Content.js b/www/datastore/Content.js
> index c2403ff9c..eb25f6ca4 100644
> --- a/www/datastore/Content.js
> +++ b/www/datastore/Content.js
> @@ -1050,7 +1050,7 @@ Ext.define('PBS.DataStoreContent', {
>  		    tooltip: gettext('Browse'),
>  		    getClass: (v, m, { data }) => {
>  			if (
> -			    (data.ty === 'file' && data.filename.endsWith('pxar.didx')) ||
> +			    (data.ty === 'file' && (data.filename.endsWith('pxar.didx') || data.filename.endsWith('mpxar.didx'))) ||
>  			    (data.ty === 'ns' && !data.root)
>  			) {
>  			    return 'fa fa-folder-open-o';
> @@ -1058,7 +1058,9 @@ Ext.define('PBS.DataStoreContent', {
>  			return 'pmx-hidden';
>  		    },
>  		    isActionDisabled: (v, r, c, i, { data }) =>
> -			!(data.ty === 'file' && data.filename.endsWith('pxar.didx') && data['crypt-mode'] < 3) && data.ty !== 'ns',
> +			!(data.ty === 'file' &&
> +			(data.filename.endsWith('pxar.didx') || data.filename.endsWith('mpxar.didx')) &&
> +			data['crypt-mode'] < 3) && data.ty !== 'ns',

is this patch needed? the filename now always ends with pxar.didx (note
the missing leading '.') ;)

if we want to keep it and only make non-split archives and the meta
archives browsable, then we need to add the '.'

>  		},
>  	    ],
>  	},
> -- 
> 2.39.2
> 
> 
> 
> _______________________________________________
> pbs-devel mailing list
> pbs-devel@lists.proxmox.com
> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel
> 
> 
> 




^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [pbs-devel] [PATCH v3 proxmox-backup 37/58] client: pxar: helper for lookup of reusable dynamic entries
  2024-03-28 12:36 ` [pbs-devel] [PATCH v3 proxmox-backup 37/58] client: pxar: helper for lookup of reusable dynamic entries Christian Ebner
@ 2024-04-04 12:54   ` Fabian Grünbichler
  2024-04-04 17:13     ` Christian Ebner
  2024-04-05 11:28   ` Fabian Grünbichler
  1 sibling, 1 reply; 122+ messages in thread
From: Fabian Grünbichler @ 2024-04-04 12:54 UTC (permalink / raw)
  To: Proxmox Backup Server development discussion

On March 28, 2024 1:36 pm, Christian Ebner wrote:
> The helper method allows to lookup the entries of a dynamic index
> which fully cover a given offset range. Further, the helper returns
> the start padding from the start offset of the dynamic index entry
> to the start offset of the given range and the end padding.
> 
> This will be used to lookup size and digest for chunks covering the
> payload range of a regular file in order to re-use found chunks by
> indexing them in the archives index file instead of re-encoding the
> payload.
> 
> Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
> ---
> changes since version 2:
> - moved this from the dynamic index to the pxar create as suggested
> - refactored and optimized search, going for linear search to find the
>   end entry
> - reworded commit message
> 
>  pbs-client/src/pxar/create.rs | 63 +++++++++++++++++++++++++++++++++++
>  1 file changed, 63 insertions(+)
> 
> diff --git a/pbs-client/src/pxar/create.rs b/pbs-client/src/pxar/create.rs
> index 2bb5a6253..e2d3954ca 100644
> --- a/pbs-client/src/pxar/create.rs
> +++ b/pbs-client/src/pxar/create.rs
> @@ -2,6 +2,7 @@ use std::collections::{HashMap, HashSet};
>  use std::ffi::{CStr, CString, OsStr};
>  use std::fmt;
>  use std::io::{self, Read};
> +use std::ops::Range;
>  use std::os::unix::ffi::OsStrExt;
>  use std::os::unix::io::{AsRawFd, FromRawFd, IntoRawFd, OwnedFd, RawFd};
>  use std::path::{Path, PathBuf};
> @@ -16,6 +17,7 @@ use nix::fcntl::OFlag;
>  use nix::sys::stat::{FileStat, Mode};
>  
>  use pathpatterns::{MatchEntry, MatchFlag, MatchList, MatchType, PatternFlag};
> +use pbs_datastore::index::IndexFile;
>  use proxmox_sys::error::SysError;
>  use pxar::encoder::{LinkOffset, SeqWrite};
>  use pxar::Metadata;
> @@ -25,6 +27,7 @@ use proxmox_lang::c_str;
>  use proxmox_sys::fs::{self, acl, xattr};
>  
>  use pbs_datastore::catalog::BackupCatalogWriter;
> +use pbs_datastore::dynamic_index::DynamicIndexReader;
>  
>  use crate::pxar::metadata::errno_is_unsupported;
>  use crate::pxar::tools::assert_single_path_component;
> @@ -791,6 +794,66 @@ impl Archiver {
>      }
>  }
>  
> +/// Dynamic Entry reusable by payload references
> +#[derive(Clone, Debug)]
> +#[repr(C)]
> +pub struct ReusableDynamicEntry {
> +    size_le: u64,
> +    digest: [u8; 32],
> +}
> +
> +impl ReusableDynamicEntry {
> +    #[inline]
> +    pub fn size(&self) -> u64 {
> +        u64::from_le(self.size_le)
> +    }
> +
> +    #[inline]
> +    pub fn digest(&self) -> [u8; 32] {
> +        self.digest
> +    }
> +}
> +
> +/// List of dynamic entries containing the data given by an offset range
> +fn lookup_dynamic_entries(
> +    index: &DynamicIndexReader,
> +    range: Range<u64>,
> +) -> Result<(Vec<ReusableDynamicEntry>, u64, u64), Error> {
> +    let end_idx = index.index_count() - 1;
> +    let chunk_end = index.chunk_end(end_idx);
> +    let start = index.binary_search(0, 0, end_idx, chunk_end, range.start)?;
> +    let mut end = start;
> +    while end < end_idx {
> +        if range.end < index.chunk_end(end) {
> +            break;
> +        }
> +        end += 1;
> +    }

this loop here

> +
> +    let offset_first = if start == 0 {
> +        0
> +    } else {
> +        index.chunk_end(start - 1)
> +    };

offset_first is prev_end, so maybe we could just name it like that from
the start?

> +
> +    let padding_start = range.start - offset_first;
> +    let padding_end = index.chunk_end(end) - range.end;
> +
> +    let mut indices = Vec::new();
> +    let mut prev_end = offset_first;
> +    for dynamic_entry in &index.index()[start..end + 1] {
> +        let size = dynamic_entry.end() - prev_end;
> +        let reusable_dynamic_entry = ReusableDynamicEntry {
> +            size_le: size.to_le(),
> +            digest: dynamic_entry.digest(),
> +        };
> +        prev_end += size;
> +        indices.push(reusable_dynamic_entry);
> +    }

and this one here could probably be combined?

> +
> +    Ok((indices, padding_start, padding_end))
> +}

e.g., the whole thing could become something like (untested ;)):

    let end_idx = index.index_count() - 1;
    let chunk_end = index.chunk_end(end_idx);
    let start = index.binary_search(0, 0, end_idx, chunk_end, range.start)?;

    let mut prev_end = if start == 0 {
        0
    } else {
        index.chunk_end(start - 1)
    };
    let padding_start = range.start - prev_end;
    let mut padding_end = 0;

    let mut indices = Vec::new();
    for dynamic_entry in &index.index()[start..] {
        let end = dynamic_entry.end();
        if range.end < end {
            padding_end = end - range.end;
            break;
        }

        let reusable_dynamic_entry = ReusableDynamicEntry {
            size_le: (end - prev_end).to_le(),
            digest: dynamic_entry.digest(),
        };
        indices.push(reusable_dynamic_entry);
        prev_end = end;
    }

    Ok((indices, padding_start, padding_end))

> +
>  fn get_metadata(
>      fd: RawFd,
>      stat: &FileStat,
> -- 
> 2.39.2
> 
> 
> 
> _______________________________________________
> pbs-devel mailing list
> pbs-devel@lists.proxmox.com
> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel
> 
> 
> 




^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [pbs-devel] [PATCH v3 proxmox-backup 26/58] client: mount: make split pxar archives mountable
  2024-04-04  9:43   ` Fabian Grünbichler
@ 2024-04-04 13:29     ` Christian Ebner
  0 siblings, 0 replies; 122+ messages in thread
From: Christian Ebner @ 2024-04-04 13:29 UTC (permalink / raw)
  To: Proxmox Backup Server development discussion, Fabian Grünbichler

On 4/4/24 11:43, Fabian Grünbichler wrote:
> On March 28, 2024 1:36 pm, Christian Ebner wrote:
>> Cover the cases where the pxar archive was uploaded as split payload
>> data and metadata streams. Instantiate the required reader and
>> decoder instances to access the metadata and payload data archives.
>>
>> Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
>> ---
>> changes since version 2:
>> - use mpxar and ppxar file extensions
>> - use helpers for archive names and fuse reader
>>
>>   proxmox-backup-client/src/mount.rs | 54 ++++++++++++++++++++----------
>>   1 file changed, 37 insertions(+), 17 deletions(-)
>>
>> diff --git a/proxmox-backup-client/src/mount.rs b/proxmox-backup-client/src/mount.rs
>> index 4a2f83357..7bcd581be 100644
>> --- a/proxmox-backup-client/src/mount.rs
>> +++ b/proxmox-backup-client/src/mount.rs
>> @@ -21,17 +21,16 @@ use pbs_api_types::BackupNamespace;
>>   use pbs_client::tools::key_source::get_encryption_key_password;
>>   use pbs_client::{BackupReader, RemoteChunkReader};
>>   use pbs_datastore::cached_chunk_reader::CachedChunkReader;
>> -use pbs_datastore::dynamic_index::BufferedDynamicReader;
>>   use pbs_datastore::index::IndexFile;
>>   use pbs_key_config::load_and_decrypt_key;
>>   use pbs_tools::crypt_config::CryptConfig;
>>   use pbs_tools::json::required_string_param;
>>   
>> +use crate::helper;
>>   use crate::{
>>       complete_group_or_snapshot, complete_img_archive_name, complete_namespace,
>>       complete_pxar_archive_name, complete_repository, connect, dir_or_last_from_group,
>> -    extract_repository_from_value, optional_ns_param, record_repository, BufferedDynamicReadAt,
>> -    REPO_URL_SCHEMA,
>> +    extract_repository_from_value, optional_ns_param, record_repository, REPO_URL_SCHEMA,
>>   };
>>   
>>   #[sortable]
>> @@ -219,7 +218,10 @@ async fn mount_do(param: Value, pipe: Option<OwnedFd>) -> Result<Value, Error> {
>>           }
>>       };
>>   
>> -    let server_archive_name = if archive_name.ends_with(".pxar") {
>> +    let server_archive_name = if archive_name.ends_with(".pxar")
>> +        || archive_name.ends_with(".mpxar")
>> +        || archive_name.ends_with(".ppxar")
>> +    {
>>           if target.is_none() {
>>               bail!("use the 'mount' command to mount pxar archives");
>>           }
>> @@ -246,6 +248,16 @@ async fn mount_do(param: Value, pipe: Option<OwnedFd>) -> Result<Value, Error> {
>>       let (manifest, _) = client.download_manifest().await?;
>>       manifest.check_fingerprint(crypt_config.as_ref().map(Arc::as_ref))?;
>>   
>> +    let base_name = server_archive_name
>> +        .strip_suffix(".mpxar.didx")
>> +        .or_else(|| server_archive_name.strip_suffix(".ppxar.didx"));
>> +
>> +    let server_archive_name = if let Some(base) = base_name {
>> +        format!("{base}.mpxar.didx")
>> +    } else {
>> +        server_archive_name.to_owned()
>> +    };
>> +
> 
> should this be switched to the get_pxar_archive_names helper?

Yes, indeed! Moved the helper invocation up here and removed the latter 
(now unneeded) one.

> 
>>       let file_info = manifest.lookup_file_info(&server_archive_name)?;
>>   
>>       let daemonize = || -> Result<(), Error> {
>> @@ -283,20 +295,28 @@ async fn mount_do(param: Value, pipe: Option<OwnedFd>) -> Result<Value, Error> {
>>           futures::future::select(interrupt_int.recv().boxed(), interrupt_term.recv().boxed());
>>   
>>       if server_archive_name.ends_with(".didx") {
>> -        let index = client
>> -            .download_dynamic_index(&manifest, &server_archive_name)
>> -            .await?;
>> -        let most_used = index.find_most_used_chunks(8);
>> -        let chunk_reader = RemoteChunkReader::new(
>> +        let (archive_name, payload_archive_name) =
>> +            helper::get_pxar_archive_names(&server_archive_name);
> 
> and then we can skip this one here ;)
> 
>> +        let (reader, archive_size) = helper::get_pxar_fuse_reader(
>> +            &archive_name,
>>               client.clone(),
>> -            crypt_config,
>> -            file_info.chunk_crypt_mode(),
>> -            most_used,
>> -        );
>> -        let reader = BufferedDynamicReader::new(index, chunk_reader);
>> -        let archive_size = reader.archive_size();
>> -        let reader: pbs_pxar_fuse::Reader = Arc::new(BufferedDynamicReadAt::new(reader));
>> -        let decoder = pbs_pxar_fuse::Accessor::new(reader, archive_size).await?;
>> +            &manifest,
>> +            crypt_config.clone(),
>> +        )
>> +        .await?;
>> +
>> +        let decoder = if let Some(payload_archive_name) = payload_archive_name {
>> +            let (payload_reader, _) = helper::get_pxar_fuse_reader(
>> +                &payload_archive_name,
>> +                client.clone(),
>> +                &manifest,
>> +                crypt_config.clone(),
>> +            )
>> +            .await?;
>> +            pbs_pxar_fuse::Accessor::new(reader, archive_size, Some(payload_reader)).await?
>> +        } else {
>> +            pbs_pxar_fuse::Accessor::new(reader, archive_size, None).await?
>> +        };
>>   
>>           let session =
>>               pbs_pxar_fuse::Session::mount(decoder, options, false, Path::new(target.unwrap()))
>> -- 
>> 2.39.2
>>
>>
>>
>> _______________________________________________
>> pbs-devel mailing list
>> pbs-devel@lists.proxmox.com
>> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel
>>
>>
>>
> 
> 
> _______________________________________________
> pbs-devel mailing list
> pbs-devel@lists.proxmox.com
> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel
> 
> 





^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [pbs-devel] [PATCH v3 proxmox-backup 38/58] upload stream: impl reused chunk injector
  2024-03-28 12:36 ` [pbs-devel] [PATCH v3 proxmox-backup 38/58] upload stream: impl reused chunk injector Christian Ebner
@ 2024-04-04 14:24   ` Fabian Grünbichler
  2024-04-05 10:26     ` Christian Ebner
  0 siblings, 1 reply; 122+ messages in thread
From: Fabian Grünbichler @ 2024-04-04 14:24 UTC (permalink / raw)
  To: Proxmox Backup Server development discussion

On March 28, 2024 1:36 pm, Christian Ebner wrote:
> In order to be included in the backups index file, reused payload
> chunks have to be injected into the payload upload stream.
> 
> The chunker forces a chunk boundary and queues the list of chunks to
> be uploaded thereafter.
> 
> This implements the logic to inject the chunks into the chunk upload
> stream after such a boundary is requested, by looping over the queued
> chunks and inserting them into the stream.
> 
> Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
> ---
> changes since version 2:
> - no changes
> 
>  pbs-client/src/inject_reused_chunks.rs | 152 +++++++++++++++++++++++++
>  pbs-client/src/lib.rs                  |   1 +
>  2 files changed, 153 insertions(+)
>  create mode 100644 pbs-client/src/inject_reused_chunks.rs
> 
> diff --git a/pbs-client/src/inject_reused_chunks.rs b/pbs-client/src/inject_reused_chunks.rs
> new file mode 100644
> index 000000000..5cc19ce5d
> --- /dev/null
> +++ b/pbs-client/src/inject_reused_chunks.rs
> @@ -0,0 +1,152 @@
> +use std::collections::VecDeque;
> +use std::pin::Pin;
> +use std::sync::atomic::{AtomicUsize, Ordering};
> +use std::sync::{Arc, Mutex};
> +use std::task::{Context, Poll};
> +
> +use anyhow::{anyhow, Error};
> +use futures::{ready, Stream};
> +use pin_project_lite::pin_project;
> +
> +use crate::pxar::create::ReusableDynamicEntry;
> +
> +pin_project! {
> +    pub struct InjectReusedChunksQueue<S> {
> +        #[pin]
> +        input: S,
> +        current: Option<InjectChunks>,
> +        buffer: Option<bytes::BytesMut>,
> +        injection_queue: Arc<Mutex<VecDeque<InjectChunks>>>,
> +        stream_len: Arc<AtomicUsize>,
> +        reused_len: Arc<AtomicUsize>,
> +        index_csum: Arc<Mutex<Option<openssl::sha::Sha256>>>,
> +    }
> +}
> +
> +#[derive(Debug)]
> +pub struct InjectChunks {
> +    pub boundary: u64,
> +    pub chunks: Vec<ReusableDynamicEntry>,
> +    pub size: usize,
> +}
> +
> +pub enum InjectedChunksInfo {
> +    Known(Vec<(u64, [u8; 32])>),
> +    Raw((u64, bytes::BytesMut)),

this ones might benefit from a comment or typedef to explain what the
u64 is/u64s are..

> +}
> +
> +pub trait InjectReusedChunks: Sized {
> +    fn inject_reused_chunks(
> +        self,
> +        injection_queue: Arc<Mutex<VecDeque<InjectChunks>>>,
> +        stream_len: Arc<AtomicUsize>,
> +        reused_len: Arc<AtomicUsize>,
> +        index_csum: Arc<Mutex<Option<openssl::sha::Sha256>>>,

this doesn't actually need to be an Option I think. it's always there
after all, and we just need the Arc<Mutex<_>> to ensure updates are
serialized. for the final `finish` call we can just use Arc::to_inner
and Mutex::to_inner to get the owned Sha256 out of it (there can't be
any `update`s afterwards in any case).

> +    ) -> InjectReusedChunksQueue<Self>;
> +}
> +
> +impl<S> InjectReusedChunks for S
> +where
> +    S: Stream<Item = Result<bytes::BytesMut, Error>>,
> +{
> +    fn inject_reused_chunks(
> +        self,
> +        injection_queue: Arc<Mutex<VecDeque<InjectChunks>>>,
> +        stream_len: Arc<AtomicUsize>,
> +        reused_len: Arc<AtomicUsize>,
> +        index_csum: Arc<Mutex<Option<openssl::sha::Sha256>>>,
> +    ) -> InjectReusedChunksQueue<Self> {
> +        InjectReusedChunksQueue {
> +            input: self,
> +            current: None,
> +            injection_queue,
> +            buffer: None,
> +            stream_len,
> +            reused_len,
> +            index_csum,
> +        }
> +    }
> +}
> +
> +impl<S> Stream for InjectReusedChunksQueue<S>
> +where
> +    S: Stream<Item = Result<bytes::BytesMut, Error>>,
> +{
> +    type Item = Result<InjectedChunksInfo, Error>;
> +
> +    fn poll_next(self: Pin<&mut Self>, cx: &mut Context) -> Poll<Option<Self::Item>> {

and this fn here could use some comments as well IMHO ;) I hope I didn't
misunderstand anything and my suggestions below are correct..

> +        let mut this = self.project();
> +        loop {

> +            let current = this.current.take();
> +            if let Some(current) = current {

the take can be inlined here..

> +                let mut chunks = Vec::new();
> +                let mut guard = this.index_csum.lock().unwrap();
> +                let csum = guard.as_mut().unwrap();
> +
> +                for chunk in current.chunks {
> +                    let offset = this
> +                        .stream_len
> +                        .fetch_add(chunk.size() as usize, Ordering::SeqCst)
> +                        as u64;
> +                    this.reused_len
> +                        .fetch_add(chunk.size() as usize, Ordering::SeqCst);
> +                    let digest = chunk.digest();
> +                    chunks.push((offset, digest));
> +                    let end_offset = offset + chunk.size();
> +                    csum.update(&end_offset.to_le_bytes());
> +                    csum.update(&digest);
> +                }
> +                let chunk_info = InjectedChunksInfo::Known(chunks);
> +                return Poll::Ready(Some(Ok(chunk_info)));

okay, so this part here takes care of accounting known chunks, updating
the index digest and passing them along

> +            }
> +
> +            let buffer = this.buffer.take();
> +            if let Some(buffer) = buffer {

take can be inlined

> +                let offset = this.stream_len.fetch_add(buffer.len(), Ordering::SeqCst) as u64;
> +                let data = InjectedChunksInfo::Raw((offset, buffer));
> +                return Poll::Ready(Some(Ok(data)));

this part here takes care of accounting for and passing along a new chunk
and its data, if it had to be buffered because injected chunks came
first..

> +            }
> +

> +            match ready!(this.input.as_mut().poll_next(cx)) {
> +                None => return Poll::Ready(None),

that one's purpose is pretty clear - should we also check that there is
no more injection stuff in the queue here as that would mean something
fundamental went wrong?

> +                Some(Err(err)) => return Poll::Ready(Some(Err(err))),
> +                Some(Ok(raw)) => {
> +                    let chunk_size = raw.len();
> +                    let offset = this.stream_len.load(Ordering::SeqCst) as u64;
> +                    let mut injections = this.injection_queue.lock().unwrap();
> +
> +                    if let Some(inject) = injections.pop_front() {
> +                        if inject.boundary == offset {

if we do what I suggest below (X), then this branch here is the only one
touching current and buffer. that in turn means we can inline the
handling of current (dropping it altogether from
InjectReusedChunksQueue) and drop the loop and continue

> +                            if this.current.replace(inject).is_some() {
> +                                return Poll::Ready(Some(Err(anyhow!(
> +                                    "replaced injection queue not empty"
> +                                ))));
> +                            }
> +                            if chunk_size > 0 && this.buffer.replace(raw).is_some() {

I guess this means chunks with size 0 are used in case everything is
re-used? or is the difference between here and below (where chunk_size 0
is a fatal error) accidental?

> +                                return Poll::Ready(Some(Err(anyhow!(
> +                                    "replaced buffer not empty"
> +                                ))));

with all the other changes, this should be impossible to trigger.. then
again, it probably doesn't hurt as a safeguard either..

> +                            }
> +                            continue;
> +                        } else if inject.boundary == offset + chunk_size as u64 {
> +                            let _ = this.current.insert(inject);

X: since we add the chunk size to the offset below, this means that the next poll
ends up in the previous branch of the if (boundary == offset), even if
we remove this whole condition and branch

> +                        } else if inject.boundary < offset + chunk_size as u64 {
> +                            return Poll::Ready(Some(Err(anyhow!("invalid injection boundary"))));
> +                        } else {
> +                            injections.push_front(inject);

I normally dislike this kind of code (pop - check - push), but I guess
here it doesn't hurt throughput too much since the rest of the things we
do is more expensive anyway..

> +                        }
> +                    }
> +
> +                    if chunk_size == 0 {
> +                        return Poll::Ready(Some(Err(anyhow!("unexpected empty raw data"))));
> +                    }
> +
> +                    let offset = this.stream_len.fetch_add(chunk_size, Ordering::SeqCst) as u64;
> +                    let data = InjectedChunksInfo::Raw((offset, raw));
> +
> +                    return Poll::Ready(Some(Ok(data)));
> +                }
> +            }
> +        }
> +    }
> +}

anyhow, here's a slightly simplified version of poll_next:

    fn poll_next(self: Pin<&mut Self>, cx: &mut Context) -> Poll<Option<Self::Item>> {
        let mut this = self.project();

        if let Some(buffer) = this.buffer.take() {
            let offset = this.stream_len.fetch_add(buffer.len(), Ordering::SeqCst) as u64;
            let data = InjectedChunksInfo::Raw((offset, buffer));
            return Poll::Ready(Some(Ok(data)));
        }

        match ready!(this.input.as_mut().poll_next(cx)) {
            None => Poll::Ready(None),
            Some(Err(err)) => Poll::Ready(Some(Err(err))),
            Some(Ok(raw)) => {
                let chunk_size = raw.len();
                let offset = this.stream_len.load(Ordering::SeqCst) as u64;
                let mut injections = this.injection_queue.lock().unwrap();

                if let Some(inject) = injections.pop_front() {
                    // inject chunk now, buffer incoming data for later
                    if inject.boundary == offset {
                        if chunk_size > 0 && this.buffer.replace(raw).is_some() {
                            return Poll::Ready(Some(Err(anyhow!("replaced buffer not empty"))));
                        }
                        let mut chunks = Vec::new();
                        let mut csum = this.index_csum.lock().unwrap();

                        for chunk in inject.chunks {
                            let offset = this
                                .stream_len
                                .fetch_add(chunk.size() as usize, Ordering::SeqCst)
                                as u64;
                            this.reused_len
                                .fetch_add(chunk.size() as usize, Ordering::SeqCst);
                            let digest = chunk.digest();
                            chunks.push((offset, digest));
                            let end_offset = offset + chunk.size();
                            csum.update(&end_offset.to_le_bytes());
                            csum.update(&digest);
                        }
                        let chunk_info = InjectedChunksInfo::Known(chunks);
                        return Poll::Ready(Some(Ok(chunk_info)));
                    } else if inject.boundary < offset + chunk_size as u64 {
                        return Poll::Ready(Some(Err(anyhow!("invalid injection boundary"))));
                    } else {
                        injections.push_front(inject);
                    }
                }

                if chunk_size == 0 {
                    return Poll::Ready(Some(Err(anyhow!("unexpected empty raw data"))));
                }

                let offset = this.stream_len.fetch_add(chunk_size, Ordering::SeqCst) as u64;
                let data = InjectedChunksInfo::Raw((offset, raw));

                Poll::Ready(Some(Ok(data)))
            }
        }
    }

this has the index_csum no longer an Option folded in, so requires a few
adaptations in other parts as well.

but I'd like the following even better, since it allows us to get rid of
the buffer altogether:

    fn poll_next(self: Pin<&mut Self>, cx: &mut Context) -> Poll<Option<Self::Item>> {
        let mut this = self.project();

        let mut injections = this.injection_queue.lock().unwrap();

        // check whether we have something to inject
        if let Some(inject) = injections.pop_front() {
            let offset = this.stream_len.load(Ordering::SeqCst) as u64;

            if inject.boundary == offset {
                // inject now
                let mut chunks = Vec::new();
                let mut csum = this.index_csum.lock().unwrap();

                // account for injected chunks
                for chunk in inject.chunks {
                    let offset = this
                        .stream_len
                        .fetch_add(chunk.size() as usize, Ordering::SeqCst)
                        as u64;
                    this.reused_len
                        .fetch_add(chunk.size() as usize, Ordering::SeqCst);
                    let digest = chunk.digest();
                    chunks.push((offset, digest));
                    let end_offset = offset + chunk.size();
                    csum.update(&end_offset.to_le_bytes());
                    csum.update(&digest);
                }
                let chunk_info = InjectedChunksInfo::Known(chunks);
                return Poll::Ready(Some(Ok(chunk_info)));
            } else if inject.boundary < offset {
                // incoming new chunks and injections didn't line up?
                return Poll::Ready(Some(Err(anyhow!("invalid injection boundary"))));
            } else {
                // inject later
                injections.push_front(inject);
            }
        }

        // nothing to inject now, let's see if there's further input
        match ready!(this.input.as_mut().poll_next(cx)) {
            None => Poll::Ready(None),
            Some(Err(err)) => Poll::Ready(Some(Err(err))),
            Some(Ok(raw)) if raw.is_empty() => {
                Poll::Ready(Some(Err(anyhow!("unexpected empty raw data"))))
            }
            Some(Ok(raw)) => {
                let offset = this.stream_len.fetch_add(raw.len(), Ordering::SeqCst) as u64;
                let data = InjectedChunksInfo::Raw((offset, raw));

                Poll::Ready(Some(Ok(data)))
            }
        }
    }

but technically all this accounting could move back to the backup_writer
as well, if the injected chunk info also contained the size..

> diff --git a/pbs-client/src/lib.rs b/pbs-client/src/lib.rs
> index 21cf8556b..3e7bd2a8b 100644
> --- a/pbs-client/src/lib.rs
> +++ b/pbs-client/src/lib.rs
> @@ -7,6 +7,7 @@ pub mod catalog_shell;
>  pub mod pxar;
>  pub mod tools;
>  
> +mod inject_reused_chunks;
>  mod merge_known_chunks;
>  pub mod pipe_to_stream;
>  
> -- 
> 2.39.2
> 
> 
> 
> _______________________________________________
> pbs-devel mailing list
> pbs-devel@lists.proxmox.com
> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel
> 
> 
> 




^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [pbs-devel] [PATCH v3 proxmox-backup 31/58] www: cover meta extension for pxar archives
  2024-04-04 10:01   ` Fabian Grünbichler
@ 2024-04-04 14:51     ` Christian Ebner
  0 siblings, 0 replies; 122+ messages in thread
From: Christian Ebner @ 2024-04-04 14:51 UTC (permalink / raw)
  To: Proxmox Backup Server development discussion, Fabian Grünbichler

On 4/4/24 12:01, Fabian Grünbichler wrote:
> On March 28, 2024 1:36 pm, Christian Ebner wrote:
>> Allows to access the pxar meta archives for navigation and download
>> via the Proxmox Backup Server web ui.
>>
>> Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
>> ---
>> changes since version 2:
>> - use mpxar and ppxar file extensions
>>
>>   www/datastore/Content.js | 6 ++++--
>>   1 file changed, 4 insertions(+), 2 deletions(-)
>>
>> diff --git a/www/datastore/Content.js b/www/datastore/Content.js
>> index c2403ff9c..eb25f6ca4 100644
>> --- a/www/datastore/Content.js
>> +++ b/www/datastore/Content.js
>> @@ -1050,7 +1050,7 @@ Ext.define('PBS.DataStoreContent', {
>>   		    tooltip: gettext('Browse'),
>>   		    getClass: (v, m, { data }) => {
>>   			if (
>> -			    (data.ty === 'file' && data.filename.endsWith('pxar.didx')) ||
>> +			    (data.ty === 'file' && (data.filename.endsWith('pxar.didx') || data.filename.endsWith('mpxar.didx'))) ||
>>   			    (data.ty === 'ns' && !data.root)
>>   			) {
>>   			    return 'fa fa-folder-open-o';
>> @@ -1058,7 +1058,9 @@ Ext.define('PBS.DataStoreContent', {
>>   			return 'pmx-hidden';
>>   		    },
>>   		    isActionDisabled: (v, r, c, i, { data }) =>
>> -			!(data.ty === 'file' && data.filename.endsWith('pxar.didx') && data['crypt-mode'] < 3) && data.ty !== 'ns',
>> +			!(data.ty === 'file' &&
>> +			(data.filename.endsWith('pxar.didx') || data.filename.endsWith('mpxar.didx')) &&
>> +			data['crypt-mode'] < 3) && data.ty !== 'ns',
> 
> is this patch needed? the filename now always ends with pxar.didx (note
> the missing leading '.') ;)
> 
> if we want to keep it and only make non-split archives and the meta
> archives browsable, then we need to add the '.'

True, I will add the dot so only pxar and mpxar files can be browsed, as 
otherwise it might be confusing (although we allow the ppxar to be used 
for the cli commands).





^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [pbs-devel] [PATCH v3 proxmox-backup 40/58] client: chunk stream: add dynamic entries injection queues
  2024-03-28 12:36 ` [pbs-devel] [PATCH v3 proxmox-backup 40/58] client: chunk stream: add dynamic entries injection queues Christian Ebner
@ 2024-04-04 14:52   ` Fabian Grünbichler
  2024-04-08 13:54     ` Christian Ebner
  0 siblings, 1 reply; 122+ messages in thread
From: Fabian Grünbichler @ 2024-04-04 14:52 UTC (permalink / raw)
  To: Proxmox Backup Server development discussion

On March 28, 2024 1:36 pm, Christian Ebner wrote:
> Adds a queue to the chunk stream to request forced boundaries at a
> given offset within the stream and inject reused dynamic entries
> after this boundary.
> 
> The chunks are then passed along to the uploader stream using the
> injection queue, which inserts them during upload.
> 
> Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
> ---
> changes since version 2:
> - combined queues into new optional struct
> - refactoring
> 
>  examples/test_chunk_speed2.rs                 |  2 +-
>  pbs-client/src/backup_writer.rs               | 89 +++++++++++--------
>  pbs-client/src/chunk_stream.rs                | 36 +++++++-
>  pbs-client/src/pxar/create.rs                 |  6 +-
>  pbs-client/src/pxar_backup_stream.rs          |  7 +-
>  proxmox-backup-client/src/main.rs             | 31 ++++---
>  .../src/proxmox_restore_daemon/api.rs         |  1 +
>  pxar-bin/src/main.rs                          |  1 +
>  tests/catar.rs                                |  1 +
>  9 files changed, 121 insertions(+), 53 deletions(-)
> 
> diff --git a/examples/test_chunk_speed2.rs b/examples/test_chunk_speed2.rs
> index 3f69b436d..22dd14ce2 100644
> --- a/examples/test_chunk_speed2.rs
> +++ b/examples/test_chunk_speed2.rs
> @@ -26,7 +26,7 @@ async fn run() -> Result<(), Error> {
>          .map_err(Error::from);
>  
>      //let chunk_stream = FixedChunkStream::new(stream, 4*1024*1024);
> -    let mut chunk_stream = ChunkStream::new(stream, None);
> +    let mut chunk_stream = ChunkStream::new(stream, None, None);
>  
>      let start_time = std::time::Instant::now();
>  
> diff --git a/pbs-client/src/backup_writer.rs b/pbs-client/src/backup_writer.rs
> index 8bd0e4f36..032d93da7 100644
> --- a/pbs-client/src/backup_writer.rs
> +++ b/pbs-client/src/backup_writer.rs
> @@ -1,4 +1,4 @@
> -use std::collections::HashSet;
> +use std::collections::{HashSet, VecDeque};
>  use std::future::Future;
>  use std::os::unix::fs::OpenOptionsExt;
>  use std::sync::atomic::{AtomicU64, AtomicUsize, Ordering};
> @@ -23,6 +23,7 @@ use pbs_tools::crypt_config::CryptConfig;
>  
>  use proxmox_human_byte::HumanByte;
>  
> +use super::inject_reused_chunks::{InjectChunks, InjectReusedChunks, InjectedChunksInfo};
>  use super::merge_known_chunks::{MergeKnownChunks, MergedChunkInfo};
>  
>  use super::{H2Client, HttpClient};
> @@ -265,6 +266,7 @@ impl BackupWriter {
>          archive_name: &str,
>          stream: impl Stream<Item = Result<bytes::BytesMut, Error>>,
>          options: UploadOptions,
> +        injection_queue: Option<Arc<Mutex<VecDeque<InjectChunks>>>>,
>      ) -> Result<BackupStats, Error> {
>          let known_chunks = Arc::new(Mutex::new(HashSet::new()));
>  
> @@ -341,6 +343,7 @@ impl BackupWriter {
>                  None
>              },
>              options.compress,
> +            injection_queue,
>          )
>          .await?;
>  
> @@ -637,6 +640,7 @@ impl BackupWriter {
>          known_chunks: Arc<Mutex<HashSet<[u8; 32]>>>,
>          crypt_config: Option<Arc<CryptConfig>>,
>          compress: bool,
> +        injection_queue: Option<Arc<Mutex<VecDeque<InjectChunks>>>>,
>      ) -> impl Future<Output = Result<UploadStats, Error>> {
>          let total_chunks = Arc::new(AtomicUsize::new(0));
>          let total_chunks2 = total_chunks.clone();
> @@ -663,48 +667,63 @@ impl BackupWriter {
>          let index_csum_2 = index_csum.clone();
>  
>          stream
> -            .and_then(move |data| {
> -                let chunk_len = data.len();
> +            .inject_reused_chunks(
> +                injection_queue.unwrap_or_default(),
> +                stream_len,
> +                reused_len.clone(),
> +                index_csum.clone(),
> +            )
> +            .and_then(move |chunk_info| match chunk_info {

for this part here I am still not sure whether doing all of the
accounting here wouldn't be nicer..

> [..]

> diff --git a/pbs-client/src/chunk_stream.rs b/pbs-client/src/chunk_stream.rs
> index a45420ca0..6ac0c638b 100644
> --- a/pbs-client/src/chunk_stream.rs
> +++ b/pbs-client/src/chunk_stream.rs
> @@ -38,15 +38,17 @@ pub struct ChunkStream<S: Unpin> {
>      chunker: Chunker,
>      buffer: BytesMut,
>      scan_pos: usize,
> +    injection_data: Option<InjectionData>,
>  }
>  
>  impl<S: Unpin> ChunkStream<S> {
> -    pub fn new(input: S, chunk_size: Option<usize>) -> Self {
> +    pub fn new(input: S, chunk_size: Option<usize>, injection_data: Option<InjectionData>) -> Self {
>          Self {
>              input,
>              chunker: Chunker::new(chunk_size.unwrap_or(4 * 1024 * 1024)),
>              buffer: BytesMut::new(),
>              scan_pos: 0,
> +            injection_data,
>          }
>      }
>  }
> @@ -64,6 +66,34 @@ where
>      fn poll_next(self: Pin<&mut Self>, cx: &mut Context) -> Poll<Option<Self::Item>> {
>          let this = self.get_mut();
>          loop {
> +            if let Some(InjectionData {
> +                boundaries,
> +                injections,
> +                consumed,
> +            }) = this.injection_data.as_mut()
> +            {
> +                // Make sure to release this lock as soon as possible
> +                let mut boundaries = boundaries.lock().unwrap();
> +                if let Some(inject) = boundaries.pop_front() {

here I am a bit more wary that this popping and re-pushing might hurt
performance..

> +                    let max = *consumed + this.buffer.len() as u64;
> +                    if inject.boundary <= max {
> +                        let chunk_size = (inject.boundary - *consumed) as usize;
> +                        let result = this.buffer.split_to(chunk_size);

a comment or better variable naming would make this easier to follow
along..

"result" is a forced chunk that is created here because we've reached a
point where we want to inject something afterwards..

once more I am wondering here whether for the payload stream, a vastly
simplified chunker that just picks the boundaries based on re-use and
payload size(s) (to avoid the one file == one chunk pathological case
for lots of small files) wouldn't improve performance :)

> +                        *consumed += chunk_size as u64;
> +                        this.scan_pos = 0;
> +
> +                        // Add the size of the injected chunks to consumed, so chunk stream offsets
> +                        // are in sync with the rest of the archive.
> +                        *consumed += inject.size as u64;
> +
> +                        injections.lock().unwrap().push_back(inject);
> +
> +                        return Poll::Ready(Some(Ok(result)));
> +                    }
> +                    boundaries.push_front(inject);
> +                }
> +            }
> +
>              if this.scan_pos < this.buffer.len() {
>                  let boundary = this.chunker.scan(&this.buffer[this.scan_pos..]);
>  
> @@ -74,7 +104,11 @@ where
>                      // continue poll
>                  } else if chunk_size <= this.buffer.len() {
>                      let result = this.buffer.split_to(chunk_size);
> +                    if let Some(InjectionData { consumed, .. }) = this.injection_data.as_mut() {
> +                        *consumed += chunk_size as u64;
> +                    }
>                      this.scan_pos = 0;
> +
>                      return Poll::Ready(Some(Ok(result)));
>                  } else {
>                      panic!("got unexpected chunk boundary from chunker");

> [..]




^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [pbs-devel] [PATCH v3 proxmox-backup 41/58] specs: add backup detection mode specification
  2024-03-28 12:36 ` [pbs-devel] [PATCH v3 proxmox-backup 41/58] specs: add backup detection mode specification Christian Ebner
@ 2024-04-04 14:54   ` Fabian Grünbichler
  2024-04-08 13:36     ` Christian Ebner
  0 siblings, 1 reply; 122+ messages in thread
From: Fabian Grünbichler @ 2024-04-04 14:54 UTC (permalink / raw)
  To: Proxmox Backup Server development discussion

On March 28, 2024 1:36 pm, Christian Ebner wrote:
> Adds the specification for switching the detection mode used to
> identify regular files which changed since a reference backup run.
> 
> Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
> ---
> changes since version 2:
> - removed unneeded vector storing archive names for which to enable
>   metadata mode, set either for all or none
> 
>  pbs-client/src/backup_specification.rs | 40 ++++++++++++++++++++++++++
>  1 file changed, 40 insertions(+)
> 
> diff --git a/pbs-client/src/backup_specification.rs b/pbs-client/src/backup_specification.rs
> index 619a3a9da..4b1dbd188 100644
> --- a/pbs-client/src/backup_specification.rs
> +++ b/pbs-client/src/backup_specification.rs
> @@ -4,6 +4,7 @@ use proxmox_schema::*;
>  
>  const_regex! {
>      BACKUPSPEC_REGEX = r"^([a-zA-Z0-9_-]+\.(pxar|img|conf|log)):(.+)$";
> +    DETECTION_MODE_REGEX = r"^(data|metadata(:[a-zA-Z0-9_-]+\.pxar)*)$";

this

>  }
>  
>  pub const BACKUP_SOURCE_SCHEMA: Schema =
> @@ -11,6 +12,11 @@ pub const BACKUP_SOURCE_SCHEMA: Schema =
>          .format(&ApiStringFormat::Pattern(&BACKUPSPEC_REGEX))
>          .schema();
>  
> +pub const BACKUP_DETECTION_MODE_SPEC: Schema =
> +    StringSchema::new("Backup source specification ([data|metadata(:<label>,...)]).")

and this

> +        .format(&ApiStringFormat::Pattern(&DETECTION_MODE_REGEX))
> +        .schema();
> +
>  pub enum BackupSpecificationType {
>      PXAR,
>      IMAGE,
> @@ -45,3 +51,37 @@ pub fn parse_backup_specification(value: &str) -> Result<BackupSpecification, Er
>  
>      bail!("unable to parse backup source specification '{}'", value);
>  }
> +
> +/// Mode to detect file changes since last backup run
> +pub enum BackupDetectionMode {
> +    /// Regular mode, re-encode payload data
> +    Data,
> +    /// Compare metadata, reuse payload chunks if metadata unchanged
> +    Metadata,

and this now do not really match anymore?

> +}
> +
> +impl BackupDetectionMode {
> +    /// Check if the selected mode is metadata based file change detection
> +    pub fn is_metadata(&self) -> bool {
> +        match self {
> +            Self::Data => false,
> +            Self::Metadata => true,
> +        }
> +    }
> +}
> +
> +pub fn parse_backup_detection_mode_specification(
> +    value: &str,
> +) -> Result<BackupDetectionMode, Error> {
> +    match (DETECTION_MODE_REGEX.regex_obj)().captures(value) {
> +        Some(caps) => {
> +            let mode = match caps.get(1).unwrap().as_str() {
> +                "data" => BackupDetectionMode::Data,
> +                "metadata" => BackupDetectionMode::Metadata,
> +                _ => bail!("invalid backup detection mode"),
> +            };
> +            Ok(mode)
> +        }
> +        None => bail!("unable to parse backup detection mode specification '{value}'"),
> +    }
> +}
> -- 
> 2.39.2
> 
> 
> 
> _______________________________________________
> pbs-devel mailing list
> pbs-devel@lists.proxmox.com
> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel
> 
> 
> 




^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [pbs-devel] [PATCH v3 proxmox-backup 44/58] client: pxar: add previous reference to archiver
  2024-03-28 12:36 ` [pbs-devel] [PATCH v3 proxmox-backup 44/58] client: pxar: add previous reference to archiver Christian Ebner
@ 2024-04-04 15:04   ` Fabian Grünbichler
  0 siblings, 0 replies; 122+ messages in thread
From: Fabian Grünbichler @ 2024-04-04 15:04 UTC (permalink / raw)
  To: Proxmox Backup Server development discussion

On March 28, 2024 1:36 pm, Christian Ebner wrote:
> Read the previous snaphosts manifest and check if a split archive
> with the same name is given. If so, create the accessor instance to
> read the previous archive entries to be able to lookup and compare
> the metata for the entries, allowing to make a decision if the
> entry is reusable or not.
> 
> Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
> ---
> changes since version 2:
> - renamed accessor to previous metadata_accessor
> - get backup reader for previous snapshot after creating the writer
>   instance for the new snapshot
> - adapted to only use metadata mode for all or non of the given archives
> 
>  pbs-client/src/pxar/create.rs                 | 55 ++++++++++++++++---
>  proxmox-backup-client/src/main.rs             | 51 ++++++++++++++++-
>  .../src/proxmox_restore_daemon/api.rs         |  1 +
>  pxar-bin/src/main.rs                          |  1 +
>  4 files changed, 97 insertions(+), 11 deletions(-)
> 
> diff --git a/pbs-client/src/pxar/create.rs b/pbs-client/src/pxar/create.rs
> index 95a91a59b..79925bba2 100644
> --- a/pbs-client/src/pxar/create.rs
> +++ b/pbs-client/src/pxar/create.rs
> @@ -19,7 +19,7 @@ use nix::sys::stat::{FileStat, Mode};
>  use pathpatterns::{MatchEntry, MatchFlag, MatchList, MatchType, PatternFlag};
>  use pbs_datastore::index::IndexFile;
>  use proxmox_sys::error::SysError;
> -use pxar::accessor::aio::Accessor;
> +use pxar::accessor::aio::{Accessor, Directory};
>  use pxar::encoder::{LinkOffset, PayloadOffset, SeqWrite};
>  use pxar::Metadata;
>  
> @@ -159,7 +159,7 @@ impl ReusedChunks {
>  }
>  
>  /// Pxar options for creating a pxar archive/stream
> -#[derive(Default, Clone)]
> +#[derive(Default)]
>  pub struct PxarCreateOptions {
>      /// Device/mountpoint st_dev numbers that should be included. None for no limitation.
>      pub device_set: Option<HashSet<u64>>,
> @@ -171,6 +171,8 @@ pub struct PxarCreateOptions {
>      pub skip_lost_and_found: bool,
>      /// Skip xattrs of files that return E2BIG error
>      pub skip_e2big_xattr: bool,
> +    /// Reference state for partial backups
> +    pub previous_ref: Option<PxarPrevRef>,
>  }
>  
>  /// Statefull information of previous backups snapshots for partial backups
> @@ -270,6 +272,7 @@ struct Archiver {
>      file_copy_buffer: Vec<u8>,
>      skip_e2big_xattr: bool,
>      reused_chunks: ReusedChunks,
> +    previous_payload_index: Option<DynamicIndexReader>,
>      forced_boundaries: Option<Arc<Mutex<VecDeque<InjectChunks>>>>,
>  }
>  
> @@ -346,6 +349,15 @@ where
>              MatchType::Exclude,
>          )?);
>      }
> +    let (previous_payload_index, previous_metadata_accessor) =
> +        if let Some(refs) = options.previous_ref {
> +            (
> +                Some(refs.payload_index),
> +                refs.accessor.open_root().await.ok(),
> +            )
> +        } else {
> +            (None, None)
> +        };
>  
>      let mut archiver = Archiver {
>          feature_flags,
> @@ -363,11 +375,12 @@ where
>          file_copy_buffer: vec::undefined(4 * 1024 * 1024),
>          skip_e2big_xattr: options.skip_e2big_xattr,
>          reused_chunks: ReusedChunks::new(),
> +        previous_payload_index,
>          forced_boundaries,
>      };
>  
>      archiver
> -        .archive_dir_contents(&mut encoder, source_dir, true)
> +        .archive_dir_contents(&mut encoder, previous_metadata_accessor, source_dir, true)
>          .await?;
>      encoder.finish().await?;
>      encoder.close().await?;
> @@ -399,6 +412,7 @@ impl Archiver {
>      fn archive_dir_contents<'a, T: SeqWrite + Send>(
>          &'a mut self,
>          encoder: &'a mut Encoder<'_, T>,
> +        mut previous_metadata_accessor: Option<Directory<LocalDynamicReadAt<RemoteChunkReader>>>,
>          mut dir: Dir,
>          is_root: bool,
>      ) -> BoxFuture<'a, Result<(), Error>> {
> @@ -433,9 +447,15 @@ impl Archiver {
>  
>                  (self.callback)(&file_entry.path)?;
>                  self.path = file_entry.path;
> -                self.add_entry(encoder, dir_fd, &file_entry.name, &file_entry.stat)
> -                    .await
> -                    .map_err(|err| self.wrap_err(err))?;
> +                self.add_entry(
> +                    encoder,
> +                    &mut previous_metadata_accessor,
> +                    dir_fd,
> +                    &file_entry.name,
> +                    &file_entry.stat,
> +                )
> +                .await
> +                .map_err(|err| self.wrap_err(err))?;
>              }
>              self.path = old_path;
>              self.entry_counter = entry_counter;
> @@ -683,6 +703,7 @@ impl Archiver {
>      async fn add_entry<T: SeqWrite + Send>(
>          &mut self,
>          encoder: &mut Encoder<'_, T>,
> +        previous_metadata: &mut Option<Directory<LocalDynamicReadAt<RemoteChunkReader>>>,
>          parent: RawFd,
>          c_file_name: &CStr,
>          stat: &FileStat,
> @@ -772,7 +793,14 @@ impl Archiver {
>                      catalog.lock().unwrap().start_directory(c_file_name)?;
>                  }
>                  let result = self
> -                    .add_directory(encoder, dir, c_file_name, &metadata, stat)
> +                    .add_directory(
> +                        encoder,
> +                        previous_metadata,
> +                        dir,
> +                        c_file_name,
> +                        &metadata,
> +                        stat,
> +                    )
>                      .await;
>                  if let Some(ref catalog) = self.catalog {
>                      catalog.lock().unwrap().end_directory()?;
> @@ -825,6 +853,7 @@ impl Archiver {
>      async fn add_directory<T: SeqWrite + Send>(
>          &mut self,
>          encoder: &mut Encoder<'_, T>,
> +        previous_metadata_accessor: &mut Option<Directory<LocalDynamicReadAt<RemoteChunkReader>>>,
>          dir: Dir,
>          dir_name: &CStr,
>          metadata: &Metadata,
> @@ -855,7 +884,17 @@ impl Archiver {
>              log::info!("skipping mount point: {:?}", self.path);
>              Ok(())
>          } else {
> -            self.archive_dir_contents(encoder, dir, false).await
> +            let mut dir_accessor = None;
> +            if let Some(accessor) = previous_metadata_accessor.as_mut() {
> +                if let Some(file_entry) = accessor.lookup(dir_name).await? {
> +                    if file_entry.entry().is_dir() {
> +                        let dir = file_entry.enter_directory().await?;
> +                        dir_accessor = Some(dir);
> +                    }
> +                }
> +            }
> +            self.archive_dir_contents(encoder, dir_accessor, dir, false)
> +                .await
>          };
>  
>          self.fs_magic = old_fs_magic;
> diff --git a/proxmox-backup-client/src/main.rs b/proxmox-backup-client/src/main.rs
> index 0b747453c..66dcaa63e 100644
> --- a/proxmox-backup-client/src/main.rs
> +++ b/proxmox-backup-client/src/main.rs
> @@ -688,6 +688,10 @@ fn spawn_catalog_upload(
>                 schema: TRAFFIC_CONTROL_BURST_SCHEMA,
>                 optional: true,
>             },
> +           "change-detection-mode": {
> +               schema: BACKUP_DETECTION_MODE_SPEC,
> +               optional: true,
> +           },
>             "exclude": {
>                 type: Array,
>                 description: "List of paths or patterns for matching files to exclude.",
> @@ -882,6 +886,9 @@ async fn create_backup(
>  
>      let backup_time = backup_time_opt.unwrap_or_else(epoch_i64);
>  
> +    let detection_mode = param["change-detection-mode"].as_str().unwrap_or("data");
> +    let detection_mode = parse_backup_detection_mode_specification(detection_mode)?;
> +
>      let http_client = connect_rate_limited(&repo, rate_limit)?;
>      record_repository(&repo);
>  
> @@ -982,7 +989,7 @@ async fn create_backup(
>          None
>      };
>  
> -    let mut manifest = BackupManifest::new(snapshot);
> +    let mut manifest = BackupManifest::new(snapshot.clone());
>  
>      let mut catalog = None;
>      let mut catalog_result_rx = None;
> @@ -1029,14 +1036,13 @@ async fn create_backup(
>                  manifest.add_file(target, stats.size, stats.csum, crypto.mode)?;
>              }
>              (BackupSpecificationType::PXAR, false) => {
> -                let metadata_mode = false; // Until enabled via param
> -
>                  let target_base = if let Some(base) = target_base.strip_suffix(".pxar") {
>                      base.to_string()
>                  } else {
>                      bail!("unexpected suffix in target: {target_base}");
>                  };
>  
> +                let metadata_mode = detection_mode.is_metadata();
>                  let (target, payload_target) = if metadata_mode {
>                      (
>                          format!("{target_base}.mpxar.{extension}"),
> @@ -1061,12 +1067,51 @@ async fn create_backup(
>                      .unwrap()
>                      .start_directory(std::ffi::CString::new(target.as_str())?.as_c_str())?;
>  
> +                let mut previous_ref = None;
> +                if metadata_mode {
> +                    if let Some(ref manifest) = previous_manifest {

client.previous_backup_time() gives you the timestamp without the need
to query all snapshots, and it's guaranteed to match the
previous_manifest since both come from the same (lock-guarded) lookup
when starting the writer/backup H2 session.

> +                        let list = api_datastore_list_snapshots(
> +                            &http_client,
> +                            repo.store(),
> +                            &backup_ns,
> +                            Some(&snapshot.group),
> +                        )
> +                        .await?;
> +                        let mut list: Vec<SnapshotListItem> = serde_json::from_value(list)?;
> +
> +                        // BackupWriter::start created a new snapshot, get the one before
> +                        if list.len() > 1 {
> +                            list.sort_unstable_by(|a, b| b.backup.time.cmp(&a.backup.time));
> +                            let backup_dir: BackupDir =
> +                                (snapshot.group.clone(), list[1].backup.time).into();
> +                            let backup_reader = BackupReader::start(
> +                                &http_client,
> +                                crypt_config.clone(),
> +                                repo.store(),
> +                                &backup_ns,
> +                                &backup_dir,
> +                                true,
> +                            )
> +                            .await?;
> +                            previous_ref = prepare_reference(
> +                                &target,
> +                                manifest.clone(),
> +                                &client,
> +                                backup_reader.clone(),
> +                                crypt_config.clone(),
> +                            )
> +                            .await?
> +                        }
> +                    }
> +                }
> +
>                  let pxar_options = pbs_client::pxar::PxarCreateOptions {
>                      device_set: devices.clone(),
>                      patterns: pattern_list.clone(),
>                      entries_max: entries_max as usize,
>                      skip_lost_and_found,
>                      skip_e2big_xattr,
> +                    previous_ref,
>                  };
>  
>                  let upload_options = UploadOptions {
> diff --git a/proxmox-restore-daemon/src/proxmox_restore_daemon/api.rs b/proxmox-restore-daemon/src/proxmox_restore_daemon/api.rs
> index 0883d6cda..e50cb8184 100644
> --- a/proxmox-restore-daemon/src/proxmox_restore_daemon/api.rs
> +++ b/proxmox-restore-daemon/src/proxmox_restore_daemon/api.rs
> @@ -355,6 +355,7 @@ fn extract(
>                          patterns,
>                          skip_lost_and_found: false,
>                          skip_e2big_xattr: false,
> +                        previous_ref: None,
>                      };
>  
>                      let pxar_writer = TokioWriter::new(writer);
> diff --git a/pxar-bin/src/main.rs b/pxar-bin/src/main.rs
> index d46c98d2b..c6d3794bb 100644
> --- a/pxar-bin/src/main.rs
> +++ b/pxar-bin/src/main.rs
> @@ -358,6 +358,7 @@ async fn create_archive(
>          patterns,
>          skip_lost_and_found: false,
>          skip_e2big_xattr: false,
> +        previous_ref: None,
>      };
>  
>      let source = PathBuf::from(source);
> -- 
> 2.39.2
> 
> 
> 
> _______________________________________________
> pbs-devel mailing list
> pbs-devel@lists.proxmox.com
> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel
> 
> 
> 




^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [pbs-devel] [PATCH v3 proxmox-backup 30/58] catalog: shell: redirect payload reader for split streams
  2024-04-04  9:49   ` Fabian Grünbichler
@ 2024-04-04 15:52     ` Christian Ebner
  0 siblings, 0 replies; 122+ messages in thread
From: Christian Ebner @ 2024-04-04 15:52 UTC (permalink / raw)
  To: Proxmox Backup Server development discussion, Fabian Grünbichler

On 4/4/24 11:49, Fabian Grünbichler wrote:
> On March 28, 2024 1:36 pm, Christian Ebner wrote:
>> Allows to attach to pxar archives with split metadata and payload
>> streams, by redirecting the payload input to a dedicated reader
>> accessing the payload index.
>>
>> Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
>> ---
>> changes since version 2:
>> - use mpxar and ppxar file extensions
>> - use pxar fuse reader helper
>>
>>   proxmox-backup-client/src/catalog.rs | 18 ++++++++++++++++--
>>   1 file changed, 16 insertions(+), 2 deletions(-)
>>
>> diff --git a/proxmox-backup-client/src/catalog.rs b/proxmox-backup-client/src/catalog.rs
>> index 2073e058d..3e52880b9 100644
>> --- a/proxmox-backup-client/src/catalog.rs
>> +++ b/proxmox-backup-client/src/catalog.rs
>> @@ -181,7 +181,10 @@ async fn catalog_shell(param: Value) -> Result<(), Error> {
>>           }
>>       };
>>   
>> -    let server_archive_name = if archive_name.ends_with(".pxar") {
>> +    let server_archive_name = if archive_name.ends_with(".pxar")
>> +        || archive_name.ends_with(".mpxar")
>> +        || archive_name.ends_with(".ppxar")
>> +    {
> 
> as with mount - there is a call to get_pxar_archive_names betwee this
> hunk and the next one (introduced by the previous patch), shouldn't that
> just be moved up here?

But this just adds the `.didx` postfix to the filename, not creating the 
possible split archive names just yet. I left this in place as is.

> 
>>           format!("{}.didx", archive_name)
>>       } else {
>>           bail!("Can only mount pxar archives.");
>> @@ -216,7 +219,18 @@ async fn catalog_shell(param: Value) -> Result<(), Error> {
>>       )
>>       .await?;
>>   
>> -    let decoder = pbs_pxar_fuse::Accessor::new(reader, archive_size).await?;
>> +    let decoder = if let Some(payload_archive_name) = payload_archive_name {
>> +        let (payload_reader, _) = helper::get_pxar_fuse_reader(
>> +            &payload_archive_name,
>> +            client.clone(),
>> +            &manifest,
>> +            crypt_config.clone(),
>> +        )
>> +        .await?;
>> +        pbs_pxar_fuse::Accessor::new(reader, archive_size, Some(payload_reader)).await?
>> +    } else {
>> +        pbs_pxar_fuse::Accessor::new(reader, archive_size, None).await?
>> +    };
> 
> we have this exact pattern twice at least as well, once for mount, once
> for the catalog.. in fact, all four calls to helpers::get_pxar_fuse_reader
> are in those two call sites, so probably the helper should just be that
> whole sequence instead (or the existing helper made internal to a new
> helper for this sequence)? :)

Yes, I moved this and the same code for the mount part into a helper 
function and merged this now rather unspectacular patch with the 
previous one also touching the catalog shell restore functionality.
Didn't make sense anymore to keep them separate.

> 
>>   
>>       client.download(CATALOG_NAME, &mut tmpfile).await?;
>>       let index = DynamicIndexReader::new(tmpfile)
>> -- 
>> 2.39.2
>>
>>
>>
>> _______________________________________________
>> pbs-devel mailing list
>> pbs-devel@lists.proxmox.com
>> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel
>>
>>
>>
> 
> 
> _______________________________________________
> pbs-devel mailing list
> pbs-devel@lists.proxmox.com
> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel
> 
> 





^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [pbs-devel] [PATCH v3 proxmox-backup 37/58] client: pxar: helper for lookup of reusable dynamic entries
  2024-04-04 12:54   ` Fabian Grünbichler
@ 2024-04-04 17:13     ` Christian Ebner
  2024-04-05  7:22       ` Christian Ebner
  0 siblings, 1 reply; 122+ messages in thread
From: Christian Ebner @ 2024-04-04 17:13 UTC (permalink / raw)
  To: Proxmox Backup Server development discussion, Fabian Grünbichler

On 4/4/24 14:54, Fabian Grünbichler wrote:
> On March 28, 2024 1:36 pm, Christian Ebner wrote:
>> The helper method allows to lookup the entries of a dynamic index
>> which fully cover a given offset range. Further, the helper returns
>> the start padding from the start offset of the dynamic index entry
>> to the start offset of the given range and the end padding.
>>
>> This will be used to lookup size and digest for chunks covering the
>> payload range of a regular file in order to re-use found chunks by
>> indexing them in the archives index file instead of re-encoding the
>> payload.
>>
>> Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
>> ---
>> changes since version 2:
>> - moved this from the dynamic index to the pxar create as suggested
>> - refactored and optimized search, going for linear search to find the
>>    end entry
>> - reworded commit message
>>
>>   pbs-client/src/pxar/create.rs | 63 +++++++++++++++++++++++++++++++++++
>>   1 file changed, 63 insertions(+)
>>
>> diff --git a/pbs-client/src/pxar/create.rs b/pbs-client/src/pxar/create.rs
>> index 2bb5a6253..e2d3954ca 100644
>> --- a/pbs-client/src/pxar/create.rs
>> +++ b/pbs-client/src/pxar/create.rs
>> @@ -2,6 +2,7 @@ use std::collections::{HashMap, HashSet};
>>   use std::ffi::{CStr, CString, OsStr};
>>   use std::fmt;
>>   use std::io::{self, Read};
>> +use std::ops::Range;
>>   use std::os::unix::ffi::OsStrExt;
>>   use std::os::unix::io::{AsRawFd, FromRawFd, IntoRawFd, OwnedFd, RawFd};
>>   use std::path::{Path, PathBuf};
>> @@ -16,6 +17,7 @@ use nix::fcntl::OFlag;
>>   use nix::sys::stat::{FileStat, Mode};
>>   
>>   use pathpatterns::{MatchEntry, MatchFlag, MatchList, MatchType, PatternFlag};
>> +use pbs_datastore::index::IndexFile;
>>   use proxmox_sys::error::SysError;
>>   use pxar::encoder::{LinkOffset, SeqWrite};
>>   use pxar::Metadata;
>> @@ -25,6 +27,7 @@ use proxmox_lang::c_str;
>>   use proxmox_sys::fs::{self, acl, xattr};
>>   
>>   use pbs_datastore::catalog::BackupCatalogWriter;
>> +use pbs_datastore::dynamic_index::DynamicIndexReader;
>>   
>>   use crate::pxar::metadata::errno_is_unsupported;
>>   use crate::pxar::tools::assert_single_path_component;
>> @@ -791,6 +794,66 @@ impl Archiver {
>>       }
>>   }
>>   
>> +/// Dynamic Entry reusable by payload references
>> +#[derive(Clone, Debug)]
>> +#[repr(C)]
>> +pub struct ReusableDynamicEntry {
>> +    size_le: u64,
>> +    digest: [u8; 32],
>> +}
>> +
>> +impl ReusableDynamicEntry {
>> +    #[inline]
>> +    pub fn size(&self) -> u64 {
>> +        u64::from_le(self.size_le)
>> +    }
>> +
>> +    #[inline]
>> +    pub fn digest(&self) -> [u8; 32] {
>> +        self.digest
>> +    }
>> +}
>> +
>> +/// List of dynamic entries containing the data given by an offset range
>> +fn lookup_dynamic_entries(
>> +    index: &DynamicIndexReader,
>> +    range: Range<u64>,
>> +) -> Result<(Vec<ReusableDynamicEntry>, u64, u64), Error> {
>> +    let end_idx = index.index_count() - 1;
>> +    let chunk_end = index.chunk_end(end_idx);
>> +    let start = index.binary_search(0, 0, end_idx, chunk_end, range.start)?;
>> +    let mut end = start;
>> +    while end < end_idx {
>> +        if range.end < index.chunk_end(end) {
>> +            break;
>> +        }
>> +        end += 1;
>> +    }
> 
> this loop here
> 
>> +
>> +    let offset_first = if start == 0 {
>> +        0
>> +    } else {
>> +        index.chunk_end(start - 1)
>> +    };
> 
> offset_first is prev_end, so maybe we could just name it like that from
> the start?
> 
>> +
>> +    let padding_start = range.start - offset_first;
>> +    let padding_end = index.chunk_end(end) - range.end;
>> +
>> +    let mut indices = Vec::new();
>> +    let mut prev_end = offset_first;
>> +    for dynamic_entry in &index.index()[start..end + 1] {
>> +        let size = dynamic_entry.end() - prev_end;
>> +        let reusable_dynamic_entry = ReusableDynamicEntry {
>> +            size_le: size.to_le(),
>> +            digest: dynamic_entry.digest(),
>> +        };
>> +        prev_end += size;
>> +        indices.push(reusable_dynamic_entry);
>> +    }
> 
> and this one here could probably be combined?
> 
>> +
>> +    Ok((indices, padding_start, padding_end))
>> +}
> 
> e.g., the whole thing could become something like (untested ;)):
> 
>      let end_idx = index.index_count() - 1;
>      let chunk_end = index.chunk_end(end_idx);
>      let start = index.binary_search(0, 0, end_idx, chunk_end, range.start)?;
> 
>      let mut prev_end = if start == 0 {
>          0
>      } else {
>          index.chunk_end(start - 1)
>      };
>      let padding_start = range.start - prev_end;
>      let mut padding_end = 0;
> 
>      let mut indices = Vec::new();
>      for dynamic_entry in &index.index()[start..] {
>          let end = dynamic_entry.end();
>          if range.end < end {
>              padding_end = end - range.end;
>              break;
>          }
> 
>          let reusable_dynamic_entry = ReusableDynamicEntry {
>              size_le: (end - prev_end).to_le(),
>              digest: dynamic_entry.digest(),
>          };
>          indices.push(reusable_dynamic_entry);
>          prev_end = end;
>      }
> 
>      Ok((indices, padding_start, padding_end))

Thanks for looking into this so deeply, unfortunately this version leads 
to missing injected chunks in my quick test. Will have a look on where 
the problem is tomorrow.





^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [pbs-devel] [PATCH v3 proxmox-backup 37/58] client: pxar: helper for lookup of reusable dynamic entries
  2024-04-04 17:13     ` Christian Ebner
@ 2024-04-05  7:22       ` Christian Ebner
  0 siblings, 0 replies; 122+ messages in thread
From: Christian Ebner @ 2024-04-05  7:22 UTC (permalink / raw)
  To: Proxmox Backup Server development discussion, Fabian Grünbichler

On 4/4/24 19:13, Christian Ebner wrote:
> On 4/4/24 14:54, Fabian Grünbichler wrote:
>> On March 28, 2024 1:36 pm, Christian Ebner wrote:
>>> The helper method allows to lookup the entries of a dynamic index
>>> which fully cover a given offset range. Further, the helper returns
>>> the start padding from the start offset of the dynamic index entry
>>> to the start offset of the given range and the end padding.
>>>
>>> This will be used to lookup size and digest for chunks covering the
>>> payload range of a regular file in order to re-use found chunks by
>>> indexing them in the archives index file instead of re-encoding the
>>> payload.
>>>
>>> Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
>>> ---
>>> changes since version 2:
>>> - moved this from the dynamic index to the pxar create as suggested
>>> - refactored and optimized search, going for linear search to find the
>>>    end entry
>>> - reworded commit message
>>>
>>>   pbs-client/src/pxar/create.rs | 63 +++++++++++++++++++++++++++++++++++
>>>   1 file changed, 63 insertions(+)
>>>
>>> diff --git a/pbs-client/src/pxar/create.rs 
>>> b/pbs-client/src/pxar/create.rs
>>> index 2bb5a6253..e2d3954ca 100644
>>> --- a/pbs-client/src/pxar/create.rs
>>> +++ b/pbs-client/src/pxar/create.rs
>>> @@ -2,6 +2,7 @@ use std::collections::{HashMap, HashSet};
>>>   use std::ffi::{CStr, CString, OsStr};
>>>   use std::fmt;
>>>   use std::io::{self, Read};
>>> +use std::ops::Range;
>>>   use std::os::unix::ffi::OsStrExt;
>>>   use std::os::unix::io::{AsRawFd, FromRawFd, IntoRawFd, OwnedFd, 
>>> RawFd};
>>>   use std::path::{Path, PathBuf};
>>> @@ -16,6 +17,7 @@ use nix::fcntl::OFlag;
>>>   use nix::sys::stat::{FileStat, Mode};
>>>   use pathpatterns::{MatchEntry, MatchFlag, MatchList, MatchType, 
>>> PatternFlag};
>>> +use pbs_datastore::index::IndexFile;
>>>   use proxmox_sys::error::SysError;
>>>   use pxar::encoder::{LinkOffset, SeqWrite};
>>>   use pxar::Metadata;
>>> @@ -25,6 +27,7 @@ use proxmox_lang::c_str;
>>>   use proxmox_sys::fs::{self, acl, xattr};
>>>   use pbs_datastore::catalog::BackupCatalogWriter;
>>> +use pbs_datastore::dynamic_index::DynamicIndexReader;
>>>   use crate::pxar::metadata::errno_is_unsupported;
>>>   use crate::pxar::tools::assert_single_path_component;
>>> @@ -791,6 +794,66 @@ impl Archiver {
>>>       }
>>>   }
>>> +/// Dynamic Entry reusable by payload references
>>> +#[derive(Clone, Debug)]
>>> +#[repr(C)]
>>> +pub struct ReusableDynamicEntry {
>>> +    size_le: u64,
>>> +    digest: [u8; 32],
>>> +}
>>> +
>>> +impl ReusableDynamicEntry {
>>> +    #[inline]
>>> +    pub fn size(&self) -> u64 {
>>> +        u64::from_le(self.size_le)
>>> +    }
>>> +
>>> +    #[inline]
>>> +    pub fn digest(&self) -> [u8; 32] {
>>> +        self.digest
>>> +    }
>>> +}
>>> +
>>> +/// List of dynamic entries containing the data given by an offset 
>>> range
>>> +fn lookup_dynamic_entries(
>>> +    index: &DynamicIndexReader,
>>> +    range: Range<u64>,
>>> +) -> Result<(Vec<ReusableDynamicEntry>, u64, u64), Error> {
>>> +    let end_idx = index.index_count() - 1;
>>> +    let chunk_end = index.chunk_end(end_idx);
>>> +    let start = index.binary_search(0, 0, end_idx, chunk_end, 
>>> range.start)?;
>>> +    let mut end = start;
>>> +    while end < end_idx {
>>> +        if range.end < index.chunk_end(end) {
>>> +            break;
>>> +        }
>>> +        end += 1;
>>> +    }
>>
>> this loop here
>>
>>> +
>>> +    let offset_first = if start == 0 {
>>> +        0
>>> +    } else {
>>> +        index.chunk_end(start - 1)
>>> +    };
>>
>> offset_first is prev_end, so maybe we could just name it like that from
>> the start?
>>
>>> +
>>> +    let padding_start = range.start - offset_first;
>>> +    let padding_end = index.chunk_end(end) - range.end;
>>> +
>>> +    let mut indices = Vec::new();
>>> +    let mut prev_end = offset_first;
>>> +    for dynamic_entry in &index.index()[start..end + 1] {
>>> +        let size = dynamic_entry.end() - prev_end;
>>> +        let reusable_dynamic_entry = ReusableDynamicEntry {
>>> +            size_le: size.to_le(),
>>> +            digest: dynamic_entry.digest(),
>>> +        };
>>> +        prev_end += size;
>>> +        indices.push(reusable_dynamic_entry);
>>> +    }
>>
>> and this one here could probably be combined?
>>
>>> +
>>> +    Ok((indices, padding_start, padding_end))
>>> +}
>>
>> e.g., the whole thing could become something like (untested ;)):
>>
>>      let end_idx = index.index_count() - 1;
>>      let chunk_end = index.chunk_end(end_idx);
>>      let start = index.binary_search(0, 0, end_idx, chunk_end, 
>> range.start)?;
>>
>>      let mut prev_end = if start == 0 {
>>          0
>>      } else {
>>          index.chunk_end(start - 1)
>>      };
>>      let padding_start = range.start - prev_end;
>>      let mut padding_end = 0;
>>
>>      let mut indices = Vec::new();
>>      for dynamic_entry in &index.index()[start..] {
>>          let end = dynamic_entry.end();
>>          if range.end < end {
>>              padding_end = end - range.end;
>>              break;
>>          }
>>
>>          let reusable_dynamic_entry = ReusableDynamicEntry {
>>              size_le: (end - prev_end).to_le(),
>>              digest: dynamic_entry.digest(),
>>          };
>>          indices.push(reusable_dynamic_entry);
>>          prev_end = end;
>>      }
>>
>>      Ok((indices, padding_start, padding_end))
> 
> Thanks for looking into this so deeply, unfortunately this version leads 
> to missing injected chunks in my quick test. Will have a look on where 
> the problem is tomorrow.

Just had to move the pushing of the final chunk to before the end check. 
Will include this in the next version of the patches, thanks a lot for 
the optimization!





^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [pbs-devel] [PATCH v3 proxmox-backup 43/58] client: pxar: implement store to insert chunks on caching
  2024-03-28 12:36 ` [pbs-devel] [PATCH v3 proxmox-backup 43/58] client: pxar: implement store to insert chunks on caching Christian Ebner
@ 2024-04-05  7:52   ` Fabian Grünbichler
  2024-04-09  9:12     ` Christian Ebner
  0 siblings, 1 reply; 122+ messages in thread
From: Fabian Grünbichler @ 2024-04-05  7:52 UTC (permalink / raw)
  To: Christian Ebner, pbs-devel

Quoting Christian Ebner (2024-03-28 13:36:52)
> In preparation for the look-ahead caching used to temprarily store
> entries before encoding them in the pxar archive, being able to
> decide wether to re-use or re-encode regular file entries.
> 
> Allows to insert and store reused chunks in the archiver,
> deduplicating chunks upon insert when possible.
> 
> Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
> ---
> changes since version 2:
> - Strongly adapted and refactored: keep track also of paddings
>   introduced by reusing the chunks, making a suggestion whether to
>   re-use, re-encode or check next entry based on threshold
> - completely removed code which allowed to calculate offsets based on
>   chunks found in the middle, they must either be a continuation of the
>   end or be added after, otherwise offsets are not monotonically
>   increasing, which is required for sequential restore
> 
>  pbs-client/src/pxar/create.rs | 126 +++++++++++++++++++++++++++++++++-
>  1 file changed, 125 insertions(+), 1 deletion(-)
> 
> diff --git a/pbs-client/src/pxar/create.rs b/pbs-client/src/pxar/create.rs
> index 335e3556f..95a91a59b 100644
> --- a/pbs-client/src/pxar/create.rs
> +++ b/pbs-client/src/pxar/create.rs
> @@ -20,7 +20,7 @@ use pathpatterns::{MatchEntry, MatchFlag, MatchList, MatchType, PatternFlag};
>  use pbs_datastore::index::IndexFile;
>  use proxmox_sys::error::SysError;
>  use pxar::accessor::aio::Accessor;
> -use pxar::encoder::{LinkOffset, SeqWrite};
> +use pxar::encoder::{LinkOffset, PayloadOffset, SeqWrite};
>  use pxar::Metadata;
>  
>  use proxmox_io::vec;
> @@ -36,6 +36,128 @@ use crate::pxar::metadata::errno_is_unsupported;
>  use crate::pxar::tools::assert_single_path_component;
>  use crate::pxar::Flags;
>  
> +const CHUNK_PADDING_THRESHOLD: f64 = 0.1;
> +
> +#[derive(Default)]
> +struct ReusedChunks {
> +    start_boundary: PayloadOffset,
> +    total: PayloadOffset,
> +    padding: u64,
> +    chunks: Vec<(u64, ReusableDynamicEntry)>,
> +    must_flush_first: bool,
> +    suggestion: Suggested,
> +}
> +
> +#[derive(Copy, Clone, Default)]
> +enum Suggested {
> +    #[default]
> +    CheckNext,
> +    Reuse,
> +    Reencode,
> +}

this is a bit of a misnomer - it's not a suggestion, it's what is going to
happen ;) maybe `action` or even `result` or something similar might be a
better fit? or something going into the direction of "decision"?

> +
> +impl ReusedChunks {
> +    fn new() -> Self {
> +        Self::default()
> +    }

this is not needed, can just use default()

> +
> +    fn start_boundary(&self) -> PayloadOffset {
> +        self.start_boundary
> +    }

is this needed? access is only local anyway, and we don't have a similar helper
for self.chunks ;)

> +
> +    fn is_empty(&self) -> bool {
> +        self.chunks.is_empty()
> +    }
> +
> +    fn suggested(&self) -> Suggested {
> +        self.suggestion

same question here..

> +    }
> +
> +    fn insert(
> +        &mut self,
> +        indices: Vec<ReusableDynamicEntry>,
> +        boundary: PayloadOffset,
> +        start_padding: u64,
> +        end_padding: u64,
> +    ) -> PayloadOffset {

another fn that definitely would benefit from some comments describing the
thought processes behind the logic :)

> +        if self.is_empty() {
> +            self.start_boundary = boundary;
> +        }

okay, so this is actually just set

> +
> +        if let Some(offset) = self.last_digest_matched(&indices) {
> +            if let Some((padding, last)) = self.chunks.last_mut() {

we already know a last chunk must exist (else last_digest_matched wouldn't have
returned). that means last_digest_matched could just return the last chunk?

> +                // Existing chunk, update padding based on pre-existing one
> +                // Start padding is expected to be larger than previous padding

should we validate this expectation? or is it already validated somewhere else?
and also, isn't start_padding by definition always smaller than last.size()?

> +                *padding += start_padding - last.size();
> +                self.padding += start_padding - last.size();

(see below) here we correct the per-chunk padding, but only for a partial chunk that is continued..

if I understand this part correctly, here we basically want to adapt from a
potential end_padding of the last chunk, if it was:

A|A|A|P|P|P|P

and we now have

P|P|P|P|P|B|B

we want to end up with just 2 'P's in the middle? isn't start_padding - size
the count of the payload? so what we actually want is

*padding -= (last.size() - start_padding)

? IMHO that makes the intention much more readable, especially if you factor out

let payload_size = last.size() - start_padding;
*padding -= payload_size;
self.padding -= payload_size;

if we want to be extra careful, we could even add the three checks/assertions here:
- start_padding must be smaller than the chunk size
- both chunk and total padding must be bigger than the payload size

> +            }
> +
> +            for chunk in indices.into_iter().skip(1) {
> +                self.total = self.total.add(chunk.size());
> +                self.chunks.push((0, chunk));

here we push the second and later chunks with 0 padding

> +            }
> +
> +            if let Some((padding, _last)) = self.chunks.last_mut() {
> +                *padding += end_padding;
> +                self.padding += end_padding;

and here we account for the end_padding of the last chunk, which might actually
be the same as the first chunk, but that works out..

> +            }
> +
> +            let padding_ratio = self.padding as f64 / self.total.raw() as f64;
> +            if self.chunks.len() > 1 && padding_ratio < CHUNK_PADDING_THRESHOLD {
> +                self.suggestion = Suggested::Reuse;
> +            }

see below

> +
> +            self.start_boundary.add(offset + start_padding)
> +        } else {
> +            let offset = self.total.raw();
> +
> +            if let Some(first) = indices.first() {
> +                self.total = self.total.add(first.size());
> +                self.chunks.push((start_padding, first.clone()));
> +                // New chunk, all start padding counts
> +                self.padding += start_padding;
> +            }
> +
> +            for chunk in indices.into_iter().skip(1) {
> +                self.total = self.total.add(chunk.size());
> +                self.chunks.push((chunk.size(), chunk));

so here we count the full chunk size as padding (for the per-chunk stats), but
don't count it for the total padding? I think this should be 0 just like above?

this and the handling of the first chunk could actually be combined:

for (idx, chunk) in indices.into_iter().enumerate() {
	self.total = self.total.add(chunk.size());
	let chunk_padding = if idx == 0 { self.padding += start_padding; start_padding } else { 0 };
	self.chunks.push((chunk_padding, chunk));
}

or we could make start_padding mut, and do

self.padding += start_padding;
for chunk in indices.into_iter() {
	self.total = self.total.add(chunk.size());
	self.chunks.push((start_padding, chunk));
	start_padding = 0; // only first chunk counts
}

> +            }
> +
> +            if let Some((padding, _last)) = self.chunks.last_mut() {
> +                *padding += end_padding;
> +                self.padding += end_padding;
> +            }
> +
> +            if self.chunks.len() > 2 {

so if we insert more than two chunks without a continuation, all of them are
accounted for as full of padding but they are still re-used if start and end
padding are below the threshold ;)

> +                let padding_ratio = self.padding as f64 / self.total.raw() as f64;
> +                if padding_ratio < CHUNK_PADDING_THRESHOLD {
> +                    self.suggestion = Suggested::Reuse;
> +                } else {
> +                    self.suggestion = Suggested::Reencode;
> +                }

we could just return the suggestion instead of storing it - it's only ever needed right after `insert` anyway?

this calculation above seems to not handle some corner cases though.

 if I have the following sequence

|P|A|A|A|A|A|B|P|P|P|P|P|
|    C1     |    C2     |

where P represent padding parts, A and B are file contents of two files, and C1
and C2 are two chunks. let's say both A and B are re-usable files. first A is
resolved via the index and inserted, but since it's the first chunk, there is
no "suggestion" yet (CheckNext). now B is resolved and inserted - it doesn't
continue a partial previous chunk, so we take the else branch. but now the
(total) padding ratio is skewed because of B, even though A on its own would
have been perfectly fine to be re-used.. (let's say we then insert another
chunk different to C1 and C2, depending on the exact ratios that one might lead
to the whole sequence being re-used or not, even though C2 should not be
re-used, C1 should, and the hypothetical C3 might or might not!)

basically, as soon as we have a break in the chunk sequence (file A followed by
file B, with no chunk overlap) we can treat each part on its own?

> +            }
> +
> +            self.start_boundary.add(offset + start_padding)
> +        }
> +    }
> +
> +    fn last_digest_matched(&self, indices: &[ReusableDynamicEntry]) -> Option<u64> {
> +        let digest = if let Some(first) = indices.first() {
> +            first.digest()
> +        } else {
> +            return None;
> +        };
> +
> +        if let Some(last) = self.chunks.last() {
> +            if last.1.digest() == digest {
> +                return Some(self.total.raw() - last.1.size());
> +            }
> +        }
> +
> +        None
> +    }
> +}
> +
>  /// Pxar options for creating a pxar archive/stream
>  #[derive(Default, Clone)]
>  pub struct PxarCreateOptions {
> @@ -147,6 +269,7 @@ struct Archiver {
>      hardlinks: HashMap<HardLinkInfo, (PathBuf, LinkOffset)>,
>      file_copy_buffer: Vec<u8>,
>      skip_e2big_xattr: bool,
> +    reused_chunks: ReusedChunks,
>      forced_boundaries: Option<Arc<Mutex<VecDeque<InjectChunks>>>>,
>  }
>  
> @@ -239,6 +362,7 @@ where
>          hardlinks: HashMap::new(),
>          file_copy_buffer: vec::undefined(4 * 1024 * 1024),
>          skip_e2big_xattr: options.skip_e2big_xattr,
> +        reused_chunks: ReusedChunks::new(),
>          forced_boundaries,
>      };
>  
> -- 
> 2.39.2
> 
> 
> 
> _______________________________________________
> pbs-devel mailing list
> pbs-devel@lists.proxmox.com
> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel
> 
>




^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [pbs-devel] [PATCH v3 proxmox-backup 42/58] client: implement prepare reference method
  2024-03-28 12:36 ` [pbs-devel] [PATCH v3 proxmox-backup 42/58] client: implement prepare reference method Christian Ebner
@ 2024-04-05  8:01   ` Fabian Grünbichler
  0 siblings, 0 replies; 122+ messages in thread
From: Fabian Grünbichler @ 2024-04-05  8:01 UTC (permalink / raw)
  To: Christian Ebner, pbs-devel

Quoting Christian Ebner (2024-03-28 13:36:51)
> Implement a method that prepares the decoder instance to access a
> previous snapshots metadata index and payload index in order to
> pass it to the pxar archiver. The archiver than can utilize these
> to compare the metadata for files to the previous state and gather
> reusable chunks.
> 
> Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
> ---
> changes since version 2:
> - moved checks for reader and mainifest to call side as suggested
> - distinguish between previous manifest not having index and error state
> 
>  pbs-client/src/pxar/create.rs     | 14 +++++++-
>  pbs-client/src/pxar/mod.rs        |  2 +-
>  proxmox-backup-client/src/main.rs | 57 +++++++++++++++++++++++++++++--
>  3 files changed, 69 insertions(+), 4 deletions(-)
> 
> diff --git a/pbs-client/src/pxar/create.rs b/pbs-client/src/pxar/create.rs
> index 2c7867f22..335e3556f 100644
> --- a/pbs-client/src/pxar/create.rs
> +++ b/pbs-client/src/pxar/create.rs
> @@ -19,6 +19,7 @@ use nix::sys::stat::{FileStat, Mode};
>  use pathpatterns::{MatchEntry, MatchFlag, MatchList, MatchType, PatternFlag};
>  use pbs_datastore::index::IndexFile;
>  use proxmox_sys::error::SysError;
> +use pxar::accessor::aio::Accessor;
>  use pxar::encoder::{LinkOffset, SeqWrite};
>  use pxar::Metadata;
>  
> @@ -26,8 +27,9 @@ use proxmox_io::vec;
>  use proxmox_lang::c_str;
>  use proxmox_sys::fs::{self, acl, xattr};
>  
> +use crate::RemoteChunkReader;

nit: (not only here) use statement grouping

>  use pbs_datastore::catalog::BackupCatalogWriter;
> -use pbs_datastore::dynamic_index::DynamicIndexReader;
> +use pbs_datastore::dynamic_index::{DynamicIndexReader, LocalDynamicReadAt};
>  
>  use crate::inject_reused_chunks::InjectChunks;
>  use crate::pxar::metadata::errno_is_unsupported;
> @@ -49,6 +51,16 @@ pub struct PxarCreateOptions {
>      pub skip_e2big_xattr: bool,
>  }
>  
> +/// Statefull information of previous backups snapshots for partial backups
> +pub struct PxarPrevRef {
> +    /// Reference accessor for metadata comparison
> +    pub accessor: Accessor<LocalDynamicReadAt<RemoteChunkReader>>,
> +    /// Reference index for reusing payload chunks
> +    pub payload_index: DynamicIndexReader,
> +    /// Reference archive name for partial backups
> +    pub archive_name: String,
> +}
> +
>  fn detect_fs_type(fd: RawFd) -> Result<i64, Error> {
>      let mut fs_stat = std::mem::MaybeUninit::uninit();
>      let res = unsafe { libc::fstatfs(fd, fs_stat.as_mut_ptr()) };
> diff --git a/pbs-client/src/pxar/mod.rs b/pbs-client/src/pxar/mod.rs
> index b7dcf8362..76652094e 100644
> --- a/pbs-client/src/pxar/mod.rs
> +++ b/pbs-client/src/pxar/mod.rs
> @@ -56,7 +56,7 @@ pub(crate) mod tools;
>  mod flags;
>  pub use flags::Flags;
>  
> -pub use create::{create_archive, PxarCreateOptions, PxarWriters};
> +pub use create::{create_archive, PxarCreateOptions, PxarPrevRef, PxarWriters};
>  pub use extract::{
>      create_tar, create_zip, extract_archive, extract_sub_dir, extract_sub_dir_seq, ErrorHandler,
>      OverwriteFlags, PxarExtractContext, PxarExtractOptions,
> diff --git a/proxmox-backup-client/src/main.rs b/proxmox-backup-client/src/main.rs
> index 215095ee7..0b747453c 100644
> --- a/proxmox-backup-client/src/main.rs
> +++ b/proxmox-backup-client/src/main.rs
> @@ -21,6 +21,7 @@ use proxmox_router::{cli::*, ApiMethod, RpcEnvironment};
>  use proxmox_schema::api;
>  use proxmox_sys::fs::{file_get_json, image_size, replace_file, CreateOptions};
>  use proxmox_time::{epoch_i64, strftime_local};
> +use pxar::accessor::aio::Accessor;
>  use pxar::accessor::{MaybeReady, ReadAt, ReadAtOperation};
>  
>  use pbs_api_types::{
> @@ -30,7 +31,7 @@ use pbs_api_types::{
>      BACKUP_TYPE_SCHEMA, TRAFFIC_CONTROL_BURST_SCHEMA, TRAFFIC_CONTROL_RATE_SCHEMA,
>  };
>  use pbs_client::catalog_shell::Shell;
> -use pbs_client::pxar::ErrorHandler as PxarErrorHandler;
> +use pbs_client::pxar::{ErrorHandler as PxarErrorHandler, PxarPrevRef};
>  use pbs_client::tools::{
>      complete_archive_name, complete_auth_id, complete_backup_group, complete_backup_snapshot,
>      complete_backup_source, complete_chunk_size, complete_group_or_snapshot,
> @@ -50,7 +51,7 @@ use pbs_client::{
>  };
>  use pbs_datastore::catalog::{BackupCatalogWriter, CatalogReader, CatalogWriter};
>  use pbs_datastore::chunk_store::verify_chunk_size;
> -use pbs_datastore::dynamic_index::{BufferedDynamicReader, DynamicIndexReader};
> +use pbs_datastore::dynamic_index::{BufferedDynamicReader, DynamicIndexReader, LocalDynamicReadAt};
>  use pbs_datastore::fixed_index::FixedIndexReader;
>  use pbs_datastore::index::IndexFile;
>  use pbs_datastore::manifest::{
> @@ -1177,6 +1178,58 @@ async fn create_backup(
>      Ok(Value::Null)
>  }
>  
> +async fn prepare_reference(
> +    target: &str,
> +    manifest: Arc<BackupManifest>,
> +    backup_writer: &BackupWriter,
> +    backup_reader: Arc<BackupReader>,
> +    crypt_config: Option<Arc<CryptConfig>>,
> +) -> Result<Option<PxarPrevRef>, Error> {
> +    let (target, payload_target) = helper::get_pxar_archive_names(target);
> +    let payload_target = payload_target.unwrap_or_default();
> +
> +    let metadata_ref_index = if let Ok(index) = backup_reader
> +        .download_dynamic_index(&manifest, &target)
> +        .await
> +    {
> +        index
> +    } else {
> +        log::info!("No previous metadata index, continue without reference");
> +        return Ok(None);
> +    };
> +
> +    if let Err(_err) = manifest.lookup_file_info(&payload_target) {
> +        log::info!("No previous payload index found in manifest, continue without reference");
> +        return Ok(None);
> +    }

nit: is_err() ;)

> +
> +    let known_payload_chunks = Arc::new(Mutex::new(HashSet::new()));
> +    let payload_ref_index = backup_writer
> +        .download_previous_dynamic_index(&payload_target, &manifest, known_payload_chunks)
> +        .await?;
> +
> +    log::info!("Using previous index as metadata reference for '{target}'");
> +
> +    let most_used = metadata_ref_index.find_most_used_chunks(8);
> +    let file_info = manifest.lookup_file_info(&target)?;
> +    let chunk_reader = RemoteChunkReader::new(
> +        backup_reader.clone(),
> +        crypt_config.clone(),
> +        file_info.chunk_crypt_mode(),
> +        most_used,
> +    );
> +    let reader = BufferedDynamicReader::new(metadata_ref_index, chunk_reader);
> +    let archive_size = reader.archive_size();
> +    let reader = LocalDynamicReadAt::new(reader);
> +    let accessor = Accessor::new(reader, archive_size, None).await?;
> +
> +    Ok(Some(pbs_client::pxar::PxarPrevRef {
> +        accessor,
> +        payload_index: payload_ref_index,
> +        archive_name: target,
> +    }))
> +}
> +
>  async fn dump_image<W: Write>(
>      client: Arc<BackupReader>,
>      crypt_config: Option<Arc<CryptConfig>>,
> -- 
> 2.39.2
> 
> 
> 
> _______________________________________________
> pbs-devel mailing list
> pbs-devel@lists.proxmox.com
> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel
> 
>




^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [pbs-devel] [PATCH v3 proxmox-backup 45/58] client: pxar: add method for metadata comparison
  2024-03-28 12:36 ` [pbs-devel] [PATCH v3 proxmox-backup 45/58] client: pxar: add method for metadata comparison Christian Ebner
@ 2024-04-05  8:08   ` Fabian Grünbichler
  2024-04-05  8:14     ` Christian Ebner
  0 siblings, 1 reply; 122+ messages in thread
From: Fabian Grünbichler @ 2024-04-05  8:08 UTC (permalink / raw)
  To: Christian Ebner, pbs-devel

Quoting Christian Ebner (2024-03-28 13:36:54)
> Adds a method to compare the metadata of the current file entry
> against the metadata of the entry looked up in the previous backup
> snapshot.
> 
> If the metadata matched, the start offset for the payload stream is
> returned.
> 
> This is in preparation for reusing payload chunks for unchanged files.
> 
> Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
> ---
> changes since version 2:
> - refactored to new padding based threshold
> 
>  pbs-client/src/pxar/create.rs | 31 ++++++++++++++++++++++++++++++-
>  1 file changed, 30 insertions(+), 1 deletion(-)
> 
> diff --git a/pbs-client/src/pxar/create.rs b/pbs-client/src/pxar/create.rs
> index 79925bba2..c64084a74 100644
> --- a/pbs-client/src/pxar/create.rs
> +++ b/pbs-client/src/pxar/create.rs
> @@ -21,7 +21,7 @@ use pbs_datastore::index::IndexFile;
>  use proxmox_sys::error::SysError;
>  use pxar::accessor::aio::{Accessor, Directory};
>  use pxar::encoder::{LinkOffset, PayloadOffset, SeqWrite};
> -use pxar::Metadata;
> +use pxar::{EntryKind, Metadata};
>  
>  use proxmox_io::vec;
>  use proxmox_lang::c_str;
> @@ -466,6 +466,35 @@ impl Archiver {
>          .boxed()
>      }
>  
> +    async fn is_reusable_entry(
> +        &mut self,
> +        previous_metadata_accessor: &mut Directory<LocalDynamicReadAt<RemoteChunkReader>>,
> +        file_name: &Path,
> +        stat: &FileStat,
> +        metadata: &Metadata,
> +    ) -> Result<Option<u64>, Error> {
> +        if stat.st_nlink > 1 {
> +            log::debug!("re-encode: {file_name:?} has hardlinks.");
> +            return Ok(None);
> +        }

it would be nice if we had a way to handle those as well.. what's the current
blocker? shouldn't we be able to use the same scheme as for regular archives?

first encounter adds (possibly re-uses) the payload and remembers the offset,
subsequent ones just add another reference/meta entry?

> +
> +        if let Some(file_entry) = previous_metadata_accessor.lookup(file_name).await? {
> +            if metadata == file_entry.metadata() {
> +                if let EntryKind::File { payload_offset, .. } = file_entry.entry().kind() {
> +                    log::debug!("possible re-use: {file_name:?} at offset {payload_offset:?} has unchanged metadata.");
> +                    return Ok(*payload_offset);
> +                }
> +                log::debug!("re-encode: {file_name:?} not a regular file.");
> +                return Ok(None);
> +            }
> +            log::debug!("re-encode: {file_name:?} metadata did not match.");
> +            return Ok(None);
> +        }
> +
> +        log::debug!("re-encode: {file_name:?} not found in previous archive.");
> +        Ok(None)
> +    }
> +
>      /// openat() wrapper which allows but logs `EACCES` and turns `ENOENT` into `None`.
>      ///
>      /// The `existed` flag is set when iterating through a directory to note that we know the file
> -- 
> 2.39.2
> 
> 
> 
> _______________________________________________
> pbs-devel mailing list
> pbs-devel@lists.proxmox.com
> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel
> 
>




^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [pbs-devel] [PATCH v3 proxmox-backup 45/58] client: pxar: add method for metadata comparison
  2024-04-05  8:08   ` Fabian Grünbichler
@ 2024-04-05  8:14     ` Christian Ebner
  2024-04-09 12:52       ` Christian Ebner
  0 siblings, 1 reply; 122+ messages in thread
From: Christian Ebner @ 2024-04-05  8:14 UTC (permalink / raw)
  To: Fabian Grünbichler, pbs-devel

On 4/5/24 10:08, Fabian Grünbichler wrote:
> Quoting Christian Ebner (2024-03-28 13:36:54)
>> Adds a method to compare the metadata of the current file entry
>> against the metadata of the entry looked up in the previous backup
>> snapshot.
>>
>> If the metadata matched, the start offset for the payload stream is
>> returned.
>>
>> This is in preparation for reusing payload chunks for unchanged files.
>>
>> Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
>> ---
>> changes since version 2:
>> - refactored to new padding based threshold
>>
>>   pbs-client/src/pxar/create.rs | 31 ++++++++++++++++++++++++++++++-
>>   1 file changed, 30 insertions(+), 1 deletion(-)
>>
>> diff --git a/pbs-client/src/pxar/create.rs b/pbs-client/src/pxar/create.rs
>> index 79925bba2..c64084a74 100644
>> --- a/pbs-client/src/pxar/create.rs
>> +++ b/pbs-client/src/pxar/create.rs
>> @@ -21,7 +21,7 @@ use pbs_datastore::index::IndexFile;
>>   use proxmox_sys::error::SysError;
>>   use pxar::accessor::aio::{Accessor, Directory};
>>   use pxar::encoder::{LinkOffset, PayloadOffset, SeqWrite};
>> -use pxar::Metadata;
>> +use pxar::{EntryKind, Metadata};
>>   
>>   use proxmox_io::vec;
>>   use proxmox_lang::c_str;
>> @@ -466,6 +466,35 @@ impl Archiver {
>>           .boxed()
>>       }
>>   
>> +    async fn is_reusable_entry(
>> +        &mut self,
>> +        previous_metadata_accessor: &mut Directory<LocalDynamicReadAt<RemoteChunkReader>>,
>> +        file_name: &Path,
>> +        stat: &FileStat,
>> +        metadata: &Metadata,
>> +    ) -> Result<Option<u64>, Error> {
>> +        if stat.st_nlink > 1 {
>> +            log::debug!("re-encode: {file_name:?} has hardlinks.");
>> +            return Ok(None);
>> +        }
> 
> it would be nice if we had a way to handle those as well.. what's the current
> blocker? shouldn't we be able to use the same scheme as for regular archives?
> 
> first encounter adds (possibly re-uses) the payload and remembers the offset,
> subsequent ones just add another reference/meta entry?

True, this is a leftover from the initial approach with the appendix 
section instead of the split archive where it caused issues.





^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [pbs-devel] [PATCH v3 proxmox-backup 47/58] client: pxar: add look-ahead caching
  2024-03-28 12:36 ` [pbs-devel] [PATCH v3 proxmox-backup 47/58] client: pxar: add look-ahead caching Christian Ebner
@ 2024-04-05  8:33   ` Fabian Grünbichler
  2024-04-09 14:53     ` Christian Ebner
  0 siblings, 1 reply; 122+ messages in thread
From: Fabian Grünbichler @ 2024-04-05  8:33 UTC (permalink / raw)
  To: Christian Ebner, pbs-devel

Quoting Christian Ebner (2024-03-28 13:36:56)
> Implements the methods to cache entries in a look-ahead cache and
> flush the entries to archive, either by re-using and injecting the
> payload chunks from the previous backup snapshot and storing the
> reference to it, or by re-encoding the chunks.
> 
> When walking the file system tree, check for each entry if it is
> re-usable, meaning that the metadata did not change and the payload
> chunks can be re-indexed instead of re-encoding the whole data.
> Since the ammount of payload data might be small as compared to the

s/ammount/amount

> actual chunk size, a decision whether to re-use or re-encode is
> postponed if the reused payload does not fall below a threshold value,
> but the chunks where continuous.
> In this case, put the entry's file handle an metadata on the cache and

s/an/and/

> enable caching mode, and continue with the next entry.
> Reusable chunk digests and size as well as reference offsets to the
> start of regular files payloads within the payload stream are stored in
> memory, to be injected for re-usable file entries.
> 
> If the threshold value for re-use is reached, the chunks are injected
> in the payload stream and the references with the corresponding offsets
> encoded in the metadata stream.
> If however a non-reusable (because changed) entry is encountered before
> the threshold is reached, the entries on the cache are flushed to the
> archive by re-encoding them, the memorized chunks and payload reference
> offsets are discarted.

s/discarted/discarded/

> 
> Since multiple files might be contained within a single chunk, it is
> assured that the deduplication of chunks is performed also when the
> reuse threshold is reached, by keeping back the last chunk in the
> memorized list, so following files might as well rei-use that chunk.

s/rei/re/

> It is assured that this chunk is however injected in the stream also in
> case that the following lookups lead to a cache clear and re-encoding.
> 
> Directory boundaries are cached as well, and written as part of the
> encoding when flushing.

thanks for adding a high-level description!

> 
> Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
> ---
> changes since version 2:
> - completely reworked
> - strongly reduced duplicate code
> 
>  pbs-client/src/pxar/create.rs | 259 ++++++++++++++++++++++++++++++++++
>  1 file changed, 259 insertions(+)
> 
> diff --git a/pbs-client/src/pxar/create.rs b/pbs-client/src/pxar/create.rs
> index c64084a74..07fa17ec4 100644
> --- a/pbs-client/src/pxar/create.rs
> +++ b/pbs-client/src/pxar/create.rs
> @@ -2,6 +2,7 @@ use std::collections::{HashMap, HashSet, VecDeque};
>  use std::ffi::{CStr, CString, OsStr};
>  use std::fmt;
>  use std::io::{self, Read};
> +use std::mem::size_of;
>  use std::ops::Range;
>  use std::os::unix::ffi::OsStrExt;
>  use std::os::unix::io::{AsRawFd, FromRawFd, IntoRawFd, OwnedFd, RawFd};
> @@ -23,6 +24,7 @@ use pxar::accessor::aio::{Accessor, Directory};
>  use pxar::encoder::{LinkOffset, PayloadOffset, SeqWrite};
>  use pxar::{EntryKind, Metadata};
>  
> +use proxmox_human_byte::HumanByte;
>  use proxmox_io::vec;
>  use proxmox_lang::c_str;
>  use proxmox_sys::fs::{self, acl, xattr};
> @@ -32,6 +34,7 @@ use pbs_datastore::catalog::BackupCatalogWriter;
>  use pbs_datastore::dynamic_index::{DynamicIndexReader, LocalDynamicReadAt};
>  
>  use crate::inject_reused_chunks::InjectChunks;
> +use crate::pxar::look_ahead_cache::{CacheEntry, CacheEntryData};
>  use crate::pxar::metadata::errno_is_unsupported;
>  use crate::pxar::tools::assert_single_path_component;
>  use crate::pxar::Flags;
> @@ -274,6 +277,12 @@ struct Archiver {
>      reused_chunks: ReusedChunks,
>      previous_payload_index: Option<DynamicIndexReader>,
>      forced_boundaries: Option<Arc<Mutex<VecDeque<InjectChunks>>>>,
> +    cached_entries: Vec<CacheEntry>,
> +    caching_enabled: bool,
> +    total_injected_size: u64,
> +    total_injected_count: u64,
> +    partial_chunks_count: u64,
> +    total_reused_payload_size: u64,
>  }
>  
>  type Encoder<'a, T> = pxar::encoder::aio::Encoder<'a, T>;
> @@ -377,6 +386,12 @@ where
>          reused_chunks: ReusedChunks::new(),
>          previous_payload_index,
>          forced_boundaries,
> +        cached_entries: Vec::new(),
> +        caching_enabled: false,
> +        total_injected_size: 0,
> +        total_injected_count: 0,
> +        partial_chunks_count: 0,
> +        total_reused_payload_size: 0,
>      };
>  
>      archiver
> @@ -879,6 +894,250 @@ impl Archiver {
>          }
>      }
>  
> +    async fn cache_or_flush_entries<T: SeqWrite + Send>(
> +        &mut self,
> +        encoder: &mut Encoder<'_, T>,
> +        previous_metadata_accessor: &mut Option<Directory<LocalDynamicReadAt<RemoteChunkReader>>>,
> +        c_file_name: &CStr,
> +        stat: &FileStat,
> +        fd: OwnedFd,
> +        metadata: &Metadata,
> +    ) -> Result<(), Error> {
> +        let file_name: &Path = OsStr::from_bytes(c_file_name.to_bytes()).as_ref();
> +        let reusable = if let Some(accessor) = previous_metadata_accessor {
> +            self.is_reusable_entry(accessor, file_name, stat, metadata)
> +                .await?
> +        } else {
> +            None
> +        };
> +
> +        let file_size = stat.st_size as u64;

couldn't we get this via is_reusable?

> +        if let Some(start_offset) = reusable {
> +            if let Some(ref ref_payload_index) = self.previous_payload_index {
> +                let payload_size = file_size + size_of::<pxar::format::Header>() as u64;

or better yet, get this here directly ;)

> +                let end_offset = start_offset + payload_size;

or better yet, this one here ;)

> +                let (indices, start_padding, end_padding) =
> +                    lookup_dynamic_entries(ref_payload_index, start_offset..end_offset)?;

or better yet, just return the Range in the payload archive? :)

> +
> +                let boundary = encoder.payload_position()?;
> +                let offset =
> +                    self.reused_chunks
> +                        .insert(indices, boundary, start_padding, end_padding);
> +
> +                self.caching_enabled = true;
> +                let cache_entry = CacheEntry::RegEntry(CacheEntryData::new(
> +                    fd,
> +                    c_file_name.into(),
> +                    *stat,
> +                    metadata.clone(),
> +                    offset,
> +                ));
> +                self.cached_entries.push(cache_entry);
> +
> +                match self.reused_chunks.suggested() {
> +                    Suggested::Reuse => self.flush_cached_to_archive(encoder, true, true).await?,
> +                    Suggested::Reencode => {
> +                        self.flush_cached_to_archive(encoder, false, true).await?
> +                    }
> +                    Suggested::CheckNext => {}
> +                }
> +
> +                return Ok(());
> +            }
> +        }
> +
> +        self.flush_cached_to_archive(encoder, false, true).await?;
> +        self.add_entry(encoder, previous_metadata_accessor, fd.as_raw_fd(), c_file_name, stat)
> +            .await

this part here is where I think we mishandle some edge cases, like mentioned in
the ReusedChunks patch comments.. even keeping back the last chunk doesn't save
us from losing some re-usable files sometimes..

> +    }
> +
> +    async fn flush_cached_to_archive<T: SeqWrite + Send>(
> +        &mut self,
> +        encoder: &mut Encoder<'_, T>,
> +        reuse_chunks: bool,
> +        keep_back_last_chunk: bool,
> +    ) -> Result<(), Error> {
> +        let entries = std::mem::take(&mut self.cached_entries);
> +
> +        if !reuse_chunks {
> +            self.clear_cached_chunks(encoder)?;
> +        }
> +
> +        for entry in entries {
> +            match entry {
> +                CacheEntry::RegEntry(CacheEntryData {
> +                    fd,
> +                    c_file_name,
> +                    stat,
> +                    metadata,
> +                    payload_offset,
> +                }) => {
> +                    self.add_entry_to_archive(
> +                        encoder,
> +                        &mut None,
> +                        &c_file_name,
> +                        &stat,
> +                        fd,
> +                        &metadata,
> +                        reuse_chunks,
> +                        Some(payload_offset),
> +                    )
> +                    .await?
> +                }
> +                CacheEntry::DirEntry(CacheEntryData {
> +                    c_file_name,
> +                    metadata,
> +                    ..
> +                }) => {
> +                    if let Some(ref catalog) = self.catalog {
> +                        catalog.lock().unwrap().start_directory(&c_file_name)?;
> +                    }
> +                    let dir_name = OsStr::from_bytes(c_file_name.to_bytes());
> +                    encoder.create_directory(dir_name, &metadata).await?;
> +                }
> +                CacheEntry::DirEnd => {
> +                    encoder.finish().await?;
> +                    if let Some(ref catalog) = self.catalog {
> +                        catalog.lock().unwrap().end_directory()?;
> +                    }
> +                }
> +            }
> +        }
> +
> +        self.caching_enabled = false;
> +
> +        if reuse_chunks {
> +            self.flush_reused_chunks(encoder, keep_back_last_chunk)?;
> +        }
> +
> +        Ok(())
> +    }
> +
> +    fn flush_reused_chunks<T: SeqWrite + Send>(
> +        &mut self,
> +        encoder: &mut Encoder<'_, T>,
> +        keep_back_last_chunk: bool,
> +    ) -> Result<(), Error> {
> +        let mut reused_chunks = std::mem::take(&mut self.reused_chunks);
> +
> +        // Do not inject the last reused chunk directly, but keep it as base for further entries
> +        // to reduce chunk duplication. Needs to be flushed even on cache clear!
> +        let last_chunk = if keep_back_last_chunk {
> +            reused_chunks.chunks.pop()
> +        } else {
> +            None
> +        };
> +
> +        let mut injection_boundary = reused_chunks.start_boundary();
> +        let payload_writer_position = encoder.payload_position()?.raw();
> +
> +        if !reused_chunks.chunks.is_empty() && injection_boundary.raw() != payload_writer_position {
> +            bail!(
> +                "encoder payload writer position out of sync: got {payload_writer_position}, expected {}",
> +                injection_boundary.raw(),
> +            );
> +        }
> +
> +        for chunks in reused_chunks.chunks.chunks(128) {
> +            let mut chunk_list = Vec::with_capacity(128);
> +            let mut size = PayloadOffset::default();
> +            for (padding, chunk) in chunks.iter() {
> +                log::debug!(
> +                    "Injecting chunk with {} padding (chunk size {})",
> +                    HumanByte::from(*padding),
> +                    HumanByte::from(chunk.size()),
> +                );
> +                self.total_injected_size += chunk.size();
> +                self.total_injected_count += 1;
> +                if *padding > 0 {
> +                    self.partial_chunks_count += 1;
> +                }
> +                size = size.add(chunk.size());
> +                chunk_list.push(chunk.clone());
> +            }
> +
> +            let inject_chunks = InjectChunks {
> +                boundary: injection_boundary.raw(),
> +                chunks: chunk_list,
> +                size: size.raw() as usize,
> +            };
> +
> +            if let Some(boundary) = self.forced_boundaries.as_mut() {
> +                let mut boundary = boundary.lock().unwrap();
> +                boundary.push_back(inject_chunks);
> +            } else {
> +                bail!("missing injection queue");
> +            };
> +
> +            injection_boundary = injection_boundary.add(size.raw());
> +            encoder.advance(size)?;
> +        }
> +
> +        if let Some((padding, chunk)) = last_chunk {
> +            // Make sure that we flush this chunk even on clear calls
> +            self.reused_chunks.must_flush_first = true;

might make sense to rename this one to "must_flush_first_chunk", else the other
call sites might be interpreted as "must flush (all) chunks first"

> +            let _offset = self
> +                .reused_chunks
> +                .insert(vec![chunk], injection_boundary, padding, 0);
> +        }
> +
> +        Ok(())
> +    }
> +
> +    fn clear_cached_chunks<T: SeqWrite + Send>(
> +        &mut self,
> +        encoder: &mut Encoder<'_, T>,
> +    ) -> Result<(), Error> {
> +        let reused_chunks = std::mem::take(&mut self.reused_chunks);
> +
> +        if !reused_chunks.must_flush_first {
> +            return Ok(());
> +        }
> +
> +        // First chunk was kept back to avoid duplication but needs to be injected
> +        let injection_boundary = reused_chunks.start_boundary();
> +        let payload_writer_position = encoder.payload_position()?.raw();
> +
> +        if !reused_chunks.chunks.is_empty() && injection_boundary.raw() != payload_writer_position {
> +            bail!(
> +                "encoder payload writer position out of sync: got {payload_writer_position}, expected {}",
> +                injection_boundary.raw()
> +            );
> +        }
> +
> +        if let Some((padding, chunk)) = reused_chunks.chunks.first() {
> +            let size = PayloadOffset::default().add(chunk.size());
> +            log::debug!(
> +                "Injecting chunk with {} padding (chunk size {})",
> +                HumanByte::from(*padding),
> +                HumanByte::from(chunk.size()),
> +            );
> +            let inject_chunks = InjectChunks {
> +                boundary: injection_boundary.raw(),
> +                chunks: vec![chunk.clone()],
> +                size: size.raw() as usize,
> +            };
> +
> +            self.total_injected_size += size.raw();
> +            self.total_injected_count += 1;
> +            if *padding > 0 {
> +                self.partial_chunks_count += 1;
> +            }
> +
> +            if let Some(boundary) = self.forced_boundaries.as_mut() {
> +                let mut boundary = boundary.lock().unwrap();
> +                boundary.push_back(inject_chunks);
> +            } else {
> +                bail!("missing injection queue");
> +            };
> +            encoder.advance(size)?;

this part is basically the loop in flush_reused_chunks and could be de-duplicated in some fashion..

> +        } else {
> +            bail!("missing first chunk");
> +        }
> +
> +        Ok(())
> +    }
> +
>      async fn add_directory<T: SeqWrite + Send>(
>          &mut self,
>          encoder: &mut Encoder<'_, T>,
> -- 
> 2.39.2
> 
> 
> 
> _______________________________________________
> pbs-devel mailing list
> pbs-devel@lists.proxmox.com
> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel
> 
>




^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [pbs-devel] [PATCH v3 proxmox-backup 49/58] client: backup: increase average chunk size for metadata
  2024-03-28 12:36 ` [pbs-devel] [PATCH v3 proxmox-backup 49/58] client: backup: increase average chunk size for metadata Christian Ebner
@ 2024-04-05  9:42   ` Fabian Grünbichler
  2024-04-05 10:49     ` Dietmar Maurer
  0 siblings, 1 reply; 122+ messages in thread
From: Fabian Grünbichler @ 2024-04-05  9:42 UTC (permalink / raw)
  To: Christian Ebner, pbs-devel

Quoting Christian Ebner (2024-03-28 13:36:58)
> Use double the average chunk size for the metadata archive as compared
> to the payload stream. This does not only reduce the number of unique
> chunks produced by the metadata archive, not well chunkable because
> mainly many localized small changes, but further has the positive side
> effect of producing well compressable larger chunks. The reduced number
> of chunks further increases the performance for access because of
> reduced number of download requests and increased cachability.
> 
> Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
> ---
> changes since version 2:
> - not present in previous version
> 
>  proxmox-backup-client/src/main.rs | 12 +++++++++++-
>  1 file changed, 11 insertions(+), 1 deletion(-)
> 
> diff --git a/proxmox-backup-client/src/main.rs b/proxmox-backup-client/src/main.rs
> index 66dcaa63e..4aad0ff8c 100644
> --- a/proxmox-backup-client/src/main.rs
> +++ b/proxmox-backup-client/src/main.rs
> @@ -78,6 +78,8 @@ pub(crate) use helper::*;
>  pub mod key;
>  pub mod namespace;
>  
> +const AVG_METADATA_CHUNK_SIZE: usize = 8 * 1024 * 1024;
> +
>  fn record_repository(repo: &BackupRepository) {
>      let base = match BaseDirectories::with_prefix("proxmox-backup") {
>          Ok(v) => v,
> @@ -209,7 +211,15 @@ async fn backup_directory<P: AsRef<Path>>(
>          payload_target.is_some(),
>      )?;
>  
> -    let mut chunk_stream = ChunkStream::new(pxar_stream, chunk_size, None);
> +    let avg_chunk_size = if payload_stream.is_none() {
> +        chunk_size
> +    } else {
> +        chunk_size
> +            .map(|size| 2 * size)

what if the user provided us with a very small chunk size? should we have a lower bound here?

I still wonder whether getting rid of the sliding window chunker wouldn't be a
net benefit for the split archive case. for the metadata stream it probably
doesn't matter much (it has a lot of churn, is small and compresses well).

for the payload stream simple accumulating 1..N files (or rather, their
contents) in a chunk until a certain size threshold is reached might perform
better (as in, both be faster than the current chunker, and give us more/better
re-usable chunks).

> +            .or_else(|| Some(AVG_METADATA_CHUNK_SIZE))
> +    };
> +
> +    let mut chunk_stream = ChunkStream::new(pxar_stream, avg_chunk_size, None);
>      let (tx, rx) = mpsc::channel(10); // allow to buffer 10 chunks
>  
>      let stream = ReceiverStream::new(rx).map_err(Error::from);
> -- 
> 2.39.2
> 
> 
> 
> _______________________________________________
> pbs-devel mailing list
> pbs-devel@lists.proxmox.com
> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel
> 
>




^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [pbs-devel] [PATCH v3 proxmox-backup 51/58] pxar: create: show chunk injection stats debug output
  2024-03-28 12:37 ` [pbs-devel] [PATCH v3 proxmox-backup 51/58] pxar: create: show chunk injection stats debug output Christian Ebner
@ 2024-04-05  9:47   ` Fabian Grünbichler
  2024-04-10 10:00     ` Christian Ebner
  0 siblings, 1 reply; 122+ messages in thread
From: Fabian Grünbichler @ 2024-04-05  9:47 UTC (permalink / raw)
  To: Christian Ebner, pbs-devel

Quoting Christian Ebner (2024-03-28 13:37:00)
> Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
> ---
> changes since version 2:
> - not present in previous version
> 
>  pbs-client/src/pxar/create.rs | 8 ++++++++
>  1 file changed, 8 insertions(+)
> 
> diff --git a/pbs-client/src/pxar/create.rs b/pbs-client/src/pxar/create.rs
> index f103127c4..461509c39 100644
> --- a/pbs-client/src/pxar/create.rs
> +++ b/pbs-client/src/pxar/create.rs
> @@ -407,6 +407,14 @@ where
>      encoder.finish().await?;
>      encoder.close().await?;
>  
> +    log::info!(
> +        "Total injected: {} ({} chunks), total reused payload: {}, padding: {} ({} partial chunks)",

we already discussed this off-list, but something like

Change detection: processed XX files, YY unmodified (reused AA GB data + BB GB padding = CC GB total in DD chunks)

and only printing it with change detection metadata is probably easier to
understand, but maybe we should get some more feedback on that as well :)

> +        HumanByte::from(archiver.total_injected_size),
> +        archiver.total_injected_count,
> +        HumanByte::from(archiver.total_reused_payload_size),
> +        HumanByte::from(archiver.total_injected_size - archiver.total_reused_payload_size),
> +        archiver.partial_chunks_count,
> +    );
>      Ok(())
>  }
>  
> -- 
> 2.39.2
> 
> 
> 
> _______________________________________________
> pbs-devel mailing list
> pbs-devel@lists.proxmox.com
> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel
> 
>




^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [pbs-devel] [PATCH v3 proxmox-backup 53/58] client: pxar: opt encode cli exclude patterns as CliParams
  2024-03-28 12:37 ` [pbs-devel] [PATCH v3 proxmox-backup 53/58] client: pxar: opt encode cli exclude patterns as CliParams Christian Ebner
@ 2024-04-05  9:49   ` Fabian Grünbichler
  0 siblings, 0 replies; 122+ messages in thread
From: Fabian Grünbichler @ 2024-04-05  9:49 UTC (permalink / raw)
  To: Christian Ebner, pbs-devel

Quoting Christian Ebner (2024-03-28 13:37:02)
> Instead of encoding the pxar cli exclude patterns as regular file
> within the root directory of an archive, store this information
> directly after the pxar format version entry in a new pxar cli params
> entry.
> 
> This behaviour is however currently exclusive to the archives written
> with format version 2 in a split metadata and payload case.
> 
> This is a breaking change for the encoding of new cli exclude
> parameters. Any new exclude parameter will not be added to an already
> present .pxar-cliexclude file, and it will not be created if not
> present.
> 
> Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
> ---
> changes since version 2:
> - not present in previous version
> 
>  pbs-client/src/pxar/create.rs             | 25 +++++++++++++++--------
>  pbs-client/src/pxar/extract.rs            |  3 ++-
>  pbs-client/src/pxar/tools.rs              |  6 ++++++
>  src/tape/file_formats/snapshot_archive.rs |  8 ++++++--
>  4 files changed, 31 insertions(+), 11 deletions(-)
> 
> diff --git a/pbs-client/src/pxar/create.rs b/pbs-client/src/pxar/create.rs
> index 461509c39..5f2270fe8 100644
> --- a/pbs-client/src/pxar/create.rs
> +++ b/pbs-client/src/pxar/create.rs
> @@ -343,13 +343,6 @@ where
>          set.insert(stat.st_dev);
>      }
>  
> -    let mut encoder = Encoder::new(
> -        &mut writers.writer,
> -        &metadata,
> -        writers.payload_writer.as_mut(),
> -    )
> -    .await?;
> -
>      let mut patterns = options.patterns;
>  
>      if options.skip_lost_and_found {
> @@ -359,6 +352,14 @@ where
>              MatchType::Exclude,
>          )?);
>      }
> +
> +    let cli_params_content = generate_pxar_excludes_cli(&patterns[..]);
> +    let cli_params = if options.previous_ref.is_some() {
> +        Some(cli_params_content.as_slice())
> +    } else {
> +        None
> +    };
> +
>      let (previous_payload_index, previous_metadata_accessor) =
>          if let Some(refs) = options.previous_ref {
>              (
> @@ -369,6 +370,14 @@ where
>              (None, None)
>          };
>  
> +    let mut encoder = Encoder::new(
> +        &mut writers.writer,
> +        &metadata,
> +        writers.payload_writer.as_mut(),
> +        cli_params,
> +    )
> +    .await?;
> +
>      let mut archiver = Archiver {
>          feature_flags,
>          fs_feature_flags,
> @@ -454,7 +463,7 @@ impl Archiver {
>  
>              let mut file_list = self.generate_directory_file_list(&mut dir, is_root)?;
>  
> -            if is_root && old_patterns_count > 0 {
> +            if is_root && old_patterns_count > 0 && previous_metadata_accessor.is_none() {
>                  file_list.push(FileListEntry {
>                      name: CString::new(".pxarexclude-cli").unwrap(),
>                      path: PathBuf::new(),
> diff --git a/pbs-client/src/pxar/extract.rs b/pbs-client/src/pxar/extract.rs
> index 56f8d7adc..46ff8fc80 100644
> --- a/pbs-client/src/pxar/extract.rs
> +++ b/pbs-client/src/pxar/extract.rs
> @@ -267,7 +267,8 @@ where
>          };
>  
>          let extract_res = match (did_match, entry.kind()) {
> -            (_, EntryKind::Version(_)) => Ok(()),
> +            (_, EntryKind::Version(_version)) => Ok(()),
> +            (_, EntryKind::CliParams(_data)) => Ok(()),
>              (_, EntryKind::Directory) => {
>                  self.callback(entry.path());
>  
> diff --git a/pbs-client/src/pxar/tools.rs b/pbs-client/src/pxar/tools.rs
> index 4e9bd5b60..478acdc0f 100644
> --- a/pbs-client/src/pxar/tools.rs
> +++ b/pbs-client/src/pxar/tools.rs
> @@ -173,6 +173,12 @@ pub fn format_multi_line_entry(entry: &Entry) -> String {
>  
>      let (size, link, type_name, payload_offset) = match entry.kind() {
>          EntryKind::Version(version) => (format!("{version:?}"), String::new(), "version", None),
> +        EntryKind::CliParams(params) => (
> +            "0".to_string(),
> +            format!(" -> {:?}", params.as_os_str()),
> +            "symlink",
> +            None,

this seems wrong ;)

> +        ),
>          EntryKind::File {
>              size,
>              payload_offset,
> diff --git a/src/tape/file_formats/snapshot_archive.rs b/src/tape/file_formats/snapshot_archive.rs
> index 43d1cf9c3..7e052919b 100644
> --- a/src/tape/file_formats/snapshot_archive.rs
> +++ b/src/tape/file_formats/snapshot_archive.rs
> @@ -58,8 +58,12 @@ pub fn tape_write_snapshot_archive<'a>(
>              ));
>          }
>  
> -        let mut encoder =
> -            pxar::encoder::sync::Encoder::new(PxarTapeWriter::new(writer), &root_metadata, None)?;
> +        let mut encoder = pxar::encoder::sync::Encoder::new(
> +            PxarTapeWriter::new(writer),
> +            &root_metadata,
> +            None,
> +            None,
> +        )?;
>  
>          for filename in file_list.iter() {
>              let mut file = snapshot_reader.open_file(filename).map_err(|err| {
> -- 
> 2.39.2
> 
> 
> 
> _______________________________________________
> pbs-devel mailing list
> pbs-devel@lists.proxmox.com
> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel
> 
>




^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [pbs-devel] [PATCH v3 proxmox-backup 54/58] client: pxar: add flow chart for metadata change detection
  2024-03-28 12:37 ` [pbs-devel] [PATCH v3 proxmox-backup 54/58] client: pxar: add flow chart for metadata change detection Christian Ebner
@ 2024-04-05 10:16   ` Fabian Grünbichler
  2024-04-10 10:04     ` Christian Ebner
  0 siblings, 1 reply; 122+ messages in thread
From: Fabian Grünbichler @ 2024-04-05 10:16 UTC (permalink / raw)
  To: Christian Ebner, pbs-devel

Quoting Christian Ebner (2024-03-28 13:37:03)
> A high level flow chart describing the logic used for the metadata
> based file change detection mode.
> 
> Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
> ---
> changes since version 2:
> - not present in previous version
> 
>  ...ow-chart-metadata-based-file-change-detection.svg |  1 +
>  ...ow-chart-metadata-based-file-change-detection.txt | 12 ++++++++++++
>  2 files changed, 13 insertions(+)
>  create mode 100644 pbs-client/src/pxar/flow-chart-metadata-based-file-change-detection.svg
>  create mode 100644 pbs-client/src/pxar/flow-chart-metadata-based-file-change-detection.txt
> 
> diff --git a/pbs-client/src/pxar/flow-chart-metadata-based-file-change-detection.svg b/pbs-client/src/pxar/flow-chart-metadata-based-file-change-detection.svg
> new file mode 100644
> index 000000000..5e6df4815
> --- /dev/null
> +++ b/pbs-client/src/pxar/flow-chart-metadata-based-file-change-detection.svg
> @@ -0,0 +1 @@

[snip]

something here got broken (I guess mail related somewhere along the way?). it
does work if the contents are manually merged backed into a single line before
applying the patch though. in any case, it probably would be nice to have this
autogenerated and moved to some part of the docs :)

> diff --git a/pbs-client/src/pxar/flow-chart-metadata-based-file-change-detection.txt b/pbs-client/src/pxar/flow-chart-metadata-based-file-change-detection.txt
> new file mode 100644
> index 000000000..5eace70be
> --- /dev/null
> +++ b/pbs-client/src/pxar/flow-chart-metadata-based-file-change-detection.txt
> @@ -0,0 +1,12 @@
> +flowchart TD
> +    A[Archiver] -->|lookup metadata| B[Accessor]
> +    B -->|is reusable entry| C[Lookahead Cache]
> +    C -->|lookup reusable chunks| D[Dynamic Index]
> +    D -->|insert and deduplicate dynamic entries| E[Reused Chunks]
> +    B -->|is not reusable entry| F(re-encode cached entries and current entry)
> +    F -->|caching disabled| A
> +    E -->|padding above threshold, non-continuous chunks, caching disabled| F
> +    E -->|padding above threshold, chunks continuous, caching enabled| A
> +    E -->|padding below threshold| G(force boundary, inject chunks, keepback last chunk for potential followup)
> +    G -->|caching enabled| A

the caching enabled/disabled parts here are probably confusing (do those edges
mean caching is enabled/disabled at that point? or are they taken if it is
enabled/disabled?)

but probably it makes sense to re-visit this in detail once the dust has settled :)

> +
> -- 
> 2.39.2
> 
> 
> 
> _______________________________________________
> pbs-devel mailing list
> pbs-devel@lists.proxmox.com
> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel
> 
>




^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [pbs-devel] [PATCH v3 proxmox-backup 38/58] upload stream: impl reused chunk injector
  2024-04-04 14:24   ` Fabian Grünbichler
@ 2024-04-05 10:26     ` Christian Ebner
  0 siblings, 0 replies; 122+ messages in thread
From: Christian Ebner @ 2024-04-05 10:26 UTC (permalink / raw)
  To: Proxmox Backup Server development discussion, Fabian Grünbichler

On 4/4/24 16:24, Fabian Grünbichler wrote:
> 
> but I'd like the following even better, since it allows us to get rid of
> the buffer altogether:
> 
>      fn poll_next(self: Pin<&mut Self>, cx: &mut Context) -> Poll<Option<Self::Item>> {
>          let mut this = self.project();
> 
>          let mut injections = this.injection_queue.lock().unwrap();
> 
>          // check whether we have something to inject
>          if let Some(inject) = injections.pop_front() {
>              let offset = this.stream_len.load(Ordering::SeqCst) as u64;
> 
>              if inject.boundary == offset {
>                  // inject now
>                  let mut chunks = Vec::new();
>                  let mut csum = this.index_csum.lock().unwrap();
> 
>                  // account for injected chunks
>                  for chunk in inject.chunks {
>                      let offset = this
>                          .stream_len
>                          .fetch_add(chunk.size() as usize, Ordering::SeqCst)
>                          as u64;
>                      this.reused_len
>                          .fetch_add(chunk.size() as usize, Ordering::SeqCst);
>                      let digest = chunk.digest();
>                      chunks.push((offset, digest));
>                      let end_offset = offset + chunk.size();
>                      csum.update(&end_offset.to_le_bytes());
>                      csum.update(&digest);
>                  }
>                  let chunk_info = InjectedChunksInfo::Known(chunks);
>                  return Poll::Ready(Some(Ok(chunk_info)));
>              } else if inject.boundary < offset {
>                  // incoming new chunks and injections didn't line up?
>                  return Poll::Ready(Some(Err(anyhow!("invalid injection boundary"))));
>              } else {
>                  // inject later
>                  injections.push_front(inject);
>              }
>          }
> 
>          // nothing to inject now, let's see if there's further input
>          match ready!(this.input.as_mut().poll_next(cx)) {
>              None => Poll::Ready(None),
>              Some(Err(err)) => Poll::Ready(Some(Err(err))),
>              Some(Ok(raw)) if raw.is_empty() => {
>                  Poll::Ready(Some(Err(anyhow!("unexpected empty raw data"))))
>              }
>              Some(Ok(raw)) => {
>                  let offset = this.stream_len.fetch_add(raw.len(), Ordering::SeqCst) as u64;
>                  let data = InjectedChunksInfo::Raw((offset, raw));
> 
>                  Poll::Ready(Some(Ok(data)))
>              }
>          }
>      }
> 
> but technically all this accounting could move back to the backup_writer
> as well, if the injected chunk info also contained the size..
> 

Yes, this is much more compact! Also, moving this to the backup writer 
as suggested should allow to further reduce code even more there, at 
least from the initial refactoring it seems to behave just fine.





^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [pbs-devel] [PATCH v3 proxmox-backup 55/58] docs: describe file format for split payload files
  2024-03-28 12:37 ` [pbs-devel] [PATCH v3 proxmox-backup 55/58] docs: describe file format for split payload files Christian Ebner
@ 2024-04-05 10:26   ` Fabian Grünbichler
  0 siblings, 0 replies; 122+ messages in thread
From: Fabian Grünbichler @ 2024-04-05 10:26 UTC (permalink / raw)
  To: Christian Ebner, pbs-devel

Quoting Christian Ebner (2024-03-28 13:37:04)
> Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
> ---
> changes since version 2:
> - not present in previous version
> 
>  docs/file-formats.rst         | 32 ++++++++++++++++++++++
>  docs/meta-format-overview.dot | 50 +++++++++++++++++++++++++++++++++++
>  2 files changed, 82 insertions(+)
>  create mode 100644 docs/meta-format-overview.dot
> 
> diff --git a/docs/file-formats.rst b/docs/file-formats.rst
> index 43ecfefce..292660579 100644
> --- a/docs/file-formats.rst
> +++ b/docs/file-formats.rst
> @@ -8,6 +8,38 @@ Proxmox File Archive Format (``.pxar``)
>  
>  .. graphviz:: pxar-format-overview.dot
>  
> +.. _pxar-meta-format:
> +
> +Proxmox File Archive Format - Meta (``.mpxar``)
> +-----------------------------------------------
> +
> +.. graphviz:: meta-format-overview.dot
> +
> +.. _ppxar-format:
> +
> +Proxmox File Archive Format - Payload (``.ppxar``)
> +--------------------------------------------------
> +
> +The pxar payload contains a concatenation of regular file payloads,
> +each prefixed by a `PAYLOAD` header. Further, the entries can have
> +some padding following the actual payload, if a referenced chunk was
> +not fully reused:
> +
> +.. list-table::
> +   :widths: auto
> +
> +   * - ``PAYLOAD_START_MARKER``
> +     - ``[u8; 16]``
> +   * - ``PAYLOAD``
> +     - ``header with [u8; 16]``
> +   * - ``Payload``
> +     - ``raw regular file payload``
> +   * - ``Padding``
> +     - ``none if chunk fully reused``

well.. technically somehow true, but for a format description this only gives
half the picture IMHO ;) at least mentioning that the Padding itself consists
of (possibly truncated) ``PAYLOAD`` entries would probably help for new people
trying to understand the structure. also, mentioning what the 16 bytes in the
header mean would be helpful?

it might also be nice to dump the magic values and include them here and in the
other related sections, like for the blobs (but always correct, not as a
possibly outdated copy ;))

> +   * - ``...``
> +     - ``Further list of header, payload and padding``
> +   * - ``PAYLOAD_TAIL_MARKER``
> +     - ``[u8; 16]``
>  
>  .. _data-blob-format:




^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [pbs-devel] [PATCH v3 proxmox-backup 49/58] client: backup: increase average chunk size for metadata
  2024-04-05  9:42   ` Fabian Grünbichler
@ 2024-04-05 10:49     ` Dietmar Maurer
  2024-04-08  8:28       ` Fabian Grünbichler
  0 siblings, 1 reply; 122+ messages in thread
From: Dietmar Maurer @ 2024-04-05 10:49 UTC (permalink / raw)
  To: Proxmox Backup Server development discussion,
	Fabian Grünbichler, Christian Ebner

> for the payload stream simple accumulating 1..N files (or rather, their
> contents) in a chunk until a certain size threshold is reached might perform
> better (as in, both be faster than the current chunker, and give us more/better
> re-usable chunks).

Sorry, but that way you would never reuse any chunks! How is
that supposed to work?




^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [pbs-devel] [PATCH v3 proxmox-backup 56/58] docs: add section describing change detection mode
  2024-03-28 12:37 ` [pbs-devel] [PATCH v3 proxmox-backup 56/58] docs: add section describing change detection mode Christian Ebner
@ 2024-04-05 11:22   ` Fabian Grünbichler
  0 siblings, 0 replies; 122+ messages in thread
From: Fabian Grünbichler @ 2024-04-05 11:22 UTC (permalink / raw)
  To: Christian Ebner, pbs-devel

Quoting Christian Ebner (2024-03-28 13:37:05)
> Describe the motivation and basic principle of the clients change
> detection mode and show an example invocation.
> 
> Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
> ---
> changes since version 2:
> - not present in previous version
> 
>  docs/backup-client.rst | 33 +++++++++++++++++++++++++++++++++
>  1 file changed, 33 insertions(+)
> 
> diff --git a/docs/backup-client.rst b/docs/backup-client.rst
> index 00a1abbb3..e54a52bf4 100644
> --- a/docs/backup-client.rst
> +++ b/docs/backup-client.rst
> @@ -280,6 +280,39 @@ Multiple paths can be excluded like this:
>  
>      # proxmox-backup-client backup.pxar:./linux --exclude=/usr --exclude=/rust
>  
> +.. _client_change_detection_mode:
> +
> +Change detection mode
> +~~~~~~~~~~~~~~~~~~~~~
> +
> +Backing up filesystems with large contents can take a long time, as the default
> +behaviour for the Proxmox Backup Client is to read all data and re-encode it
> +into chunks. For some usecases, where files do not change frequently this is not
> +feasible and undesired.

Maybe

File-based backups containing a lot of data can

and maybe also add something along the lines of "before being able to decide
whether those chunks need uploading or are already available on the server."

> +
> +In order to instruct the client to not re-encode files with unchanged metadata,

re-read (as that is what gives us the performance bonus ;))

> +the `change-detection-mode` can be set from the default `data` to `metadata`.
> +By this, regular file payloads for files with unchanged metadata are looked up

s/By this/In this mode

do we want to explain "payload" here? or just use "contents" or "data"?

> +and re-used from the previous backup runs snapshot when possible. For this to

s/runs//

> +be feasible, the pxar archives for backup runs using this mode are split into

s/ runs/s/

> +two separate files, the `mpxar` containing the archives metadata and the `ppxar`
> +containing a concatenation of the file payloads.
> +
> +During backup, the current file metadata is compared to the one looked up in the

During the backup 

or maybe

When creating the backup archives, 

> +previous `mpxar` archive, and if unchanged, the payload of the file is included
> +in the current backup by referencing the indices of the previous snaphshot. The

by reusing the chunks containing that payload in the previous snapshot.

> +increase in backup speed comes at the cost of a possible increase of used space,
> +as chunks might only be partially reused, containing unneeded padding. This is
> +however minimized by selectively re-encoding files where the padding overhead
> +does not justify a re-use.

nit: reused is not hyphenated consistenly

> +
> +The following shows an example for the client invocation with the `metadata`
> +mode:
> +
> +.. code-block:: console
> +
> +    # proxmox-backup-client backup.pxar:./linux --change-detection-mode=metadata
> +
>  .. _client_encryption:
>  
>  Encryption
> -- 
> 2.39.2
> 
> 
> 
> _______________________________________________
> pbs-devel mailing list
> pbs-devel@lists.proxmox.com
> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel
> 
>




^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [pbs-devel] [PATCH v3 proxmox-backup 37/58] client: pxar: helper for lookup of reusable dynamic entries
  2024-03-28 12:36 ` [pbs-devel] [PATCH v3 proxmox-backup 37/58] client: pxar: helper for lookup of reusable dynamic entries Christian Ebner
  2024-04-04 12:54   ` Fabian Grünbichler
@ 2024-04-05 11:28   ` Fabian Grünbichler
  1 sibling, 0 replies; 122+ messages in thread
From: Fabian Grünbichler @ 2024-04-05 11:28 UTC (permalink / raw)
  To: Christian Ebner, pbs-devel

Quoting Christian Ebner (2024-03-28 13:36:46)
> The helper method allows to lookup the entries of a dynamic index
> which fully cover a given offset range. Further, the helper returns
> the start padding from the start offset of the dynamic index entry
> to the start offset of the given range and the end padding.
> 
> This will be used to lookup size and digest for chunks covering the
> payload range of a regular file in order to re-use found chunks by
> indexing them in the archives index file instead of re-encoding the
> payload.
> 
> Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
> ---
> changes since version 2:
> - moved this from the dynamic index to the pxar create as suggested
> - refactored and optimized search, going for linear search to find the
>   end entry
> - reworded commit message
> 
>  pbs-client/src/pxar/create.rs | 63 +++++++++++++++++++++++++++++++++++
>  1 file changed, 63 insertions(+)
> 
> diff --git a/pbs-client/src/pxar/create.rs b/pbs-client/src/pxar/create.rs
> index 2bb5a6253..e2d3954ca 100644
> --- a/pbs-client/src/pxar/create.rs
> +++ b/pbs-client/src/pxar/create.rs
> @@ -2,6 +2,7 @@ use std::collections::{HashMap, HashSet};
>  use std::ffi::{CStr, CString, OsStr};
>  use std::fmt;
>  use std::io::{self, Read};
> +use std::ops::Range;
>  use std::os::unix::ffi::OsStrExt;
>  use std::os::unix::io::{AsRawFd, FromRawFd, IntoRawFd, OwnedFd, RawFd};
>  use std::path::{Path, PathBuf};
> @@ -16,6 +17,7 @@ use nix::fcntl::OFlag;
>  use nix::sys::stat::{FileStat, Mode};
>  
>  use pathpatterns::{MatchEntry, MatchFlag, MatchList, MatchType, PatternFlag};
> +use pbs_datastore::index::IndexFile;
>  use proxmox_sys::error::SysError;
>  use pxar::encoder::{LinkOffset, SeqWrite};
>  use pxar::Metadata;
> @@ -25,6 +27,7 @@ use proxmox_lang::c_str;
>  use proxmox_sys::fs::{self, acl, xattr};
>  
>  use pbs_datastore::catalog::BackupCatalogWriter;
> +use pbs_datastore::dynamic_index::DynamicIndexReader;
>  
>  use crate::pxar::metadata::errno_is_unsupported;
>  use crate::pxar::tools::assert_single_path_component;
> @@ -791,6 +794,66 @@ impl Archiver {
>      }
>  }
>  
> +/// Dynamic Entry reusable by payload references
> +#[derive(Clone, Debug)]
> +#[repr(C)]
> +pub struct ReusableDynamicEntry {
> +    size_le: u64,

I don't think the `le` here makes sense, this is never stored on disk..

> +    digest: [u8; 32],
> +}




^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [pbs-devel] [PATCH v3 pxar proxmox-backup 00/58] fix #3174: improve file-level backup
  2024-03-28 12:36 [pbs-devel] [PATCH v3 pxar proxmox-backup 00/58] fix #3174: improve file-level backup Christian Ebner
                   ` (57 preceding siblings ...)
  2024-03-28 12:37 ` [pbs-devel] [PATCH v3 proxmox-backup 58/58] test-suite: add bin to deb, add shell completions Christian Ebner
@ 2024-04-05 11:39 ` Fabian Grünbichler
  2024-04-29 12:13 ` Christian Ebner
  59 siblings, 0 replies; 122+ messages in thread
From: Fabian Grünbichler @ 2024-04-05 11:39 UTC (permalink / raw)
  To: Christian Ebner, pbs-devel

Quoting Christian Ebner (2024-03-28 13:36:09)
> A big thank you to Dietmar and Fabian for the review of the previous
> version and Fabian for extensive testing and help during debugging.
> 
> This series of patches implements an metadata based file change
> detection mechanism for improved pxar file level backup creation speed
> for unchanged files.
> 
> The chosen approach is to split pxar archives on creation via the
> proxmox-backup-client into two separate data and upload streams,
> one exclusive for regular file payloads, the other one for the rest
> of the pxar archive, which is mostly metadata.
> 
> On consecutive runs, the metadata archive of the previous backup run,
> which is limited in size and therefore rapidly accessed is used to
> lookup and compare the metadata for entries to encode.
> This assumes that the connection speed to the Proxmox Backup Server is
> sufficiently fast, allowing the download and chaching of the chunks for
> that index.
> 
> Changes to regular files are detected by comparing all of the files
> metadata object, including mtime, acls, ecc. If no changes are detected,
> the previous payload index is used to lookup chunks to possibly re-use
> in the payload stream of the new archive.
> In order to reduce possible chunk fragmentation, the decision whether to
> re-use or re-encode a file payload is deferred until enough information
> is gathered by adding entries to a look-ahead cache. If the padding
> introduced by reusing chunks falls below a threshold, the entries are
> referenced, the chunks are re-used and injected into the pxar payload
> upload stream, otherwise they are discated and the files encoded
> regularly.

There's still some not-too-fundamental refactoring in the feedback this time
around, but it's alreayd taking up shape.

A few bigger open questions:
- maybe do some test runs with different non-sliding-window chunking approaches
- what to do about the catalog? with the split archives, it would be nice to
  get rid of the overhead of having two metadata archives..
- CliParams/Prelude: what to use it for, and how (other than the parameters/CLI
  excludes)
- should we add a mode to force split archive, but no-reuse (for example,
  allowing to reset padding overhead every X backups)
- more testing, also of pathologically constructed input would be great (both
  for validation, and for performance/reuse regression testing)

Also, clippy doesn't like some of the new code, maybe you could take a look at
those as well, it's mostly minor stuff like unnecessary reference taking..

Thanks for all your work on this, I am sure this will be a big step forward in
extending the use cases where PBS makes sense :)




^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [pbs-devel] [PATCH v3 proxmox-backup 49/58] client: backup: increase average chunk size for metadata
  2024-04-05 10:49     ` Dietmar Maurer
@ 2024-04-08  8:28       ` Fabian Grünbichler
  0 siblings, 0 replies; 122+ messages in thread
From: Fabian Grünbichler @ 2024-04-08  8:28 UTC (permalink / raw)
  To: Christian Ebner, Dietmar Maurer,
	Proxmox Backup Server development discussion

On April 5, 2024 12:49 pm, Dietmar Maurer wrote:
>> for the payload stream simple accumulating 1..N files (or rather, their
>> contents) in a chunk until a certain size threshold is reached might perform
>> better (as in, both be faster than the current chunker, and give us more/better
>> re-usable chunks).
> 
> Sorry, but that way you would never reuse any chunks! How is
> that supposed to work?

the chunk re-usage would be moved to the metadata-based caching,
basically:

- big files get a sequence of chunks according to some splitting rules,
  those chunks are completely just for that file (so if you just modify
  a bit at the front, only the first chunk would be new, the rest still
  re-used, but with read penalty)
- smaller files are aggregated into a single chunk, those would not be
  re-used if too many of them changed (payload threshold)

it might just trade on set of issues with another (higher padding vs
less deduplication), not sure.




^ permalink raw reply	[flat|nested] 122+ messages in thread

* [pbs-devel] applied: [PATCH v3 proxmox-backup 16/58] client: backup writer: only borrow http client
  2024-03-28 12:36 ` [pbs-devel] [PATCH v3 proxmox-backup 16/58] client: backup writer: only borrow http client Christian Ebner
@ 2024-04-08  9:04   ` Fabian Grünbichler
  2024-04-08  9:17     ` Christian Ebner
  0 siblings, 1 reply; 122+ messages in thread
From: Fabian Grünbichler @ 2024-04-08  9:04 UTC (permalink / raw)
  To: Proxmox Backup Server development discussion

this one is independent from the rest, applied already, thanks :)

in general it's helpful to order such patches first - increases the
chances of them being applied, since then it's obvious that they don't
include side-effects from earlier patches ;)

On March 28, 2024 1:36 pm, Christian Ebner wrote:
> Instead of taking ownership of the http client when starting a new
> BackupWriter instance, only borrow the client.
> 
> This allows to reuse the http client to later reuse it to start also a
> BackupReader instance as required for backup runs with metadata based
> file change detection mode, where both must use the same http client.
> 
> Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
> ---
> changes since version 2:
> - not present in previous version
> 
>  examples/upload-speed.rs               | 2 +-
>  pbs-client/src/backup_writer.rs        | 2 +-
>  proxmox-backup-client/src/benchmark.rs | 2 +-
>  proxmox-backup-client/src/main.rs      | 4 ++--
>  4 files changed, 5 insertions(+), 5 deletions(-)
> 
> diff --git a/examples/upload-speed.rs b/examples/upload-speed.rs
> index f9fc52a85..e4b570ec5 100644
> --- a/examples/upload-speed.rs
> +++ b/examples/upload-speed.rs
> @@ -18,7 +18,7 @@ async fn upload_speed() -> Result<f64, Error> {
>      let backup_time = proxmox_time::epoch_i64();
>  
>      let client = BackupWriter::start(
> -        client,
> +        &client,
>          None,
>          datastore,
>          &BackupNamespace::root(),
> diff --git a/pbs-client/src/backup_writer.rs b/pbs-client/src/backup_writer.rs
> index 8a03d8ea6..8bd0e4f36 100644
> --- a/pbs-client/src/backup_writer.rs
> +++ b/pbs-client/src/backup_writer.rs
> @@ -78,7 +78,7 @@ impl BackupWriter {
>      // FIXME: extract into (flattened) parameter struct?
>      #[allow(clippy::too_many_arguments)]
>      pub async fn start(
> -        client: HttpClient,
> +        client: &HttpClient,
>          crypt_config: Option<Arc<CryptConfig>>,
>          datastore: &str,
>          ns: &BackupNamespace,
> diff --git a/proxmox-backup-client/src/benchmark.rs b/proxmox-backup-client/src/benchmark.rs
> index b3047308c..1262fb46d 100644
> --- a/proxmox-backup-client/src/benchmark.rs
> +++ b/proxmox-backup-client/src/benchmark.rs
> @@ -229,7 +229,7 @@ async fn test_upload_speed(
>  
>      log::debug!("Connecting to backup server");
>      let client = BackupWriter::start(
> -        client,
> +        &client,
>          crypt_config.clone(),
>          repo.store(),
>          &BackupNamespace::root(),
> diff --git a/proxmox-backup-client/src/main.rs b/proxmox-backup-client/src/main.rs
> index 546275cb1..148708976 100644
> --- a/proxmox-backup-client/src/main.rs
> +++ b/proxmox-backup-client/src/main.rs
> @@ -834,7 +834,7 @@ async fn create_backup(
>  
>      let backup_time = backup_time_opt.unwrap_or_else(epoch_i64);
>  
> -    let client = connect_rate_limited(&repo, rate_limit)?;
> +    let http_client = connect_rate_limited(&repo, rate_limit)?;
>      record_repository(&repo);
>  
>      let snapshot = BackupDir::from((backup_type, backup_id.to_owned(), backup_time));
> @@ -886,7 +886,7 @@ async fn create_backup(
>      };
>  
>      let client = BackupWriter::start(
> -        client,
> +        &http_client,
>          crypt_config.clone(),
>          repo.store(),
>          &backup_ns,
> -- 
> 2.39.2
> 
> 
> 
> _______________________________________________
> pbs-devel mailing list
> pbs-devel@lists.proxmox.com
> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel
> 
> 
> 




^ permalink raw reply	[flat|nested] 122+ messages in thread

* [pbs-devel] applied: [PATCH v3 proxmox-backup 18/58] client: backup: early check for fixed index type
  2024-03-28 12:36 ` [pbs-devel] [PATCH v3 proxmox-backup 18/58] client: backup: early check for fixed index type Christian Ebner
@ 2024-04-08  9:05   ` Fabian Grünbichler
  0 siblings, 0 replies; 122+ messages in thread
From: Fabian Grünbichler @ 2024-04-08  9:05 UTC (permalink / raw)
  To: Proxmox Backup Server development discussion

same applies here - this makes sense even for the current code.

On March 28, 2024 1:36 pm, Christian Ebner wrote:
> Early return when the check fails, avoiding constuction of unused
> object instances.
> 
> Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
> ---
> changes since version 2:
> - move to top of the function as intended
> 
>  proxmox-backup-client/src/main.rs | 7 ++++---
>  1 file changed, 4 insertions(+), 3 deletions(-)
> 
> diff --git a/proxmox-backup-client/src/main.rs b/proxmox-backup-client/src/main.rs
> index 74adf1b16..931c841c7 100644
> --- a/proxmox-backup-client/src/main.rs
> +++ b/proxmox-backup-client/src/main.rs
> @@ -192,6 +192,10 @@ async fn backup_directory<P: AsRef<Path>>(
>      pxar_create_options: pbs_client::pxar::PxarCreateOptions,
>      upload_options: UploadOptions,
>  ) -> Result<BackupStats, Error> {
> +    if upload_options.fixed_size.is_some() {
> +        bail!("cannot backup directory with fixed chunk size!");
> +    }
> +
>      let pxar_stream = PxarBackupStream::open(dir_path.as_ref(), catalog, pxar_create_options)?;
>      let mut chunk_stream = ChunkStream::new(pxar_stream, chunk_size);
>  
> @@ -206,9 +210,6 @@ async fn backup_directory<P: AsRef<Path>>(
>          }
>      });
>  
> -    if upload_options.fixed_size.is_some() {
> -        bail!("cannot backup directory with fixed chunk size!");
> -    }
>  
>      let stats = client
>          .upload_stream(archive_name, stream, upload_options)
> -- 
> 2.39.2
> 
> 
> 
> _______________________________________________
> pbs-devel mailing list
> pbs-devel@lists.proxmox.com
> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel
> 
> 
> 




^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [pbs-devel] applied: [PATCH v3 proxmox-backup 16/58] client: backup writer: only borrow http client
  2024-04-08  9:04   ` [pbs-devel] applied: " Fabian Grünbichler
@ 2024-04-08  9:17     ` Christian Ebner
  0 siblings, 0 replies; 122+ messages in thread
From: Christian Ebner @ 2024-04-08  9:17 UTC (permalink / raw)
  To: Proxmox Backup Server development discussion, Fabian Grünbichler

> On 08.04.2024 11:04 CEST Fabian Grünbichler <f.gruenbichler@proxmox.com> wrote:
> 
>  
> this one is independent from the rest, applied already, thanks :)
> 
> in general it's helpful to order such patches first - increases the
> chances of them being applied, since then it's obvious that they don't
> include side-effects from earlier patches ;)
> 

Noted, thx!




^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [pbs-devel] [PATCH v3 proxmox-backup 41/58] specs: add backup detection mode specification
  2024-04-04 14:54   ` Fabian Grünbichler
@ 2024-04-08 13:36     ` Christian Ebner
  0 siblings, 0 replies; 122+ messages in thread
From: Christian Ebner @ 2024-04-08 13:36 UTC (permalink / raw)
  To: Proxmox Backup Server development discussion, Fabian Grünbichler

On 4/4/24 16:54, Fabian Grünbichler wrote:
> On March 28, 2024 1:36 pm, Christian Ebner wrote:
>> Adds the specification for switching the detection mode used to
>> identify regular files which changed since a reference backup run.
>>
>> Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
>> ---
>> changes since version 2:
>> - removed unneeded vector storing archive names for which to enable
>>    metadata mode, set either for all or none
>>
>>   pbs-client/src/backup_specification.rs | 40 ++++++++++++++++++++++++++
>>   1 file changed, 40 insertions(+)
>>
>> diff --git a/pbs-client/src/backup_specification.rs b/pbs-client/src/backup_specification.rs
>> index 619a3a9da..4b1dbd188 100644
>> --- a/pbs-client/src/backup_specification.rs
>> +++ b/pbs-client/src/backup_specification.rs
>> @@ -4,6 +4,7 @@ use proxmox_schema::*;
>>   
>>   const_regex! {
>>       BACKUPSPEC_REGEX = r"^([a-zA-Z0-9_-]+\.(pxar|img|conf|log)):(.+)$";
>> +    DETECTION_MODE_REGEX = r"^(data|metadata(:[a-zA-Z0-9_-]+\.pxar)*)$";
> 
> this
> 
>>   }
>>   
>>   pub const BACKUP_SOURCE_SCHEMA: Schema =
>> @@ -11,6 +12,11 @@ pub const BACKUP_SOURCE_SCHEMA: Schema =
>>           .format(&ApiStringFormat::Pattern(&BACKUPSPEC_REGEX))
>>           .schema();
>>   
>> +pub const BACKUP_DETECTION_MODE_SPEC: Schema =
>> +    StringSchema::new("Backup source specification ([data|metadata(:<label>,...)]).")
> 
> and this
> 
>> +        .format(&ApiStringFormat::Pattern(&DETECTION_MODE_REGEX))
>> +        .schema();
>> +
>>   pub enum BackupSpecificationType {
>>       PXAR,
>>       IMAGE,
>> @@ -45,3 +51,37 @@ pub fn parse_backup_specification(value: &str) -> Result<BackupSpecification, Er
>>   
>>       bail!("unable to parse backup source specification '{}'", value);
>>   }
>> +
>> +/// Mode to detect file changes since last backup run
>> +pub enum BackupDetectionMode {
>> +    /// Regular mode, re-encode payload data
>> +    Data,
>> +    /// Compare metadata, reuse payload chunks if metadata unchanged
>> +    Metadata,
> 
> and this now do not really match anymore?
> 

Yes, removed the leftover complexity of the regex for the upcoming version




^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [pbs-devel] [PATCH v3 proxmox-backup 40/58] client: chunk stream: add dynamic entries injection queues
  2024-04-04 14:52   ` Fabian Grünbichler
@ 2024-04-08 13:54     ` Christian Ebner
  2024-04-09  7:19       ` Fabian Grünbichler
  0 siblings, 1 reply; 122+ messages in thread
From: Christian Ebner @ 2024-04-08 13:54 UTC (permalink / raw)
  To: Proxmox Backup Server development discussion, Fabian Grünbichler

On 4/4/24 16:52, Fabian Grünbichler wrote:
> On March 28, 2024 1:36 pm, Christian Ebner wrote:
>> Adds a queue to the chunk stream to request forced boundaries at a
>> given offset within the stream and inject reused dynamic entries
>> after this boundary.
>>
>> The chunks are then passed along to the uploader stream using the
>> injection queue, which inserts them during upload.
>>
>> Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
>> ---
>> changes since version 2:
>> - combined queues into new optional struct
>> - refactoring
>>
>>   examples/test_chunk_speed2.rs                 |  2 +-
>>   pbs-client/src/backup_writer.rs               | 89 +++++++++++--------
>>   pbs-client/src/chunk_stream.rs                | 36 +++++++-
>>   pbs-client/src/pxar/create.rs                 |  6 +-
>>   pbs-client/src/pxar_backup_stream.rs          |  7 +-
>>   proxmox-backup-client/src/main.rs             | 31 ++++---
>>   .../src/proxmox_restore_daemon/api.rs         |  1 +
>>   pxar-bin/src/main.rs                          |  1 +
>>   tests/catar.rs                                |  1 +
>>   9 files changed, 121 insertions(+), 53 deletions(-)
>>
>> diff --git a/examples/test_chunk_speed2.rs b/examples/test_chunk_speed2.rs
>> index 3f69b436d..22dd14ce2 100644
>> --- a/examples/test_chunk_speed2.rs
>> +++ b/examples/test_chunk_speed2.rs
>> @@ -26,7 +26,7 @@ async fn run() -> Result<(), Error> {
>>           .map_err(Error::from);
>>   
>>       //let chunk_stream = FixedChunkStream::new(stream, 4*1024*1024);
>> -    let mut chunk_stream = ChunkStream::new(stream, None);
>> +    let mut chunk_stream = ChunkStream::new(stream, None, None);
>>   
>>       let start_time = std::time::Instant::now();
>>   
>> diff --git a/pbs-client/src/backup_writer.rs b/pbs-client/src/backup_writer.rs
>> index 8bd0e4f36..032d93da7 100644
>> --- a/pbs-client/src/backup_writer.rs
>> +++ b/pbs-client/src/backup_writer.rs
>> @@ -1,4 +1,4 @@
>> -use std::collections::HashSet;
>> +use std::collections::{HashSet, VecDeque};
>>   use std::future::Future;
>>   use std::os::unix::fs::OpenOptionsExt;
>>   use std::sync::atomic::{AtomicU64, AtomicUsize, Ordering};
>> @@ -23,6 +23,7 @@ use pbs_tools::crypt_config::CryptConfig;
>>   
>>   use proxmox_human_byte::HumanByte;
>>   
>> +use super::inject_reused_chunks::{InjectChunks, InjectReusedChunks, InjectedChunksInfo};
>>   use super::merge_known_chunks::{MergeKnownChunks, MergedChunkInfo};
>>   
>>   use super::{H2Client, HttpClient};
>> @@ -265,6 +266,7 @@ impl BackupWriter {
>>           archive_name: &str,
>>           stream: impl Stream<Item = Result<bytes::BytesMut, Error>>,
>>           options: UploadOptions,
>> +        injection_queue: Option<Arc<Mutex<VecDeque<InjectChunks>>>>,
>>       ) -> Result<BackupStats, Error> {
>>           let known_chunks = Arc::new(Mutex::new(HashSet::new()));
>>   
>> @@ -341,6 +343,7 @@ impl BackupWriter {
>>                   None
>>               },
>>               options.compress,
>> +            injection_queue,
>>           )
>>           .await?;
>>   
>> @@ -637,6 +640,7 @@ impl BackupWriter {
>>           known_chunks: Arc<Mutex<HashSet<[u8; 32]>>>,
>>           crypt_config: Option<Arc<CryptConfig>>,
>>           compress: bool,
>> +        injection_queue: Option<Arc<Mutex<VecDeque<InjectChunks>>>>,
>>       ) -> impl Future<Output = Result<UploadStats, Error>> {
>>           let total_chunks = Arc::new(AtomicUsize::new(0));
>>           let total_chunks2 = total_chunks.clone();
>> @@ -663,48 +667,63 @@ impl BackupWriter {
>>           let index_csum_2 = index_csum.clone();
>>   
>>           stream
>> -            .and_then(move |data| {
>> -                let chunk_len = data.len();
>> +            .inject_reused_chunks(
>> +                injection_queue.unwrap_or_default(),
>> +                stream_len,
>> +                reused_len.clone(),
>> +                index_csum.clone(),
>> +            )
>> +            .and_then(move |chunk_info| match chunk_info {
> 
> for this part here I am still not sure whether doing all of the
> accounting here wouldn't be nicer..
> 

Moved almost all the accounting to here, only stream len is still 
required for the offset calculation in `inject_reused_chunks`.

> 
>> diff --git a/pbs-client/src/chunk_stream.rs b/pbs-client/src/chunk_stream.rs
>> index a45420ca0..6ac0c638b 100644
>> --- a/pbs-client/src/chunk_stream.rs
>> +++ b/pbs-client/src/chunk_stream.rs
>> @@ -38,15 +38,17 @@ pub struct ChunkStream<S: Unpin> {
>>       chunker: Chunker,
>>       buffer: BytesMut,
>>       scan_pos: usize,
>> +    injection_data: Option<InjectionData>,
>>   }
>>   
>>   impl<S: Unpin> ChunkStream<S> {
>> -    pub fn new(input: S, chunk_size: Option<usize>) -> Self {
>> +    pub fn new(input: S, chunk_size: Option<usize>, injection_data: Option<InjectionData>) -> Self {
>>           Self {
>>               input,
>>               chunker: Chunker::new(chunk_size.unwrap_or(4 * 1024 * 1024)),
>>               buffer: BytesMut::new(),
>>               scan_pos: 0,
>> +            injection_data,
>>           }
>>       }
>>   }
>> @@ -64,6 +66,34 @@ where
>>       fn poll_next(self: Pin<&mut Self>, cx: &mut Context) -> Poll<Option<Self::Item>> {
>>           let this = self.get_mut();
>>           loop {
>> +            if let Some(InjectionData {
>> +                boundaries,
>> +                injections,
>> +                consumed,
>> +            }) = this.injection_data.as_mut()
>> +            {
>> +                // Make sure to release this lock as soon as possible
>> +                let mut boundaries = boundaries.lock().unwrap();
>> +                if let Some(inject) = boundaries.pop_front() {
> 
> here I am a bit more wary that this popping and re-pushing might hurt
> performance..

> 
>> +                    let max = *consumed + this.buffer.len() as u64;
>> +                    if inject.boundary <= max {
>> +                        let chunk_size = (inject.boundary - *consumed) as usize;
>> +                        let result = this.buffer.split_to(chunk_size);
> 
> a comment or better variable naming would make this easier to follow
> along.. >
> "result" is a forced chunk that is created here because we've reached a
> point where we want to inject something afterwards..
> 

Improved the variable naming and added comments to clarify the 
functionality for the upcoming version of the patches.

> once more I am wondering here whether for the payload stream, a vastly
> simplified chunker that just picks the boundaries based on re-use and
> payload size(s) (to avoid the one file == one chunk pathological case
> for lots of small files) wouldn't improve performance :)

Do you suggest to have 2 chunker implementations and for the payload 
stream, instead of performing chunking by the statistical sliding window 
approach use the  provide the chunk boundaries by some interface rather 
than performing the chunking based on the statistical approach with the 
sliding window? As you mentioned in response to Dietmar on patch 49 of 
this patch series version?




^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [pbs-devel] [PATCH v3 proxmox-backup 40/58] client: chunk stream: add dynamic entries injection queues
  2024-04-08 13:54     ` Christian Ebner
@ 2024-04-09  7:19       ` Fabian Grünbichler
  0 siblings, 0 replies; 122+ messages in thread
From: Fabian Grünbichler @ 2024-04-09  7:19 UTC (permalink / raw)
  To: Christian Ebner, Proxmox Backup Server development discussion

On April 8, 2024 3:54 pm, Christian Ebner wrote:
> On 4/4/24 16:52, Fabian Grünbichler wrote:
>> once more I am wondering here whether for the payload stream, a vastly
>> simplified chunker that just picks the boundaries based on re-use and
>> payload size(s) (to avoid the one file == one chunk pathological case
>> for lots of small files) wouldn't improve performance :)
> 
> Do you suggest to have 2 chunker implementations and for the payload 
> stream, instead of performing chunking by the statistical sliding window 
> approach use the  provide the chunk boundaries by some interface rather 
> than performing the chunking based on the statistical approach with the 
> sliding window? As you mentioned in response to Dietmar on patch 49 of 
> this patch series version?

yes - I think it would be interesting to evaluate. but only if such an
experiment is not a week-long effort :)

the two main questions would be:
- is a metadata-informed chunker faster than the sliding window (or how
  much faster)
- how does the dedup rate compare for some common scenarios

so maybe it would make sense to have a "change based" test corpus
first (which we IMHO want anyway), and then compare the two.




^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [pbs-devel] [PATCH v3 proxmox-backup 43/58] client: pxar: implement store to insert chunks on caching
  2024-04-05  7:52   ` Fabian Grünbichler
@ 2024-04-09  9:12     ` Christian Ebner
  0 siblings, 0 replies; 122+ messages in thread
From: Christian Ebner @ 2024-04-09  9:12 UTC (permalink / raw)
  To: Fabian Grünbichler, pbs-devel

On 4/5/24 09:52, Fabian Grünbichler wrote:
> Quoting Christian Ebner (2024-03-28 13:36:52)
>> In preparation for the look-ahead caching used to temprarily store
>> entries before encoding them in the pxar archive, being able to
>> decide wether to re-use or re-encode regular file entries.
>>
>> Allows to insert and store reused chunks in the archiver,
>> deduplicating chunks upon insert when possible.
>>
>> Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
>> ---
>> changes since version 2:
>> - Strongly adapted and refactored: keep track also of paddings
>>    introduced by reusing the chunks, making a suggestion whether to
>>    re-use, re-encode or check next entry based on threshold
>> - completely removed code which allowed to calculate offsets based on
>>    chunks found in the middle, they must either be a continuation of the
>>    end or be added after, otherwise offsets are not monotonically
>>    increasing, which is required for sequential restore
>>
>>   pbs-client/src/pxar/create.rs | 126 +++++++++++++++++++++++++++++++++-
>>   1 file changed, 125 insertions(+), 1 deletion(-)
>>
>> diff --git a/pbs-client/src/pxar/create.rs b/pbs-client/src/pxar/create.rs
>> index 335e3556f..95a91a59b 100644
>> --- a/pbs-client/src/pxar/create.rs
>> +++ b/pbs-client/src/pxar/create.rs
>> @@ -20,7 +20,7 @@ use pathpatterns::{MatchEntry, MatchFlag, MatchList, MatchType, PatternFlag};
>>   use pbs_datastore::index::IndexFile;
>>   use proxmox_sys::error::SysError;
>>   use pxar::accessor::aio::Accessor;
>> -use pxar::encoder::{LinkOffset, SeqWrite};
>> +use pxar::encoder::{LinkOffset, PayloadOffset, SeqWrite};
>>   use pxar::Metadata;
>>   
>>   use proxmox_io::vec;
>> @@ -36,6 +36,128 @@ use crate::pxar::metadata::errno_is_unsupported;
>>   use crate::pxar::tools::assert_single_path_component;
>>   use crate::pxar::Flags;
>>   
>> +const CHUNK_PADDING_THRESHOLD: f64 = 0.1;
>> +
>> +#[derive(Default)]
>> +struct ReusedChunks {
>> +    start_boundary: PayloadOffset,
>> +    total: PayloadOffset,
>> +    padding: u64,
>> +    chunks: Vec<(u64, ReusableDynamicEntry)>,
>> +    must_flush_first: bool,
>> +    suggestion: Suggested,
>> +}
>> +
>> +#[derive(Copy, Clone, Default)]
>> +enum Suggested {
>> +    #[default]
>> +    CheckNext,
>> +    Reuse,
>> +    Reencode,
>> +}
> 
> this is a bit of a misnomer - it's not a suggestion, it's what is going to
> happen ;) maybe `action` or even `result` or something similar might be a
> better fit? or something going into the direction of "decision"?
> 

Renamed this to `Action`, as this seemed better fitting of the two 
suggested options, and I could not come up with a better name.

>> +
>> +impl ReusedChunks {
>> +    fn new() -> Self {
>> +        Self::default()
>> +    }
> 
> this is not needed, can just use default()

Removed

> 
>> +
>> +    fn start_boundary(&self) -> PayloadOffset {
>> +        self.start_boundary
>> +    }
> 
> is this needed? access is only local anyway, and we don't have a similar helper
> for self.chunks ;)

Yes, removed it... my intention was to move this maybe to a different 
sub-module, but in the end it remained where it is now.

> 
>> +
>> +    fn is_empty(&self) -> bool {
>> +        self.chunks.is_empty()
>> +    }
>> +
>> +    fn suggested(&self) -> Suggested {
>> +        self.suggestion
> 
> same question here..

same as above

> 
>> +    }
>> +
>> +    fn insert(
>> +        &mut self,
>> +        indices: Vec<ReusableDynamicEntry>,
>> +        boundary: PayloadOffset,
>> +        start_padding: u64,
>> +        end_padding: u64,
>> +    ) -> PayloadOffset {
> 
> another fn that definitely would benefit from some comments describing the
> thought processes behind the logic :)

added some more explanation as comments

> 
>> +        if self.is_empty() {
>> +            self.start_boundary = boundary;
>> +        }
> 
> okay, so this is actually just set
> 
>> +
>> +        if let Some(offset) = self.last_digest_matched(&indices) {
>> +            if let Some((padding, last)) = self.chunks.last_mut() {
> 
> we already know a last chunk must exist (else last_digest_matched wouldn't have
> returned). that means last_digest_matched could just return the last chunk?

True, but than there is the mutable reference of self blocking further 
modifications, therefore kept this as is.

> 
>> +                // Existing chunk, update padding based on pre-existing one
>> +                // Start padding is expected to be larger than previous padding
> 
> should we validate this expectation? or is it already validated somewhere else?
> and also, isn't start_padding by definition always smaller than last.size()?

Will opt for incorporating the padding into the `ReusableDynamicEntry` 
itself, therefore not needing to check this independently, as it is then 
inherent.

> 
>> +                *padding += start_padding - last.size();
>> +                self.padding += start_padding - last.size();
> 
> (see below) here we correct the per-chunk padding, but only for a partial chunk that is continued..
> 
> if I understand this part correctly, here we basically want to adapt from a
> potential end_padding of the last chunk, if it was:
> 
> A|A|A|P|P|P|P
> 
> and we now have
> 
> P|P|P|P|P|B|B
> 
> we want to end up with just 2 'P's in the middle? isn't start_padding - size
> the count of the payload? so what we actually want is
> 
> *padding -= (last.size() - start_padding)
> 
> ? IMHO that makes the intention much more readable, especially if you factor out

Yes, in this case it was possible to deduplicate the first chunk of the 
list to append, meaning that at least the last chunk of the last file 
and the first chunk of the new file are the same (this if both files are 
fully or partially contained within this chunk).

Therefore the chunk is not inserted again, but rather the padding of the 
already present entry updated, by first subtracting the used and 
possible end_padding (note that this is not equal to the payload size, 
which might even cover more than one chunk), the end_padding being 
readded afterwards.

> 
> let payload_size = last.size() - start_padding;
> *padding -= payload_size;
> self.padding -= payload_size;

makes it more readable, agreed, but this is not the payload_size, so 
this will be referred to as remaining in the upcoming version of the 
patches.

So we have:

chunk_size = start_padding + used + end_padding

and call:

remaining = used + end_padding = chunk_size - start_padding

where these can be start_padding >= 0 and end_padding >=0 if all the 
payload is contained within one chunk. If the file payload covers 
multiple chunks, this is covered by updating the corresponding paddings 
for just the start and end chunk in the list.

> 
> if we want to be extra careful, we could even add the three checks/assertions here:
> - start_padding must be smaller than the chunk size
> - both chunk and total padding must be bigger than the payload size
> 
>> +            }
>> +
>> +            for chunk in indices.into_iter().skip(1) {
>> +                self.total = self.total.add(chunk.size());
>> +                self.chunks.push((0, chunk));
> 
> here we push the second and later chunks with 0 padding
> 
>> +            }
>> +
>> +            if let Some((padding, _last)) = self.chunks.last_mut() {
>> +                *padding += end_padding;
>> +                self.padding += end_padding;
> 
> and here we account for the end_padding of the last chunk, which might actually
> be the same as the first chunk, but that works out..

Yes, this is intentionally updated here, to not have to double account 
for the end padding if there is only that one chunk, as it is 
in-depended of whether this is also the first chunk.

> 
>> +            }
>> +
>> +            let padding_ratio = self.padding as f64 / self.total.raw() as f64;
>> +            if self.chunks.len() > 1 && padding_ratio < CHUNK_PADDING_THRESHOLD {
>> +                self.suggestion = Suggested::Reuse;
>> +            }
> 
> see below
> 
>> +
>> +            self.start_boundary.add(offset + start_padding)
>> +        } else {
>> +            let offset = self.total.raw();
>> +
>> +            if let Some(first) = indices.first() {
>> +                self.total = self.total.add(first.size());
>> +                self.chunks.push((start_padding, first.clone()));
>> +                // New chunk, all start padding counts
>> +                self.padding += start_padding;
>> +            }
>> +
>> +            for chunk in indices.into_iter().skip(1) {
>> +                self.total = self.total.add(chunk.size());
>> +                self.chunks.push((chunk.size(), chunk));
> 
> so here we count the full chunk size as padding (for the per-chunk stats), but
> don't count it for the total padding? I think this should be 0 just like above?

Yes, this is indeed incorrect and should be set to 0 here. These chunks 
don't add padding since they are fully used.

> 
> this and the handling of the first chunk could actually be combined:
> 
> for (idx, chunk) in indices.into_iter().enumerate() {
> 	self.total = self.total.add(chunk.size());
> 	let chunk_padding = if idx == 0 { self.padding += start_padding; start_padding } else { 0 };
> 	self.chunks.push((chunk_padding, chunk));
> }

Yeah, that is more compact, requires however the index check for each 
item in the iteration? But that should not be that much cost I guess. 
Will also add the end_padding accounting in here using the same logic

> 
> or we could make start_padding mut, and do
> 
> self.padding += start_padding;
> for chunk in indices.into_iter() {
> 	self.total = self.total.add(chunk.size());
> 	self.chunks.push((start_padding, chunk));
> 	start_padding = 0; // only first chunk counts
> }
> 
>> +            }
>> +
>> +            if let Some((padding, _last)) = self.chunks.last_mut() {
>> +                *padding += end_padding;
>> +                self.padding += end_padding;
>> +            }
>> +
>> +            if self.chunks.len() > 2 {
> 
> so if we insert more than two chunks without a continuation, all of them are
> accounted for as full of padding but they are still re-used if start and end
> padding are below the threshold ;)

If a file covers more than one chunk, adding less than the given 
threshold value in padding, than it should be reused, yes. Now that the 
paddings are actually set correctly for this branch, that is the intention.

> 
>> +                let padding_ratio = self.padding as f64 / self.total.raw() as f64;
>> +                if padding_ratio < CHUNK_PADDING_THRESHOLD {
>> +                    self.suggestion = Suggested::Reuse;
>> +                } else {
>> +                    self.suggestion = Suggested::Reencode;
>> +                }
> 
> we could just return the suggestion instead of storing it - it's only ever needed right after `insert` anyway?

Agreed, opted for that one

> 
> this calculation above seems to not handle some corner cases though.
> 
>   if I have the following sequence
> 
> |P|A|A|A|A|A|B|P|P|P|P|P|
> |    C1     |    C2     |
> 
> where P represent padding parts, A and B are file contents of two files, and C1
> and C2 are two chunks. let's say both A and B are re-usable files. first A is
> resolved via the index and inserted, but since it's the first chunk, there is
> no "suggestion" yet (CheckNext). now B is resolved and inserted - it doesn't
> continue a partial previous chunk, so we take the else branch. but now the
> (total) padding ratio is skewed because of B, even though A on its own would
> have been perfectly fine to be re-used.. (let's say we then insert another
> chunk different to C1 and C2, depending on the exact ratios that one might lead
> to the whole sequence being re-used or not, even though C2 should not be
> re-used, C1 should, and the hypothetical C3 might or might not!)
> 
> basically, as soon as we have a break in the chunk sequence (file A followed by
> file B, with no chunk overlap) we can treat each part on its own?

True, adapted the check here so that already the single chunk padding 
threshold reach results in a reuse.





^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [pbs-devel] [PATCH v3 proxmox-backup 45/58] client: pxar: add method for metadata comparison
  2024-04-05  8:14     ` Christian Ebner
@ 2024-04-09 12:52       ` Christian Ebner
  0 siblings, 0 replies; 122+ messages in thread
From: Christian Ebner @ 2024-04-09 12:52 UTC (permalink / raw)
  To: Fabian Grünbichler, pbs-devel

On 4/5/24 10:14, Christian Ebner wrote:
> On 4/5/24 10:08, Fabian Grünbichler wrote:
>> Quoting Christian Ebner (2024-03-28 13:36:54)
>>> Adds a method to compare the metadata of the current file entry
>>> against the metadata of the entry looked up in the previous backup
>>> snapshot.
>>>
>>> If the metadata matched, the start offset for the payload stream is
>>> returned.
>>>
>>> This is in preparation for reusing payload chunks for unchanged files.
>>>
>>> Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
>>> ---
>>> changes since version 2:
>>> - refactored to new padding based threshold
>>>
>>>   pbs-client/src/pxar/create.rs | 31 ++++++++++++++++++++++++++++++-
>>>   1 file changed, 30 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/pbs-client/src/pxar/create.rs 
>>> b/pbs-client/src/pxar/create.rs
>>> index 79925bba2..c64084a74 100644
>>> --- a/pbs-client/src/pxar/create.rs
>>> +++ b/pbs-client/src/pxar/create.rs
>>> @@ -21,7 +21,7 @@ use pbs_datastore::index::IndexFile;
>>>   use proxmox_sys::error::SysError;
>>>   use pxar::accessor::aio::{Accessor, Directory};
>>>   use pxar::encoder::{LinkOffset, PayloadOffset, SeqWrite};
>>> -use pxar::Metadata;
>>> +use pxar::{EntryKind, Metadata};
>>>   use proxmox_io::vec;
>>>   use proxmox_lang::c_str;
>>> @@ -466,6 +466,35 @@ impl Archiver {
>>>           .boxed()
>>>       }
>>> +    async fn is_reusable_entry(
>>> +        &mut self,
>>> +        previous_metadata_accessor: &mut 
>>> Directory<LocalDynamicReadAt<RemoteChunkReader>>,
>>> +        file_name: &Path,
>>> +        stat: &FileStat,
>>> +        metadata: &Metadata,
>>> +    ) -> Result<Option<u64>, Error> {
>>> +        if stat.st_nlink > 1 {
>>> +            log::debug!("re-encode: {file_name:?} has hardlinks.");
>>> +            return Ok(None);
>>> +        }
>>
>> it would be nice if we had a way to handle those as well.. what's the 
>> current
>> blocker? shouldn't we be able to use the same scheme as for regular 
>> archives?
>>
>> first encounter adds (possibly re-uses) the payload and remembers the 
>> offset,
>> subsequent ones just add another reference/meta entry?
> 
> True, this is a leftover from the initial approach with the appendix 
> section instead of the split archive where it caused issues.
> 

Hardlinks will be encoded as such with the upcoming version of the 
patches, this however required additional changes to the pxar encoder, 
such that it returns the LinkOffset also for the `add_payload_ref` calls 
and requires an additional HashSet on the Archiver to remember cached 
regular files for which the payload chunks have already been looked up, 
so that the encountered, hard-linked file does not lead to re-injected 
chunks again.





^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [pbs-devel] [PATCH v3 proxmox-backup 47/58] client: pxar: add look-ahead caching
  2024-04-05  8:33   ` Fabian Grünbichler
@ 2024-04-09 14:53     ` Christian Ebner
       [not found]       ` <<dce38c53-f3e7-47ac-b1fd-a63daaabbcec@proxmox.com>
  0 siblings, 1 reply; 122+ messages in thread
From: Christian Ebner @ 2024-04-09 14:53 UTC (permalink / raw)
  To: Fabian Grünbichler, pbs-devel

On 4/5/24 10:33, Fabian Grünbichler wrote:
> Quoting Christian Ebner (2024-03-28 13:36:56)
>> +    async fn cache_or_flush_entries<T: SeqWrite + Send>(
>> +        &mut self,
>> +        encoder: &mut Encoder<'_, T>,
>> +        previous_metadata_accessor: &mut Option<Directory<LocalDynamicReadAt<RemoteChunkReader>>>,
>> +        c_file_name: &CStr,
>> +        stat: &FileStat,
>> +        fd: OwnedFd,
>> +        metadata: &Metadata,
>> +    ) -> Result<(), Error> {
>> +        let file_name: &Path = OsStr::from_bytes(c_file_name.to_bytes()).as_ref();
>> +        let reusable = if let Some(accessor) = previous_metadata_accessor {
>> +            self.is_reusable_entry(accessor, file_name, stat, metadata)
>> +                .await?
>> +        } else {
>> +            None
>> +        };
>> +
>> +        let file_size = stat.st_size as u64;
> 
> couldn't we get this via is_reusable?
> 
>> +        if let Some(start_offset) = reusable {
>> +            if let Some(ref ref_payload_index) = self.previous_payload_index {
>> +                let payload_size = file_size + size_of::<pxar::format::Header>() as u64;
> 
> or better yet, get this here directly ;)
> 
>> +                let end_offset = start_offset + payload_size;
> 
> or better yet, this one here ;)
> 
>> +                let (indices, start_padding, end_padding) =
>> +                    lookup_dynamic_entries(ref_payload_index, start_offset..end_offset)?;
> 
> or better yet, just return the Range in the payload archive? :)

Okay, yes.. changed is_reusable_entry to return the Range<u64> of the 
payload in the payload archive, calculated based on the offset and size 
as encoded in the archive, which is already available there.

> 
>> +
>> +                let boundary = encoder.payload_position()?;
>> +                let offset =
>> +                    self.reused_chunks
>> +                        .insert(indices, boundary, start_padding, end_padding);
>> +
>> +                self.caching_enabled = true;
>> +                let cache_entry = CacheEntry::RegEntry(CacheEntryData::new(
>> +                    fd,
>> +                    c_file_name.into(),
>> +                    *stat,
>> +                    metadata.clone(),
>> +                    offset,
>> +                ));
>> +                self.cached_entries.push(cache_entry);
>> +
>> +                match self.reused_chunks.suggested() {
>> +                    Suggested::Reuse => self.flush_cached_to_archive(encoder, true, true).await?,
>> +                    Suggested::Reencode => {
>> +                        self.flush_cached_to_archive(encoder, false, true).await?
>> +                    }
>> +                    Suggested::CheckNext => {}
>> +                }
>> +
>> +                return Ok(());
>> +            }
>> +        }
>> +
>> +        self.flush_cached_to_archive(encoder, false, true).await?;
>> +        self.add_entry(encoder, previous_metadata_accessor, fd.as_raw_fd(), c_file_name, stat)
>> +            .await
> 
> this part here is where I think we mishandle some edge cases, like mentioned in
> the ReusedChunks patch comments.. even keeping back the last chunk doesn't save
> us from losing some re-usable files sometimes..

Still need to have a closer look at what you mean exactly here... The 
code path itself should be fine I think, or am I missing your point here?

It is rather the match against the `suggested` (now called `action`) 
where the wrong decision might be made.


>> +        if let Some((padding, chunk)) = last_chunk {
>> +            // Make sure that we flush this chunk even on clear calls
>> +            self.reused_chunks.must_flush_first = true;
> 
> might make sense to rename this one to "must_flush_first_chunk", else the other
> call sites might be interpreted as "must flush (all) chunks first"
> 

Yes, updated this according to your suggestion

>> +            let _offset = self
>> +                .reused_chunks
>> +                .insert(vec![chunk], injection_boundary, padding, 0);
>> +        }
>> +
>> +        Ok(())
>> +    }
>> +
>> +    fn clear_cached_chunks<T: SeqWrite + Send>(
>> +        &mut self,
>> +        encoder: &mut Encoder<'_, T>,
>> +    ) -> Result<(), Error> {
>> +        let reused_chunks = std::mem::take(&mut self.reused_chunks);
>> +
>> +        if !reused_chunks.must_flush_first {
>> +            return Ok(());
>> +        }
>> +
>> +        // First chunk was kept back to avoid duplication but needs to be injected
>> +        let injection_boundary = reused_chunks.start_boundary();
>> +        let payload_writer_position = encoder.payload_position()?.raw();
>> +
>> +        if !reused_chunks.chunks.is_empty() && injection_boundary.raw() != payload_writer_position {
>> +            bail!(
>> +                "encoder payload writer position out of sync: got {payload_writer_position}, expected {}",
>> +                injection_boundary.raw()
>> +            );
>> +        }
>> +
>> +        if let Some((padding, chunk)) = reused_chunks.chunks.first() {
>> +            let size = PayloadOffset::default().add(chunk.size());
>> +            log::debug!(
>> +                "Injecting chunk with {} padding (chunk size {})",
>> +                HumanByte::from(*padding),
>> +                HumanByte::from(chunk.size()),
>> +            );
>> +            let inject_chunks = InjectChunks {
>> +                boundary: injection_boundary.raw(),
>> +                chunks: vec![chunk.clone()],
>> +                size: size.raw() as usize,
>> +            };
>> +
>> +            self.total_injected_size += size.raw();
>> +            self.total_injected_count += 1;
>> +            if *padding > 0 {
>> +                self.partial_chunks_count += 1;
>> +            }
>> +
>> +            if let Some(boundary) = self.forced_boundaries.as_mut() {
>> +                let mut boundary = boundary.lock().unwrap();
>> +                boundary.push_back(inject_chunks);
>> +            } else {
>> +                bail!("missing injection queue");
>> +            };
>> +            encoder.advance(size)?;
> 
> this part is basically the loop in flush_reused_chunks and could be de-duplicated in some fashion..

Yes, moved the common code into an additional helper function 
`inject_chunks_at_boundary`.





^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [pbs-devel] [PATCH v3 proxmox-backup 47/58] client: pxar: add look-ahead caching
       [not found]       ` <<dce38c53-f3e7-47ac-b1fd-a63daaabbcec@proxmox.com>
@ 2024-04-10  7:03         ` Fabian Grünbichler
  2024-04-10  7:11           ` Christian Ebner
  0 siblings, 1 reply; 122+ messages in thread
From: Fabian Grünbichler @ 2024-04-10  7:03 UTC (permalink / raw)
  To: Christian Ebner, pbs-devel

On April 9, 2024 4:53 pm, Christian Ebner wrote:
> On 4/5/24 10:33, Fabian Grünbichler wrote:
>> Quoting Christian Ebner (2024-03-28 13:36:56)
>>> +
>>> +                let boundary = encoder.payload_position()?;
>>> +                let offset =
>>> +                    self.reused_chunks
>>> +                        .insert(indices, boundary, start_padding, end_padding);
>>> +
>>> +                self.caching_enabled = true;
>>> +                let cache_entry = CacheEntry::RegEntry(CacheEntryData::new(
>>> +                    fd,
>>> +                    c_file_name.into(),
>>> +                    *stat,
>>> +                    metadata.clone(),
>>> +                    offset,
>>> +                ));
>>> +                self.cached_entries.push(cache_entry);
>>> +
>>> +                match self.reused_chunks.suggested() {
>>> +                    Suggested::Reuse => self.flush_cached_to_archive(encoder, true, true).await?,
>>> +                    Suggested::Reencode => {
>>> +                        self.flush_cached_to_archive(encoder, false, true).await?
>>> +                    }
>>> +                    Suggested::CheckNext => {}
>>> +                }
>>> +
>>> +                return Ok(());
>>> +            }
>>> +        }
>>> +
>>> +        self.flush_cached_to_archive(encoder, false, true).await?;
>>> +        self.add_entry(encoder, previous_metadata_accessor, fd.as_raw_fd(), c_file_name, stat)
>>> +            .await
>> 
>> this part here is where I think we mishandle some edge cases, like mentioned in
>> the ReusedChunks patch comments.. even keeping back the last chunk doesn't save
>> us from losing some re-usable files sometimes..
> 
> Still need to have a closer look at what you mean exactly here... The 
> code path itself should be fine I think, or am I missing your point here?
> 
> It is rather the match against the `suggested` (now called `action`) 
> where the wrong decision might be made.

yes, sorry for not phrasing that more explicitly - I just meant to say
that we mishandle those wrong "Suggestions" here (because "CheckNext" is
suggested too often, we then proceed with re-encoding if the next file
is not re-usable. there also is the second issue about re-usable file
sequences crossing chunk boundaries and padding that can cause the
suggestion to take a wrong turn).

not like "this code here is broken", but "it does the wrong thing based
on wrong information" ;) the fix is (most likely) in the suggestion
part, not in the handling part here.. but depending on the solution, it
might also be necessary to adapt something here :)




^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [pbs-devel] [PATCH v3 proxmox-backup 47/58] client: pxar: add look-ahead caching
  2024-04-10  7:03         ` Fabian Grünbichler
@ 2024-04-10  7:11           ` Christian Ebner
  0 siblings, 0 replies; 122+ messages in thread
From: Christian Ebner @ 2024-04-10  7:11 UTC (permalink / raw)
  To: Fabian Grünbichler, pbs-devel

> On 10.04.2024 09:03 CEST Fabian Grünbichler <f.gruenbichler@proxmox.com> wrote:
> 
> 
> yes, sorry for not phrasing that more explicitly - I just meant to say
> that we mishandle those wrong "Suggestions" here (because "CheckNext" is
> suggested too often, we then proceed with re-encoding if the next file
> is not re-usable. there also is the second issue about re-usable file
> sequences crossing chunk boundaries and padding that can cause the
> suggestion to take a wrong turn).
> 
> not like "this code here is broken", but "it does the wrong thing based
> on wrong information" ;) the fix is (most likely) in the suggestion
> part, not in the handling part here.. but depending on the solution, it
> might also be necessary to adapt something here :)

Okay, thanks for the clarification!




^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [pbs-devel] [PATCH v3 proxmox-backup 51/58] pxar: create: show chunk injection stats debug output
  2024-04-05  9:47   ` Fabian Grünbichler
@ 2024-04-10 10:00     ` Christian Ebner
  0 siblings, 0 replies; 122+ messages in thread
From: Christian Ebner @ 2024-04-10 10:00 UTC (permalink / raw)
  To: Fabian Grünbichler, pbs-devel

On 4/5/24 11:47, Fabian Grünbichler wrote:
> Quoting Christian Ebner (2024-03-28 13:37:00)
>> Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
>> ---
>> changes since version 2:
>> - not present in previous version
>>
>>   pbs-client/src/pxar/create.rs | 8 ++++++++
>>   1 file changed, 8 insertions(+)
>>
>> diff --git a/pbs-client/src/pxar/create.rs b/pbs-client/src/pxar/create.rs
>> index f103127c4..461509c39 100644
>> --- a/pbs-client/src/pxar/create.rs
>> +++ b/pbs-client/src/pxar/create.rs
>> @@ -407,6 +407,14 @@ where
>>       encoder.finish().await?;
>>       encoder.close().await?;
>>   
>> +    log::info!(
>> +        "Total injected: {} ({} chunks), total reused payload: {}, padding: {} ({} partial chunks)",
> 
> we already discussed this off-list, but something like
> 
> Change detection: processed XX files, YY unmodified (reused AA GB data + BB GB padding = CC GB total in DD chunks)
> 
> and only printing it with change detection metadata is probably easier to
> understand, but maybe we should get some more feedback on that as well :)
> 

Adapted this as suggested for the new version of the patches, opted 
however for splitting it into 2 lines:
- number of processed and unmodified files
- reused data, padding and total, as well as number of reused chunks and 
partial chunks

This output is now only shown when there is a payload writer available 
and the second line is not shown if there was no reused data.





^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [pbs-devel] [PATCH v3 proxmox-backup 54/58] client: pxar: add flow chart for metadata change detection
  2024-04-05 10:16   ` Fabian Grünbichler
@ 2024-04-10 10:04     ` Christian Ebner
  0 siblings, 0 replies; 122+ messages in thread
From: Christian Ebner @ 2024-04-10 10:04 UTC (permalink / raw)
  To: Fabian Grünbichler, pbs-devel

On 4/5/24 12:16, Fabian Grünbichler wrote:
> Quoting Christian Ebner (2024-03-28 13:37:03)
>> A high level flow chart describing the logic used for the metadata
>> based file change detection mode.
>>
>> Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
>> ---
>> changes since version 2:
>> - not present in previous version
>>
>>   ...ow-chart-metadata-based-file-change-detection.svg |  1 +
>>   ...ow-chart-metadata-based-file-change-detection.txt | 12 ++++++++++++
>>   2 files changed, 13 insertions(+)
>>   create mode 100644 pbs-client/src/pxar/flow-chart-metadata-based-file-change-detection.svg
>>   create mode 100644 pbs-client/src/pxar/flow-chart-metadata-based-file-change-detection.txt
>>
>> diff --git a/pbs-client/src/pxar/flow-chart-metadata-based-file-change-detection.svg b/pbs-client/src/pxar/flow-chart-metadata-based-file-change-detection.svg
>> new file mode 100644
>> index 000000000..5e6df4815
>> --- /dev/null
>> +++ b/pbs-client/src/pxar/flow-chart-metadata-based-file-change-detection.svg
>> @@ -0,0 +1 @@
> 
> [snip]
> 
> something here got broken (I guess mail related somewhere along the way?). it
> does work if the contents are manually merged backed into a single line before
> applying the patch though. in any case, it probably would be nice to have this
> autogenerated and moved to some part of the docs :)
> 
>> diff --git a/pbs-client/src/pxar/flow-chart-metadata-based-file-change-detection.txt b/pbs-client/src/pxar/flow-chart-metadata-based-file-change-detection.txt
>> new file mode 100644
>> index 000000000..5eace70be
>> --- /dev/null
>> +++ b/pbs-client/src/pxar/flow-chart-metadata-based-file-change-detection.txt
>> @@ -0,0 +1,12 @@
>> +flowchart TD
>> +    A[Archiver] -->|lookup metadata| B[Accessor]
>> +    B -->|is reusable entry| C[Lookahead Cache]
>> +    C -->|lookup reusable chunks| D[Dynamic Index]
>> +    D -->|insert and deduplicate dynamic entries| E[Reused Chunks]
>> +    B -->|is not reusable entry| F(re-encode cached entries and current entry)
>> +    F -->|caching disabled| A
>> +    E -->|padding above threshold, non-continuous chunks, caching disabled| F
>> +    E -->|padding above threshold, chunks continuous, caching enabled| A
>> +    E -->|padding below threshold| G(force boundary, inject chunks, keepback last chunk for potential followup)
>> +    G -->|caching enabled| A
> 
> the caching enabled/disabled parts here are probably confusing (do those edges
> mean caching is enabled/disabled at that point? or are they taken if it is
> enabled/disabled?)
> 
> but probably it makes sense to re-visit this in detail once the dust has settled :)
> 

Yes agreed, will drop this patch for now and rethink on how to include 
this better, or even move some of this into the documentation in text 
form (although in a more user oriented fashion rather than dev oriented).





^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [pbs-devel] [PATCH v3 pxar proxmox-backup 00/58] fix #3174: improve file-level backup
  2024-03-28 12:36 [pbs-devel] [PATCH v3 pxar proxmox-backup 00/58] fix #3174: improve file-level backup Christian Ebner
                   ` (58 preceding siblings ...)
  2024-04-05 11:39 ` [pbs-devel] [PATCH v3 pxar proxmox-backup 00/58] fix #3174: improve file-level backup Fabian Grünbichler
@ 2024-04-29 12:13 ` Christian Ebner
  59 siblings, 0 replies; 122+ messages in thread
From: Christian Ebner @ 2024-04-29 12:13 UTC (permalink / raw)
  To: pbs-devel

On 3/28/24 13:36, Christian Ebner wrote:
> A big thank you to Dietmar and Fabian for the review of the previous
> version and Fabian for extensive testing and help during debugging.
> 
> This series of patches implements an metadata based file change
> detection mechanism for improved pxar file level backup creation speed
> for unchanged files.
> 
> The chosen approach is to split pxar archives on creation via the
> proxmox-backup-client into two separate data and upload streams,
> one exclusive for regular file payloads, the other one for the rest
> of the pxar archive, which is mostly metadata.
> 
> On consecutive runs, the metadata archive of the previous backup run,
> which is limited in size and therefore rapidly accessed is used to
> lookup and compare the metadata for entries to encode.
> This assumes that the connection speed to the Proxmox Backup Server is
> sufficiently fast, allowing the download and chaching of the chunks for
> that index.
> 
> Changes to regular files are detected by comparing all of the files
> metadata object, including mtime, acls, ecc. If no changes are detected,
> the previous payload index is used to lookup chunks to possibly re-use
> in the payload stream of the new archive.
> In order to reduce possible chunk fragmentation, the decision whether to
> re-use or re-encode a file payload is deferred until enough information
> is gathered by adding entries to a look-ahead cache. If the padding
> introduced by reusing chunks falls below a threshold, the entries are
> referenced, the chunks are re-used and injected into the pxar payload
> upload stream, otherwise they are discated and the files encoded
> regularly.
> 
> The following lists the most notable changes included in this series since
> the version 2:
> - many bugfixes regarding incorrect archive encoding by wrong offset
>    generation, adding additional sanity checks and rather fail on
>    encoding than produce an incorrectly encoded archive
> - different approach for deciding whether to re-use or re-encode the
>    entries. Previously, the entries have been encoded when a cached
>    payload size threshold was reached. Now, the padding introduced by
>    reusable chunks is tracked, and only if the padding does not exceed
>    the set threshold, the entries are re-used. This reduces the possible
>    padding, at the cost of re-encoding more entries. Also avoids to
>    re-use chunks which have now large padding holes because of
>    moved/removed files contained within.
> - added headers for metadata archive and payload file
> - added documentation
> 
> An invocation of a backup run with this patches now is:
> ```bash
> proxmox-backup-client backup <label>.pxar:<source-path> --change-detection-mode=metadata
> ```
> During the first run, no reference index is available, the pxar archive
> will however be split into the two parts.
> Following backups will however utilize the pxar archive accessor and
> index files of the previous run to perform file change detection.
> 
> As benchmarks, the linux source code as well as the coco dataset for
> computer vision and pattern recognition can be used.
> The benchmarks can be performed by running:
> ```bash
> proxmox-backup-test-suite detection-mode-bench prepare --target /<path-to-bench-source-target>
> proxmox-backup-test-suite detection-mode-bench run linux.pxar:/<path-to-bench-source-target>/linux
> proxmox-backup-test-suite detection-mode-bench run coco.pxar:/<path-to-bench-source-target>/coco
> ```
> 
> Above command invocations assume the default repository and credentials
> to be set as environment variables, they might however be passed as
> additional optional parameters instead.
> 
> pxar:
> 
> Christian Ebner (14):
>    encoder: fix two typos in comments
>    format/examples: add PXAR_PAYLOAD_REF entry header
>    decoder: add method to read payload references
>    decoder: factor out skip part from skip_entry
>    encoder: add optional output writer for file payloads
>    encoder: move to stack based state tracking
>    decoder/accessor: add optional payload input stream
>    encoder: add payload reference capability
>    encoder: add payload position capability
>    encoder: add payload advance capability
>    encoder/format: finish payload stream with marker
>    format: add payload stream start marker
>    format: add pxar format version entry
>    format/encoder/decoder: add entry type cli params
> 
>   examples/apxar.rs            |   2 +-
>   examples/mk-format-hashes.rs |  21 ++
>   examples/pxarcmd.rs          |   7 +-
>   src/accessor/aio.rs          |  10 +-
>   src/accessor/mod.rs          |  52 +++-
>   src/accessor/sync.rs         |   8 +-
>   src/decoder/aio.rs           |  14 +-
>   src/decoder/mod.rs           | 191 ++++++++++++--
>   src/decoder/sync.rs          |  15 +-
>   src/encoder/aio.rs           |  87 +++++--
>   src/encoder/mod.rs           | 475 +++++++++++++++++++++++++----------
>   src/encoder/sync.rs          |  67 ++++-
>   src/format/mod.rs            |  63 +++++
>   src/lib.rs                   |   9 +
>   tests/simple/main.rs         |   3 +
>   15 files changed, 827 insertions(+), 197 deletions(-)
> 
> proxmox-backup:
> 
> Christian Ebner (44):
>    client: pxar: switch to stack based encoder state
>    client: backup writer: only borrow http client
>    client: backup: factor out extension from backup target
>    client: backup: early check for fixed index type
>    client: pxar: combine writer params into struct
>    client: backup: split payload to dedicated stream
>    client: helper: add helpers for creating reader instances
>    client: helper: add method for split archive name mapping
>    client: restore: read payload from dedicated index
>    tools: cover meta extension for pxar archives
>    restore: cover meta extension for pxar archives
>    client: mount: make split pxar archives mountable
>    api: datastore: refactor getting local chunk reader
>    api: datastore: attach optional payload chunk reader
>    catalog: shell: factor out pxar fuse reader instantiation
>    catalog: shell: redirect payload reader for split streams
>    www: cover meta extension for pxar archives
>    pxar: add optional payload input for achive restore
>    pxar: add more context to extraction error
>    client: pxar: include payload offset in output
>    pxar: show padding in debug output on archive list
>    datastore: dynamic index: add method to get digest
>    client: pxar: helper for lookup of reusable dynamic entries
>    upload stream: impl reused chunk injector
>    client: chunk stream: add struct to hold injection state
>    client: chunk stream: add dynamic entries injection queues
>    specs: add backup detection mode specification
>    client: implement prepare reference method
>    client: pxar: implement store to insert chunks on caching
>    client: pxar: add previous reference to archiver
>    client: pxar: add method for metadata comparison
>    pxar: caching: add look-ahead cache types
>    client: pxar: add look-ahead caching
>    fix #3174: client: pxar: enable caching and meta comparison
>    client: backup: increase average chunk size for metadata
>    client: backup writer: add injected chunk count to stats
>    pxar: create: show chunk injection stats debug output
>    client: pxar: add entry kind format version
>    client: pxar: opt encode cli exclude patterns as CliParams
>    client: pxar: add flow chart for metadata change detection
>    docs: describe file format for split payload files
>    docs: add section describing change detection mode
>    test-suite: add detection mode change benchmark
>    test-suite: add bin to deb, add shell completions
> 
>   Cargo.toml                                    |   1 +
>   Makefile                                      |  13 +-
>   debian/proxmox-backup-client.bash-completion  |   1 +
>   debian/proxmox-backup-client.install          |   2 +
>   debian/proxmox-backup-test-suite.bc           |   8 +
>   docs/backup-client.rst                        |  33 +
>   docs/file-formats.rst                         |  32 +
>   docs/meta-format-overview.dot                 |  50 ++
>   examples/test_chunk_speed2.rs                 |   2 +-
>   examples/upload-speed.rs                      |   2 +-
>   pbs-client/src/backup_specification.rs        |  40 +
>   pbs-client/src/backup_writer.rs               | 103 ++-
>   pbs-client/src/chunk_stream.rs                |  60 +-
>   pbs-client/src/inject_reused_chunks.rs        | 152 ++++
>   pbs-client/src/lib.rs                         |   3 +-
>   pbs-client/src/pxar/create.rs                 | 779 +++++++++++++++++-
>   pbs-client/src/pxar/extract.rs                |   2 +
>   ...t-metadata-based-file-change-detection.svg |   1 +
>   ...t-metadata-based-file-change-detection.txt |  12 +
>   pbs-client/src/pxar/look_ahead_cache.rs       |  38 +
>   pbs-client/src/pxar/mod.rs                    |   3 +-
>   pbs-client/src/pxar/tools.rs                  | 123 ++-
>   pbs-client/src/pxar_backup_stream.rs          |  57 +-
>   pbs-client/src/tools/mod.rs                   |   5 +-
>   pbs-datastore/src/dynamic_index.rs            |   5 +
>   pbs-pxar-fuse/src/lib.rs                      |   2 +-
>   proxmox-backup-client/src/benchmark.rs        |   2 +-
>   proxmox-backup-client/src/catalog.rs          |  42 +-
>   proxmox-backup-client/src/helper.rs           |  64 ++
>   proxmox-backup-client/src/main.rs             | 281 ++++++-
>   proxmox-backup-client/src/mount.rs            |  54 +-
>   proxmox-backup-test-suite/Cargo.toml          |  18 +
>   .../src/detection_mode_bench.rs               | 294 +++++++
>   proxmox-backup-test-suite/src/main.rs         |  17 +
>   proxmox-file-restore/src/main.rs              |  20 +-
>   .../src/proxmox_restore_daemon/api.rs         |  16 +-
>   pxar-bin/src/main.rs                          |  53 +-
>   src/api2/admin/datastore.rs                   |  47 +-
>   src/api2/tape/restore.rs                      |   4 +-
>   src/bin/proxmox_backup_debug/diff.rs          |   2 +-
>   src/tape/file_formats/snapshot_archive.rs     |   9 +-
>   tests/catar.rs                                |   4 +-
>   www/datastore/Content.js                      |   6 +-
>   zsh-completions/_proxmox-backup-test-suite    |  13 +
>   44 files changed, 2219 insertions(+), 256 deletions(-)
>   create mode 100644 debian/proxmox-backup-test-suite.bc
>   create mode 100644 docs/meta-format-overview.dot
>   create mode 100644 pbs-client/src/inject_reused_chunks.rs
>   create mode 100644 pbs-client/src/pxar/flow-chart-metadata-based-file-change-detection.svg
>   create mode 100644 pbs-client/src/pxar/flow-chart-metadata-based-file-change-detection.txt
>   create mode 100644 pbs-client/src/pxar/look_ahead_cache.rs
>   create mode 100644 proxmox-backup-client/src/helper.rs
>   create mode 100644 proxmox-backup-test-suite/Cargo.toml
>   create mode 100644 proxmox-backup-test-suite/src/detection_mode_bench.rs
>   create mode 100644 proxmox-backup-test-suite/src/main.rs
>   create mode 100644 zsh-completions/_proxmox-backup-test-suite
> 
An updated version of the patch series is available 
https://lists.proxmox.com/pipermail/pbs-devel/2024-April/009104.html


_______________________________________________
pbs-devel mailing list
pbs-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel


^ permalink raw reply	[flat|nested] 122+ messages in thread

end of thread, other threads:[~2024-04-29 12:13 UTC | newest]

Thread overview: 122+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-03-28 12:36 [pbs-devel] [PATCH v3 pxar proxmox-backup 00/58] fix #3174: improve file-level backup Christian Ebner
2024-03-28 12:36 ` [pbs-devel] [PATCH v3 pxar 01/58] encoder: fix two typos in comments Christian Ebner
2024-04-03  9:12   ` [pbs-devel] applied: " Fabian Grünbichler
2024-03-28 12:36 ` [pbs-devel] [PATCH v3 pxar 02/58] format/examples: add PXAR_PAYLOAD_REF entry header Christian Ebner
2024-03-28 12:36 ` [pbs-devel] [PATCH v3 pxar 03/58] decoder: add method to read payload references Christian Ebner
2024-03-28 12:36 ` [pbs-devel] [PATCH v3 pxar 04/58] decoder: factor out skip part from skip_entry Christian Ebner
2024-04-03  9:18   ` Fabian Grünbichler
2024-04-03 11:02     ` Christian Ebner
2024-03-28 12:36 ` [pbs-devel] [PATCH v3 pxar 05/58] encoder: add optional output writer for file payloads Christian Ebner
2024-03-28 12:36 ` [pbs-devel] [PATCH v3 pxar 06/58] encoder: move to stack based state tracking Christian Ebner
2024-04-03  9:54   ` Fabian Grünbichler
2024-04-03 11:01     ` Christian Ebner
2024-04-04  8:48       ` Fabian Grünbichler
2024-04-04  9:04         ` Christian Ebner
2024-03-28 12:36 ` [pbs-devel] [PATCH v3 pxar 07/58] decoder/accessor: add optional payload input stream Christian Ebner
2024-04-03 10:38   ` Fabian Grünbichler
2024-04-03 11:47     ` Christian Ebner
2024-04-03 12:18     ` Christian Ebner
2024-04-04  8:46       ` Fabian Grünbichler
2024-04-04  9:49         ` Christian Ebner
2024-03-28 12:36 ` [pbs-devel] [PATCH v3 pxar 08/58] encoder: add payload reference capability Christian Ebner
2024-03-28 12:36 ` [pbs-devel] [PATCH v3 pxar 09/58] encoder: add payload position capability Christian Ebner
2024-03-28 12:36 ` [pbs-devel] [PATCH v3 pxar 10/58] encoder: add payload advance capability Christian Ebner
2024-03-28 12:36 ` [pbs-devel] [PATCH v3 pxar 11/58] encoder/format: finish payload stream with marker Christian Ebner
2024-03-28 12:36 ` [pbs-devel] [PATCH v3 pxar 12/58] format: add payload stream start marker Christian Ebner
2024-03-28 12:36 ` [pbs-devel] [PATCH v3 pxar 13/58] format: add pxar format version entry Christian Ebner
2024-04-03 11:41   ` Fabian Grünbichler
2024-04-03 13:31     ` Christian Ebner
2024-03-28 12:36 ` [pbs-devel] [PATCH v3 pxar 14/58] format/encoder/decoder: add entry type cli params Christian Ebner
2024-04-03 12:01   ` Fabian Grünbichler
2024-04-03 14:41     ` Christian Ebner
2024-03-28 12:36 ` [pbs-devel] [PATCH v3 proxmox-backup 15/58] client: pxar: switch to stack based encoder state Christian Ebner
2024-03-28 12:36 ` [pbs-devel] [PATCH v3 proxmox-backup 16/58] client: backup writer: only borrow http client Christian Ebner
2024-04-08  9:04   ` [pbs-devel] applied: " Fabian Grünbichler
2024-04-08  9:17     ` Christian Ebner
2024-03-28 12:36 ` [pbs-devel] [PATCH v3 proxmox-backup 17/58] client: backup: factor out extension from backup target Christian Ebner
2024-03-28 12:36 ` [pbs-devel] [PATCH v3 proxmox-backup 18/58] client: backup: early check for fixed index type Christian Ebner
2024-04-08  9:05   ` [pbs-devel] applied: " Fabian Grünbichler
2024-03-28 12:36 ` [pbs-devel] [PATCH v3 proxmox-backup 19/58] client: pxar: combine writer params into struct Christian Ebner
2024-03-28 12:36 ` [pbs-devel] [PATCH v3 proxmox-backup 20/58] client: backup: split payload to dedicated stream Christian Ebner
2024-03-28 12:36 ` [pbs-devel] [PATCH v3 proxmox-backup 21/58] client: helper: add helpers for creating reader instances Christian Ebner
2024-03-28 12:36 ` [pbs-devel] [PATCH v3 proxmox-backup 22/58] client: helper: add method for split archive name mapping Christian Ebner
2024-03-28 12:36 ` [pbs-devel] [PATCH v3 proxmox-backup 23/58] client: restore: read payload from dedicated index Christian Ebner
2024-03-28 12:36 ` [pbs-devel] [PATCH v3 proxmox-backup 24/58] tools: cover meta extension for pxar archives Christian Ebner
2024-04-04  9:01   ` Fabian Grünbichler
2024-04-04  9:06     ` Christian Ebner
2024-04-04  9:10       ` Christian Ebner
2024-03-28 12:36 ` [pbs-devel] [PATCH v3 proxmox-backup 25/58] restore: " Christian Ebner
2024-04-04  9:02   ` Fabian Grünbichler
2024-03-28 12:36 ` [pbs-devel] [PATCH v3 proxmox-backup 26/58] client: mount: make split pxar archives mountable Christian Ebner
2024-04-04  9:43   ` Fabian Grünbichler
2024-04-04 13:29     ` Christian Ebner
2024-03-28 12:36 ` [pbs-devel] [PATCH v3 proxmox-backup 27/58] api: datastore: refactor getting local chunk reader Christian Ebner
2024-03-28 12:36 ` [pbs-devel] [PATCH v3 proxmox-backup 28/58] api: datastore: attach optional payload " Christian Ebner
2024-03-28 12:36 ` [pbs-devel] [PATCH v3 proxmox-backup 29/58] catalog: shell: factor out pxar fuse reader instantiation Christian Ebner
2024-03-28 12:36 ` [pbs-devel] [PATCH v3 proxmox-backup 30/58] catalog: shell: redirect payload reader for split streams Christian Ebner
2024-04-04  9:49   ` Fabian Grünbichler
2024-04-04 15:52     ` Christian Ebner
2024-03-28 12:36 ` [pbs-devel] [PATCH v3 proxmox-backup 31/58] www: cover meta extension for pxar archives Christian Ebner
2024-04-04 10:01   ` Fabian Grünbichler
2024-04-04 14:51     ` Christian Ebner
2024-03-28 12:36 ` [pbs-devel] [PATCH v3 proxmox-backup 32/58] pxar: add optional payload input for achive restore Christian Ebner
2024-03-28 12:36 ` [pbs-devel] [PATCH v3 proxmox-backup 33/58] pxar: add more context to extraction error Christian Ebner
2024-03-28 12:36 ` [pbs-devel] [PATCH v3 proxmox-backup 34/58] client: pxar: include payload offset in output Christian Ebner
2024-03-28 12:36 ` [pbs-devel] [PATCH v3 proxmox-backup 35/58] pxar: show padding in debug output on archive list Christian Ebner
2024-03-28 12:36 ` [pbs-devel] [PATCH v3 proxmox-backup 36/58] datastore: dynamic index: add method to get digest Christian Ebner
2024-03-28 12:36 ` [pbs-devel] [PATCH v3 proxmox-backup 37/58] client: pxar: helper for lookup of reusable dynamic entries Christian Ebner
2024-04-04 12:54   ` Fabian Grünbichler
2024-04-04 17:13     ` Christian Ebner
2024-04-05  7:22       ` Christian Ebner
2024-04-05 11:28   ` Fabian Grünbichler
2024-03-28 12:36 ` [pbs-devel] [PATCH v3 proxmox-backup 38/58] upload stream: impl reused chunk injector Christian Ebner
2024-04-04 14:24   ` Fabian Grünbichler
2024-04-05 10:26     ` Christian Ebner
2024-03-28 12:36 ` [pbs-devel] [PATCH v3 proxmox-backup 39/58] client: chunk stream: add struct to hold injection state Christian Ebner
2024-03-28 12:36 ` [pbs-devel] [PATCH v3 proxmox-backup 40/58] client: chunk stream: add dynamic entries injection queues Christian Ebner
2024-04-04 14:52   ` Fabian Grünbichler
2024-04-08 13:54     ` Christian Ebner
2024-04-09  7:19       ` Fabian Grünbichler
2024-03-28 12:36 ` [pbs-devel] [PATCH v3 proxmox-backup 41/58] specs: add backup detection mode specification Christian Ebner
2024-04-04 14:54   ` Fabian Grünbichler
2024-04-08 13:36     ` Christian Ebner
2024-03-28 12:36 ` [pbs-devel] [PATCH v3 proxmox-backup 42/58] client: implement prepare reference method Christian Ebner
2024-04-05  8:01   ` Fabian Grünbichler
2024-03-28 12:36 ` [pbs-devel] [PATCH v3 proxmox-backup 43/58] client: pxar: implement store to insert chunks on caching Christian Ebner
2024-04-05  7:52   ` Fabian Grünbichler
2024-04-09  9:12     ` Christian Ebner
2024-03-28 12:36 ` [pbs-devel] [PATCH v3 proxmox-backup 44/58] client: pxar: add previous reference to archiver Christian Ebner
2024-04-04 15:04   ` Fabian Grünbichler
2024-03-28 12:36 ` [pbs-devel] [PATCH v3 proxmox-backup 45/58] client: pxar: add method for metadata comparison Christian Ebner
2024-04-05  8:08   ` Fabian Grünbichler
2024-04-05  8:14     ` Christian Ebner
2024-04-09 12:52       ` Christian Ebner
2024-03-28 12:36 ` [pbs-devel] [PATCH v3 proxmox-backup 46/58] pxar: caching: add look-ahead cache types Christian Ebner
2024-03-28 12:36 ` [pbs-devel] [PATCH v3 proxmox-backup 47/58] client: pxar: add look-ahead caching Christian Ebner
2024-04-05  8:33   ` Fabian Grünbichler
2024-04-09 14:53     ` Christian Ebner
     [not found]       ` <<dce38c53-f3e7-47ac-b1fd-a63daaabbcec@proxmox.com>
2024-04-10  7:03         ` Fabian Grünbichler
2024-04-10  7:11           ` Christian Ebner
2024-03-28 12:36 ` [pbs-devel] [PATCH v3 proxmox-backup 48/58] fix #3174: client: pxar: enable caching and meta comparison Christian Ebner
2024-03-28 12:36 ` [pbs-devel] [PATCH v3 proxmox-backup 49/58] client: backup: increase average chunk size for metadata Christian Ebner
2024-04-05  9:42   ` Fabian Grünbichler
2024-04-05 10:49     ` Dietmar Maurer
2024-04-08  8:28       ` Fabian Grünbichler
2024-03-28 12:36 ` [pbs-devel] [PATCH v3 proxmox-backup 50/58] client: backup writer: add injected chunk count to stats Christian Ebner
2024-03-28 12:37 ` [pbs-devel] [PATCH v3 proxmox-backup 51/58] pxar: create: show chunk injection stats debug output Christian Ebner
2024-04-05  9:47   ` Fabian Grünbichler
2024-04-10 10:00     ` Christian Ebner
2024-03-28 12:37 ` [pbs-devel] [PATCH v3 proxmox-backup 52/58] client: pxar: add entry kind format version Christian Ebner
2024-03-28 12:37 ` [pbs-devel] [PATCH v3 proxmox-backup 53/58] client: pxar: opt encode cli exclude patterns as CliParams Christian Ebner
2024-04-05  9:49   ` Fabian Grünbichler
2024-03-28 12:37 ` [pbs-devel] [PATCH v3 proxmox-backup 54/58] client: pxar: add flow chart for metadata change detection Christian Ebner
2024-04-05 10:16   ` Fabian Grünbichler
2024-04-10 10:04     ` Christian Ebner
2024-03-28 12:37 ` [pbs-devel] [PATCH v3 proxmox-backup 55/58] docs: describe file format for split payload files Christian Ebner
2024-04-05 10:26   ` Fabian Grünbichler
2024-03-28 12:37 ` [pbs-devel] [PATCH v3 proxmox-backup 56/58] docs: add section describing change detection mode Christian Ebner
2024-04-05 11:22   ` Fabian Grünbichler
2024-03-28 12:37 ` [pbs-devel] [PATCH v3 proxmox-backup 57/58] test-suite: add detection mode change benchmark Christian Ebner
2024-03-28 12:37 ` [pbs-devel] [PATCH v3 proxmox-backup 58/58] test-suite: add bin to deb, add shell completions Christian Ebner
2024-04-05 11:39 ` [pbs-devel] [PATCH v3 pxar proxmox-backup 00/58] fix #3174: improve file-level backup Fabian Grünbichler
2024-04-29 12:13 ` Christian Ebner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox
Service provided by Proxmox Server Solutions GmbH | Privacy | Legal