public inbox for pbs-devel@lists.proxmox.com
 help / color / mirror / Atom feed
From: Christian Ebner <c.ebner@proxmox.com>
To: pbs-devel@lists.proxmox.com
Subject: [pbs-devel] [PATCH v9 proxmox-backup 00/58] fix #3174: improve file-level backup
Date: Wed,  5 Jun 2024 12:53:18 +0200	[thread overview]
Message-ID: <20240605105416.278748-1-c.ebner@proxmox.com> (raw)

This series of patches implements an metadata based file change
detection mechanism for improved pxar file level backup creation speed
for unchanged files.

The chosen approach is to split pxar archives on creation via the
proxmox-backup-client into two separate data and upload streams,
one exclusive for regular file payloads, the other one for the rest
of the pxar archive, which is mostly metadata.

On consecutive runs, the metadata archive of the previous backup run,
which is limited in size and therefore rapidly accessed is used to
lookup and compare the metadata for entries to encode.
This assumes that the connection speed to the Proxmox Backup Server is
sufficiently fast, allowing the download and chaching of the chunks for
that index.

Changes to regular files are detected by comparing all of the files
metadata object, including mtime, acls, ecc. If no changes are detected,
the previous payload index is used to lookup chunks to possibly re-use
in the payload stream of the new archive.
In order to reduce possible chunk fragmentation, the decision whether to
reuse or reencode a file payload is deferred until enough information
is gathered by adding entries to a look-ahead cache. If the padding
introduced by reusing chunks falls below a threshold, the entries are
referenced, the chunks are reused and injected into the pxar payload
upload stream, otherwise they are discated and the files encoded
regularly.

Note:
Patches up to patch 36 will only compile without patches
be5d68aa8a3d5848e0fbbf651834514a35ed6dd8 and
0983094c87345284ee56ae9eeab47be8375cd730 on the pxar side, patch 36 and
following however require them.

The following lists the most notable changes included in this series since
the version 8:
- Fix an issue with entries being reencoded even when they could be
  reused.
- Reordered and regrouped patches based on parts of the code they
  modify.
- Smaller refactoring as outlined in the individual patches

The following lists the most notable changes included in this series since
the version 7:
- Fixed incorrectly squashed patches during rebase

The following lists the most notable changes included in this series since
the version 6:
- Allow to use `.pxar` extension in cli commands for convenience
- Refactor the input/output interface for the pxar encoder, decoder and
  accessor to use a `PxarVariant` enum, in order to guarantee the
  payload relate input/output is always attached for split archives.
- Refactor the lookahead caching logic in the pxars `Archiver` to
  improve overall code readability.
- Add helper method for file name matching and use it where possible,
  for it to be handled in a single place.
- Extend documentation to include additional information about which
  metadata is compared to the previous snapshot
- Fix an issue with the `pxar list` which failed in case of metadata
  only pxar archives.
- Fix an issue in the payload chunker test where the context was not
  updated accordingly.
- Various clippy fixes, smaller refactoring and reordering of patches

The following lists the most notable changes included in this series since
the version 5:
- Fix an issue where the payload chunker was not correctly reset after
  suggested or forced boundaries.
- Added regression tests for payload chunker and chunk stream.

The following lists the most notable changes included in this series since
the version 4:
- Increase open file handle limit to hard limit and adapt lookahead
  cache size dynamically (thanks a lot to Thomas for pointing this out
  and providing the necessary background information). This helps with
  the reuse of multiple entries being contained within the same chunk,
  otherwise exceeding padding threshold and being therefore reencoded
  instead.
- Fix payload chunker scan to only scan up until chunk pos in case a
  suggested boundary is chosen.
- Fix issue with decoder state being not set to correct `InDirectory`
  after reading prelude and getting root directory entry.
- Fix issue with kept back chunk injection when the chunk follows a
  range discontinuity.
- Add regression test for pxar create with metadata archive and payload
  index reference.

The following lists the most notable changes included in this series since
the version 3:
- Rework the whole reused chunk injection and accounting logic and use
  lockless async `mpsc::channel`s instead of `Arc<Mutex<VecDeque<..>>>`.
- Reworked lookahead caching logic to use payload ranges and check for
  possible range continuation instead of looking up the reusable dynamic
  entries immediately in case of a reusable entry chain. This also
  avoids edge cases not covered in the previous version of the patch series.
  This current version therefore tends to reencode small files more
  aggressively, since they might introduce additional unwanted paddings.
- Correctly cover also hardlinks for the reuse logic, avoiding to
  reencode these entries.
- Add additional dedicatet chunker implementation for payload data
  stream, allowing the archiver to suggest boundaries to the chunker to
  reduce padding for reused chunks.
- Add additional `change-detection-mode=data`, in order to allow
  creating split archives with fully reencoded payload data.
- Add additional payload input readers for pxar accessor type
  implementations where needed.
- Add additional consistency check in pxar encoder when dropping state
  or encoder instance.
- CliParams was renamed to the more opaque Prelude, since the pxar
  archive does not care about its contents and this might be extended to
  store other information about the archive as well.
- Add missing proxmox-file-restore for split archives and fix restore of
  tar/zip archives via WebUI. This is handled by the same decoder logic,
  and needed an updated payload input content range to read the data
  from the correct location in the payload data archive.
- Additional refactoring to use the pxar reader helpers where possible.

The following lists the most notable changes included in this series since
the version 2:
- many bugfixes regarding incorrect archive encoding by wrong offset
  generation, adding additional sanity checks and rather fail on
  encoding than produce an incorrectly encoded archive
- different approach for deciding whether to reuse or reencode the
  entries. Previously, the entries have been encoded when a cached
  payload size threshold was reached. Now, the padding introduced by
  reusable chunks is tracked, and only if the padding does not exceed
  the set threshold, the entries are reused. This reduces the possible
  padding, at the cost of reencoding more entries. Also avoids to
  re-use chunks which have now large padding holes because of
  moved/removed files contained within.
- added headers for metadata archive and payload file
- added documentation

An invocation of a backup run with this patches now is:
```bash
proxmox-backup-client backup <label>.pxar:<source-path> --change-detection-mode=metadata
```
During the first run, no reference index is available, the pxar archive
will however be split into the two parts.
Following backups will however utilize the pxar archive accessor and
index files of the previous run to perform file change detection.

As benchmarks, the linux source code as well as the coco dataset for
computer vision and pattern recognition can be used.
The benchmarks can be performed by running:
```bash
proxmox-backup-test-suite detection-mode-bench prepare --target /<path-to-bench-source-target>
proxmox-backup-test-suite detection-mode-bench run linux.pxar:/<path-to-bench-source-target>/linux
proxmox-backup-test-suite detection-mode-bench run coco.pxar:/<path-to-bench-source-target>/coco
```

Above command invocations assume the default repository and credentials
to be set as environment variables, they might however be passed as
additional optional parameters instead.

Christian Ebner (58):
  client: pxar: switch to stack based encoder state
  client: pxar: combine writers into struct
  client: pxar: optionally split metadata and payload streams
  client: helper: add helpers for creating reader instances
  client: helper: add method for split archive name mapping
  client: tools: helper to check pxar filename extensions
  client: restore: read payload from dedicated index
  client: tools: cover extension for split pxar archives
  client: mount: make split pxar archives mountable
  api: datastore: attach split archive payload chunk reader
  catalog: shell: make split pxar archives accessible
  www: cover metadata extension for pxar archives
  file restore: cover extension for split pxar archives
  file restore: factor out getting pxar reader
  file restore: cover split metadata and payload archives
  file restore: show more error context when extraction fails
  pxar: bin: add optional payload input for archive restore
  pxar: bin: cover listing for split archives
  pxar: bin: add more context to extraction error
  client: pxar: include payload offset in entry listing
  client: pxar: helper for lookup of reusable dynamic entries
  upload stream: implement reused chunk injector
  client: chunk stream: add struct to hold injection state
  chunker: add method to reset chunker state
  client: streams: add channels for dynamic entry injection
  specs: add backup detection mode specification
  client: implement prepare reference method
  client: pxar: add method for metadata comparison
  pxar: caching: add look-ahead cache
  client: pxar: refactor catalog encoding for directories
  fix #3174: client: pxar: enable caching and meta comparison
  client: backup writer: add injected chunk count to stats
  pxar: create: keep track of reused chunks and files
  pxar: create: show chunk injection stats info output
  client: backup writer: make backup info output more concise
  client: pxar: add helper to handle optional preludes
  client: pxar: opt encode cli exclude patterns as Prelude
  client: pxar: allow to restore prelude to optional path
  pxar: bin: show padding in debug output on archive list
  pxar: bin: ignore version and prelude entries in listing
  pxar: bin: test `pxar list` with payload-input
  pxar: bin: support creation of split pxar archives via cli
  pxar: add optional payload input to mount archive
  datastore: chunker: add Chunker trait
  datastore: chunker: implement chunker for payload stream
  chunker: tests: add regression tests for payload chunker
  chunk stream: tests: add regression tests for payload chunker
  client: chunk stream: switch payload stream chunker
  client: pxar: add archive creation with reference test
  client: tools: add helper to raise nofile rlimit
  client: pxar: set cache limit based on nofile rlimit
  api: datastore: add endpoint to lookup entries via pxar archive
  api: datastore: add optional archive-name to file-restore
  www: content: lookup via metadata archive instead of catalog
  docs: file formats: describe split pxar archive file layout
  docs: add section describing change detection mode
  test-suite: add detection mode change benchmark
  test-suite: Makefile: add debian package and related files

 Cargo.toml                                    |   1 +
 Makefile                                      |  18 +-
 debian/control                                |   7 +
 debian/proxmox-backup-client.bash-completion  |   1 +
 debian/proxmox-backup-test-suite.bc           |   8 +
 debian/proxmox-backup-test-suite.install      |   3 +
 docs/Makefile                                 |   2 +
 docs/backup-client.rst                        |  47 +
 docs/command-line-tools.rst                   |   5 +
 docs/command-syntax.rst                       |   4 +
 docs/conf.py                                  |   1 +
 docs/file-formats.rst                         |  46 +
 docs/meta-format-overview.dot                 |  50 +
 .../proxmox-backup-test-suite/description.rst |   2 +
 docs/proxmox-backup-test-suite/man1.rst       |  17 +
 docs/technical-overview.rst                   |   3 +
 examples/test_chunk_size.rs                   |   9 +-
 examples/test_chunk_speed.rs                  |   7 +-
 examples/test_chunk_speed2.rs                 |   2 +-
 pbs-client/src/backup_specification.rs        |  26 +
 pbs-client/src/backup_writer.rs               | 125 ++-
 pbs-client/src/chunk_stream.rs                | 238 ++++-
 pbs-client/src/inject_reused_chunks.rs        | 127 +++
 pbs-client/src/lib.rs                         |   3 +-
 pbs-client/src/pxar/create.rs                 | 908 +++++++++++++++++-
 pbs-client/src/pxar/extract.rs                |  28 +-
 pbs-client/src/pxar/look_ahead_cache.rs       | 162 ++++
 pbs-client/src/pxar/mod.rs                    |   5 +-
 pbs-client/src/pxar/tools.rs                  | 123 ++-
 pbs-client/src/pxar_backup_stream.rs          |  71 +-
 pbs-client/src/tools/mod.rs                   | 124 ++-
 pbs-datastore/src/chunker.rs                  | 267 ++++-
 pbs-datastore/src/dynamic_index.rs            |   9 +-
 pbs-datastore/src/lib.rs                      |   2 +-
 pbs-pxar-fuse/src/lib.rs                      |  14 +-
 proxmox-backup-client/src/catalog.rs          |  29 +-
 proxmox-backup-client/src/helper.rs           |  72 ++
 proxmox-backup-client/src/main.rs             | 293 +++++-
 proxmox-backup-client/src/mount.rs            |  33 +-
 proxmox-backup-test-suite/Cargo.toml          |  18 +
 .../src/detection_mode_bench.rs               | 294 ++++++
 proxmox-backup-test-suite/src/main.rs         |  17 +
 proxmox-file-restore/src/main.rs              |  74 +-
 .../src/proxmox_restore_daemon/api.rs         |  20 +-
 pxar-bin/Cargo.toml                           |   1 +
 pxar-bin/src/main.rs                          | 158 ++-
 pxar-bin/tests/pxar.rs                        | 135 +++
 src/api2/admin/datastore.rs                   | 178 +++-
 src/api2/tape/restore.rs                      |  22 +-
 src/bin/proxmox_backup_debug/diff.rs          |   2 +-
 src/tape/file_formats/snapshot_archive.rs     |   8 +-
 tests/catar.rs                                |   7 +-
 tests/pxar/backup-client-pxar-data.mpxar      | Bin 0 -> 15070 bytes
 tests/pxar/backup-client-pxar-data.ppxar.didx | Bin 0 -> 8096 bytes
 tests/pxar/backup-client-pxar-expected.mpxar  | Bin 0 -> 15086 bytes
 tests/pxar/backup-client-pxar-expected.ppxar  | Bin 0 -> 104859264 bytes
 www/datastore/Content.js                      |  37 +-
 zsh-completions/_proxmox-backup-test-suite    |  13 +
 58 files changed, 3526 insertions(+), 350 deletions(-)
 create mode 100644 debian/proxmox-backup-test-suite.bc
 create mode 100644 debian/proxmox-backup-test-suite.install
 create mode 100644 docs/meta-format-overview.dot
 create mode 100644 docs/proxmox-backup-test-suite/description.rst
 create mode 100644 docs/proxmox-backup-test-suite/man1.rst
 create mode 100644 pbs-client/src/inject_reused_chunks.rs
 create mode 100644 pbs-client/src/pxar/look_ahead_cache.rs
 create mode 100644 proxmox-backup-client/src/helper.rs
 create mode 100644 proxmox-backup-test-suite/Cargo.toml
 create mode 100644 proxmox-backup-test-suite/src/detection_mode_bench.rs
 create mode 100644 proxmox-backup-test-suite/src/main.rs
 create mode 100644 tests/pxar/backup-client-pxar-data.mpxar
 create mode 100644 tests/pxar/backup-client-pxar-data.ppxar.didx
 create mode 100644 tests/pxar/backup-client-pxar-expected.mpxar
 create mode 100644 tests/pxar/backup-client-pxar-expected.ppxar
 create mode 100644 zsh-completions/_proxmox-backup-test-suite

-- 
2.39.2



_______________________________________________
pbs-devel mailing list
pbs-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel


             reply	other threads:[~2024-06-05 10:55 UTC|newest]

Thread overview: 59+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-06-05 10:53 Christian Ebner [this message]
2024-06-05 10:53 ` [pbs-devel] [PATCH v9 proxmox-backup 01/58] client: pxar: switch to stack based encoder state Christian Ebner
2024-06-05 10:53 ` [pbs-devel] [PATCH v9 proxmox-backup 02/58] client: pxar: combine writers into struct Christian Ebner
2024-06-05 10:53 ` [pbs-devel] [PATCH v9 proxmox-backup 03/58] client: pxar: optionally split metadata and payload streams Christian Ebner
2024-06-05 10:53 ` [pbs-devel] [PATCH v9 proxmox-backup 04/58] client: helper: add helpers for creating reader instances Christian Ebner
2024-06-05 10:53 ` [pbs-devel] [PATCH v9 proxmox-backup 05/58] client: helper: add method for split archive name mapping Christian Ebner
2024-06-05 10:53 ` [pbs-devel] [PATCH v9 proxmox-backup 06/58] client: tools: helper to check pxar filename extensions Christian Ebner
2024-06-05 10:53 ` [pbs-devel] [PATCH v9 proxmox-backup 07/58] client: restore: read payload from dedicated index Christian Ebner
2024-06-05 10:53 ` [pbs-devel] [PATCH v9 proxmox-backup 08/58] client: tools: cover extension for split pxar archives Christian Ebner
2024-06-05 10:53 ` [pbs-devel] [PATCH v9 proxmox-backup 09/58] client: mount: make split pxar archives mountable Christian Ebner
2024-06-05 10:53 ` [pbs-devel] [PATCH v9 proxmox-backup 10/58] api: datastore: attach split archive payload chunk reader Christian Ebner
2024-06-05 10:53 ` [pbs-devel] [PATCH v9 proxmox-backup 11/58] catalog: shell: make split pxar archives accessible Christian Ebner
2024-06-05 10:53 ` [pbs-devel] [PATCH v9 proxmox-backup 12/58] www: cover metadata extension for pxar archives Christian Ebner
2024-06-05 10:53 ` [pbs-devel] [PATCH v9 proxmox-backup 13/58] file restore: cover extension for split " Christian Ebner
2024-06-05 10:53 ` [pbs-devel] [PATCH v9 proxmox-backup 14/58] file restore: factor out getting pxar reader Christian Ebner
2024-06-05 10:53 ` [pbs-devel] [PATCH v9 proxmox-backup 15/58] file restore: cover split metadata and payload archives Christian Ebner
2024-06-05 10:53 ` [pbs-devel] [PATCH v9 proxmox-backup 16/58] file restore: show more error context when extraction fails Christian Ebner
2024-06-05 10:53 ` [pbs-devel] [PATCH v9 proxmox-backup 17/58] pxar: bin: add optional payload input for archive restore Christian Ebner
2024-06-05 10:53 ` [pbs-devel] [PATCH v9 proxmox-backup 18/58] pxar: bin: cover listing for split archives Christian Ebner
2024-06-05 10:53 ` [pbs-devel] [PATCH v9 proxmox-backup 19/58] pxar: bin: add more context to extraction error Christian Ebner
2024-06-05 10:53 ` [pbs-devel] [PATCH v9 proxmox-backup 20/58] client: pxar: include payload offset in entry listing Christian Ebner
2024-06-05 10:53 ` [pbs-devel] [PATCH v9 proxmox-backup 21/58] client: pxar: helper for lookup of reusable dynamic entries Christian Ebner
2024-06-05 10:53 ` [pbs-devel] [PATCH v9 proxmox-backup 22/58] upload stream: implement reused chunk injector Christian Ebner
2024-06-05 10:53 ` [pbs-devel] [PATCH v9 proxmox-backup 23/58] client: chunk stream: add struct to hold injection state Christian Ebner
2024-06-05 10:53 ` [pbs-devel] [PATCH v9 proxmox-backup 24/58] chunker: add method to reset chunker state Christian Ebner
2024-06-05 10:53 ` [pbs-devel] [PATCH v9 proxmox-backup 25/58] client: streams: add channels for dynamic entry injection Christian Ebner
2024-06-05 10:53 ` [pbs-devel] [PATCH v9 proxmox-backup 26/58] specs: add backup detection mode specification Christian Ebner
2024-06-05 10:53 ` [pbs-devel] [PATCH v9 proxmox-backup 27/58] client: implement prepare reference method Christian Ebner
2024-06-05 10:53 ` [pbs-devel] [PATCH v9 proxmox-backup 28/58] client: pxar: add method for metadata comparison Christian Ebner
2024-06-05 10:53 ` [pbs-devel] [PATCH v9 proxmox-backup 29/58] pxar: caching: add look-ahead cache Christian Ebner
2024-06-05 10:53 ` [pbs-devel] [PATCH v9 proxmox-backup 30/58] client: pxar: refactor catalog encoding for directories Christian Ebner
2024-06-05 10:53 ` [pbs-devel] [PATCH v9 proxmox-backup 31/58] fix #3174: client: pxar: enable caching and meta comparison Christian Ebner
2024-06-05 10:53 ` [pbs-devel] [PATCH v9 proxmox-backup 32/58] client: backup writer: add injected chunk count to stats Christian Ebner
2024-06-05 10:53 ` [pbs-devel] [PATCH v9 proxmox-backup 33/58] pxar: create: keep track of reused chunks and files Christian Ebner
2024-06-05 10:53 ` [pbs-devel] [PATCH v9 proxmox-backup 34/58] pxar: create: show chunk injection stats info output Christian Ebner
2024-06-05 10:53 ` [pbs-devel] [PATCH v9 proxmox-backup 35/58] client: backup writer: make backup info output more concise Christian Ebner
2024-06-05 10:53 ` [pbs-devel] [PATCH v9 proxmox-backup 36/58] client: pxar: add helper to handle optional preludes Christian Ebner
2024-06-05 10:53 ` [pbs-devel] [PATCH v9 proxmox-backup 37/58] client: pxar: opt encode cli exclude patterns as Prelude Christian Ebner
2024-06-05 10:53 ` [pbs-devel] [PATCH v9 proxmox-backup 38/58] client: pxar: allow to restore prelude to optional path Christian Ebner
2024-06-05 10:53 ` [pbs-devel] [PATCH v9 proxmox-backup 39/58] pxar: bin: show padding in debug output on archive list Christian Ebner
2024-06-05 10:53 ` [pbs-devel] [PATCH v9 proxmox-backup 40/58] pxar: bin: ignore version and prelude entries in listing Christian Ebner
2024-06-05 10:54 ` [pbs-devel] [PATCH v9 proxmox-backup 42/58] pxar: bin: support creation of split pxar archives via cli Christian Ebner
2024-06-05 10:54 ` [pbs-devel] [PATCH v9 proxmox-backup 43/58] pxar: add optional payload input to mount archive Christian Ebner
2024-06-05 10:54 ` [pbs-devel] [PATCH v9 proxmox-backup 44/58] datastore: chunker: add Chunker trait Christian Ebner
2024-06-05 10:54 ` [pbs-devel] [PATCH v9 proxmox-backup 45/58] datastore: chunker: implement chunker for payload stream Christian Ebner
2024-06-05 10:54 ` [pbs-devel] [PATCH v9 proxmox-backup 46/58] chunker: tests: add regression tests for payload chunker Christian Ebner
2024-06-05 10:54 ` [pbs-devel] [PATCH v9 proxmox-backup 47/58] chunk stream: " Christian Ebner
2024-06-05 10:54 ` [pbs-devel] [PATCH v9 proxmox-backup 48/58] client: chunk stream: switch payload stream chunker Christian Ebner
2024-06-05 10:54 ` [pbs-devel] [PATCH v9 proxmox-backup 49/58] client: pxar: add archive creation with reference test Christian Ebner
2024-06-05 10:54 ` [pbs-devel] [PATCH v9 proxmox-backup 50/58] client: tools: add helper to raise nofile rlimit Christian Ebner
2024-06-05 10:54 ` [pbs-devel] [PATCH v9 proxmox-backup 51/58] client: pxar: set cache limit based on " Christian Ebner
2024-06-05 10:54 ` [pbs-devel] [PATCH v9 proxmox-backup 52/58] api: datastore: add endpoint to lookup entries via pxar archive Christian Ebner
2024-06-05 10:54 ` [pbs-devel] [PATCH v9 proxmox-backup 53/58] api: datastore: add optional archive-name to file-restore Christian Ebner
2024-06-05 10:54 ` [pbs-devel] [PATCH v9 proxmox-backup 54/58] www: content: lookup via metadata archive instead of catalog Christian Ebner
2024-06-05 10:54 ` [pbs-devel] [PATCH v9 proxmox-backup 55/58] docs: file formats: describe split pxar archive file layout Christian Ebner
2024-06-05 10:54 ` [pbs-devel] [PATCH v9 proxmox-backup 56/58] docs: add section describing change detection mode Christian Ebner
2024-06-05 10:54 ` [pbs-devel] [PATCH v9 proxmox-backup 57/58] test-suite: add detection mode change benchmark Christian Ebner
2024-06-05 10:54 ` [pbs-devel] [PATCH v9 proxmox-backup 58/58] test-suite: Makefile: add debian package and related files Christian Ebner
2024-06-06  6:47 ` [pbs-devel] partially-applied: [PATCH v9 proxmox-backup 00/58] fix #3174: improve file-level backup Fabian Grünbichler

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20240605105416.278748-1-c.ebner@proxmox.com \
    --to=c.ebner@proxmox.com \
    --cc=pbs-devel@lists.proxmox.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox
Service provided by Proxmox Server Solutions GmbH | Privacy | Legal