public inbox for pbs-devel@lists.proxmox.com
 help / color / mirror / Atom feed
From: Christian Ebner <c.ebner@proxmox.com>
To: pbs-devel@lists.proxmox.com
Subject: [pbs-devel] [PATCH-SERIES v5 pxar proxmox-backup proxmox-widget-toolkit 00/28] fix #3174: improve file-level backup
Date: Wed, 15 Nov 2023 16:47:45 +0100	[thread overview]
Message-ID: <20231115154813.281564-1-c.ebner@proxmox.com> (raw)

Changes to the patch series since version 4 are based on the feedback
obtained via internal communication channels. Many thanks to Thomas,
Fabian, Wolfgang and Dominik for their continuous feedback up until
now.

This series of patches implements an metadata based file change
detection mechanism for improved pxar file level backup creation speed
for unchanged files.

The chosen approach is to skip encoding of regular file payloads,
for which metadata (currently ctime and size) did not change as
compared to a previous backup run. Instead of re-encoding the files, a
reference to a newly introduced appendix section of the pxar archive
will be written. The appendix section will be created as concatenation
of indexed chunks from the previous backup run, thereby containing the
sequential file payload at a calculated offset with respect to the
starting point of the appendix section.

Metadata comparison and calculation of the chunks to be indexed for the
appendix section is performed using the catalog of a previous backup as
reference. In order to be able to calculate the offsets, an updated
catalog file format version 2 is introduced which extends the previous
version by including the file offset with respect to the pxar archive
byte stream, as well as the files ctime. This allows to find the required
chunks indexes and the start padding within the concatenated chunks.
The catalog reader remains backwards compatible to the catalog file
format version 1.

During encoding, the chunks needed for the appendix section are injected
in the backup upload stream after forcing a chunk boundary when regular
pxar encoding is finished. Finally, the pxar archive containing an
appendix section are marked as such by appending a final pxar goodbye
lookup table only containing the offset to the appendix section start and
total size of that section, needed for random access as e.g. to mount
the archive via the fuse filesystem implementation.

The following lists the most notable changes included in this series since
the version 4:

- fix an issue with premature injection queue chunk insertion on initialization
- fix an issue with the decoder state not being correctly set at the start of
  the appendix section, leading to decoding errors in special cases.
- avoid double injection of chunks in cases where the chunk list to insert
  starts with the first chunk of the list already being present, but not the
  subsequent ones
- refactoring and renaming of the Encoder's `bytes_len` to `encoded_size`

The following lists the most notable changes included in this series since
the version 3:

- count appendix chunks as reused chunks, as they are not re-encoded and
  re-uploaded.
- add a heuristic to reduce chunk fragmentation for appendix chunks for
  multiple consecutive backup runs with metadata based file change detection.
- refactor the appendix list generation during pxar encoding in the archiver.
- switch from Vec to BTreeMap for the restoring the appendix entries so
  entries are inserted in sorted order based on their offset, making it
  unnecessary to sort afterwards.
- fix issue with chunk injection code which lead to data corruption in some
  edge cases, add additional checks as fortification.

The following lists the most notable changes included in this series since
the version 2:

- avoid re-indexing the same chunks multiple times in the appendix
  section by looking them up in the already present appendix chunks list
  and calculate the appendix reference offset accordingly. This now
  requires to sort entries by their appendix start offset for sequential
  restore.
- reduce appendix reference chunk fragmentation by increasing the file
  size threshold to 1k.
- Fix the WebUIs file browser and single file restore, broken in the
  previous patch series.
- fixes previous catalog and/or dynamic index downloads when either the
  backup group was empty or no archive with the same name present in the
  backup

The following lists the most notable changes included in this series since
the version 1:

- fixes and refactors the missing chunk issue by modifying the logic to
  avoid re-appending the same chunks multiple times if referenced by
  multiple consecutive files.
- fixes a performance issue with catalog lookup being the bottleneck
  for cases with directories with many entries, resulting in the
  metadata based file change detection performing worse than the regular
  mode.
- fixes the creation of multi archive backups. All of the archives use
  the same reference catalog.
- the catalog file format is pushed to version 2, including the needed
  archive offsets as well as ctime for file change detection.
- the catalog is fully backward compatible to catalog file format
  version 1, so both can be decoded by the reader. However, the new
  version of the catalog file format will be used for all new backups
- it is now possible to perform multiple consecutive runs of the backup
  with metadata based file change detection, no more need to perform the
  regular run previous to the other one.
- change from `incremental` flag to enum based `BackupDetectionMode`
  parameter for command invocations.
- includes a new `proxmox-backup-test-suite` binary to create and run
  benchmarks to compare the performance of the different detection modes.

An invocation of a backup run with this patches now is:
```bash
proxmox-backup-client backup <label>.pxar:<source-path> --change-detection-mode=metadata
```
During the first run, no reference catalog can be used, the backup will
be a regular run. Following backups will however utilize the catalog and
index files of the previous run to perform file change detection.

As benchmarks, the linux source code as well as the coco dataset for
computer vision and pattern recognition can be used.
The benchmarks can be performed by running:
```bash
proxmox-backup-test-suite detection-mode-bench prepare --target /<path-to-bench-source-target>
proxmox-backup-test-suite detection-mode-bench run linux.pxar:/<path-to-bench-source-target>/linux
proxmox-backup-test-suite detection-mode-bench run coco.pxar:/<path-to-bench-source-target>/coco
```

Above command invocations assume the default repository and credentials
to be set as environment variables, they might however be passed as
additinal optional parameters instead.

Benchmark runs using these test data show a significant improvement in
the time needed for the backups:

For the linux source code backup:
    Completed benchmark with 5 runs for each tested mode.

    Completed regular backup with:
    Total runtime: 46.40 s
    Average: 9.28 ± 0.04 s
    Min: 9.25 s
    Max: 9.34 s

    Completed metadata detection mode backup with:
    Total runtime: 9.07 s
    Average: 1.81 ± 0.09 s
    Min: 1.70 s
    Max: 1.92 s

    Differences (metadata based - regular):
    Delta total runtime: -37.34 s (-80.46 %)
    Delta average: -7.47 ± 0.10 s (-80.46 %)
    Delta min: -7.55 s (-81.63 %)
    Delta max: -7.43 s (-79.49 %)

For the coco dataset backup:
    Completed benchmark with 5 runs for each tested mode.

    Completed regular backup with:
    Total runtime: 586.37 s
    Average: 117.27 ± 5.26 s
    Min: 108.51 s
    Max: 121.13 s

    Completed metadata detection mode backup with:
    Total runtime: 124.33 s
    Average: 24.87 ± 1.00 s
    Min: 23.71 s
    Max: 26.19 s

    Differences (metadata based - regular):
    Delta total runtime: -462.04 s (-78.80 %)
    Delta average: -92.41 ± 5.35 s (-78.80 %)
    Delta min: -84.79 s (-78.15 %)
    Delta max: -94.94 s (-78.38 %)

Above tests were performed inside a test VM backing up to a local
datastore.

pxar:

Christian Ebner (8):
  fix #3174: decoder: factor out skip_bytes from skip_entry
  fix #3174: decoder: impl skip_bytes for sync dec
  fix #3174: encoder: calc filename + metadata byte size
  fix #3174: enc/dec: impl PXAR_APPENDIX_REF entrytype
  fix #3174: enc/dec: impl PXAR_APPENDIX entrytype
  fix #3174: encoder: helper to add to encoder position
  fix #3174: enc/dec: impl PXAR_APPENDIX_TAIL entrytype
  fix #3174: enc/dec: introduce pxar format version 2

 examples/mk-format-hashes.rs |  16 +++
 examples/pxarcmd.rs          |   4 +-
 src/accessor/mod.rs          | 132 +++++++++++++++++++++----
 src/decoder/mod.rs           |  83 ++++++++++++++--
 src/decoder/sync.rs          |   5 +
 src/encoder/aio.rs           |  60 ++++++++++--
 src/encoder/mod.rs           | 183 ++++++++++++++++++++++++++++++++++-
 src/encoder/sync.rs          |  54 +++++++++--
 src/format/mod.rs            |  29 +++++-
 src/lib.rs                   |  10 ++
 10 files changed, 526 insertions(+), 50 deletions(-)

proxmox-backup:

Christian Ebner (19):
  fix #3174: index: add fn index list from start/end-offsets
  fix #3174: index: add fn digest for DynamicEntry
  fix #3174: api: double catalog upload size
  fix #3174: catalog: introduce extended format v2
  fix #3174: archiver/extractor: impl appendix ref
  fix #3174: catalog: add specialized Archive entry
  fix #3174: extractor: impl seq restore from appendix
  fix #3174: archiver: store ref to previous backup
  fix #3174: upload stream: impl reused chunk injector
  fix #3174: chunker: add forced boundaries
  fix #3174: backup writer: inject queued chunk in upload steam
  fix #3174: archiver: reuse files with unchanged metadata
  fix #3174: specs: add backup detection mode specification
  fix #3174: client: Add detection mode to backup creation
  test-suite: add detection mode change benchmark
  test-suite: Add bin to deb, add shell completions
  catalog: fetch offset and size for files and refs
  pxar: add heuristic to reduce reused chunk fragmentation
  catalog: use format version 2 conditionally

 Cargo.toml                                    |   1 +
 Makefile                                      |  13 +-
 debian/proxmox-backup-client.bash-completion  |   1 +
 debian/proxmox-backup-client.install          |   2 +
 debian/proxmox-backup-test-suite.bc           |   8 +
 examples/test_chunk_speed2.rs                 |   9 +-
 pbs-client/src/backup_specification.rs        |  30 +
 pbs-client/src/backup_writer.rs               |  89 ++-
 pbs-client/src/catalog_shell.rs               |   3 +-
 pbs-client/src/chunk_stream.rs                |  42 +-
 pbs-client/src/inject_reused_chunks.rs        | 153 ++++
 pbs-client/src/lib.rs                         |   1 +
 pbs-client/src/pxar/create.rs                 | 401 +++++++++-
 pbs-client/src/pxar/extract.rs                | 149 +++-
 pbs-client/src/pxar/mod.rs                    |   2 +-
 pbs-client/src/pxar/tools.rs                  |   9 +
 pbs-client/src/pxar_backup_stream.rs          |   8 +-
 pbs-datastore/src/catalog.rs                  | 693 ++++++++++++++++--
 pbs-datastore/src/dynamic_index.rs            |  38 +
 pbs-datastore/src/file_formats.rs             |   3 +
 proxmox-backup-client/src/main.rs             | 190 ++++-
 proxmox-backup-test-suite/Cargo.toml          |  18 +
 .../src/detection_mode_bench.rs               | 291 ++++++++
 proxmox-backup-test-suite/src/main.rs         |  17 +
 .../src/proxmox_restore_daemon/api.rs         |  21 +-
 pxar-bin/src/main.rs                          |  23 +-
 src/api2/backup/upload_chunk.rs               |   4 +-
 src/tape/file_formats/snapshot_archive.rs     |   9 +-
 tests/catar.rs                                |   3 +
 zsh-completions/_proxmox-backup-test-suite    |  13 +
 30 files changed, 2059 insertions(+), 185 deletions(-)
 create mode 100644 debian/proxmox-backup-test-suite.bc
 create mode 100644 pbs-client/src/inject_reused_chunks.rs
 create mode 100644 proxmox-backup-test-suite/Cargo.toml
 create mode 100644 proxmox-backup-test-suite/src/detection_mode_bench.rs
 create mode 100644 proxmox-backup-test-suite/src/main.rs
 create mode 100644 zsh-completions/_proxmox-backup-test-suite

proxmox-widget-toolkit:

Christian Ebner (1):
  file-browser: support pxar archive and fileref types

 src/Schema.js             |  2 ++
 src/window/FileBrowser.js | 16 ++++++++++++----
 2 files changed, 14 insertions(+), 4 deletions(-)

-- 
2.39.2





             reply	other threads:[~2023-11-15 15:49 UTC|newest]

Thread overview: 29+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-11-15 15:47 Christian Ebner [this message]
2023-11-15 15:47 ` [pbs-devel] [PATCH v5 pxar 1/28] fix #3174: decoder: factor out skip_bytes from skip_entry Christian Ebner
2023-11-15 15:47 ` [pbs-devel] [PATCH v5 pxar 2/28] fix #3174: decoder: impl skip_bytes for sync dec Christian Ebner
2023-11-15 15:47 ` [pbs-devel] [PATCH v5 pxar 3/28] fix #3174: encoder: calc filename + metadata byte size Christian Ebner
2023-11-15 15:47 ` [pbs-devel] [PATCH v5 pxar 4/28] fix #3174: enc/dec: impl PXAR_APPENDIX_REF entrytype Christian Ebner
2023-11-15 15:47 ` [pbs-devel] [PATCH v5 pxar 5/28] fix #3174: enc/dec: impl PXAR_APPENDIX entrytype Christian Ebner
2023-11-15 15:47 ` [pbs-devel] [PATCH v5 pxar 6/28] fix #3174: encoder: helper to add to encoder position Christian Ebner
2023-11-15 15:47 ` [pbs-devel] [PATCH v5 pxar 7/28] fix #3174: enc/dec: impl PXAR_APPENDIX_TAIL entrytype Christian Ebner
2023-11-15 15:47 ` [pbs-devel] [PATCH v5 pxar 8/28] fix #3174: enc/dec: introduce pxar format version 2 Christian Ebner
2023-11-15 15:47 ` [pbs-devel] [PATCH v5 proxmox-backup 09/28] fix #3174: index: add fn index list from start/end-offsets Christian Ebner
2023-11-15 15:47 ` [pbs-devel] [PATCH v5 proxmox-backup 10/28] fix #3174: index: add fn digest for DynamicEntry Christian Ebner
2023-11-15 15:47 ` [pbs-devel] [PATCH v5 proxmox-backup 11/28] fix #3174: api: double catalog upload size Christian Ebner
2023-11-15 15:47 ` [pbs-devel] [PATCH v5 proxmox-backup 12/28] fix #3174: catalog: introduce extended format v2 Christian Ebner
2023-11-15 15:47 ` [pbs-devel] [PATCH v5 proxmox-backup 13/28] fix #3174: archiver/extractor: impl appendix ref Christian Ebner
2023-11-15 15:47 ` [pbs-devel] [PATCH v5 proxmox-backup 14/28] fix #3174: catalog: add specialized Archive entry Christian Ebner
2023-11-15 15:48 ` [pbs-devel] [PATCH v5 proxmox-backup 15/28] fix #3174: extractor: impl seq restore from appendix Christian Ebner
2023-11-15 15:48 ` [pbs-devel] [PATCH v5 proxmox-backup 16/28] fix #3174: archiver: store ref to previous backup Christian Ebner
2023-11-15 15:48 ` [pbs-devel] [PATCH v5 proxmox-backup 17/28] fix #3174: upload stream: impl reused chunk injector Christian Ebner
2023-11-15 15:48 ` [pbs-devel] [PATCH v5 proxmox-backup 18/28] fix #3174: chunker: add forced boundaries Christian Ebner
2023-11-15 15:48 ` [pbs-devel] [PATCH v5 proxmox-backup 19/28] fix #3174: backup writer: inject queued chunk in upload steam Christian Ebner
2023-11-15 15:48 ` [pbs-devel] [PATCH v5 proxmox-backup 20/28] fix #3174: archiver: reuse files with unchanged metadata Christian Ebner
2023-11-15 15:48 ` [pbs-devel] [PATCH v5 proxmox-backup 21/28] fix #3174: specs: add backup detection mode specification Christian Ebner
2023-11-15 15:48 ` [pbs-devel] [PATCH v5 proxmox-backup 22/28] fix #3174: client: Add detection mode to backup creation Christian Ebner
2023-11-15 15:48 ` [pbs-devel] [PATCH v5 proxmox-backup 23/28] test-suite: add detection mode change benchmark Christian Ebner
2023-11-15 15:48 ` [pbs-devel] [PATCH v5 proxmox-backup 24/28] test-suite: Add bin to deb, add shell completions Christian Ebner
2023-11-15 15:48 ` [pbs-devel] [PATCH v5 proxmox-backup 25/28] catalog: fetch offset and size for files and refs Christian Ebner
2023-11-15 15:48 ` [pbs-devel] [PATCH v5 proxmox-backup 26/28] pxar: add heuristic to reduce reused chunk fragmentation Christian Ebner
2023-11-15 15:48 ` [pbs-devel] [PATCH v5 proxmox-backup 27/28] catalog: use format version 2 conditionally Christian Ebner
2023-11-15 15:48 ` [pbs-devel] [PATCH v5 proxmox-widget-toolkit 28/28] file-browser: support pxar archive and fileref types Christian Ebner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20231115154813.281564-1-c.ebner@proxmox.com \
    --to=c.ebner@proxmox.com \
    --cc=pbs-devel@lists.proxmox.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox
Service provided by Proxmox Server Solutions GmbH | Privacy | Legal