public inbox for pbs-devel@lists.proxmox.com
 help / color / mirror / Atom feed
From: "Fabian Grünbichler" <f.gruenbichler@proxmox.com>
To: Proxmox Backup Server development discussion
	<pbs-devel@lists.proxmox.com>
Subject: Re: [pbs-devel] [RFC pxar proxmox-backup 00/36] fix #3174: improve file-level backup
Date: Wed, 13 Mar 2024 12:44:03 +0100	[thread overview]
Message-ID: <1710329165.ekeozwbmqq.astroid@yuna.none> (raw)
In-Reply-To: <20240305092703.126906-1-c.ebner@proxmox.com>

On March 5, 2024 10:26 am, Christian Ebner wrote:
> Disclaimer: This patches are work in progress and not intended for
> production use just yet. The purpose is for initial testing and review.
> 
> This series of patches implements an metadata based file change
> detection mechanism for improved pxar file level backup creation speed
> for unchanged files.
> 
> The chosen approach is to split pxar archives on creation via the
> proxmox-backup-client into two separate archives and upload streams,
> one exclusive for regular file payloads, the other one for the rest
> of the pxar archive, which is mostly metadata.
> 
> On consecutive runs, the metadata archive of the previous backup run,
> which is limited in size and therefore rapidly accessed is used to
> lookup and compare the metadata for entries to encode.
> This assumes that the connection speed to the Proxmox Backup Server is
> sufficiently fast, allowing the download and chaching of the chunks for
> that index.
> 
> Changes to regular files are detected by comparing all of the files
> metadata object, including mtime, acls, ecc. If no changes are detected,
> the previous payload index is used to lookup chunks to possibly re-use
> in the payload stream of the new archive.
> In order to reduce possible chunk fragmentation, the decision whether to
> re-use or re-encode a file payload is deferred until enough information
> is gathered by adding entries to a look-ahead cache. If enough payload
> is referenced, the chunks are re-used and injected into the pxar payload
> upload stream, otherwise they are discated and the files encoded
> regularly.

I like how this is shaping up!

some high-level feedback in addition to things noted at individual
patches:

I think the two archive types should also get a proper header that has
fields like an archive version and possible other metadata. while this
means losing concat support, this is not something we use or need
anyway. it would make the next bump a lot less painful, since the old
client can print meaningful error messages like "encountered pxar
archive v3, unsupported, please upgrade" instead of opaque "invalid
entry type <magic blob>, abort" (which cannot be differentiated from a
corrupt archive!).

I think the pxar/create.rs code can be simplified/refactor to make it
easier to understand, although it's probably not the easiest task.

Some (at least debug) collection of the "wasted space" in the form of
padding (i.e., all the bytes of re-used chunks that are not referenced
by this snapshot) would be nice to have. Or at least an upper bound of
that (calculating an accurate amount might be expensive for
intra-archive dedup, and also, in real-world, the actual waste depends
on other snapshots anyway..). maybe we can also re-visit some sort of
heuristic for this, so that at least the final chunk of a file is not
re-used unless it or the next re-used file(s) make up > $threshold of
the chunk.

the benchmark tool is not that meaningful without some way of testing
*changing* input data in a systematic fashion ;)

I'll give this a more in-depth spin and see what else I notice/find!

> The following lists the most notable changes included in this series since
> the version 1:
> - also cache pxar exclude pattern passed via cli instead of encoding
>   them directly. This lead to an inconsistent archive while caching.
> - Fix the flushing of entries and chunks to inject before finishing the
>   archiver. Previously these last entries have been re-encoded, now they
>   are re-used.
> - add a dedicated method and type in the decoder for decoding payload
>   references.
> 
> An invocation of a backup run with this patches now is:
> ```bash
> proxmox-backup-client backup <label>.pxar:<source-path> --change-detection-mode=metadata
> ```
> During the first run, no reference index is available, the pxar archive
> will however be split into the two parts.
> Following backups will however utilize the pxar archive accessor and
> index files of the previous run to perform file change detection.
> 
> As benchmarks, the linux source code as well as the coco dataset for
> computer vision and pattern recognition can be used.
> The benchmarks can be performed by running:
> ```bash
> proxmox-backup-test-suite detection-mode-bench prepare --target /<path-to-bench-source-target>
> proxmox-backup-test-suite detection-mode-bench run linux.pxar:/<path-to-bench-source-target>/linux
> proxmox-backup-test-suite detection-mode-bench run coco.pxar:/<path-to-bench-source-target>/coco
> ```
> 
> Above command invocations assume the default repository and credentials
> to be set as environment variables, they might however be passed as
> additional optional parameters instead.
> 
> Benchmark runs using these test data show a significant improvement in
> the time needed for the backups. Note that all of these results were to a local
> PBS instance within a VM, minimizing therefore possible influences by the network.
> 
> For the linux source code backup:
>     Completed benchmark with 5 runs for each tested mode.
> 
>     Completed regular backup with:
>     Total runtime: 51.31 s
>     Average: 10.26 ± 0.12 s
>     Min: 10.16 s
>     Max: 10.46 s
> 
>     Completed metadata detection mode backup with:
>     Total runtime: 4.89 s
>     Average: 0.98 ± 0.02 s
>     Min: 0.95 s
>     Max: 1.00 s
> 
>     Differences (metadata based - regular):
>     Delta total runtime: -46.42 s (-90.47 %)
>     Delta average: -9.28 ± 0.12 s (-90.47 %)
>     Delta min: -9.21 s (-90.64 %)
>     Delta max: -9.46 s (-90.44 %)
> 
> For the coco dataset backup:
>     Completed benchmark with 5 runs for each tested mode.
> 
>     Completed regular backup with:
>     Total runtime: 520.72 s
>     Average: 104.14 ± 0.79 s
>     Min: 103.44 s
>     Max: 105.49 s
> 
>     Completed metadata detection mode backup with:
>     Total runtime: 6.95 s
>     Average: 1.39 ± 0.23 s
>     Min: 1.26 s
>     Max: 1.79 s
> 
>     Differences (metadata based - regular):
>     Delta total runtime: -513.76 s (-98.66 %)
>     Delta average: -102.75 ± 0.83 s (-98.66 %)
>     Delta min: -102.18 s (-98.78 %)
>     Delta max: -103.69 s (-98.30 %)
> 
> This series of patches implements an alternative, but more promising
> approach to the series presented previously [0], with the intention to
> solve the same issue with less changes required to the pxar format and to
> be more efficient.
> 
> [0] https://lists.proxmox.com/pipermail/pbs-devel/2024-January/007693.html
> 
> pxar:
> 
> Christian Ebner (10):
>   format/examples: add PXAR_PAYLOAD_REF entry header
>   encoder: add optional output writer for file payloads
>   format/decoder: add method to read payload references
>   decoder: add optional payload input stream
>   accessor: add optional payload input stream
>   encoder: move to stack based state tracking
>   encoder: add payload reference capability
>   encoder: add payload position capability
>   encoder: add payload advance capability
>   encoder/format: finish payload stream with marker
> 
>  examples/mk-format-hashes.rs |  10 +
>  examples/pxarcmd.rs          |   6 +-
>  src/accessor/aio.rs          |   7 +
>  src/accessor/mod.rs          |  85 ++++++++-
>  src/decoder/mod.rs           |  92 ++++++++-
>  src/decoder/sync.rs          |   7 +
>  src/encoder/aio.rs           |  52 +++--
>  src/encoder/mod.rs           | 357 +++++++++++++++++++++++++----------
>  src/encoder/sync.rs          |  45 ++++-
>  src/format/mod.rs            |  10 +
>  src/lib.rs                   |   3 +
>  11 files changed, 534 insertions(+), 140 deletions(-)
> 
> proxmox-backup:
> 
> Christian Ebner (26):
>   client: pxar: switch to stack based encoder state
>   client: backup: factor out extension from backup target
>   client: backup: early check for fixed index type
>   client: backup: split payload to dedicated stream
>   client: restore: read payload from dedicated index
>   tools: cover meta extension for pxar archives
>   restore: cover meta extension for pxar archives
>   client: mount: make split pxar archives mountable
>   api: datastore: refactor getting local chunk reader
>   api: datastore: attach optional payload chunk reader
>   catalog: shell: factor out pxar fuse reader instantiation
>   catalog: shell: redirect payload reader for split streams
>   www: cover meta extension for pxar archives
>   index: fetch chunk form index by start/end-offset
>   upload stream: impl reused chunk injector
>   client: chunk stream: add chunk injection queues
>   client: implement prepare reference method
>   client: pxar: implement store to insert chunks on caching
>   client: pxar: add previous reference to archiver
>   client: pxar: add method for metadata comparison
>   specs: add backup detection mode specification
>   pxar: caching: add look-ahead cache types
>   client: pxar: add look-ahead caching
>   fix #3174: client: pxar: enable caching and meta comparison
>   test-suite: add detection mode change benchmark
>   test-suite: Add bin to deb, add shell completions
> 
>  Cargo.toml                                    |   1 +
>  Makefile                                      |  13 +-
>  debian/proxmox-backup-client.bash-completion  |   1 +
>  debian/proxmox-backup-client.install          |   2 +
>  debian/proxmox-backup-test-suite.bc           |   8 +
>  examples/test_chunk_speed2.rs                 |  10 +-
>  pbs-client/src/backup_specification.rs        |  53 ++
>  pbs-client/src/backup_writer.rs               |  89 ++-
>  pbs-client/src/chunk_stream.rs                |  42 +-
>  pbs-client/src/inject_reused_chunks.rs        | 152 +++++
>  pbs-client/src/lib.rs                         |   1 +
>  pbs-client/src/pxar/create.rs                 | 620 +++++++++++++++++-
>  pbs-client/src/pxar/look_ahead_cache.rs       |  40 ++
>  pbs-client/src/pxar/mod.rs                    |   3 +-
>  pbs-client/src/pxar_backup_stream.rs          |  54 +-
>  pbs-client/src/tools/mod.rs                   |   2 +-
>  pbs-datastore/src/dynamic_index.rs            |  55 ++
>  proxmox-backup-client/src/catalog.rs          |  73 ++-
>  proxmox-backup-client/src/main.rs             | 280 +++++++-
>  proxmox-backup-client/src/mount.rs            |  56 +-
>  proxmox-backup-test-suite/Cargo.toml          |  18 +
>  .../src/detection_mode_bench.rs               | 294 +++++++++
>  proxmox-backup-test-suite/src/main.rs         |  17 +
>  proxmox-file-restore/src/main.rs              |  11 +-
>  .../src/proxmox_restore_daemon/api.rs         |  16 +-
>  pxar-bin/src/main.rs                          |   7 +-
>  src/api2/admin/datastore.rs                   |  45 +-
>  tests/catar.rs                                |   4 +
>  www/datastore/Content.js                      |   6 +-
>  zsh-completions/_proxmox-backup-test-suite    |  13 +
>  30 files changed, 1827 insertions(+), 159 deletions(-)
>  create mode 100644 debian/proxmox-backup-test-suite.bc
>  create mode 100644 pbs-client/src/inject_reused_chunks.rs
>  create mode 100644 pbs-client/src/pxar/look_ahead_cache.rs
>  create mode 100644 proxmox-backup-test-suite/Cargo.toml
>  create mode 100644 proxmox-backup-test-suite/src/detection_mode_bench.rs
>  create mode 100644 proxmox-backup-test-suite/src/main.rs
>  create mode 100644 zsh-completions/_proxmox-backup-test-suite
> 
> -- 
> 2.39.2
> 
> 
> 
> _______________________________________________
> pbs-devel mailing list
> pbs-devel@lists.proxmox.com
> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel
> 




  parent reply	other threads:[~2024-03-13 11:44 UTC|newest]

Thread overview: 95+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-03-05  9:26 Christian Ebner
2024-03-05  9:26 ` [pbs-devel] [RFC v2 pxar 01/36] format/examples: add PXAR_PAYLOAD_REF entry header Christian Ebner
2024-03-05  9:26 ` [pbs-devel] [RFC v2 pxar 02/36] encoder: add optional output writer for file payloads Christian Ebner
2024-03-11 13:21   ` Fabian Grünbichler
2024-03-11 13:50     ` Christian Ebner
2024-03-11 15:41       ` Fabian Grünbichler
2024-03-05  9:26 ` [pbs-devel] [RFC v2 pxar 03/36] format/decoder: add method to read payload references Christian Ebner
2024-03-05  9:26 ` [pbs-devel] [RFC v2 pxar 04/36] decoder: add optional payload input stream Christian Ebner
2024-03-11 13:21   ` Fabian Grünbichler
2024-03-11 14:05     ` Christian Ebner
2024-03-11 15:27       ` Fabian Grünbichler
2024-03-11 15:51         ` Christian Ebner
2024-03-05  9:26 ` [pbs-devel] [RFC v2 pxar 05/36] accessor: " Christian Ebner
2024-03-11 13:21   ` Fabian Grünbichler
2024-03-05  9:26 ` [pbs-devel] [RFC v2 pxar 06/36] encoder: move to stack based state tracking Christian Ebner
2024-03-11 13:21   ` Fabian Grünbichler
2024-03-11 14:12     ` Christian Ebner
2024-03-05  9:26 ` [pbs-devel] [RFC v2 pxar 07/36] encoder: add payload reference capability Christian Ebner
2024-03-11 13:21   ` Fabian Grünbichler
2024-03-11 14:15     ` Christian Ebner
2024-03-05  9:26 ` [pbs-devel] [RFC v2 pxar 08/36] encoder: add payload position capability Christian Ebner
2024-03-05  9:26 ` [pbs-devel] [RFC v2 pxar 09/36] encoder: add payload advance capability Christian Ebner
2024-03-11 13:22   ` Fabian Grünbichler
2024-03-11 14:22     ` Christian Ebner
2024-03-11 15:27       ` Fabian Grünbichler
2024-03-11 15:41         ` Christian Ebner
2024-03-05  9:26 ` [pbs-devel] [RFC v2 pxar 10/36] encoder/format: finish payload stream with marker Christian Ebner
2024-03-05  9:26 ` [pbs-devel] [RFC v2 proxmox-backup 11/36] client: pxar: switch to stack based encoder state Christian Ebner
2024-03-05  9:26 ` [pbs-devel] [RFC v2 proxmox-backup 12/36] client: backup: factor out extension from backup target Christian Ebner
2024-03-05  9:26 ` [pbs-devel] [RFC v2 proxmox-backup 13/36] client: backup: early check for fixed index type Christian Ebner
2024-03-11 14:57   ` Fabian Grünbichler
2024-03-11 15:12     ` Christian Ebner
2024-03-05  9:26 ` [pbs-devel] [RFC v2 proxmox-backup 14/36] client: backup: split payload to dedicated stream Christian Ebner
2024-03-11 14:57   ` Fabian Grünbichler
2024-03-11 15:22     ` Christian Ebner
2024-03-05  9:26 ` [pbs-devel] [RFC v2 proxmox-backup 15/36] client: restore: read payload from dedicated index Christian Ebner
2024-03-11 14:58   ` Fabian Grünbichler
2024-03-11 15:26     ` Christian Ebner
2024-03-05  9:26 ` [pbs-devel] [RFC v2 proxmox-backup 16/36] tools: cover meta extension for pxar archives Christian Ebner
2024-03-05  9:26 ` [pbs-devel] [RFC v2 proxmox-backup 17/36] restore: " Christian Ebner
2024-03-05  9:26 ` [pbs-devel] [RFC v2 proxmox-backup 18/36] client: mount: make split pxar archives mountable Christian Ebner
2024-03-11 14:58   ` Fabian Grünbichler
2024-03-11 15:29     ` Christian Ebner
2024-03-05  9:26 ` [pbs-devel] [RFC v2 proxmox-backup 19/36] api: datastore: refactor getting local chunk reader Christian Ebner
2024-03-05  9:26 ` [pbs-devel] [RFC v2 proxmox-backup 20/36] api: datastore: attach optional payload " Christian Ebner
2024-03-05  9:26 ` [pbs-devel] [RFC v2 proxmox-backup 21/36] catalog: shell: factor out pxar fuse reader instantiation Christian Ebner
2024-03-11 14:58   ` Fabian Grünbichler
2024-03-11 15:31     ` Christian Ebner
2024-03-05  9:26 ` [pbs-devel] [RFC v2 proxmox-backup 22/36] catalog: shell: redirect payload reader for split streams Christian Ebner
2024-03-11 14:58   ` Fabian Grünbichler
2024-03-11 15:24     ` Christian Ebner
2024-03-05  9:26 ` [pbs-devel] [RFC v2 proxmox-backup 23/36] www: cover meta extension for pxar archives Christian Ebner
2024-03-11 14:58   ` Fabian Grünbichler
2024-03-11 15:31     ` Christian Ebner
2024-03-05  9:26 ` [pbs-devel] [RFC v2 proxmox-backup 24/36] index: fetch chunk form index by start/end-offset Christian Ebner
2024-03-12  8:50   ` Fabian Grünbichler
2024-03-14  8:23     ` Christian Ebner
2024-03-12 12:47   ` Dietmar Maurer
2024-03-12 12:51     ` Christian Ebner
2024-03-12 13:03       ` Dietmar Maurer
2024-03-05  9:26 ` [pbs-devel] [RFC v2 proxmox-backup 25/36] upload stream: impl reused chunk injector Christian Ebner
2024-03-13  9:43   ` Dietmar Maurer
2024-03-14 14:03     ` Christian Ebner
2024-03-05  9:26 ` [pbs-devel] [RFC v2 proxmox-backup 26/36] client: chunk stream: add chunk injection queues Christian Ebner
2024-03-12  9:46   ` Fabian Grünbichler
2024-03-19 10:52     ` Christian Ebner
2024-03-05  9:26 ` [pbs-devel] [RFC v2 proxmox-backup 27/36] client: implement prepare reference method Christian Ebner
2024-03-12 10:07   ` Fabian Grünbichler
2024-03-19 11:51     ` Christian Ebner
2024-03-19 12:49       ` Fabian Grünbichler
2024-03-20  8:37         ` Christian Ebner
2024-03-05  9:26 ` [pbs-devel] [RFC v2 proxmox-backup 28/36] client: pxar: implement store to insert chunks on caching Christian Ebner
2024-03-05  9:26 ` [pbs-devel] [RFC v2 proxmox-backup 29/36] client: pxar: add previous reference to archiver Christian Ebner
2024-03-12 12:12   ` Fabian Grünbichler
2024-03-12 12:25     ` Christian Ebner
2024-03-19 12:59     ` Christian Ebner
2024-03-19 13:04       ` Fabian Grünbichler
2024-03-20  8:52         ` Christian Ebner
2024-03-05  9:26 ` [pbs-devel] [RFC v2 proxmox-backup 30/36] client: pxar: add method for metadata comparison Christian Ebner
2024-03-05  9:26 ` [pbs-devel] [RFC v2 proxmox-backup 31/36] specs: add backup detection mode specification Christian Ebner
2024-03-12 12:17   ` Fabian Grünbichler
2024-03-12 12:31     ` Christian Ebner
2024-03-20  9:28       ` Christian Ebner
2024-03-05  9:26 ` [pbs-devel] [RFC v2 proxmox-backup 32/36] pxar: caching: add look-ahead cache types Christian Ebner
2024-03-05  9:27 ` [pbs-devel] [RFC v2 proxmox-backup 33/36] client: pxar: add look-ahead caching Christian Ebner
2024-03-12 14:08   ` Fabian Grünbichler
2024-03-20 10:28     ` Christian Ebner
2024-03-05  9:27 ` [pbs-devel] [RFC v2 proxmox-backup 34/36] fix #3174: client: pxar: enable caching and meta comparison Christian Ebner
2024-03-13 11:12   ` Fabian Grünbichler
2024-03-05  9:27 ` [pbs-devel] [RFC v2 proxmox-backup 35/36] test-suite: add detection mode change benchmark Christian Ebner
2024-03-13 11:48   ` Fabian Grünbichler
2024-03-05  9:27 ` [pbs-devel] [RFC v2 proxmox-backup 36/36] test-suite: Add bin to deb, add shell completions Christian Ebner
2024-03-13 11:18   ` Fabian Grünbichler
2024-03-13 11:44 ` Fabian Grünbichler [this message]
  -- strict thread matches above, loose matches on Subject: below --
2024-02-28 14:01 [pbs-devel] [RFC pxar proxmox-backup 00/36] fix #3174: improve file-level backup Christian Ebner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1710329165.ekeozwbmqq.astroid@yuna.none \
    --to=f.gruenbichler@proxmox.com \
    --cc=pbs-devel@lists.proxmox.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox
Service provided by Proxmox Server Solutions GmbH | Privacy | Legal