From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: <c.ebner@proxmox.com> Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits)) (No client certificate requested) by lists.proxmox.com (Postfix) with ESMTPS id 455D698826 for <pbs-devel@lists.proxmox.com>; Mon, 9 Oct 2023 13:52:12 +0200 (CEST) Received: from firstgate.proxmox.com (localhost [127.0.0.1]) by firstgate.proxmox.com (Proxmox) with ESMTP id 2893916BEF for <pbs-devel@lists.proxmox.com>; Mon, 9 Oct 2023 13:52:12 +0200 (CEST) Received: from proxmox-new.maurer-it.com (proxmox-new.maurer-it.com [94.136.29.106]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits)) (No client certificate requested) by firstgate.proxmox.com (Proxmox) with ESMTPS for <pbs-devel@lists.proxmox.com>; Mon, 9 Oct 2023 13:52:08 +0200 (CEST) Received: from proxmox-new.maurer-it.com (localhost.localdomain [127.0.0.1]) by proxmox-new.maurer-it.com (Proxmox) with ESMTP id 85CB144949 for <pbs-devel@lists.proxmox.com>; Mon, 9 Oct 2023 13:52:08 +0200 (CEST) From: Christian Ebner <c.ebner@proxmox.com> To: pbs-devel@lists.proxmox.com Date: Mon, 9 Oct 2023 13:51:16 +0200 Message-Id: <20231009115139.1417886-1-c.ebner@proxmox.com> X-Mailer: git-send-email 2.39.2 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-SPAM-LEVEL: Spam detection results: 0 AWL 0.090 Adjusted score from AWL reputation of From: address BAYES_00 -1.9 Bayes spam probability is 0 to 1% DMARC_MISSING 0.1 Missing DMARC policy KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment SPF_HELO_NONE 0.001 SPF: HELO does not publish an SPF Record SPF_PASS -0.001 SPF: sender matches SPF record URIBL_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to URIBL was blocked. See http://wiki.apache.org/spamassassin/DnsBlocklists#dnsbl-block for more information. [mk-format-hashes.rs, aio.rs, create.rs, datastore.rs, mod.rs, catalog.rs, api.rs, sync.rs, tools.rs, pxarcmd.rs] Subject: [pbs-devel] [RFC v2 pxar proxmox-backup 00/23] fix #3174: improve file-level backup X-BeenThere: pbs-devel@lists.proxmox.com X-Mailman-Version: 2.1.29 Precedence: list List-Id: Proxmox Backup Server development discussion <pbs-devel.lists.proxmox.com> List-Unsubscribe: <https://lists.proxmox.com/cgi-bin/mailman/options/pbs-devel>, <mailto:pbs-devel-request@lists.proxmox.com?subject=unsubscribe> List-Archive: <http://lists.proxmox.com/pipermail/pbs-devel/> List-Post: <mailto:pbs-devel@lists.proxmox.com> List-Help: <mailto:pbs-devel-request@lists.proxmox.com?subject=help> List-Subscribe: <https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel>, <mailto:pbs-devel-request@lists.proxmox.com?subject=subscribe> X-List-Received-Date: Mon, 09 Oct 2023 11:52:12 -0000 Disclaimer: Version 2 of the patch series remains work in progress and is only intended for review/testing purposes. Changes to the patch series since version 1 are based on the feedback obtained via mailing list or other communication channels. Many thanks to Fabian, Wolfgang and Thomas for testing and discussions. This series of patches prototypes a possible approach to improve the pxar file level backup creation speed. The current approach is to skip encoding of regular file payloads, for which metadata (currently ctime and size) did not change as compared to a previous backup run. Instead of re-encoding the files, a reference to a newly introduced appendix section of the pxar archive will be written. The appendix section will be created as concatenation of indexed chunks from the previous backup run, thereby containing the sequential file payload at a calculated offset with respect to the starting point of the appendix section. Metadata comparison and calculation of the chunks to be indexed for the appendix section is performed using the catalog of a previous backup as reference. In order to be able to calculate the offsets, an updated catalog file format version 2 is introduced which extends the previous version by including the file offset with respect to the pxar archive byte stream, as well as the files ctime. This allows to find the required chunks indexes and the start padding within the concatenated chunks. The catalog reader remains backwards compatible to the catalog file format version 1. During encoding, the chunks needed for the appendix section are injected in the backup upload stream after forcing a chunk boundary when regular pxar encoding is finished. Finally, the pxar archive containing an appendix section are marked as such by appending a final pxar goodbye lookup table only containing the offset to the appendix section start and total size of that section, needed for random access as e.g. to mount the archive via the fuse filesystem implementation. The following lists the most notable changes included in this series since the previous version 1: - fixes and refactors the missing chunk issue by modifying the logic to avoid re-appending the same chunks multiple times if referenced by multiple consecutive files. - fixes a performance issue with catalog lookup being the bottleneck for cases with directories with many entries, resulting in the metadata based file change detection performing worse than the regular mode. - fixes the creation of multi archive backups. All of the archives use the same reference catalog. - the catalog file format is pushed to version 2, including the needed archive offsets as well as ctime for file change detection. - the catalog is fully backward compatible to catalog file format version 1, so both can be decoded by the reader. However, the new version of the catalog file format will be used for all new backups - it is now possible to perform multiple consecutive runs of the backup with metadata based file change detection, no more need to perform the regular run previous to the other one. - change from `incremental` flag to enum based `BackupDetectionMode` parameter for command invocations. - includes a new `proxmox-backup-test-suite` binary to create and run benchmarks to compare the performance of the different detection modes. An invocation of a backup run with this patches now is: ```bash proxmox-backup-client backup <label>.pxar:<source-path> --change-detection-mode=metadata ``` During the first run, no reference catalog can be used, the backup will be a regular run. Following backups will however utilize the catalog and index files of the previous run to perform file change detection. As benchmarks, the linux source code as well as the coco dataset for computer vision and pattern recognition can be used. The benchmarks can be performed by running: ```bash proxmox-backup-test-suite detection-mode-bench prepare --target /<path-to-bench-source-target> proxmox-backup-test-suite detection-mode-bench run linux.pxar:/<path-to-bench-source-target>/linux proxmox-backup-test-suite detection-mode-bench run coco.pxar:/<path-to-bench-source-target>/coco ``` Above command invocations assume the default repository and credentials to be set as environment variables, they might however be passed as additinal optional parameters instead. Initial benchmark runs using these test data show a significant improvement in the time needed for the backups: For the linux source code backup: Completed benchmark with 5 runs for each tested mode. Completed regular backup with: Total runtime: 50.41 s Average: 10.08 ± 0.11 s Min: 9.93 s Max: 10.19 s Completed metadata detection mode backup with: Total runtime: 8.93 s Average: 1.79 ± 0.06 s Min: 1.74 s Max: 1.86 s Differences (metadata based - regular): Delta total runtime: -41.48 s (-82.28 %) Delta average: -8.30 ± 0.13 s (-82.28 %) Delta min: -8.19 s (-82.52 %) Delta max: -8.33 s (-81.77 %) For the coco dataset backup: Completed benchmark with 5 runs for each tested mode. Completed regular backup with: Total runtime: 587.54 s Average: 117.51 ± 2.66 s Min: 114.32 s Max: 119.62 s Completed metadata detection mode backup with: Total runtime: 111.07 s Average: 22.21 ± 0.19 s Min: 22.05 s Max: 22.54 s Differences (metadata based - regular): Delta total runtime: -476.47 s (-81.10 %) Delta average: -95.29 ± 2.66 s (-81.10 %) Delta min: -92.27 s (-80.71 %) Delta max: -97.08 s (-81.16 %) Above tests were performed inside a test VM backing up to a local datastore. pxar: Christian Ebner (7): fix #3174: decoder: factor out skip_bytes from skip_entry fix #3174: decoder: impl skip_bytes for sync dec fix #3174: encoder: calc filename + metadata byte size fix #3174: enc/dec: impl PXAR_APPENDIX_REF entrytype fix #3174: enc/dec: impl PXAR_APPENDIX entrytype fix #3174: encoder: helper to add to encoder position fix #3174: enc/dec: impl PXAR_APPENDIX_TAIL entrytype examples/mk-format-hashes.rs | 11 +++ examples/pxarcmd.rs | 4 +- src/accessor/mod.rs | 46 +++++++++++ src/decoder/mod.rs | 50 ++++++++++-- src/decoder/sync.rs | 5 ++ src/encoder/aio.rs | 32 +++++++- src/encoder/mod.rs | 153 ++++++++++++++++++++++++++++++++++- src/encoder/sync.rs | 39 ++++++++- src/format/mod.rs | 16 ++++ src/lib.rs | 10 +++ 10 files changed, 348 insertions(+), 18 deletions(-) proxmox-backup: Christian Ebner (16): fix #3174: index: add fn index list from start/end-offsets fix #3174: index: add fn digest for DynamicEntry fix #3174: api: double catalog upload size fix #3174: catalog: introduce extended format v2 fix #3174: archiver/extractor: impl appendix ref fix #3174: catalog: add specialized Archive entry fix #3174: extractor: impl seq restore from appendix fix #3174: archiver: store ref to previous backup fix #3174: upload stream: impl reused chunk injector fix #3174: chunker: add forced boundaries fix #3174: backup writer: inject queued chunk in upload steam fix #3174: archiver: reuse files with unchanged metadata fix #3174: schema: add backup detection mode schema fix #3174: client: Add detection mode to backup creation test-suite: add detection mode change benchmark test-suite: Add bin to deb, add shell completions Cargo.toml | 1 + Makefile | 13 +- debian/proxmox-backup-client.bash-completion | 1 + debian/proxmox-backup-client.install | 2 + debian/proxmox-backup-test-suite.bc | 8 + examples/test_chunk_speed2.rs | 9 +- pbs-api-types/src/datastore.rs | 41 ++ pbs-client/src/backup_writer.rs | 88 +-- pbs-client/src/catalog_shell.rs | 3 +- pbs-client/src/chunk_stream.rs | 41 +- pbs-client/src/inject_reused_chunks.rs | 123 ++++ pbs-client/src/lib.rs | 1 + pbs-client/src/pxar/create.rs | 314 +++++++++- pbs-client/src/pxar/extract.rs | 141 +++++ pbs-client/src/pxar/mod.rs | 2 +- pbs-client/src/pxar/tools.rs | 9 + pbs-client/src/pxar_backup_stream.rs | 8 +- pbs-datastore/src/catalog.rs | 568 +++++++++++++++--- pbs-datastore/src/dynamic_index.rs | 37 ++ pbs-datastore/src/file_formats.rs | 3 + proxmox-backup-client/src/main.rs | 161 ++++- proxmox-backup-test-suite/Cargo.toml | 18 + .../src/detection_mode_bench.rs | 288 +++++++++ proxmox-backup-test-suite/src/main.rs | 17 + .../src/proxmox_restore_daemon/api.rs | 21 +- pxar-bin/src/main.rs | 23 +- src/api2/backup/upload_chunk.rs | 4 +- src/tape/file_formats/snapshot_archive.rs | 2 +- tests/catar.rs | 3 + zsh-completions/_proxmox-backup-test-suite | 13 + 30 files changed, 1789 insertions(+), 174 deletions(-) create mode 100644 debian/proxmox-backup-test-suite.bc create mode 100644 pbs-client/src/inject_reused_chunks.rs create mode 100644 proxmox-backup-test-suite/Cargo.toml create mode 100644 proxmox-backup-test-suite/src/detection_mode_bench.rs create mode 100644 proxmox-backup-test-suite/src/main.rs create mode 100644 zsh-completions/_proxmox-backup-test-suite -- 2.39.2