From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits)) (No client certificate requested) by lists.proxmox.com (Postfix) with ESMTPS id A265DB936C for ; Wed, 13 Mar 2024 12:44:14 +0100 (CET) Received: from firstgate.proxmox.com (localhost [127.0.0.1]) by firstgate.proxmox.com (Proxmox) with ESMTP id 802D9352A2 for ; Wed, 13 Mar 2024 12:44:14 +0100 (CET) Received: from proxmox-new.maurer-it.com (proxmox-new.maurer-it.com [94.136.29.106]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits)) (No client certificate requested) by firstgate.proxmox.com (Proxmox) with ESMTPS for ; Wed, 13 Mar 2024 12:44:12 +0100 (CET) Received: from proxmox-new.maurer-it.com (localhost.localdomain [127.0.0.1]) by proxmox-new.maurer-it.com (Proxmox) with ESMTP id B212C488FF for ; Wed, 13 Mar 2024 12:44:12 +0100 (CET) Date: Wed, 13 Mar 2024 12:44:03 +0100 From: Fabian =?iso-8859-1?q?Gr=FCnbichler?= To: Proxmox Backup Server development discussion References: <20240305092703.126906-1-c.ebner@proxmox.com> In-Reply-To: <20240305092703.126906-1-c.ebner@proxmox.com> MIME-Version: 1.0 User-Agent: astroid/0.16.0 (https://github.com/astroidmail/astroid) Message-Id: <1710329165.ekeozwbmqq.astroid@yuna.none> Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-SPAM-LEVEL: Spam detection results: 0 AWL 0.065 Adjusted score from AWL reputation of From: address BAYES_00 -1.9 Bayes spam probability is 0 to 1% DMARC_MISSING 0.1 Missing DMARC policy KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment SPF_HELO_NONE 0.001 SPF: HELO does not publish an SPF Record SPF_PASS -0.001 SPF: sender matches SPF record T_SCC_BODY_TEXT_LINE -0.01 - URIBL_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to URIBL was blocked. See http://wiki.apache.org/spamassassin/DnsBlocklists#dnsbl-block for more information. [create.rs, mk-format-hashes.rs, main.rs, aio.rs, datastore.rs, mount.rs, pxarcmd.rs, api.rs, catalog.rs, mod.rs, lib.rs, catar.rs, proxmox.com, sync.rs] Subject: Re: [pbs-devel] [RFC pxar proxmox-backup 00/36] fix #3174: improve file-level backup X-BeenThere: pbs-devel@lists.proxmox.com X-Mailman-Version: 2.1.29 Precedence: list List-Id: Proxmox Backup Server development discussion List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 13 Mar 2024 11:44:14 -0000 On March 5, 2024 10:26 am, Christian Ebner wrote: > Disclaimer: This patches are work in progress and not intended for > production use just yet. The purpose is for initial testing and review. >=20 > This series of patches implements an metadata based file change > detection mechanism for improved pxar file level backup creation speed > for unchanged files. >=20 > The chosen approach is to split pxar archives on creation via the > proxmox-backup-client into two separate archives and upload streams, > one exclusive for regular file payloads, the other one for the rest > of the pxar archive, which is mostly metadata. >=20 > On consecutive runs, the metadata archive of the previous backup run, > which is limited in size and therefore rapidly accessed is used to > lookup and compare the metadata for entries to encode. > This assumes that the connection speed to the Proxmox Backup Server is > sufficiently fast, allowing the download and chaching of the chunks for > that index. >=20 > Changes to regular files are detected by comparing all of the files > metadata object, including mtime, acls, ecc. If no changes are detected, > the previous payload index is used to lookup chunks to possibly re-use > in the payload stream of the new archive. > In order to reduce possible chunk fragmentation, the decision whether to > re-use or re-encode a file payload is deferred until enough information > is gathered by adding entries to a look-ahead cache. If enough payload > is referenced, the chunks are re-used and injected into the pxar payload > upload stream, otherwise they are discated and the files encoded > regularly. I like how this is shaping up! some high-level feedback in addition to things noted at individual patches: I think the two archive types should also get a proper header that has fields like an archive version and possible other metadata. while this means losing concat support, this is not something we use or need anyway. it would make the next bump a lot less painful, since the old client can print meaningful error messages like "encountered pxar archive v3, unsupported, please upgrade" instead of opaque "invalid entry type , abort" (which cannot be differentiated from a corrupt archive!). I think the pxar/create.rs code can be simplified/refactor to make it easier to understand, although it's probably not the easiest task. Some (at least debug) collection of the "wasted space" in the form of padding (i.e., all the bytes of re-used chunks that are not referenced by this snapshot) would be nice to have. Or at least an upper bound of that (calculating an accurate amount might be expensive for intra-archive dedup, and also, in real-world, the actual waste depends on other snapshots anyway..). maybe we can also re-visit some sort of heuristic for this, so that at least the final chunk of a file is not re-used unless it or the next re-used file(s) make up > $threshold of the chunk. the benchmark tool is not that meaningful without some way of testing *changing* input data in a systematic fashion ;) I'll give this a more in-depth spin and see what else I notice/find! > The following lists the most notable changes included in this series sinc= e > the version 1: > - also cache pxar exclude pattern passed via cli instead of encoding > them directly. This lead to an inconsistent archive while caching. > - Fix the flushing of entries and chunks to inject before finishing the > archiver. Previously these last entries have been re-encoded, now they > are re-used. > - add a dedicated method and type in the decoder for decoding payload > references. >=20 > An invocation of a backup run with this patches now is: > ```bash > proxmox-backup-client backup