From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits)) (No client certificate requested) by lists.proxmox.com (Postfix) with ESMTPS id 94D7698BBC for ; Mon, 13 Nov 2023 15:23:29 +0100 (CET) Received: from firstgate.proxmox.com (localhost [127.0.0.1]) by firstgate.proxmox.com (Proxmox) with ESMTP id 6C40213AE1 for ; Mon, 13 Nov 2023 15:23:29 +0100 (CET) Received: from proxmox-new.maurer-it.com (proxmox-new.maurer-it.com [94.136.29.106]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits)) (No client certificate requested) by firstgate.proxmox.com (Proxmox) with ESMTPS for ; Mon, 13 Nov 2023 15:23:28 +0100 (CET) Received: from proxmox-new.maurer-it.com (localhost.localdomain [127.0.0.1]) by proxmox-new.maurer-it.com (Proxmox) with ESMTP id 8EA2541CEB for ; Mon, 13 Nov 2023 15:23:28 +0100 (CET) Date: Mon, 13 Nov 2023 15:23:21 +0100 From: Fabian =?iso-8859-1?q?Gr=FCnbichler?= To: Proxmox Backup Server development discussion References: <20231109184614.1611127-1-c.ebner@proxmox.com> In-Reply-To: <20231109184614.1611127-1-c.ebner@proxmox.com> MIME-Version: 1.0 User-Agent: astroid/0.16.0 (https://github.com/astroidmail/astroid) Message-Id: <1699880752.fodcayz7zn.astroid@yuna.none> Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-SPAM-LEVEL: Spam detection results: 0 AWL 0.066 Adjusted score from AWL reputation of From: address BAYES_00 -1.9 Bayes spam probability is 0 to 1% DMARC_MISSING 0.1 Missing DMARC policy KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment SPF_HELO_NONE 0.001 SPF: HELO does not publish an SPF Record SPF_PASS -0.001 SPF: sender matches SPF record T_SCC_BODY_TEXT_LINE -0.01 - Subject: Re: [pbs-devel] [PATCH-SERIES v4 pxar proxmox-backup proxmox-widget-toolkit 00/26] fix #3174: improve file-level backup X-BeenThere: pbs-devel@lists.proxmox.com X-Mailman-Version: 2.1.29 Precedence: list List-Id: Proxmox Backup Server development discussion List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 13 Nov 2023 14:23:29 -0000 On November 9, 2023 7:45 pm, Christian Ebner wrote: > Changes to the patch series since version 3 are based on the feedback > obtained via internal communication channels. Many thanks to Thomas, > Fabian, Wolfgang and Dominik for their continuous feedback up until > now. >=20 > This series of patches implements an metadata based file change > detection mechanism for improved pxar file level backup creation speed > for unchanged files. >=20 > The chosen approach is to skip encoding of regular file payloads, > for which metadata (currently ctime and size) did not change as > compared to a previous backup run. Instead of re-encoding the files, a > reference to a newly introduced appendix section of the pxar archive > will be written. The appendix section will be created as concatenation > of indexed chunks from the previous backup run, thereby containing the > sequential file payload at a calculated offset with respect to the > starting point of the appendix section. >=20 > Metadata comparison and calculation of the chunks to be indexed for the > appendix section is performed using the catalog of a previous backup as > reference. In order to be able to calculate the offsets, an updated > catalog file format version 2 is introduced which extends the previous > version by including the file offset with respect to the pxar archive > byte stream, as well as the files ctime. This allows to find the required > chunks indexes and the start padding within the concatenated chunks. > The catalog reader remains backwards compatible to the catalog file > format version 1. >=20 > During encoding, the chunks needed for the appendix section are injected > in the backup upload stream after forcing a chunk boundary when regular > pxar encoding is finished. Finally, the pxar archive containing an > appendix section are marked as such by appending a final pxar goodbye > lookup table only containing the offset to the appendix section start and > total size of that section, needed for random access as e.g. to mount > the archive via the fuse filesystem implementation. some (high-level) comments focused on compatibility: the catalog v2 format is used unconditionally at the moment. IMHO it should be guarded/opt-in via --change-detection-method, since old clients cannot parse it. else, the following would happen if a client system upgrades: - pre-upgrade backup (readable by all clients) - upgrade - post-upgrade backup *with --c-d-m data* (readable by all clients, but everything catalog related only works with new clients) - post-upgrade backup *with --c-d-m metadata* (readable by new clients only) since the pxar format itself also changes (new entry types), it should also be bumped (see below). if the new formats are then only used with the new metadata mode, both new formats are effectively opt-in (until we make that the default mode). having the incompatibility between old and new clients encoded right in the magic value in the header also means we don't spend time downloading indices and chunks only to notice at some random point within the restore that we actually don't know how to parse this particular pxar archive. an additional bonus point - tools like pxar and proxmox-backup-debug could also list the raw+parsed magic value, and in general, error messages like: Error: got unexpected magic number for catalog are a lot easier to grasp than (pxar extract) Error: encountered unexpected error during extraction or (proxmox-backup-client restore) Error: error extracting archive - encountered unexpected error during extr= action the magic values could also be backported to the oldstable client version, to make the error messages even better ("known unsupported" vs "unexpected"). in general, UX wise it might be nice to mark backups using the new mode, although I am not sure how specifically (some variants - just the version/mode, archives, archives+snapshots, ..?). one more peculiarity I noted while testing - doing three backups in a row without changing the input tree at all: - old client - new client, mode data - new client, mode metadata the last snapshot has a bigger "logical" size, e.g., when doing this for my kernel clone (6.8G), the first two have a logical size of 7.736 GiB, while the last one is 8.064Gib. for smaller input dirs, the effect is even more pronounced, a 56M dir with 10 dirs with one file each is listed as 55M for the first wo, and 97.989MiB for the last one (almost double the size!). the resulting pxar archives are actually this size, I guess there is some optimization potential still left for this particular case. the actual (deduplicated) difference is just two (small test case) / eight (linux) very small chunks, so this issue is mostly cosmetic I hope unless one really goes down the "download pxar file, extract manually" route. I hope to do some more in-depth testing and code review over the course of the week!