From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits)) (No client certificate requested) by lists.proxmox.com (Postfix) with ESMTPS id B8CFDEEC3 for ; Thu, 28 Sep 2023 09:50:07 +0200 (CEST) Received: from firstgate.proxmox.com (localhost [127.0.0.1]) by firstgate.proxmox.com (Proxmox) with ESMTP id 94BCD130FB for ; Thu, 28 Sep 2023 09:50:07 +0200 (CEST) Received: from proxmox-new.maurer-it.com (proxmox-new.maurer-it.com [94.136.29.106]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by firstgate.proxmox.com (Proxmox) with ESMTPS for ; Thu, 28 Sep 2023 09:50:04 +0200 (CEST) Received: from proxmox-new.maurer-it.com (localhost.localdomain [127.0.0.1]) by proxmox-new.maurer-it.com (Proxmox) with ESMTP id A5F54444BB for ; Thu, 28 Sep 2023 09:50:04 +0200 (CEST) Date: Thu, 28 Sep 2023 09:50:03 +0200 (CEST) From: Christian Ebner To: Wolfgang Bumiller Cc: pbs-devel@lists.proxmox.com, m.carrara@proxmox.com Message-ID: <1478379062.5245.1695887403647@webmail.proxmox.com> In-Reply-To: References: <20230922071621.12670-1-c.ebner@proxmox.com> <20230922071621.12670-8-c.ebner@proxmox.com> <526287774.5120.1695817218589@webmail.proxmox.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Priority: 3 Importance: Normal X-Mailer: Open-Xchange Mailer v7.10.6-Rev50 X-Originating-Client: open-xchange-appsuite X-SPAM-LEVEL: Spam detection results: 0 AWL 0.096 Adjusted score from AWL reputation of From: address BAYES_00 -1.9 Bayes spam probability is 0 to 1% DMARC_MISSING 0.1 Missing DMARC policy KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment SPF_HELO_NONE 0.001 SPF: HELO does not publish an SPF Record SPF_PASS -0.001 SPF: sender matches SPF record Subject: Re: [pbs-devel] [RFC pxar 7/20] fix #3174: encoder: add helper to incr encoder pos X-BeenThere: pbs-devel@lists.proxmox.com X-Mailman-Version: 2.1.29 Precedence: list List-Id: Proxmox Backup Server development discussion List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 28 Sep 2023 07:50:07 -0000 > On 28.09.2023 09:04 CEST Wolfgang Bumiller wrote: > > > On Wed, Sep 27, 2023 at 02:20:18PM +0200, Christian Ebner wrote: > > > > > On 27.09.2023 14:07 CEST Wolfgang Bumiller wrote: > > > > > > > > > 'incr' :S > > > > > > On Fri, Sep 22, 2023 at 09:16:08AM +0200, Christian Ebner wrote: > > > > Adds a helper to allow to increase the encoder position by a given > > > > size. This is used to increase the position when adding an appendix > > > > section to the pxar stream, as these bytes are never encoded directly > > > > but rather referenced by already existing chunks. > > > > > > Exposing this seems like a weird choice to me. Why exactly is this > > > needed? Why don't we instead expose methods to actually write the > > > appendix section instead? > > > > This is needed in order to increase the byte offset of the encoder itself. > > The appendix section is a list of chunks which are injected in the chunk > > stream on upload, but never really consumed by the encoder and subsequently > > the chunker itself. So there is no direct writing of the appendix section to > > the stream. > > > > By adding the bytes, consistency with the rest of the pxar archive is assured, > > as these chunks/bytes are present during decoding. > > Ah so we inject the *contents* of the old pxar archive by way of sending > the chunks a writing "layer" above. Initially I thought the archive > would contain chunk ids, but this makes more sense. And is unfortunate > for the API :-) Yes, an initial approach was to store the chunk ids inline, but that is not necessary and added unneeded storage overhead. As is, the chunks are appended to a list to be injected after encoding the regular part of the archive, while instead of the actual file payload the PXAR_APPENIDX_REF entry with payload size and offset relative to the PXAR_APPENDIX entry is stored. This section then contains the concatenated referenced chunks, allowing to restore file payloads by sequentially skipping to the correct offset and restoring the payload from there. > > Maybe consider marking the position modification as `unsafe fn`, though? > I mean it is a foot gun to break the resulting archive with, after all > ;-) You are right in that this is to be seen as an unsafe operation. Maybe instead of the function to be unsafe, the interface could take the list of chunks as input and shift the position accordingly? Thereby consuming the chunks and store them for injection afterwards. That way the ownership of the chunk list would be moved to the encoder rather than being part of the archiver, as is now. The chunk list might then be passed from the encoder to be injected to the backup upload stream, although I am not sure if and how to bypass the chunker in that case. > > But this means we don't have a direct way of creating incremental pxars > without a PBS context, doesn't it? This is correct. At the moment the only way to create an incremental pxar archive is to use the PBS context. Both, index file and catalog are required, which could in principle also be provided by a command line parameter, but finally also the actual chunk data is needed. That is currently only provided during restore of the archive from backup. > Would it make sense to have a method here which returns a Writer to > the `EncoderOutput` where we could in theory also just "dump in" > contents of another actual pxar file (where the byte counting happens > implicitly), which also has an extra `unsafe fn add_out_of_band_bytes()` > to do the raw byte count modification? Yes, this might be possible, but for creating the backup I completely want to avoid that. This would require to download the chunk data just to inject it for reuse, which is probably way more expensive and defies the purpose of reusing the chunks to begin with. If you intended this to be an addition to the current code, in order to create a pxar archive with appendix locally, without the PBS context, then yes. This might be possible by passing the data in form of a `MergedKnownChunk`, which contains either the raw chunk data or the reused chunks hash and size, allowing to pass either the data or the digest needed to index it. > > One advantage of having a "starting point" for this type of operation is > that we'd also force a `flush()` before out-of-band data gets written. > Otherwise, if we don't need/want this, we should probably just add a > `flush()` to the encoder we should call before adding any chunks out of > band, given that Max already tried to sneak in a BufRead/Writers into > the pxar crate for optimization purposes, IIRC ;-) Good point, flushing is definitely required if writes will be buffered to not break the byte stream.