From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits)) (No client certificate requested) by lists.proxmox.com (Postfix) with ESMTPS id BBB29EF52 for ; Thu, 28 Sep 2023 10:33:01 +0200 (CEST) Received: from firstgate.proxmox.com (localhost [127.0.0.1]) by firstgate.proxmox.com (Proxmox) with ESMTP id 9EB951390D for ; Thu, 28 Sep 2023 10:33:01 +0200 (CEST) Received: from proxmox-new.maurer-it.com (proxmox-new.maurer-it.com [94.136.29.106]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits)) (No client certificate requested) by firstgate.proxmox.com (Proxmox) with ESMTPS for ; Thu, 28 Sep 2023 10:32:58 +0200 (CEST) Received: from proxmox-new.maurer-it.com (localhost.localdomain [127.0.0.1]) by proxmox-new.maurer-it.com (Proxmox) with ESMTP id D81B848314 for ; Thu, 28 Sep 2023 10:32:57 +0200 (CEST) Date: Thu, 28 Sep 2023 10:32:56 +0200 From: Wolfgang Bumiller To: Christian Ebner Cc: pbs-devel@lists.proxmox.com Message-ID: References: <20230922071621.12670-1-c.ebner@proxmox.com> <20230922071621.12670-8-c.ebner@proxmox.com> <526287774.5120.1695817218589@webmail.proxmox.com> <1478379062.5245.1695887403647@webmail.proxmox.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1478379062.5245.1695887403647@webmail.proxmox.com> X-SPAM-LEVEL: Spam detection results: 0 AWL 0.099 Adjusted score from AWL reputation of From: address BAYES_00 -1.9 Bayes spam probability is 0 to 1% DMARC_MISSING 0.1 Missing DMARC policy KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment SPF_HELO_NONE 0.001 SPF: HELO does not publish an SPF Record SPF_PASS -0.001 SPF: sender matches SPF record Subject: Re: [pbs-devel] [RFC pxar 7/20] fix #3174: encoder: add helper to incr encoder pos X-BeenThere: pbs-devel@lists.proxmox.com X-Mailman-Version: 2.1.29 Precedence: list List-Id: Proxmox Backup Server development discussion List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 28 Sep 2023 08:33:01 -0000 On Thu, Sep 28, 2023 at 09:50:03AM +0200, Christian Ebner wrote: > > > On 28.09.2023 09:04 CEST Wolfgang Bumiller wrote: > > > > > > On Wed, Sep 27, 2023 at 02:20:18PM +0200, Christian Ebner wrote: > > > > > > > On 27.09.2023 14:07 CEST Wolfgang Bumiller wrote: > > > > > > > > > > > > 'incr' :S > > > > > > > > On Fri, Sep 22, 2023 at 09:16:08AM +0200, Christian Ebner wrote: > > > > > Adds a helper to allow to increase the encoder position by a given > > > > > size. This is used to increase the position when adding an appendix > > > > > section to the pxar stream, as these bytes are never encoded directly > > > > > but rather referenced by already existing chunks. > > > > > > > > Exposing this seems like a weird choice to me. Why exactly is this > > > > needed? Why don't we instead expose methods to actually write the > > > > appendix section instead? > > > > > > This is needed in order to increase the byte offset of the encoder itself. > > > The appendix section is a list of chunks which are injected in the chunk > > > stream on upload, but never really consumed by the encoder and subsequently > > > the chunker itself. So there is no direct writing of the appendix section to > > > the stream. > > > > > > By adding the bytes, consistency with the rest of the pxar archive is assured, > > > as these chunks/bytes are present during decoding. > > > > Ah so we inject the *contents* of the old pxar archive by way of sending > > the chunks a writing "layer" above. Initially I thought the archive > > would contain chunk ids, but this makes more sense. And is unfortunate > > for the API :-) > > Yes, an initial approach was to store the chunk ids inline, but that is not > necessary and added unneeded storage overhead. As is, the chunks are appended > to a list to be injected after encoding the regular part of the archive, > while instead of the actual file payload the PXAR_APPENIDX_REF entry with > payload size and offset relative to the PXAR_APPENDIX entry is stored. > > This section then contains the concatenated referenced chunks, allowing to > restore file payloads by sequentially skipping to the correct offset and > restoring the payload from there. > > > > > Maybe consider marking the position modification as `unsafe fn`, though? > > I mean it is a foot gun to break the resulting archive with, after all > > ;-) > > You are right in that this is to be seen as an unsafe operation. Maybe instead > of the function to be unsafe, the interface could take the list of chunks as > input and shift the position accordingly? > Thereby consuming the chunks and store them for injection afterwards. > > That way the ownership of the chunk list would be moved to the encoder rather than > being part of the archiver, as is now. The chunk list might then be passed from the > encoder to be injected to the backup upload stream, although I am not sure if and > how to bypass the chunker in that case. > > > > > But this means we don't have a direct way of creating incremental pxars > > without a PBS context, doesn't it? > > This is correct. At the moment the only way to create an incremental pxar > archive is to use the PBS context. Both, index file and catalog are required, > which could in principle also be provided by a command line parameter, but > finally also the actual chunk data is needed. That is currently only provided > during restore of the archive from backup. > > > Would it make sense to have a method here which returns a Writer to > > the `EncoderOutput` where we could in theory also just "dump in" > > contents of another actual pxar file (where the byte counting happens > > implicitly), which also has an extra `unsafe fn add_out_of_band_bytes()` > > to do the raw byte count modification? > > Yes, this might be possible, but for creating the backup I completely want to > avoid that. This would require to download the chunk data just to inject it > for reuse, which is probably way more expensive and defies the purpose of > reusing the chunks to begin with. No I meant the `unsafe fn add_out_of_band_bytes()` was supposed to bump just the counter exactly as we do now, and its `Write` interface specifically only for *non* PBS backup creation. But we don't need to flesh out the non-PBS-related API right now at all, my main concern was to make the pxar API more difficult to use wrongly, specifically the flushing ;-) But sure, the PBS part is so separated from the pxar code that there's never anything preventing you from inserting bogus data into the stream anyway I suppose... but that's on the PBS code side and doesn't really need to be taken into account from the pxar crate's API point of view. > > If you intended this to be an addition to the current code, in order to create > a pxar archive with appendix locally, without the PBS context, then yes. > This might be possible by passing the data in form of a `MergedKnownChunk`, > which contains either the raw chunk data or the reused chunks hash and size, > allowing to pass either the data or the digest needed to index it. > > > > > One advantage of having a "starting point" for this type of operation is > > that we'd also force a `flush()` before out-of-band data gets written. > > Otherwise, if we don't need/want this, we should probably just add a > > `flush()` to the encoder we should call before adding any chunks out of > > band, given that Max already tried to sneak in a BufRead/Writers into > > the pxar crate for optimization purposes, IIRC ;-) > > Good point, flushing is definitely required if writes will be buffered to > not break the byte stream.