From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits)) (No client certificate requested) by lists.proxmox.com (Postfix) with ESMTPS id 01ACB920AE for ; Fri, 5 Apr 2024 09:52:16 +0200 (CEST) Received: from firstgate.proxmox.com (localhost [127.0.0.1]) by firstgate.proxmox.com (Proxmox) with ESMTP id CD7C7F033 for ; Fri, 5 Apr 2024 09:52:15 +0200 (CEST) Received: from proxmox-new.maurer-it.com (proxmox-new.maurer-it.com [94.136.29.106]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits)) (No client certificate requested) by firstgate.proxmox.com (Proxmox) with ESMTPS for ; Fri, 5 Apr 2024 09:52:14 +0200 (CEST) Received: from proxmox-new.maurer-it.com (localhost.localdomain [127.0.0.1]) by proxmox-new.maurer-it.com (Proxmox) with ESMTP id DFFDE457C9 for ; Fri, 5 Apr 2024 09:52:13 +0200 (CEST) Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable In-Reply-To: <20240328123707.336951-44-c.ebner@proxmox.com> References: <20240328123707.336951-1-c.ebner@proxmox.com> <20240328123707.336951-44-c.ebner@proxmox.com> From: Fabian =?utf-8?q?Gr=C3=BCnbichler?= To: Christian Ebner , pbs-devel@lists.proxmox.com Date: Fri, 05 Apr 2024 09:52:01 +0200 Message-ID: <171230352147.1926770.11495722007301966318@yuna.proxmox.com> User-Agent: alot/0.10 X-SPAM-LEVEL: Spam detection results: 0 AWL 0.058 Adjusted score from AWL reputation of From: address BAYES_00 -1.9 Bayes spam probability is 0 to 1% DMARC_MISSING 0.1 Missing DMARC policy KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment SPF_HELO_NONE 0.001 SPF: HELO does not publish an SPF Record SPF_PASS -0.001 SPF: sender matches SPF record URIBL_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to URIBL was blocked. See http://wiki.apache.org/spamassassin/DnsBlocklists#dnsbl-block for more information. [self.total, proxmox.com, create.rs] Subject: Re: [pbs-devel] [PATCH v3 proxmox-backup 43/58] client: pxar: implement store to insert chunks on caching X-BeenThere: pbs-devel@lists.proxmox.com X-Mailman-Version: 2.1.29 Precedence: list List-Id: Proxmox Backup Server development discussion List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 05 Apr 2024 07:52:16 -0000 Quoting Christian Ebner (2024-03-28 13:36:52) > In preparation for the look-ahead caching used to temprarily store > entries before encoding them in the pxar archive, being able to > decide wether to re-use or re-encode regular file entries. >=20 > Allows to insert and store reused chunks in the archiver, > deduplicating chunks upon insert when possible. >=20 > Signed-off-by: Christian Ebner > --- > changes since version 2: > - Strongly adapted and refactored: keep track also of paddings > introduced by reusing the chunks, making a suggestion whether to > re-use, re-encode or check next entry based on threshold > - completely removed code which allowed to calculate offsets based on > chunks found in the middle, they must either be a continuation of the > end or be added after, otherwise offsets are not monotonically > increasing, which is required for sequential restore >=20 > pbs-client/src/pxar/create.rs | 126 +++++++++++++++++++++++++++++++++- > 1 file changed, 125 insertions(+), 1 deletion(-) >=20 > diff --git a/pbs-client/src/pxar/create.rs b/pbs-client/src/pxar/create.rs > index 335e3556f..95a91a59b 100644 > --- a/pbs-client/src/pxar/create.rs > +++ b/pbs-client/src/pxar/create.rs > @@ -20,7 +20,7 @@ use pathpatterns::{MatchEntry, MatchFlag, MatchList, Ma= tchType, PatternFlag}; > use pbs_datastore::index::IndexFile; > use proxmox_sys::error::SysError; > use pxar::accessor::aio::Accessor; > -use pxar::encoder::{LinkOffset, SeqWrite}; > +use pxar::encoder::{LinkOffset, PayloadOffset, SeqWrite}; > use pxar::Metadata; > =20 > use proxmox_io::vec; > @@ -36,6 +36,128 @@ use crate::pxar::metadata::errno_is_unsupported; > use crate::pxar::tools::assert_single_path_component; > use crate::pxar::Flags; > =20 > +const CHUNK_PADDING_THRESHOLD: f64 =3D 0.1; > + > +#[derive(Default)] > +struct ReusedChunks { > + start_boundary: PayloadOffset, > + total: PayloadOffset, > + padding: u64, > + chunks: Vec<(u64, ReusableDynamicEntry)>, > + must_flush_first: bool, > + suggestion: Suggested, > +} > + > +#[derive(Copy, Clone, Default)] > +enum Suggested { > + #[default] > + CheckNext, > + Reuse, > + Reencode, > +} this is a bit of a misnomer - it's not a suggestion, it's what is going to happen ;) maybe `action` or even `result` or something similar might be a better fit? or something going into the direction of "decision"? > + > +impl ReusedChunks { > + fn new() -> Self { > + Self::default() > + } this is not needed, can just use default() > + > + fn start_boundary(&self) -> PayloadOffset { > + self.start_boundary > + } is this needed? access is only local anyway, and we don't have a similar he= lper for self.chunks ;) > + > + fn is_empty(&self) -> bool { > + self.chunks.is_empty() > + } > + > + fn suggested(&self) -> Suggested { > + self.suggestion same question here.. > + } > + > + fn insert( > + &mut self, > + indices: Vec, > + boundary: PayloadOffset, > + start_padding: u64, > + end_padding: u64, > + ) -> PayloadOffset { another fn that definitely would benefit from some comments describing the thought processes behind the logic :) > + if self.is_empty() { > + self.start_boundary =3D boundary; > + } okay, so this is actually just set > + > + if let Some(offset) =3D self.last_digest_matched(&indices) { > + if let Some((padding, last)) =3D self.chunks.last_mut() { we already know a last chunk must exist (else last_digest_matched wouldn't = have returned). that means last_digest_matched could just return the last chunk? > + // Existing chunk, update padding based on pre-existing = one > + // Start padding is expected to be larger than previous = padding should we validate this expectation? or is it already validated somewhere e= lse? and also, isn't start_padding by definition always smaller than last.size()? > + *padding +=3D start_padding - last.size(); > + self.padding +=3D start_padding - last.size(); (see below) here we correct the per-chunk padding, but only for a partial c= hunk that is continued.. if I understand this part correctly, here we basically want to adapt from a potential end_padding of the last chunk, if it was: A|A|A|P|P|P|P and we now have P|P|P|P|P|B|B we want to end up with just 2 'P's in the middle? isn't start_padding - size the count of the payload? so what we actually want is *padding -=3D (last.size() - start_padding) ? IMHO that makes the intention much more readable, especially if you facto= r out let payload_size =3D last.size() - start_padding; *padding -=3D payload_size; self.padding -=3D payload_size; if we want to be extra careful, we could even add the three checks/assertio= ns here: - start_padding must be smaller than the chunk size - both chunk and total padding must be bigger than the payload size > + } > + > + for chunk in indices.into_iter().skip(1) { > + self.total =3D self.total.add(chunk.size()); > + self.chunks.push((0, chunk)); here we push the second and later chunks with 0 padding > + } > + > + if let Some((padding, _last)) =3D self.chunks.last_mut() { > + *padding +=3D end_padding; > + self.padding +=3D end_padding; and here we account for the end_padding of the last chunk, which might actu= ally be the same as the first chunk, but that works out.. > + } > + > + let padding_ratio =3D self.padding as f64 / self.total.raw()= as f64; > + if self.chunks.len() > 1 && padding_ratio < CHUNK_PADDING_TH= RESHOLD { > + self.suggestion =3D Suggested::Reuse; > + } see below > + > + self.start_boundary.add(offset + start_padding) > + } else { > + let offset =3D self.total.raw(); > + > + if let Some(first) =3D indices.first() { > + self.total =3D self.total.add(first.size()); > + self.chunks.push((start_padding, first.clone())); > + // New chunk, all start padding counts > + self.padding +=3D start_padding; > + } > + > + for chunk in indices.into_iter().skip(1) { > + self.total =3D self.total.add(chunk.size()); > + self.chunks.push((chunk.size(), chunk)); so here we count the full chunk size as padding (for the per-chunk stats), = but don't count it for the total padding? I think this should be 0 just like ab= ove? this and the handling of the first chunk could actually be combined: for (idx, chunk) in indices.into_iter().enumerate() { self.total =3D self.total.add(chunk.size()); let chunk_padding =3D if idx =3D=3D 0 { self.padding +=3D start_padding; s= tart_padding } else { 0 }; self.chunks.push((chunk_padding, chunk)); } or we could make start_padding mut, and do self.padding +=3D start_padding; for chunk in indices.into_iter() { self.total =3D self.total.add(chunk.size()); self.chunks.push((start_padding, chunk)); start_padding =3D 0; // only first chunk counts } > + } > + > + if let Some((padding, _last)) =3D self.chunks.last_mut() { > + *padding +=3D end_padding; > + self.padding +=3D end_padding; > + } > + > + if self.chunks.len() > 2 { so if we insert more than two chunks without a continuation, all of them are accounted for as full of padding but they are still re-used if start and end padding are below the threshold ;) > + let padding_ratio =3D self.padding as f64 / self.total.r= aw() as f64; > + if padding_ratio < CHUNK_PADDING_THRESHOLD { > + self.suggestion =3D Suggested::Reuse; > + } else { > + self.suggestion =3D Suggested::Reencode; > + } we could just return the suggestion instead of storing it - it's only ever = needed right after `insert` anyway? this calculation above seems to not handle some corner cases though. if I have the following sequence |P|A|A|A|A|A|B|P|P|P|P|P| | C1 | C2 | where P represent padding parts, A and B are file contents of two files, an= d C1 and C2 are two chunks. let's say both A and B are re-usable files. first A = is resolved via the index and inserted, but since it's the first chunk, there = is no "suggestion" yet (CheckNext). now B is resolved and inserted - it doesn't continue a partial previous chunk, so we take the else branch. but now the (total) padding ratio is skewed because of B, even though A on its own would have been perfectly fine to be re-used.. (let's say we then insert another chunk different to C1 and C2, depending on the exact ratios that one might = lead to the whole sequence being re-used or not, even though C2 should not be re-used, C1 should, and the hypothetical C3 might or might not!) basically, as soon as we have a break in the chunk sequence (file A followe= d by file B, with no chunk overlap) we can treat each part on its own? > + } > + > + self.start_boundary.add(offset + start_padding) > + } > + } > + > + fn last_digest_matched(&self, indices: &[ReusableDynamicEntry]) -> O= ption { > + let digest =3D if let Some(first) =3D indices.first() { > + first.digest() > + } else { > + return None; > + }; > + > + if let Some(last) =3D self.chunks.last() { > + if last.1.digest() =3D=3D digest { > + return Some(self.total.raw() - last.1.size()); > + } > + } > + > + None > + } > +} > + > /// Pxar options for creating a pxar archive/stream > #[derive(Default, Clone)] > pub struct PxarCreateOptions { > @@ -147,6 +269,7 @@ struct Archiver { > hardlinks: HashMap, > file_copy_buffer: Vec, > skip_e2big_xattr: bool, > + reused_chunks: ReusedChunks, > forced_boundaries: Option>>>, > } > =20 > @@ -239,6 +362,7 @@ where > hardlinks: HashMap::new(), > file_copy_buffer: vec::undefined(4 * 1024 * 1024), > skip_e2big_xattr: options.skip_e2big_xattr, > + reused_chunks: ReusedChunks::new(), > forced_boundaries, > }; > =20 > --=20 > 2.39.2 >=20 >=20 >=20 > _______________________________________________ > pbs-devel mailing list > pbs-devel@lists.proxmox.com > https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel >=20 >