From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <pbs-devel-bounces@lists.proxmox.com>
Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68])
	by lore.proxmox.com (Postfix) with ESMTPS id 3FDB01FF16E
	for <inbox@lore.proxmox.com>; Mon, 17 Mar 2025 15:56:31 +0100 (CET)
Received: from firstgate.proxmox.com (localhost [127.0.0.1])
	by firstgate.proxmox.com (Proxmox) with ESMTP id A9CAB6A77;
	Mon, 17 Mar 2025 15:56:21 +0100 (CET)
Date: Mon, 17 Mar 2025 15:55:44 +0100
From: Fabian =?iso-8859-1?q?Gr=FCnbichler?= <f.gruenbichler@proxmox.com>
To: Proxmox Backup Server development discussion <pbs-devel@lists.proxmox.com>
References: <20250310111634.162156-1-c.ebner@proxmox.com>
 <20250310111634.162156-4-c.ebner@proxmox.com>
In-Reply-To: <20250310111634.162156-4-c.ebner@proxmox.com>
MIME-Version: 1.0
User-Agent: astroid/0.16.0 (https://github.com/astroidmail/astroid)
Message-Id: <1742220221.ebt1keyfny.astroid@yuna.none>
X-SPAM-LEVEL: Spam detection results:  0
 AWL 0.043 Adjusted score from AWL reputation of From: address
 BAYES_00                 -1.9 Bayes spam probability is 0 to 1%
 DMARC_MISSING             0.1 Missing DMARC policy
 KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment
 RCVD_IN_VALIDITY_CERTIFIED_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to
 Validity was blocked. See
 https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more
 information.
 RCVD_IN_VALIDITY_RPBL_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to
 Validity was blocked. See
 https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more
 information.
 RCVD_IN_VALIDITY_SAFE_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to
 Validity was blocked. See
 https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more
 information.
 SPF_HELO_NONE           0.001 SPF: HELO does not publish an SPF Record
 SPF_PASS               -0.001 SPF: sender matches SPF record
 URIBL_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to URIBL was blocked. See
 http://wiki.apache.org/spamassassin/DnsBlocklists#dnsbl-block for more
 information. [proxmox.com, datastore.rs]
Subject: Re: [pbs-devel] [PATCH v2 proxmox-backup 3/4] garbage collection:
 allow to keep track of already touched chunks
X-BeenThere: pbs-devel@lists.proxmox.com
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Proxmox Backup Server development discussion
 <pbs-devel.lists.proxmox.com>
List-Unsubscribe: <https://lists.proxmox.com/cgi-bin/mailman/options/pbs-devel>, 
 <mailto:pbs-devel-request@lists.proxmox.com?subject=unsubscribe>
List-Archive: <http://lists.proxmox.com/pipermail/pbs-devel/>
List-Post: <mailto:pbs-devel@lists.proxmox.com>
List-Help: <mailto:pbs-devel-request@lists.proxmox.com?subject=help>
List-Subscribe: <https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel>, 
 <mailto:pbs-devel-request@lists.proxmox.com?subject=subscribe>
Reply-To: Proxmox Backup Server development discussion
 <pbs-devel@lists.proxmox.com>
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Errors-To: pbs-devel-bounces@lists.proxmox.com
Sender: "pbs-devel" <pbs-devel-bounces@lists.proxmox.com>

On March 10, 2025 12:16 pm, Christian Ebner wrote:
> Implements the `TouchedChunks` struct and methods to keep track of
> already touched chunks during garbage collection phase 1, to avoid
> multiple computational and I/O intensive atime updates via a syscall.
> 
> By inserting a digest, the chunk will be considered as touched and
> can be ignored for subsequent encounters. To limit memory usage, the
> structure allows to reset the chunk status, flagging them as seen
> previous to the reset. A subsequent insert will then flag it as seen
> after the reset. Chunks not seen after a reset, will be cleared from
> the structure by the next reset call, eliminating them from memory.
> 
> This allows to reset the tracking stat after each processes image
> index file, to mimic the incremental backup behaviour of known chunks
> and limit memory footprint.
> 
> Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
> ---
> changes since version 1:
> - no changes
> 
>  pbs-datastore/src/datastore.rs | 29 +++++++++++++++++++++++++++++
>  1 file changed, 29 insertions(+)
> 
> diff --git a/pbs-datastore/src/datastore.rs b/pbs-datastore/src/datastore.rs
> index 72bc9f77f..fdbb33a98 100644
> --- a/pbs-datastore/src/datastore.rs
> +++ b/pbs-datastore/src/datastore.rs
> @@ -1585,3 +1585,32 @@ impl DataStore {
>          Ok(())
>      }
>  }
> +
> +struct TouchedChunks {
> +    list: HashMap<[u8; 32], bool>,
> +}
> +
> +impl TouchedChunks {
> +    fn new() -> Self {
> +        Self {
> +            list: HashMap::new(),
> +        }
> +    }
> +
> +    // Clear untouched chunks and reset the touched marker for others.
> +    fn reset(&mut self) {
> +        let mut new_list = HashMap::new();
> +        for (digest, touched) in self.list.drain() {
> +            if touched {
> +                new_list.insert(digest, false);
> +            }
> +        }
> +        self.list = new_list;

this could avoid the memory allocation (and for bigger
indices/snapshots, probably multiple reallocations via the inserts) by
switching to `retain` (which gets a `&mut V` and can thus flip the value
of touched while filtering in a single pass).

despite the performance warning about it visiting empty buckets as well,
this seems to be (slightly) faster for my test datastore (would be
interesting if your test cases agree?) when benchmarking with a warmed
up cache.

I used the following on-top of your series (the shrink_to is to reduce
memory usage in case of outliers after they've been processed):

```
diff --git a/pbs-datastore/src/datastore.rs b/pbs-datastore/src/datastore.rs
index a80343d9b..d3c3f831f 100644
--- a/pbs-datastore/src/datastore.rs
+++ b/pbs-datastore/src/datastore.rs
@@ -1650,13 +1650,12 @@ impl TouchedChunks {
 
     // Clear untouched chunks and reset the touched marker for others.
     fn reset(&mut self) {
-        let mut new_list = HashMap::new();
-        for (digest, touched) in self.list.drain() {
-            if touched {
-                new_list.insert(digest, false);
-            }
-        }
-        self.list = new_list;
+        self.list.retain(|_digest, touched| {
+            *touched = !*touched;
+            !*touched
+        });
+        let max_capacity = self.list.len().saturating_add(self.list.len() / 3);
+        self.list.shrink_to(max_capacity);
     }
 
     // Insert the digest in the list of touched chunks.
```

if the above is slower for your test inputs, then at least initializing
the second HashMap with some initial capacity (the len of the previous
list?) is probably sensible..

alternatively, we could also explore (maybe as a follow-up to
immediately realize the performance gains we already know we get from
the current aproach?) some sort of LRU-based approach?

that might give us the option of:
- dropping the reset call/fn while still having bounded (even
  configurable?) memory overhead
- extend skipping touches across a broader range of snapshots/indices
  for additional performance gains (if the capacity is high enough)

> +    }
> +
> +    // Insert the digest in the list of touched chunks.
> +    // Returns true if the chunk was already present, false otherwise.
> +    fn insert(&mut self, digest: [u8; 32]) -> bool {
> +        self.list.insert(digest, true).is_some()
> +    }
> +}
> -- 
> 2.39.5
> 
> 
> 
> _______________________________________________
> pbs-devel mailing list
> pbs-devel@lists.proxmox.com
> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel
> 
> 
> 


_______________________________________________
pbs-devel mailing list
pbs-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel