public inbox for pbs-devel@lists.proxmox.com
 help / color / mirror / Atom feed
From: Christian Ebner <c.ebner@proxmox.com>
To: "Proxmox Backup Server development discussion"
	<pbs-devel@lists.proxmox.com>,
	"Fabian Grünbichler" <f.gruenbichler@proxmox.com>
Subject: Re: [pbs-devel] [PATCH v2 proxmox-backup 3/4] garbage collection: allow to keep track of already touched chunks
Date: Mon, 17 Mar 2025 16:39:34 +0100	[thread overview]
Message-ID: <28ba79b3-3d84-4f95-ac73-3e700766ef15@proxmox.com> (raw)
In-Reply-To: <1742220221.ebt1keyfny.astroid@yuna.none>

On 3/17/25 15:55, Fabian Grünbichler wrote:
> On March 10, 2025 12:16 pm, Christian Ebner wrote:
>> Implements the `TouchedChunks` struct and methods to keep track of
>> already touched chunks during garbage collection phase 1, to avoid
>> multiple computational and I/O intensive atime updates via a syscall.
>>
>> By inserting a digest, the chunk will be considered as touched and
>> can be ignored for subsequent encounters. To limit memory usage, the
>> structure allows to reset the chunk status, flagging them as seen
>> previous to the reset. A subsequent insert will then flag it as seen
>> after the reset. Chunks not seen after a reset, will be cleared from
>> the structure by the next reset call, eliminating them from memory.
>>
>> This allows to reset the tracking stat after each processes image
>> index file, to mimic the incremental backup behaviour of known chunks
>> and limit memory footprint.
>>
>> Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
>> ---
>> changes since version 1:
>> - no changes
>>
>>   pbs-datastore/src/datastore.rs | 29 +++++++++++++++++++++++++++++
>>   1 file changed, 29 insertions(+)
>>
>> diff --git a/pbs-datastore/src/datastore.rs b/pbs-datastore/src/datastore.rs
>> index 72bc9f77f..fdbb33a98 100644
>> --- a/pbs-datastore/src/datastore.rs
>> +++ b/pbs-datastore/src/datastore.rs
>> @@ -1585,3 +1585,32 @@ impl DataStore {
>>           Ok(())
>>       }
>>   }
>> +
>> +struct TouchedChunks {
>> +    list: HashMap<[u8; 32], bool>,
>> +}
>> +
>> +impl TouchedChunks {
>> +    fn new() -> Self {
>> +        Self {
>> +            list: HashMap::new(),
>> +        }
>> +    }
>> +
>> +    // Clear untouched chunks and reset the touched marker for others.
>> +    fn reset(&mut self) {
>> +        let mut new_list = HashMap::new();
>> +        for (digest, touched) in self.list.drain() {
>> +            if touched {
>> +                new_list.insert(digest, false);
>> +            }
>> +        }
>> +        self.list = new_list;
> 
> this could avoid the memory allocation (and for bigger
> indices/snapshots, probably multiple reallocations via the inserts) by
> switching to `retain` (which gets a `&mut V` and can thus flip the value
> of touched while filtering in a single pass).
> 
> despite the performance warning about it visiting empty buckets as well,
> this seems to be (slightly) faster for my test datastore (would be
> interesting if your test cases agree?) when benchmarking with a warmed
> up cache.
> 
> I used the following on-top of your series (the shrink_to is to reduce
> memory usage in case of outliers after they've been processed):
> 
> ```
> diff --git a/pbs-datastore/src/datastore.rs b/pbs-datastore/src/datastore.rs
> index a80343d9b..d3c3f831f 100644
> --- a/pbs-datastore/src/datastore.rs
> +++ b/pbs-datastore/src/datastore.rs
> @@ -1650,13 +1650,12 @@ impl TouchedChunks {
>   
>       // Clear untouched chunks and reset the touched marker for others.
>       fn reset(&mut self) {
> -        let mut new_list = HashMap::new();
> -        for (digest, touched) in self.list.drain() {
> -            if touched {
> -                new_list.insert(digest, false);
> -            }
> -        }
> -        self.list = new_list;
> +        self.list.retain(|_digest, touched| {
> +            *touched = !*touched;
> +            !*touched
> +        });
> +        let max_capacity = self.list.len().saturating_add(self.list.len() / 3);
> +        self.list.shrink_to(max_capacity);
>       }
>   
>       // Insert the digest in the list of touched chunks.
> ```
> 
> if the above is slower for your test inputs, then at least initializing
> the second HashMap with some initial capacity (the len of the previous
> list?) is probably sensible..

Okay, will check and incorporate these suggestions, thanks!

> 
> alternatively, we could also explore (maybe as a follow-up to
> immediately realize the performance gains we already know we get from
> the current aproach?) some sort of LRU-based approach?

Yeah, had something like that in mind as well because of the recent work 
on that. This would allow to better control the overall memory 
requirements for the garbage collection task as well, as it then does 
not depend on the index files. Keeping the cache capacity large is of 
course a pre-condition for that.

> 
> that might give us the option of:
> - dropping the reset call/fn while still having bounded (even
>    configurable?) memory overhead
> - extend skipping touches across a broader range of snapshots/indices
>    for additional performance gains (if the capacity is high enough)
> 
>> +    }
>> +
>> +    // Insert the digest in the list of touched chunks.
>> +    // Returns true if the chunk was already present, false otherwise.
>> +    fn insert(&mut self, digest: [u8; 32]) -> bool {
>> +        self.list.insert(digest, true).is_some()
>> +    }
>> +}
>> -- 
>> 2.39.5
>>
>>
>>
>> _______________________________________________
>> pbs-devel mailing list
>> pbs-devel@lists.proxmox.com
>> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel
>>
>>
>>
> 
> 
> _______________________________________________
> pbs-devel mailing list
> pbs-devel@lists.proxmox.com
> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel
> 
> 



_______________________________________________
pbs-devel mailing list
pbs-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel

  reply	other threads:[~2025-03-17 15:39 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-03-10 11:16 [pbs-devel] [PATCH v2 proxmox-backup 0/4] GC: avoid multiple atime updates Christian Ebner
2025-03-10 11:16 ` [pbs-devel] [PATCH v2 proxmox-backup 1/4] datastore: restrict datastores list_images method scope to module Christian Ebner
2025-03-17 15:00   ` [pbs-devel] applied: " Fabian Grünbichler
2025-03-10 11:16 ` [pbs-devel] [PATCH v2 proxmox-backup 2/4] datastore: add helper method to open index reader from path Christian Ebner
2025-03-17 14:59   ` Fabian Grünbichler
2025-03-17 15:41     ` Christian Ebner
2025-03-10 11:16 ` [pbs-devel] [PATCH v2 proxmox-backup 3/4] garbage collection: allow to keep track of already touched chunks Christian Ebner
2025-03-17 14:55   ` Fabian Grünbichler
2025-03-17 15:39     ` Christian Ebner [this message]
2025-03-10 11:16 ` [pbs-devel] [PATCH v2 proxmox-backup 4/4] fix #5331: garbage collection: avoid multiple chunk atime updates Christian Ebner
2025-03-10 11:40   ` Christian Ebner
2025-03-17 14:55   ` Fabian Grünbichler
2025-03-17 15:43     ` Christian Ebner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=28ba79b3-3d84-4f95-ac73-3e700766ef15@proxmox.com \
    --to=c.ebner@proxmox.com \
    --cc=f.gruenbichler@proxmox.com \
    --cc=pbs-devel@lists.proxmox.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox
Service provided by Proxmox Server Solutions GmbH | Privacy | Legal