all lists on lists.proxmox.com
 help / color / mirror / Atom feed
From: "Fabian Grünbichler" <f.gruenbichler@proxmox.com>
To: Proxmox Backup Server development discussion
	<pbs-devel@lists.proxmox.com>
Subject: Re: [pbs-devel] [PATCH proxmox-backup 3/5] garbage collection: add structure for optimized image iteration
Date: Wed, 05 Mar 2025 14:47:36 +0100	[thread overview]
Message-ID: <1741177336.gx3p7pr4h5.astroid@yuna.none> (raw)
In-Reply-To: <20250221140110.377328-4-c.ebner@proxmox.com>

On February 21, 2025 3:01 pm, Christian Ebner wrote:
> Implements the GroupedImageList struct and methods, which groups
> index files (image) paths by a hierarchy for optimized iteration
> during phase 1 of garbage collection.
> 
> Currently, phase 1 of garbage collection iterates over all folders in
> the datastore, without considering any logical organization. This is
> to avoid missing image indices which might have unexpected paths,
> thereby deleting chunks which are still in use by these indices in GC
> phase 2.
> 
> The new structure helps to iterate over the index files in a more
> logical way, without missing strange paths. The hierarchical
> organization helps to avoid touching shared chunks of incremental
> snapshot backups in a backup group multiple times, by allowing
> tracking of these without excessive memory requirements.
> 
> Since deduplication happens on a per image basis for subsequent
> snapshots, the hierarchy is chosen as follows:
> - ns/group
> - image filename
> - snapshot timestamp
> 
> This allows to iterate over consecutive snapshots for the same images
> in the same backup namespace and group.
> 
> Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
> ---
>  pbs-datastore/src/datastore.rs | 63 ++++++++++++++++++++++++++++++++++
>  1 file changed, 63 insertions(+)
> 
> diff --git a/pbs-datastore/src/datastore.rs b/pbs-datastore/src/datastore.rs
> index eda78193d..520f54548 100644
> --- a/pbs-datastore/src/datastore.rs
> +++ b/pbs-datastore/src/datastore.rs
> @@ -1,4 +1,5 @@
>  use std::collections::{HashMap, HashSet};
> +use std::ffi::OsString;
>  use std::io::{self, Write};
>  use std::os::unix::ffi::OsStrExt;
>  use std::os::unix::io::AsRawFd;
> @@ -1573,3 +1574,65 @@ impl DataStore {
>          Ok(())
>      }
>  }
> +
> +struct GroupedImageList {
> +    groups: HashMap<String, HashMap<OsString, Vec<(i64, PathBuf)>>>,

this seems very complicated, couldn't we make it a lot simpler by doing

key: NS + Group (as tuple, using the actual types)
value: (snapshot => index paths)

or even a nested mapping of

NS -> Group -> Snapshot -> Images

and then simply reset when proceeding from one snapshot to the next? the
scope of "in-memory chunk digests" would still be bounded (in a way that
is in line with how we do it in many other parts, like when doing backup
+ restore), and the structure of this helper entity and the way it is
iterated over would feel much more natural.

we could go one step further, but that might eat some of our performance
gains here

- list all images like we did before and store the result in a
  collection that allows fast removal
- iterate normally over the datastore in a structured fashion using the
  existing helpers
-- when proceeding from one snapshot to the next, use the new "don't
retouch chunks" logic
-- remove each visited image path from the list of images
- if any images are left at the end in that list, mark those manually
  (strange or vanished paths, should hopefully be empty or irrelevant)

the main downside is that we'd have to iterate twice (well, not quite
twice, since the hierarchical iterators skip parts outside of the
"known" hierarchy), but we would save all this custom parse-back logic
here. if we keep the parse-back logic, then I think we want to have a
logical structure as well that follows how we normally do things.

I think that the list_images part of GC is by far the least expensive
though (for my test datastore where the GC runtime goes down from 52 to
17s with your patch series, listing images is 12ms of that ;)), so
effectively doing it twice might not hurt that much in practice..

> +    strange_path_images: Vec<PathBuf>,
> +}
> +
> +impl GroupedImageList {
> +    fn new() -> Self {
> +        Self {
> +            groups: HashMap::new(),
> +            strange_path_images: Vec::new(),
> +        }
> +    }
> +
> +    fn insert(&mut self, img: &Path, base_path: &Path) -> Result<(), Error> {
> +        let img = img.to_path_buf();
> +
> +        if let Some(backup_dir_path) = img.parent() {
> +            let backup_dir_path = backup_dir_path.strip_prefix(base_path)?;
> +
> +            if let Some(backup_dir_str) = backup_dir_path.to_str() {
> +                if let Ok((namespace, backup_dir)) =
> +                    pbs_api_types::parse_ns_and_snapshot(backup_dir_str)
> +                {
> +                    if let Some(filename) = img.file_name() {
> +                        let filename = filename.to_os_string();
> +                        let group_key = format!("{namespace}/{group}", group = backup_dir.group);
> +
> +                        if let Some(images) = self.groups.get_mut(&group_key) {
> +                            if let Some(snapshots) = images.get_mut(&filename) {
> +                                snapshots.push((backup_dir.time, img));
> +                            } else {
> +                                let snapshots = vec![(backup_dir.time, img)];
> +                                images.insert(filename, snapshots);
> +                            }
> +                        } else {
> +                            // ns/group not present, insert new
> +                            let snapshots = vec![(backup_dir.time, img)];
> +                            let mut images = HashMap::new();
> +                            images.insert(filename, snapshots);
> +                            self.groups.insert(group_key, images);
> +                        }

this whole part could then be simplified using the Entry API, e.g. if 

    groups: HashMap<(BackupNamespace, String), HashMap<i64, Vec<PathBuf>>>,

then we can just do

                    let group_key = (namespace, backup_dir.group.to_string());

                    let snapshot_map = self.groups.entry(group_key).or_default();
                    let images = snapshot_map.entry(backup_dir.time).or_default();
                    images.push(img);

or similar if we split NS and group..

> +                        return Ok(());
> +                    }
> +                }
> +            }
> +        }
> +
> +        self.strange_path_images.push(img);
> +        Ok(())
> +    }
> +
> +    fn len(&self) -> usize {
> +        let mut count = self.strange_path_images.len();
> +        for (_group, images) in self.groups.iter() {
> +            for (_image, snapshots) in images.iter() {
> +                count += snapshots.len();
> +            }
> +        }
> +        count
> +    }
> +}
> -- 
> 2.39.5
> 
> 
> 
> _______________________________________________
> pbs-devel mailing list
> pbs-devel@lists.proxmox.com
> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel
> 
> 
> 


_______________________________________________
pbs-devel mailing list
pbs-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel


  reply	other threads:[~2025-03-05 13:48 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-02-21 14:01 [pbs-devel] [PATCH proxmox-backup 0/5] GC: avoid multiple atime updates Christian Ebner
2025-02-21 14:01 ` [pbs-devel] [PATCH proxmox-backup 1/5] datastore: restrict datastores list_images method scope to module Christian Ebner
2025-02-21 14:01 ` [pbs-devel] [PATCH proxmox-backup 2/5] garbage collection: refactor archive type based chunk marking logic Christian Ebner
2025-02-21 14:01 ` [pbs-devel] [PATCH proxmox-backup 3/5] garbage collection: add structure for optimized image iteration Christian Ebner
2025-03-05 13:47   ` Fabian Grünbichler [this message]
2025-03-07  8:24     ` Christian Ebner
2025-03-07  8:53       ` Fabian Grünbichler
2025-03-07  8:59         ` Christian Ebner
2025-02-21 14:01 ` [pbs-devel] [PATCH proxmox-backup 4/5] garbage collection: allow to keep track of already touched chunks Christian Ebner
2025-02-21 14:01 ` [pbs-devel] [PATCH proxmox-backup 5/5] fix #5331: garbage collection: avoid multiple chunk atime updates Christian Ebner
2025-02-21 15:35 ` [pbs-devel] [PATCH proxmox-backup 0/5] GC: avoid multiple " Roland
2025-02-21 15:49   ` Christian Ebner
2025-02-22 17:50     ` Roland
2025-03-10 11:18 ` Christian Ebner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1741177336.gx3p7pr4h5.astroid@yuna.none \
    --to=f.gruenbichler@proxmox.com \
    --cc=pbs-devel@lists.proxmox.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.
Service provided by Proxmox Server Solutions GmbH | Privacy | Legal