all lists on lists.proxmox.com
 help / color / mirror / Atom feed
From: Christian Ebner <c.ebner@proxmox.com>
To: pbs-devel@lists.proxmox.com
Subject: [pbs-devel] [RFC proxmox-backup 0/4] fix #5799: Gather per-namespace/group/snapshot storage usage stats
Date: Mon, 19 Jan 2026 14:27:03 +0100	[thread overview]
Message-ID: <20260119132707.686523-1-c.ebner@proxmox.com> (raw)

Disclaimer: These patches are still in a development state and send
as RFC to discuss implementation details especially with respect to
acceptable required memory footprint and performance limitations.
As is, yhey are not intended for production use yet.

Issue #5799 requested to gather and cache information about raw
storage size of uniquely referenced chunks and deduplication factor
for backup groups, with the intent to provide better introspection
for storage optimization by allowing to pruning specific backup
groups/snapshot based on this additional information.

This patches draft an approach to generate such statistics during
garbage collection, by collecting chunk to namespace/group/snapshot
relations an providing an in-memory reverse mapping from chunk
digests to namespaces/groups/snapshots referencing given chunk. This
reverse mapping would further allow to e.g. mark snapshots as invalid
if referenced chunks are missing.

During phase 1, snapshots referencing chunk digests are stored in a
lookup table. The actual namespace, group and snapshot data is stored
in dedicated indexes, only referenced by the respective key in the
lookup table, with the intent to keep the slot size predictable and
small for better allocation.

During phase 2 raw chunk size is collected while iterating over chunk
files.

Finally, the statistics are gathered by accumulating the counts of
each chunk digest for each of the namespace/group/snapshot, taking
advantage of the lookup map.

Currently, the information is being gathered unconditionally and
logged to the garbage collection task log, but it is planned to make
this opt-in and store gathered data on the namespace/group/snapshot
level, to e.g. be shown on the datastore content listings or a
dedicated content listing.

The following differences in RSS max values were observed via
`watch -n 1 "ps -p $(pidof proxmox-backup-proxy) -o rss | tail -n 1 | tee -a ps-rss.out"`
and compared to initial RSS values after service restart (and GC LRU
cache disabled by setting to 0) on 2 datastores:

| Delta RSS (MiB) | index files | chunk count | deduplication factor |
----------------------------------------------------------------------
|         412.355 |        1125 |      982236 |                14.69 |
|         168.414 |         213 |      598312 |                 5.93 |
----------------------------------------------------------------------

Open questions and ideas to discuss:
- Is the observed memory requirement acceptable if provided as opt-in
  feature? Are there other ideas to further reduce the memory
  footprint? I was pondering about a indirection mapping to group
  digests by common prefix and only store individual suffixes, which
  however only scales better when there is no need to store this as
  hashmap, so not really suitable due to diminished lookup performance.
- Conditionally replace the GC LRU cache by the lookup map if this
  feature is enabled. The digests need to be stored anyways, so it
  would make sense to use it to avoid multiple chunk atime updates
  instead.
- Add a dedicated tab to show the contents independent from the
  current datastore contents? This would reduce the risk of
  misinterpretation as this is no real-time data.
- Add this as dedicated task instead of combining it with garbage
  collection? This would allow to perform information gathering on
  specific sub-namespaces, groups or selected snapshots only.

Link to the bugtracker issue:
https://bugzilla.proxmox.com/show_bug.cgi?id=5799

proxmox-backup:

Christian Ebner (4):
  chunk store: restrict chunk sweep helper method to module parent
  datastore: add namespace/group/snapshot indices for reverse lookups
  datastore: introduce reverse chunk digest lookup table
  fix #5799: GC: track chunk digests and accumulate statistics

 pbs-datastore/src/chunk_store.rs        |  11 +-
 pbs-datastore/src/datastore.rs          |  46 +++-
 pbs-datastore/src/lib.rs                |   1 +
 pbs-datastore/src/reverse_digest_map.rs | 349 ++++++++++++++++++++++++
 4 files changed, 404 insertions(+), 3 deletions(-)
 create mode 100644 pbs-datastore/src/reverse_digest_map.rs


Summary over all repositories:
  4 files changed, 404 insertions(+), 3 deletions(-)

-- 
Generated by git-murpp 0.8.1


_______________________________________________
pbs-devel mailing list
pbs-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel


             reply	other threads:[~2026-01-19 13:27 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-01-19 13:27 Christian Ebner [this message]
2026-01-19 13:27 ` [pbs-devel] [PATCH proxmox-backup 1/4] chunk store: restrict chunk sweep helper method to module parent Christian Ebner
2026-01-19 13:27 ` [pbs-devel] [PATCH proxmox-backup 2/4] datastore: add namespace/group/snapshot indices for reverse lookups Christian Ebner
2026-01-19 13:27 ` [pbs-devel] [PATCH proxmox-backup 3/4] datastore: introduce reverse chunk digest lookup table Christian Ebner
2026-01-19 13:27 ` [pbs-devel] [PATCH proxmox-backup 4/4] fix #5799: GC: track chunk digests and accumulate statistics Christian Ebner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260119132707.686523-1-c.ebner@proxmox.com \
    --to=c.ebner@proxmox.com \
    --cc=pbs-devel@lists.proxmox.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.
Service provided by Proxmox Server Solutions GmbH | Privacy | Legal