Re: [pbs-devel] [PATCH v3 proxmox-backup 1/5] fix #3847: datastore: support writing fidx files of unknown size

From: "Fabian Grünbichler" <f.gruenbichler@proxmox.com>
To: Proxmox Backup Server development discussion
	<pbs-devel@lists.proxmox.com>
Subject: Re: [pbs-devel] [PATCH v3 proxmox-backup 1/5] fix #3847: datastore: support writing fidx files of unknown size
Date: Fri, 16 Jan 2026 15:28:13 +0100	[thread overview]
Message-ID: <176857369362.137827.9885953475732630098@yuna.proxmox.com> (raw)
In-Reply-To: <845335b1-78b7-4710-805a-d2ff464ceba0@proxmox.com>

Quoting Robert Obkircher (2026-01-16 13:42:36)
> 
> On 1/14/26 14:13, Fabian Grünbichler wrote:
> > On January 9, 2026 6:35 pm, Robert Obkircher wrote:
> >> Use mremap and ftruncate to support growable FixedIndexWriters. Grow
> >> exponentially from a small initial index size for efficiency. Truncate
> >> excessive capacity after encountering a non-full block or on close.
> >>
> >> Signed-off-by: Robert Obkircher<r.obkircher@proxmox.com>
> >> ---
> >>   pbs-datastore/src/datastore.rs   |   2 +-
> >>   pbs-datastore/src/fixed_index.rs | 120 +++++++++++++++++++++++++++++--
> >>   2 files changed, 117 insertions(+), 5 deletions(-)
> >>
> >> diff --git a/pbs-datastore/src/datastore.rs b/pbs-datastore/src/datastore.rs
> >> index 9c57aaac..af712726 100644
> >> --- a/pbs-datastore/src/datastore.rs
> >> +++ b/pbs-datastore/src/datastore.rs
> >> @@ -591,7 +591,7 @@ impl DataStore {
> >>       pub fn create_fixed_writer<P: AsRef<Path>>(
> >>           &self,
> >>           filename: P,
> >> -        size: usize,
> >> +        size: Option<usize>,
> >>           chunk_size: usize,
> >>       ) -> Result<FixedIndexWriter, Error> {
> >>           let index = FixedIndexWriter::create(
> >> diff --git a/pbs-datastore/src/fixed_index.rs b/pbs-datastore/src/fixed_index.rs
> >> index 6c3be2d4..8036a519 100644
> >> --- a/pbs-datastore/src/fixed_index.rs
> >> +++ b/pbs-datastore/src/fixed_index.rs
> >> @@ -219,9 +219,12 @@ pub struct FixedIndexWriter {
> >>       chunk_size: usize,
> >>       size: usize,
> >>       index_length: usize,
> >> +    index_capacity: usize,
> > chunk_size and size effectively already give you index_length (as in,
> > how many "slots" are allowed to be used), so index length and index
> > capacity can be a single field. if you want to rename it to signify that
> > it might now be bigger than ceil(size/chunk_size), that's fine, but I
> > don't think we need to duplicate it and complicate the code below..
> By that logic (chunk_size, size, growable_size) also give me the capacity,
> so both fields could be removed.

that's true :)

> I kept them as separate fields because that is what the original code and
> the Reader do, supposedly to avoid the division for the bounds check.

we effectively do that bounds check twice though - once for size (in
check_chunk_alignment) and again for index_length (which is derived 1:1 from
size and chunk_size) in add_digest.. and the backup env calls
check_chunk_alignment right before calling add_digest. when copying from a
reader directly to the writer that does not happen, but in that case we already
checked that the index and chunk sizes match before we start adding the first
digest, so the check in add_digest is redundant there as well..

I now wonder we could combine check_chunk_alignment and grow_to_size, since
they handle quite similar aspects and both only have a single call site outside
of tests, which is right next to eachother.. or (see below)

> 
> This also makes it much more explicit what is going on and since the
> index_length has to be computed anyway while resizing, I don't see how
> storing that value for later is more complicated than recomputing it.
> 
> >>       index: *mut u8,
> >>       pub uuid: [u8; 16],
> >>       pub ctime: i64,
> >> +    growable_size: bool,
> >> +    write_size_on_close: bool,
> >>   }
> >>   
> >>   // `index` is mmap()ed which cannot be thread-local so should be sendable
> >> @@ -237,12 +240,18 @@ impl Drop for FixedIndexWriter {
> >>   }
> >>   
> >>   impl FixedIndexWriter {
> >> +    /// The initial capacity, if the total size is unknown.
> >> +    ///
> >> +    /// This capacity takes up the same amount of space as the header
> >> +    /// and can refer to 128 Blocks * 4 MiB/Block = 512 MiB of content.
> >> +    const INITIAL_CAPACITY: usize = 4096 / 32;
> > might make sense to make this more explicit - we can only map using page
> > granularity, the header is one page, so the first part needs to be a
> > page as well, and if then always double it when resizing we stay aligned
> > to page boundaries.
> 
> There is actually no strict requirement for the mapped length to be
> a multiple of the page size, so I didn't want to guarantee too much.
> However, the offset into the file must be aligned, so a page size
> larger than the header would break anyway.

you are right, sorry!

> I think it would make sense to just map the file from the start, because
> that would simplify my other fix [1] and the size/checksum update in
> close().

given that we already had a report about PBS not working on 16k page kernels,
and there are other platforms with 64k pages, it might be nice to fix this up
if it doesn't cause too much annoyance.

IIRC besides the index code, only our shared memory helper (for config caching
and rate limiting) use mmap. in those cases the mapped data is padded to a
single 4k page atm, and when mapping we round up to a page as well, although
the helper for that has a FIXME calling for it to actually query the page size
instead of assuming it's 4k ;)

so yeah, those all sound like they could be fixed up without too much effort
and without much additional cost (for the index access, it boils down to one
extra pointer addition for the header-to-payload-offset?).

> 
> [1] 
> https://lore.proxmox.com/pbs-devel/20260109175945.309913-3-r.obkircher@proxmox.com/
> 
> >> +
> >>       #[allow(clippy::cast_ptr_alignment)]
> >>       // Requires obtaining a shared chunk store lock beforehand
> >>       pub fn create(
> >>           store: Arc<ChunkStore>,
> >>           path: &Path,
> >> -        size: usize,
> >> +        known_size: Option<usize>,
> >>           chunk_size: usize,
> >>       ) -> Result<Self, Error> {
> >>           let full_path = store.relative_path(path);
> >> @@ -264,6 +273,7 @@ impl FixedIndexWriter {
> >>           }
> >>   
> >>           let ctime = proxmox_time::epoch_i64();
> >> +        let size = known_size.unwrap_or(0);
> >>   
> >>           let uuid = Uuid::generate();
> >>   
> >> @@ -280,8 +290,12 @@ impl FixedIndexWriter {
> >>   
> >>           file.write_all(&buffer)?;
> >>   
> >> -        let index_length = size.div_ceil(chunk_size);
> >> -        let index_size = index_length * 32;
> >> +        let (index_length, index_capacity) = known_size
> >> +            .map(|s| s.div_ceil(chunk_size))
> >> +            .map(|len| (len, len))
> >> +            .unwrap_or((0, Self::INITIAL_CAPACITY));
> >> +
> >> +        let index_size = index_capacity * 32;
> >>           nix::unistd::ftruncate(&file, (header_size + index_size) as i64)?;
> >>   
> >>           let data = unsafe {
> >> @@ -305,12 +319,87 @@ impl FixedIndexWriter {
> >>               chunk_size,
> >>               size,
> >>               index_length,
> >> +            index_capacity,
> >>               index: data,
> >>               ctime,
> >>               uuid: *uuid.as_bytes(),
> >> +            growable_size: known_size.is_none(),
> >> +            write_size_on_close: known_size.is_none(),
> >> +        })
> >> +    }
> >> +
> >> +    /// If this returns an error, the sizes may be out of sync,
> >> +    /// which is especially bad if the capacity was reduced.
> >> +    fn set_index_capacity(&mut self, new_capacity: usize) -> Result<(), Error> {
> >> +        if new_capacity == self.index_capacity {
> >> +            return Ok(());
> >> +        }
> >> +        let old_index_size = self.index_capacity * 32;
> >> +        let new_index_size = new_capacity * 32;
> >> +        let new_file_size = (size_of::<FixedIndexHeader>() + new_index_size) as i64;
> >> +
> >> +        let index_addr = NonNull::new(self.index as *mut std::ffi::c_void).ok_or_else(|| {
> >> +            format_err!("Can't resize FixedIndexWriter index because the index pointer is null.")
> >> +        })?;
> >> +
> >> +        nix::unistd::ftruncate(&self.file, new_file_size)?;
> >> +
> >> +        let new_index = unsafe {
> >> +            nix::sys::mman::mremap(
> >> +                index_addr,
> >> +                old_index_size,
> >> +                new_index_size,
> >> +                nix::sys::mman::MRemapFlags::MREMAP_MAYMOVE,
> >> +                None,
> >> +            )
> >> +        }?;
> >> +
> >> +        self.index = new_index.as_ptr().cast::<u8>();
> >> +        self.index_capacity = new_capacity;
> >> +        Ok(())
> >> +    }
> >> +
> >> +    /// Unmapping ensures future add and close operations fail.
> >> +    fn set_index_capacity_or_unmap(&mut self, new_capacity: usize) -> Result<(), Error> {
> >> +        self.set_index_capacity(new_capacity).map_err(|e| {
> >> +            let unmap_result = self.unmap();
> >> +            let message = format!(
> >> +                "failed to resize index capacity from {} to {new_capacity} with backing file: {:?}",
> >> +                self.index_capacity, self.tmp_filename
> >> +            );
> >> +            assert!(self.index.is_null(), "{message} {unmap_result:?}");
> >> +            e.context(message)
> >>           })
> >>       }
> >>   
> >> +    /// Increase the content size to be at least `requested_size` and
> >> +    /// ensure there is enough capacity.
> >> +    ///
> >> +    /// Only writers that were created without a known size can grow.
> >> +    /// The size also becomes fixed as soon as it is no longer divisible
> >> +    /// by the block size, to ensure that only the last block can be
> >> +    /// smaller.
> >> +    pub fn grow_to_size(&mut self, requested_size: usize) -> Result<(), Error> {
> >> +        if self.size < requested_size {
> >> +            if !self.growable_size {
> >> +                bail!("refusing to resize from {} to {requested_size}", self.size);
> >> +            }
> >> +            let new_len = requested_size.div_ceil(self.chunk_size);
> >> +            if new_len * self.chunk_size != requested_size {
> >> +                // not a full chunk, so this must be the last one
> >> +                self.growable_size = false;
> >> +                self.set_index_capacity_or_unmap(new_len)?;
> >> +            } else if new_len > self.index_capacity {
> >> +                self.set_index_capacity_or_unmap(new_len.next_power_of_two())?;
> >> +            };
> >> +            assert!(new_len <= self.index_capacity);
> >> +            self.index_length = new_len;
> >> +            self.size = requested_size;
> >> +        }
> > should we handle the else part here? i.e., error out if shrinking is
> > requested?
> 
> The else part is expected to succeed. I considered renaming the method to
> something like grow_if_smaller or ensure_capacity to make this clear, but
> I'm not quite happy with those names either.

we could
- call it resize, move the size check to the (only) call site, and allow
  shrinking
- call it reserve, make it only handle capacity, and move the size update to a
  (renamed) check_chunk_alignment
- combine grow_to_size and check_chunk_alignment and find a fitting name for
  what it does :-P
- keep it as is, but find a better name

_______________________________________________
pbs-devel mailing list
pbs-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel