all lists on lists.proxmox.com
 help / color / mirror / Atom feed
From: Hannes Laimer <h.laimer@proxmox.com>
To: pbs-devel@lists.proxmox.com
Subject: [PATCH proxmox-backup v8 05/13] datastore: add move journal for coordinating with gc phase 1
Date: Wed, 22 Apr 2026 15:39:43 +0200	[thread overview]
Message-ID: <20260422133951.192862-6-h.laimer@proxmox.com> (raw)
In-Reply-To: <20260422133951.192862-1-h.laimer@proxmox.com>

Close a data-loss race between move_group/move_namespace and a
concurrent gc phase-1 mark: if a snapshot is moved between
list_index_files() and the hierarchy iteration, and the target namespace
happens to be iterated before the source, neither pass sees the moved
index and phase 2 could sweep its chunks.

Add a per-datastore write-ahead journal at
/run/proxmox-backup/locks/<datastore>/move-journal that GC drains at the
end of phase-1 mark. No caller yet writes to it, the following commits
will use this.

See the `move_journal` module doc for the protocol and the invariant
that makes it correct.

Signed-off-by: Hannes Laimer <h.laimer@proxmox.com>
---
 pbs-datastore/src/datastore.rs    |  18 ++++
 pbs-datastore/src/lib.rs          |   1 +
 pbs-datastore/src/move_journal.rs | 149 ++++++++++++++++++++++++++++++
 3 files changed, 168 insertions(+)
 create mode 100644 pbs-datastore/src/move_journal.rs

diff --git a/pbs-datastore/src/datastore.rs b/pbs-datastore/src/datastore.rs
index bce0d846..319b4cba 100644
--- a/pbs-datastore/src/datastore.rs
+++ b/pbs-datastore/src/datastore.rs
@@ -1800,6 +1800,24 @@ impl DataStore {
             warn!("Found {strange_paths_count} index files outside of expected directory scheme");
         }
 
+        // Drain the move journal under an exclusive flock. Any index file whose path
+        // was recorded before its rename is processed here even if the namespace
+        // iteration missed it at both its old and new locations. See `move_journal`
+        // for details.
+        crate::move_journal::drain_move_journal(self.name(), |path| {
+            let Some(index) = self.open_index_reader(path)? else {
+                return Ok(());
+            };
+            self.index_mark_used_chunks(
+                index,
+                path,
+                &mut chunk_lru_cache,
+                status,
+                worker,
+                s3_client.as_ref().cloned(),
+            )
+        })?;
+
         Ok(())
     }
 
diff --git a/pbs-datastore/src/lib.rs b/pbs-datastore/src/lib.rs
index 4d8ac505..6647ee2b 100644
--- a/pbs-datastore/src/lib.rs
+++ b/pbs-datastore/src/lib.rs
@@ -192,6 +192,7 @@ pub mod data_blob_reader;
 pub mod file_formats;
 pub mod index;
 pub mod manifest;
+pub mod move_journal;
 pub mod paperkey;
 pub mod prune;
 pub mod read_chunk;
diff --git a/pbs-datastore/src/move_journal.rs b/pbs-datastore/src/move_journal.rs
new file mode 100644
index 00000000..891644d7
--- /dev/null
+++ b/pbs-datastore/src/move_journal.rs
@@ -0,0 +1,149 @@
+//! Per-datastore journal used to coordinate snapshot renames with a concurrent garbage collection
+//! phase-1 mark.
+//!
+//! # Race fixed
+//!
+//! GC phase 1 first calls `list_index_files()` to snapshot the set of absolute index-file paths in
+//! the datastore, then iterates namespaces live and touches the atime of every referenced chunk.
+//! If a `move_group`/`move_namespace` relocates a snapshot between those two steps, and the target
+//! namespace is visited by GC before the source (`readdir(2)` order, not deterministic), the moved
+//! index is at neither location when GC looks: missing from the target (already iterated) and
+//! missing from the source (iterated after the rename). Its old path lands in the leftover
+//! `unprocessed_index_list` and is discarded as a vanished file. Chunks referenced only by that
+//! index never get their atime bumped and phase 2 sweeps them.
+//!
+//! # Protocol
+//!
+//! Write-ahead journal for renames:
+//!
+//! - **Before** renaming a snapshot, the move records the new path of each index file it is about
+//!   to create, under a brief exclusive flock.
+//! - At the end of phase-1 mark, GC acquires the same exclusive flock, reads every recorded path,
+//!   runs the normal `index_mark_used_chunks` on each, truncates, and releases before entering phase 2.
+//!
+//! Why write-before-rename rather than write-after-rename with a long-held shared lock by each
+//! mover: the invariant is "if the new path exists, a journal entry for it exists too". So the
+//! drain - which runs only after iteration finishes - is guaranteed to catch anything iteration
+//! missed:
+//!
+//! - If the source-ns iteration found the index at the old path, its chunks are already marked.
+//!   The journal entry is then either a redundant re-mark (rename completed before the drain, LRU
+//!   dedups it) or a no-op skip (rename not yet, `open_index_reader` returns `None`) - harmless
+//!   either way.
+//! - If the source-ns iteration missed it, then the rename already happened by the time iteration
+//!   reached source, which is before the drain, so `open_index_reader(new_path)` at drain time
+//!   succeeds and marks the chunks.
+//!
+//! A move that crashes between the journal write and the rename leaves a "ghost" entry. The
+//! drain's `open_index_reader` returns `None` and skips, and the truncate step clears it. This is
+//! handled by the existing vanished-file logic in the caller.
+//!
+//! The file lives under `/run/proxmox-backup/locks/<datastore>/move-journal`. Tmpfs is correct
+//! here: a reboot aborts any in-progress GC, and the next GC rebuilds state from a fresh
+//! `list_index_files()` against the post-move filesystem - there is nothing worth persisting.
+
+use std::fs::File;
+use std::io::{Read, Seek, SeekFrom, Write};
+use std::path::{Path, PathBuf};
+use std::time::Duration;
+
+use anyhow::{bail, format_err, Context, Error};
+use nix::sys::stat::Mode;
+
+use proxmox_sys::fs::{open_file_locked, CreateOptions};
+
+use pbs_config::backup_user;
+
+use crate::backup_info::DATASTORE_LOCKS_DIR;
+
+const JOURNAL_FILENAME: &str = "move-journal";
+const APPEND_LOCK_TIMEOUT: Duration = Duration::from_secs(10);
+// Long enough to cover any in-flight append, if it takes longer than this something is very wrong
+// and we'd rather fail GC than hang forever.
+const DRAIN_LOCK_TIMEOUT: Duration = Duration::from_secs(10);
+
+fn journal_path(datastore_name: &str) -> PathBuf {
+    Path::new(DATASTORE_LOCKS_DIR)
+        .join(datastore_name)
+        .join(JOURNAL_FILENAME)
+}
+
+fn ensure_parent(path: &Path) -> Result<(), Error> {
+    if let Some(parent) = path.parent() {
+        std::fs::create_dir_all(parent)
+            .with_context(|| format!("failed to create move-journal parent dir {parent:?}"))?;
+    }
+    Ok(())
+}
+
+fn open_locked_exclusive(path: &Path, timeout: Duration) -> Result<File, Error> {
+    ensure_parent(path)?;
+    let user = backup_user()?;
+    let options = CreateOptions::new()
+        .perm(Mode::from_bits_truncate(0o660))
+        .owner(user.uid)
+        .group(user.gid);
+    open_file_locked(path, timeout, true, options)
+        .with_context(|| format!("failed to acquire exclusive move-journal lock at {path:?}"))
+}
+
+/// Append one or more absolute index-file paths to the journal under a brief exclusive flock. The
+/// caller passes the *post-rename* paths, the rename must happen after this returns.
+pub fn append_moved_indices(datastore_name: &str, paths: &[PathBuf]) -> Result<(), Error> {
+    if paths.is_empty() {
+        return Ok(());
+    }
+
+    let mut buf = Vec::new();
+    for path in paths {
+        if !path.is_absolute() {
+            bail!("move-journal: refusing to record non-absolute path {path:?}");
+        }
+        let s = path
+            .to_str()
+            .ok_or_else(|| format_err!("move-journal: non-UTF-8 path {path:?}"))?;
+        if s.as_bytes().contains(&b'\n') {
+            bail!("move-journal: path contains newline {path:?}");
+        }
+        buf.extend_from_slice(s.as_bytes());
+        buf.push(b'\n');
+    }
+
+    let path = journal_path(datastore_name);
+    let mut file = open_locked_exclusive(&path, APPEND_LOCK_TIMEOUT)?;
+    file.write_all(&buf)
+        .context("failed to append to move journal")?;
+    Ok(())
+}
+
+/// Drain the journal under an exclusive lock, calling `f` for each recorded path. Blocks only for
+/// the brief window of a concurrent append. After the callback runs on every entry, the journal is
+/// truncated under the same lock.
+///
+/// On a processing error the entry is left in the journal (no truncate) so the next GC will retry.
+pub fn drain_move_journal<F>(datastore_name: &str, mut f: F) -> Result<(), Error>
+where
+    F: FnMut(&Path) -> Result<(), Error>,
+{
+    let path = journal_path(datastore_name);
+    let mut file = open_locked_exclusive(&path, DRAIN_LOCK_TIMEOUT)?;
+
+    file.seek(SeekFrom::Start(0))
+        .context("failed to rewind move journal for draining")?;
+    let mut contents = String::new();
+    file.read_to_string(&mut contents)
+        .context("failed to read move journal")?;
+
+    for line in contents.lines() {
+        let entry = line.trim();
+        if entry.is_empty() {
+            continue;
+        }
+        f(Path::new(entry))
+            .with_context(|| format!("move-journal: processing '{entry}' failed"))?;
+    }
+
+    file.set_len(0)
+        .context("failed to truncate move journal after drain")?;
+    Ok(())
+}
-- 
2.47.3





  parent reply	other threads:[~2026-04-22 13:40 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-04-22 13:39 [PATCH proxmox-backup v8 00/13] fixes #6195: add support for moving groups and namespaces Hannes Laimer
2026-04-22 13:39 ` [PATCH proxmox-backup v8 01/13] ui: show empty groups Hannes Laimer
2026-04-22 13:39 ` [PATCH proxmox-backup v8 02/13] datastore: lift check_namespace_depth_limit to pbs-datastore Hannes Laimer
2026-04-22 13:39 ` [PATCH proxmox-backup v8 03/13] datastore: have BackupGroup::destroy consume the group lock Hannes Laimer
2026-04-22 13:39 ` [PATCH proxmox-backup v8 04/13] datastore: split remove_namespace into flat and recursive variants Hannes Laimer
2026-04-22 13:39 ` Hannes Laimer [this message]
2026-04-22 13:39 ` [PATCH proxmox-backup v8 06/13] datastore: add move-group Hannes Laimer
2026-04-22 13:39 ` [PATCH proxmox-backup v8 07/13] datastore: add move-namespace Hannes Laimer
2026-04-22 13:39 ` [PATCH proxmox-backup v8 08/13] docs: add section on moving namespaces and groups Hannes Laimer
2026-04-22 13:39 ` [PATCH proxmox-backup v8 09/13] api: add POST endpoint for move-group Hannes Laimer
2026-04-22 13:39 ` [PATCH proxmox-backup v8 10/13] api: add POST endpoint for move-namespace Hannes Laimer
2026-04-22 13:39 ` [PATCH proxmox-backup v8 11/13] ui: add move group action Hannes Laimer
2026-04-23 13:35   ` Michael Köppl
2026-04-23 13:47     ` Hannes Laimer
2026-04-22 13:39 ` [PATCH proxmox-backup v8 12/13] ui: add move namespace action Hannes Laimer
2026-04-23 14:49   ` Michael Köppl
2026-04-22 13:39 ` [PATCH proxmox-backup v8 13/13] cli: add move-namespace and move-group commands Hannes Laimer
2026-04-23 16:29 ` [PATCH proxmox-backup v8 00/13] fixes #6195: add support for moving groups and namespaces Michael Köppl
2026-04-23 22:38 ` applied: " Thomas Lamprecht
2026-04-24  8:31   ` Fabian Grünbichler
2026-04-24  8:43     ` Hannes Laimer

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260422133951.192862-6-h.laimer@proxmox.com \
    --to=h.laimer@proxmox.com \
    --cc=pbs-devel@lists.proxmox.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.
Service provided by Proxmox Server Solutions GmbH | Privacy | Legal