public inbox for pbs-devel@lists.proxmox.com
 help / color / mirror / Atom feed
From: Hannes Laimer <h.laimer@proxmox.com>
To: pbs-devel@lists.proxmox.com
Subject: [PATCH proxmox-backup v8 05/13] datastore: add move journal for coordinating with gc phase 1
Date: Wed, 22 Apr 2026 15:39:43 +0200	[thread overview]
Message-ID: <20260422133951.192862-6-h.laimer@proxmox.com> (raw)
In-Reply-To: <20260422133951.192862-1-h.laimer@proxmox.com>

Close a data-loss race between move_group/move_namespace and a
concurrent gc phase-1 mark: if a snapshot is moved between
list_index_files() and the hierarchy iteration, and the target namespace
happens to be iterated before the source, neither pass sees the moved
index and phase 2 could sweep its chunks.

Add a per-datastore write-ahead journal at
/run/proxmox-backup/locks/<datastore>/move-journal that GC drains at the
end of phase-1 mark. No caller yet writes to it, the following commits
will use this.

See the `move_journal` module doc for the protocol and the invariant
that makes it correct.

Signed-off-by: Hannes Laimer <h.laimer@proxmox.com>
---
 pbs-datastore/src/datastore.rs    |  18 ++++
 pbs-datastore/src/lib.rs          |   1 +
 pbs-datastore/src/move_journal.rs | 149 ++++++++++++++++++++++++++++++
 3 files changed, 168 insertions(+)
 create mode 100644 pbs-datastore/src/move_journal.rs

diff --git a/pbs-datastore/src/datastore.rs b/pbs-datastore/src/datastore.rs
index bce0d846..319b4cba 100644
--- a/pbs-datastore/src/datastore.rs
+++ b/pbs-datastore/src/datastore.rs
@@ -1800,6 +1800,24 @@ impl DataStore {
             warn!("Found {strange_paths_count} index files outside of expected directory scheme");
         }
 
+        // Drain the move journal under an exclusive flock. Any index file whose path
+        // was recorded before its rename is processed here even if the namespace
+        // iteration missed it at both its old and new locations. See `move_journal`
+        // for details.
+        crate::move_journal::drain_move_journal(self.name(), |path| {
+            let Some(index) = self.open_index_reader(path)? else {
+                return Ok(());
+            };
+            self.index_mark_used_chunks(
+                index,
+                path,
+                &mut chunk_lru_cache,
+                status,
+                worker,
+                s3_client.as_ref().cloned(),
+            )
+        })?;
+
         Ok(())
     }
 
diff --git a/pbs-datastore/src/lib.rs b/pbs-datastore/src/lib.rs
index 4d8ac505..6647ee2b 100644
--- a/pbs-datastore/src/lib.rs
+++ b/pbs-datastore/src/lib.rs
@@ -192,6 +192,7 @@ pub mod data_blob_reader;
 pub mod file_formats;
 pub mod index;
 pub mod manifest;
+pub mod move_journal;
 pub mod paperkey;
 pub mod prune;
 pub mod read_chunk;
diff --git a/pbs-datastore/src/move_journal.rs b/pbs-datastore/src/move_journal.rs
new file mode 100644
index 00000000..891644d7
--- /dev/null
+++ b/pbs-datastore/src/move_journal.rs
@@ -0,0 +1,149 @@
+//! Per-datastore journal used to coordinate snapshot renames with a concurrent garbage collection
+//! phase-1 mark.
+//!
+//! # Race fixed
+//!
+//! GC phase 1 first calls `list_index_files()` to snapshot the set of absolute index-file paths in
+//! the datastore, then iterates namespaces live and touches the atime of every referenced chunk.
+//! If a `move_group`/`move_namespace` relocates a snapshot between those two steps, and the target
+//! namespace is visited by GC before the source (`readdir(2)` order, not deterministic), the moved
+//! index is at neither location when GC looks: missing from the target (already iterated) and
+//! missing from the source (iterated after the rename). Its old path lands in the leftover
+//! `unprocessed_index_list` and is discarded as a vanished file. Chunks referenced only by that
+//! index never get their atime bumped and phase 2 sweeps them.
+//!
+//! # Protocol
+//!
+//! Write-ahead journal for renames:
+//!
+//! - **Before** renaming a snapshot, the move records the new path of each index file it is about
+//!   to create, under a brief exclusive flock.
+//! - At the end of phase-1 mark, GC acquires the same exclusive flock, reads every recorded path,
+//!   runs the normal `index_mark_used_chunks` on each, truncates, and releases before entering phase 2.
+//!
+//! Why write-before-rename rather than write-after-rename with a long-held shared lock by each
+//! mover: the invariant is "if the new path exists, a journal entry for it exists too". So the
+//! drain - which runs only after iteration finishes - is guaranteed to catch anything iteration
+//! missed:
+//!
+//! - If the source-ns iteration found the index at the old path, its chunks are already marked.
+//!   The journal entry is then either a redundant re-mark (rename completed before the drain, LRU
+//!   dedups it) or a no-op skip (rename not yet, `open_index_reader` returns `None`) - harmless
+//!   either way.
+//! - If the source-ns iteration missed it, then the rename already happened by the time iteration
+//!   reached source, which is before the drain, so `open_index_reader(new_path)` at drain time
+//!   succeeds and marks the chunks.
+//!
+//! A move that crashes between the journal write and the rename leaves a "ghost" entry. The
+//! drain's `open_index_reader` returns `None` and skips, and the truncate step clears it. This is
+//! handled by the existing vanished-file logic in the caller.
+//!
+//! The file lives under `/run/proxmox-backup/locks/<datastore>/move-journal`. Tmpfs is correct
+//! here: a reboot aborts any in-progress GC, and the next GC rebuilds state from a fresh
+//! `list_index_files()` against the post-move filesystem - there is nothing worth persisting.
+
+use std::fs::File;
+use std::io::{Read, Seek, SeekFrom, Write};
+use std::path::{Path, PathBuf};
+use std::time::Duration;
+
+use anyhow::{bail, format_err, Context, Error};
+use nix::sys::stat::Mode;
+
+use proxmox_sys::fs::{open_file_locked, CreateOptions};
+
+use pbs_config::backup_user;
+
+use crate::backup_info::DATASTORE_LOCKS_DIR;
+
+const JOURNAL_FILENAME: &str = "move-journal";
+const APPEND_LOCK_TIMEOUT: Duration = Duration::from_secs(10);
+// Long enough to cover any in-flight append, if it takes longer than this something is very wrong
+// and we'd rather fail GC than hang forever.
+const DRAIN_LOCK_TIMEOUT: Duration = Duration::from_secs(10);
+
+fn journal_path(datastore_name: &str) -> PathBuf {
+    Path::new(DATASTORE_LOCKS_DIR)
+        .join(datastore_name)
+        .join(JOURNAL_FILENAME)
+}
+
+fn ensure_parent(path: &Path) -> Result<(), Error> {
+    if let Some(parent) = path.parent() {
+        std::fs::create_dir_all(parent)
+            .with_context(|| format!("failed to create move-journal parent dir {parent:?}"))?;
+    }
+    Ok(())
+}
+
+fn open_locked_exclusive(path: &Path, timeout: Duration) -> Result<File, Error> {
+    ensure_parent(path)?;
+    let user = backup_user()?;
+    let options = CreateOptions::new()
+        .perm(Mode::from_bits_truncate(0o660))
+        .owner(user.uid)
+        .group(user.gid);
+    open_file_locked(path, timeout, true, options)
+        .with_context(|| format!("failed to acquire exclusive move-journal lock at {path:?}"))
+}
+
+/// Append one or more absolute index-file paths to the journal under a brief exclusive flock. The
+/// caller passes the *post-rename* paths, the rename must happen after this returns.
+pub fn append_moved_indices(datastore_name: &str, paths: &[PathBuf]) -> Result<(), Error> {
+    if paths.is_empty() {
+        return Ok(());
+    }
+
+    let mut buf = Vec::new();
+    for path in paths {
+        if !path.is_absolute() {
+            bail!("move-journal: refusing to record non-absolute path {path:?}");
+        }
+        let s = path
+            .to_str()
+            .ok_or_else(|| format_err!("move-journal: non-UTF-8 path {path:?}"))?;
+        if s.as_bytes().contains(&b'\n') {
+            bail!("move-journal: path contains newline {path:?}");
+        }
+        buf.extend_from_slice(s.as_bytes());
+        buf.push(b'\n');
+    }
+
+    let path = journal_path(datastore_name);
+    let mut file = open_locked_exclusive(&path, APPEND_LOCK_TIMEOUT)?;
+    file.write_all(&buf)
+        .context("failed to append to move journal")?;
+    Ok(())
+}
+
+/// Drain the journal under an exclusive lock, calling `f` for each recorded path. Blocks only for
+/// the brief window of a concurrent append. After the callback runs on every entry, the journal is
+/// truncated under the same lock.
+///
+/// On a processing error the entry is left in the journal (no truncate) so the next GC will retry.
+pub fn drain_move_journal<F>(datastore_name: &str, mut f: F) -> Result<(), Error>
+where
+    F: FnMut(&Path) -> Result<(), Error>,
+{
+    let path = journal_path(datastore_name);
+    let mut file = open_locked_exclusive(&path, DRAIN_LOCK_TIMEOUT)?;
+
+    file.seek(SeekFrom::Start(0))
+        .context("failed to rewind move journal for draining")?;
+    let mut contents = String::new();
+    file.read_to_string(&mut contents)
+        .context("failed to read move journal")?;
+
+    for line in contents.lines() {
+        let entry = line.trim();
+        if entry.is_empty() {
+            continue;
+        }
+        f(Path::new(entry))
+            .with_context(|| format!("move-journal: processing '{entry}' failed"))?;
+    }
+
+    file.set_len(0)
+        .context("failed to truncate move journal after drain")?;
+    Ok(())
+}
-- 
2.47.3





  parent reply	other threads:[~2026-04-22 13:40 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-04-22 13:39 [PATCH proxmox-backup v8 00/13] fixes #6195: add support for moving groups and namespaces Hannes Laimer
2026-04-22 13:39 ` [PATCH proxmox-backup v8 01/13] ui: show empty groups Hannes Laimer
2026-04-22 13:39 ` [PATCH proxmox-backup v8 02/13] datastore: lift check_namespace_depth_limit to pbs-datastore Hannes Laimer
2026-04-22 13:39 ` [PATCH proxmox-backup v8 03/13] datastore: have BackupGroup::destroy consume the group lock Hannes Laimer
2026-04-22 13:39 ` [PATCH proxmox-backup v8 04/13] datastore: split remove_namespace into flat and recursive variants Hannes Laimer
2026-04-22 13:39 ` Hannes Laimer [this message]
2026-04-22 13:39 ` [PATCH proxmox-backup v8 06/13] datastore: add move-group Hannes Laimer
2026-04-22 13:39 ` [PATCH proxmox-backup v8 07/13] datastore: add move-namespace Hannes Laimer
2026-04-22 13:39 ` [PATCH proxmox-backup v8 08/13] docs: add section on moving namespaces and groups Hannes Laimer
2026-04-22 13:39 ` [PATCH proxmox-backup v8 09/13] api: add POST endpoint for move-group Hannes Laimer
2026-04-22 13:39 ` [PATCH proxmox-backup v8 10/13] api: add POST endpoint for move-namespace Hannes Laimer
2026-04-22 13:39 ` [PATCH proxmox-backup v8 11/13] ui: add move group action Hannes Laimer
2026-04-23 13:35   ` Michael Köppl
2026-04-23 13:47     ` Hannes Laimer
2026-04-22 13:39 ` [PATCH proxmox-backup v8 12/13] ui: add move namespace action Hannes Laimer
2026-04-23 14:49   ` Michael Köppl
2026-04-22 13:39 ` [PATCH proxmox-backup v8 13/13] cli: add move-namespace and move-group commands Hannes Laimer
2026-04-23 16:29 ` [PATCH proxmox-backup v8 00/13] fixes #6195: add support for moving groups and namespaces Michael Köppl
2026-04-23 22:38 ` applied: " Thomas Lamprecht
2026-04-24  8:31   ` Fabian Grünbichler
2026-04-24  8:43     ` Hannes Laimer

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260422133951.192862-6-h.laimer@proxmox.com \
    --to=h.laimer@proxmox.com \
    --cc=pbs-devel@lists.proxmox.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox
Service provided by Proxmox Server Solutions GmbH | Privacy | Legal