From: Hannes Laimer <h.laimer@proxmox.com>
To: pbs-devel@lists.proxmox.com
Subject: [PATCH proxmox-backup v8 05/13] datastore: add move journal for coordinating with gc phase 1
Date: Wed, 22 Apr 2026 15:39:43 +0200 [thread overview]
Message-ID: <20260422133951.192862-6-h.laimer@proxmox.com> (raw)
In-Reply-To: <20260422133951.192862-1-h.laimer@proxmox.com>
Close a data-loss race between move_group/move_namespace and a
concurrent gc phase-1 mark: if a snapshot is moved between
list_index_files() and the hierarchy iteration, and the target namespace
happens to be iterated before the source, neither pass sees the moved
index and phase 2 could sweep its chunks.
Add a per-datastore write-ahead journal at
/run/proxmox-backup/locks/<datastore>/move-journal that GC drains at the
end of phase-1 mark. No caller yet writes to it, the following commits
will use this.
See the `move_journal` module doc for the protocol and the invariant
that makes it correct.
Signed-off-by: Hannes Laimer <h.laimer@proxmox.com>
---
pbs-datastore/src/datastore.rs | 18 ++++
pbs-datastore/src/lib.rs | 1 +
pbs-datastore/src/move_journal.rs | 149 ++++++++++++++++++++++++++++++
3 files changed, 168 insertions(+)
create mode 100644 pbs-datastore/src/move_journal.rs
diff --git a/pbs-datastore/src/datastore.rs b/pbs-datastore/src/datastore.rs
index bce0d846..319b4cba 100644
--- a/pbs-datastore/src/datastore.rs
+++ b/pbs-datastore/src/datastore.rs
@@ -1800,6 +1800,24 @@ impl DataStore {
warn!("Found {strange_paths_count} index files outside of expected directory scheme");
}
+ // Drain the move journal under an exclusive flock. Any index file whose path
+ // was recorded before its rename is processed here even if the namespace
+ // iteration missed it at both its old and new locations. See `move_journal`
+ // for details.
+ crate::move_journal::drain_move_journal(self.name(), |path| {
+ let Some(index) = self.open_index_reader(path)? else {
+ return Ok(());
+ };
+ self.index_mark_used_chunks(
+ index,
+ path,
+ &mut chunk_lru_cache,
+ status,
+ worker,
+ s3_client.as_ref().cloned(),
+ )
+ })?;
+
Ok(())
}
diff --git a/pbs-datastore/src/lib.rs b/pbs-datastore/src/lib.rs
index 4d8ac505..6647ee2b 100644
--- a/pbs-datastore/src/lib.rs
+++ b/pbs-datastore/src/lib.rs
@@ -192,6 +192,7 @@ pub mod data_blob_reader;
pub mod file_formats;
pub mod index;
pub mod manifest;
+pub mod move_journal;
pub mod paperkey;
pub mod prune;
pub mod read_chunk;
diff --git a/pbs-datastore/src/move_journal.rs b/pbs-datastore/src/move_journal.rs
new file mode 100644
index 00000000..891644d7
--- /dev/null
+++ b/pbs-datastore/src/move_journal.rs
@@ -0,0 +1,149 @@
+//! Per-datastore journal used to coordinate snapshot renames with a concurrent garbage collection
+//! phase-1 mark.
+//!
+//! # Race fixed
+//!
+//! GC phase 1 first calls `list_index_files()` to snapshot the set of absolute index-file paths in
+//! the datastore, then iterates namespaces live and touches the atime of every referenced chunk.
+//! If a `move_group`/`move_namespace` relocates a snapshot between those two steps, and the target
+//! namespace is visited by GC before the source (`readdir(2)` order, not deterministic), the moved
+//! index is at neither location when GC looks: missing from the target (already iterated) and
+//! missing from the source (iterated after the rename). Its old path lands in the leftover
+//! `unprocessed_index_list` and is discarded as a vanished file. Chunks referenced only by that
+//! index never get their atime bumped and phase 2 sweeps them.
+//!
+//! # Protocol
+//!
+//! Write-ahead journal for renames:
+//!
+//! - **Before** renaming a snapshot, the move records the new path of each index file it is about
+//! to create, under a brief exclusive flock.
+//! - At the end of phase-1 mark, GC acquires the same exclusive flock, reads every recorded path,
+//! runs the normal `index_mark_used_chunks` on each, truncates, and releases before entering phase 2.
+//!
+//! Why write-before-rename rather than write-after-rename with a long-held shared lock by each
+//! mover: the invariant is "if the new path exists, a journal entry for it exists too". So the
+//! drain - which runs only after iteration finishes - is guaranteed to catch anything iteration
+//! missed:
+//!
+//! - If the source-ns iteration found the index at the old path, its chunks are already marked.
+//! The journal entry is then either a redundant re-mark (rename completed before the drain, LRU
+//! dedups it) or a no-op skip (rename not yet, `open_index_reader` returns `None`) - harmless
+//! either way.
+//! - If the source-ns iteration missed it, then the rename already happened by the time iteration
+//! reached source, which is before the drain, so `open_index_reader(new_path)` at drain time
+//! succeeds and marks the chunks.
+//!
+//! A move that crashes between the journal write and the rename leaves a "ghost" entry. The
+//! drain's `open_index_reader` returns `None` and skips, and the truncate step clears it. This is
+//! handled by the existing vanished-file logic in the caller.
+//!
+//! The file lives under `/run/proxmox-backup/locks/<datastore>/move-journal`. Tmpfs is correct
+//! here: a reboot aborts any in-progress GC, and the next GC rebuilds state from a fresh
+//! `list_index_files()` against the post-move filesystem - there is nothing worth persisting.
+
+use std::fs::File;
+use std::io::{Read, Seek, SeekFrom, Write};
+use std::path::{Path, PathBuf};
+use std::time::Duration;
+
+use anyhow::{bail, format_err, Context, Error};
+use nix::sys::stat::Mode;
+
+use proxmox_sys::fs::{open_file_locked, CreateOptions};
+
+use pbs_config::backup_user;
+
+use crate::backup_info::DATASTORE_LOCKS_DIR;
+
+const JOURNAL_FILENAME: &str = "move-journal";
+const APPEND_LOCK_TIMEOUT: Duration = Duration::from_secs(10);
+// Long enough to cover any in-flight append, if it takes longer than this something is very wrong
+// and we'd rather fail GC than hang forever.
+const DRAIN_LOCK_TIMEOUT: Duration = Duration::from_secs(10);
+
+fn journal_path(datastore_name: &str) -> PathBuf {
+ Path::new(DATASTORE_LOCKS_DIR)
+ .join(datastore_name)
+ .join(JOURNAL_FILENAME)
+}
+
+fn ensure_parent(path: &Path) -> Result<(), Error> {
+ if let Some(parent) = path.parent() {
+ std::fs::create_dir_all(parent)
+ .with_context(|| format!("failed to create move-journal parent dir {parent:?}"))?;
+ }
+ Ok(())
+}
+
+fn open_locked_exclusive(path: &Path, timeout: Duration) -> Result<File, Error> {
+ ensure_parent(path)?;
+ let user = backup_user()?;
+ let options = CreateOptions::new()
+ .perm(Mode::from_bits_truncate(0o660))
+ .owner(user.uid)
+ .group(user.gid);
+ open_file_locked(path, timeout, true, options)
+ .with_context(|| format!("failed to acquire exclusive move-journal lock at {path:?}"))
+}
+
+/// Append one or more absolute index-file paths to the journal under a brief exclusive flock. The
+/// caller passes the *post-rename* paths, the rename must happen after this returns.
+pub fn append_moved_indices(datastore_name: &str, paths: &[PathBuf]) -> Result<(), Error> {
+ if paths.is_empty() {
+ return Ok(());
+ }
+
+ let mut buf = Vec::new();
+ for path in paths {
+ if !path.is_absolute() {
+ bail!("move-journal: refusing to record non-absolute path {path:?}");
+ }
+ let s = path
+ .to_str()
+ .ok_or_else(|| format_err!("move-journal: non-UTF-8 path {path:?}"))?;
+ if s.as_bytes().contains(&b'\n') {
+ bail!("move-journal: path contains newline {path:?}");
+ }
+ buf.extend_from_slice(s.as_bytes());
+ buf.push(b'\n');
+ }
+
+ let path = journal_path(datastore_name);
+ let mut file = open_locked_exclusive(&path, APPEND_LOCK_TIMEOUT)?;
+ file.write_all(&buf)
+ .context("failed to append to move journal")?;
+ Ok(())
+}
+
+/// Drain the journal under an exclusive lock, calling `f` for each recorded path. Blocks only for
+/// the brief window of a concurrent append. After the callback runs on every entry, the journal is
+/// truncated under the same lock.
+///
+/// On a processing error the entry is left in the journal (no truncate) so the next GC will retry.
+pub fn drain_move_journal<F>(datastore_name: &str, mut f: F) -> Result<(), Error>
+where
+ F: FnMut(&Path) -> Result<(), Error>,
+{
+ let path = journal_path(datastore_name);
+ let mut file = open_locked_exclusive(&path, DRAIN_LOCK_TIMEOUT)?;
+
+ file.seek(SeekFrom::Start(0))
+ .context("failed to rewind move journal for draining")?;
+ let mut contents = String::new();
+ file.read_to_string(&mut contents)
+ .context("failed to read move journal")?;
+
+ for line in contents.lines() {
+ let entry = line.trim();
+ if entry.is_empty() {
+ continue;
+ }
+ f(Path::new(entry))
+ .with_context(|| format!("move-journal: processing '{entry}' failed"))?;
+ }
+
+ file.set_len(0)
+ .context("failed to truncate move journal after drain")?;
+ Ok(())
+}
--
2.47.3
next prev parent reply other threads:[~2026-04-22 13:40 UTC|newest]
Thread overview: 21+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-04-22 13:39 [PATCH proxmox-backup v8 00/13] fixes #6195: add support for moving groups and namespaces Hannes Laimer
2026-04-22 13:39 ` [PATCH proxmox-backup v8 01/13] ui: show empty groups Hannes Laimer
2026-04-22 13:39 ` [PATCH proxmox-backup v8 02/13] datastore: lift check_namespace_depth_limit to pbs-datastore Hannes Laimer
2026-04-22 13:39 ` [PATCH proxmox-backup v8 03/13] datastore: have BackupGroup::destroy consume the group lock Hannes Laimer
2026-04-22 13:39 ` [PATCH proxmox-backup v8 04/13] datastore: split remove_namespace into flat and recursive variants Hannes Laimer
2026-04-22 13:39 ` Hannes Laimer [this message]
2026-04-22 13:39 ` [PATCH proxmox-backup v8 06/13] datastore: add move-group Hannes Laimer
2026-04-22 13:39 ` [PATCH proxmox-backup v8 07/13] datastore: add move-namespace Hannes Laimer
2026-04-22 13:39 ` [PATCH proxmox-backup v8 08/13] docs: add section on moving namespaces and groups Hannes Laimer
2026-04-22 13:39 ` [PATCH proxmox-backup v8 09/13] api: add POST endpoint for move-group Hannes Laimer
2026-04-22 13:39 ` [PATCH proxmox-backup v8 10/13] api: add POST endpoint for move-namespace Hannes Laimer
2026-04-22 13:39 ` [PATCH proxmox-backup v8 11/13] ui: add move group action Hannes Laimer
2026-04-23 13:35 ` Michael Köppl
2026-04-23 13:47 ` Hannes Laimer
2026-04-22 13:39 ` [PATCH proxmox-backup v8 12/13] ui: add move namespace action Hannes Laimer
2026-04-23 14:49 ` Michael Köppl
2026-04-22 13:39 ` [PATCH proxmox-backup v8 13/13] cli: add move-namespace and move-group commands Hannes Laimer
2026-04-23 16:29 ` [PATCH proxmox-backup v8 00/13] fixes #6195: add support for moving groups and namespaces Michael Köppl
2026-04-23 22:38 ` applied: " Thomas Lamprecht
2026-04-24 8:31 ` Fabian Grünbichler
2026-04-24 8:43 ` Hannes Laimer
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260422133951.192862-6-h.laimer@proxmox.com \
--to=h.laimer@proxmox.com \
--cc=pbs-devel@lists.proxmox.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox