From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68]) by lore.proxmox.com (Postfix) with ESMTPS id E9C671FF13B for ; Wed, 22 Apr 2026 15:40:43 +0200 (CEST) Received: from firstgate.proxmox.com (localhost [127.0.0.1]) by firstgate.proxmox.com (Proxmox) with ESMTP id 99D7820ABF; Wed, 22 Apr 2026 15:40:43 +0200 (CEST) From: Hannes Laimer To: pbs-devel@lists.proxmox.com Subject: [PATCH proxmox-backup v8 05/13] datastore: add move journal for coordinating with gc phase 1 Date: Wed, 22 Apr 2026 15:39:43 +0200 Message-ID: <20260422133951.192862-6-h.laimer@proxmox.com> X-Mailer: git-send-email 2.47.3 In-Reply-To: <20260422133951.192862-1-h.laimer@proxmox.com> References: <20260422133951.192862-1-h.laimer@proxmox.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Bm-Milter-Handled: 55990f41-d878-4baa-be0a-ee34c49e34d2 X-Bm-Transport-Timestamp: 1776865117098 X-SPAM-LEVEL: Spam detection results: 0 AWL -0.169 Adjusted score from AWL reputation of From: address BAYES_00 -1.9 Bayes spam probability is 0 to 1% DMARC_MISSING 0.1 Missing DMARC policy KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment KAM_NUMSUBJECT 0.5 Subject ends in numbers excluding current years SPF_HELO_NONE 0.001 SPF: HELO does not publish an SPF Record SPF_PASS -0.001 SPF: sender matches SPF record Message-ID-Hash: ULGWIS3CG5PTSHU3LK4EGVSE5HFQLCI7 X-Message-ID-Hash: ULGWIS3CG5PTSHU3LK4EGVSE5HFQLCI7 X-MailFrom: h.laimer@proxmox.com X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; loop; banned-address; emergency; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header X-Mailman-Version: 3.3.10 Precedence: list List-Id: Proxmox Backup Server development discussion List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: Close a data-loss race between move_group/move_namespace and a concurrent gc phase-1 mark: if a snapshot is moved between list_index_files() and the hierarchy iteration, and the target namespace happens to be iterated before the source, neither pass sees the moved index and phase 2 could sweep its chunks. Add a per-datastore write-ahead journal at /run/proxmox-backup/locks//move-journal that GC drains at the end of phase-1 mark. No caller yet writes to it, the following commits will use this. See the `move_journal` module doc for the protocol and the invariant that makes it correct. Signed-off-by: Hannes Laimer --- pbs-datastore/src/datastore.rs | 18 ++++ pbs-datastore/src/lib.rs | 1 + pbs-datastore/src/move_journal.rs | 149 ++++++++++++++++++++++++++++++ 3 files changed, 168 insertions(+) create mode 100644 pbs-datastore/src/move_journal.rs diff --git a/pbs-datastore/src/datastore.rs b/pbs-datastore/src/datastore.rs index bce0d846..319b4cba 100644 --- a/pbs-datastore/src/datastore.rs +++ b/pbs-datastore/src/datastore.rs @@ -1800,6 +1800,24 @@ impl DataStore { warn!("Found {strange_paths_count} index files outside of expected directory scheme"); } + // Drain the move journal under an exclusive flock. Any index file whose path + // was recorded before its rename is processed here even if the namespace + // iteration missed it at both its old and new locations. See `move_journal` + // for details. + crate::move_journal::drain_move_journal(self.name(), |path| { + let Some(index) = self.open_index_reader(path)? else { + return Ok(()); + }; + self.index_mark_used_chunks( + index, + path, + &mut chunk_lru_cache, + status, + worker, + s3_client.as_ref().cloned(), + ) + })?; + Ok(()) } diff --git a/pbs-datastore/src/lib.rs b/pbs-datastore/src/lib.rs index 4d8ac505..6647ee2b 100644 --- a/pbs-datastore/src/lib.rs +++ b/pbs-datastore/src/lib.rs @@ -192,6 +192,7 @@ pub mod data_blob_reader; pub mod file_formats; pub mod index; pub mod manifest; +pub mod move_journal; pub mod paperkey; pub mod prune; pub mod read_chunk; diff --git a/pbs-datastore/src/move_journal.rs b/pbs-datastore/src/move_journal.rs new file mode 100644 index 00000000..891644d7 --- /dev/null +++ b/pbs-datastore/src/move_journal.rs @@ -0,0 +1,149 @@ +//! Per-datastore journal used to coordinate snapshot renames with a concurrent garbage collection +//! phase-1 mark. +//! +//! # Race fixed +//! +//! GC phase 1 first calls `list_index_files()` to snapshot the set of absolute index-file paths in +//! the datastore, then iterates namespaces live and touches the atime of every referenced chunk. +//! If a `move_group`/`move_namespace` relocates a snapshot between those two steps, and the target +//! namespace is visited by GC before the source (`readdir(2)` order, not deterministic), the moved +//! index is at neither location when GC looks: missing from the target (already iterated) and +//! missing from the source (iterated after the rename). Its old path lands in the leftover +//! `unprocessed_index_list` and is discarded as a vanished file. Chunks referenced only by that +//! index never get their atime bumped and phase 2 sweeps them. +//! +//! # Protocol +//! +//! Write-ahead journal for renames: +//! +//! - **Before** renaming a snapshot, the move records the new path of each index file it is about +//! to create, under a brief exclusive flock. +//! - At the end of phase-1 mark, GC acquires the same exclusive flock, reads every recorded path, +//! runs the normal `index_mark_used_chunks` on each, truncates, and releases before entering phase 2. +//! +//! Why write-before-rename rather than write-after-rename with a long-held shared lock by each +//! mover: the invariant is "if the new path exists, a journal entry for it exists too". So the +//! drain - which runs only after iteration finishes - is guaranteed to catch anything iteration +//! missed: +//! +//! - If the source-ns iteration found the index at the old path, its chunks are already marked. +//! The journal entry is then either a redundant re-mark (rename completed before the drain, LRU +//! dedups it) or a no-op skip (rename not yet, `open_index_reader` returns `None`) - harmless +//! either way. +//! - If the source-ns iteration missed it, then the rename already happened by the time iteration +//! reached source, which is before the drain, so `open_index_reader(new_path)` at drain time +//! succeeds and marks the chunks. +//! +//! A move that crashes between the journal write and the rename leaves a "ghost" entry. The +//! drain's `open_index_reader` returns `None` and skips, and the truncate step clears it. This is +//! handled by the existing vanished-file logic in the caller. +//! +//! The file lives under `/run/proxmox-backup/locks//move-journal`. Tmpfs is correct +//! here: a reboot aborts any in-progress GC, and the next GC rebuilds state from a fresh +//! `list_index_files()` against the post-move filesystem - there is nothing worth persisting. + +use std::fs::File; +use std::io::{Read, Seek, SeekFrom, Write}; +use std::path::{Path, PathBuf}; +use std::time::Duration; + +use anyhow::{bail, format_err, Context, Error}; +use nix::sys::stat::Mode; + +use proxmox_sys::fs::{open_file_locked, CreateOptions}; + +use pbs_config::backup_user; + +use crate::backup_info::DATASTORE_LOCKS_DIR; + +const JOURNAL_FILENAME: &str = "move-journal"; +const APPEND_LOCK_TIMEOUT: Duration = Duration::from_secs(10); +// Long enough to cover any in-flight append, if it takes longer than this something is very wrong +// and we'd rather fail GC than hang forever. +const DRAIN_LOCK_TIMEOUT: Duration = Duration::from_secs(10); + +fn journal_path(datastore_name: &str) -> PathBuf { + Path::new(DATASTORE_LOCKS_DIR) + .join(datastore_name) + .join(JOURNAL_FILENAME) +} + +fn ensure_parent(path: &Path) -> Result<(), Error> { + if let Some(parent) = path.parent() { + std::fs::create_dir_all(parent) + .with_context(|| format!("failed to create move-journal parent dir {parent:?}"))?; + } + Ok(()) +} + +fn open_locked_exclusive(path: &Path, timeout: Duration) -> Result { + ensure_parent(path)?; + let user = backup_user()?; + let options = CreateOptions::new() + .perm(Mode::from_bits_truncate(0o660)) + .owner(user.uid) + .group(user.gid); + open_file_locked(path, timeout, true, options) + .with_context(|| format!("failed to acquire exclusive move-journal lock at {path:?}")) +} + +/// Append one or more absolute index-file paths to the journal under a brief exclusive flock. The +/// caller passes the *post-rename* paths, the rename must happen after this returns. +pub fn append_moved_indices(datastore_name: &str, paths: &[PathBuf]) -> Result<(), Error> { + if paths.is_empty() { + return Ok(()); + } + + let mut buf = Vec::new(); + for path in paths { + if !path.is_absolute() { + bail!("move-journal: refusing to record non-absolute path {path:?}"); + } + let s = path + .to_str() + .ok_or_else(|| format_err!("move-journal: non-UTF-8 path {path:?}"))?; + if s.as_bytes().contains(&b'\n') { + bail!("move-journal: path contains newline {path:?}"); + } + buf.extend_from_slice(s.as_bytes()); + buf.push(b'\n'); + } + + let path = journal_path(datastore_name); + let mut file = open_locked_exclusive(&path, APPEND_LOCK_TIMEOUT)?; + file.write_all(&buf) + .context("failed to append to move journal")?; + Ok(()) +} + +/// Drain the journal under an exclusive lock, calling `f` for each recorded path. Blocks only for +/// the brief window of a concurrent append. After the callback runs on every entry, the journal is +/// truncated under the same lock. +/// +/// On a processing error the entry is left in the journal (no truncate) so the next GC will retry. +pub fn drain_move_journal(datastore_name: &str, mut f: F) -> Result<(), Error> +where + F: FnMut(&Path) -> Result<(), Error>, +{ + let path = journal_path(datastore_name); + let mut file = open_locked_exclusive(&path, DRAIN_LOCK_TIMEOUT)?; + + file.seek(SeekFrom::Start(0)) + .context("failed to rewind move journal for draining")?; + let mut contents = String::new(); + file.read_to_string(&mut contents) + .context("failed to read move journal")?; + + for line in contents.lines() { + let entry = line.trim(); + if entry.is_empty() { + continue; + } + f(Path::new(entry)) + .with_context(|| format!("move-journal: processing '{entry}' failed"))?; + } + + file.set_len(0) + .context("failed to truncate move journal after drain")?; + Ok(()) +} -- 2.47.3