From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <pbs-devel-bounces@lists.proxmox.com>
Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68])
	by lore.proxmox.com (Postfix) with ESMTPS id E9C671FF13B
	for <inbox@lore.proxmox.com>; Wed, 22 Apr 2026 15:40:43 +0200 (CEST)
Received: from firstgate.proxmox.com (localhost [127.0.0.1])
	by firstgate.proxmox.com (Proxmox) with ESMTP id 99D7820ABF;
	Wed, 22 Apr 2026 15:40:43 +0200 (CEST)
From: Hannes Laimer <h.laimer@proxmox.com>
To: pbs-devel@lists.proxmox.com
Subject: [PATCH proxmox-backup v8 05/13] datastore: add move journal for
 coordinating with gc phase 1
Date: Wed, 22 Apr 2026 15:39:43 +0200
Message-ID: <20260422133951.192862-6-h.laimer@proxmox.com>
X-Mailer: git-send-email 2.47.3
In-Reply-To: <20260422133951.192862-1-h.laimer@proxmox.com>
References: <20260422133951.192862-1-h.laimer@proxmox.com>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-Bm-Milter-Handled: 55990f41-d878-4baa-be0a-ee34c49e34d2
X-Bm-Transport-Timestamp: 1776865117098
X-SPAM-LEVEL: Spam detection results:  0
	AWL                    -0.169 Adjusted score from AWL reputation of From:
 address
	BAYES_00                 -1.9 Bayes spam probability is 0 to 1%
	DMARC_MISSING             0.1 Missing DMARC policy
	KAM_DMARC_STATUS         0.01 Test Rule for DKIM or SPF Failure with Strict
 Alignment
	KAM_NUMSUBJECT            0.5 Subject ends in numbers excluding current years
	SPF_HELO_NONE           0.001 SPF: HELO does not publish an SPF Record
	SPF_PASS               -0.001 SPF: sender matches SPF record
Message-ID-Hash: ULGWIS3CG5PTSHU3LK4EGVSE5HFQLCI7
X-Message-ID-Hash: ULGWIS3CG5PTSHU3LK4EGVSE5HFQLCI7
X-MailFrom: h.laimer@proxmox.com
X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; loop;
 banned-address; emergency; member-moderation; nonmember-moderation;
 administrivia; implicit-dest; max-recipients; max-size; news-moderation;
 no-subject; digests; suspicious-header
X-Mailman-Version: 3.3.10
Precedence: list
List-Id: Proxmox Backup Server development discussion
 <pbs-devel.lists.proxmox.com>
List-Help: <mailto:pbs-devel-request@lists.proxmox.com?subject=help>
List-Owner: <mailto:pbs-devel-owner@lists.proxmox.com>
List-Post: <mailto:pbs-devel@lists.proxmox.com>
List-Subscribe: <mailto:pbs-devel-join@lists.proxmox.com>
List-Unsubscribe: <mailto:pbs-devel-leave@lists.proxmox.com>

Close a data-loss race between move_group/move_namespace and a
concurrent gc phase-1 mark: if a snapshot is moved between
list_index_files() and the hierarchy iteration, and the target namespace
happens to be iterated before the source, neither pass sees the moved
index and phase 2 could sweep its chunks.

Add a per-datastore write-ahead journal at
/run/proxmox-backup/locks/<datastore>/move-journal that GC drains at the
end of phase-1 mark. No caller yet writes to it, the following commits
will use this.

See the `move_journal` module doc for the protocol and the invariant
that makes it correct.

Signed-off-by: Hannes Laimer <h.laimer@proxmox.com>
---
 pbs-datastore/src/datastore.rs    |  18 ++++
 pbs-datastore/src/lib.rs          |   1 +
 pbs-datastore/src/move_journal.rs | 149 ++++++++++++++++++++++++++++++
 3 files changed, 168 insertions(+)
 create mode 100644 pbs-datastore/src/move_journal.rs
diff --git a/pbs-datastore/src/datastore.rs b/pbs-datastore/src/datastore.rs
index bce0d846..319b4cba 100644
--- a/pbs-datastore/src/datastore.rs
+++ b/pbs-datastore/src/datastore.rs
@@ -1800,6 +1800,24 @@ impl DataStore {
             warn!("Found {strange_paths_count} index files outside of expected directory scheme");
         }
 
+        // Drain the move journal under an exclusive flock. Any index file whose path
+        // was recorded before its rename is processed here even if the namespace
+        // iteration missed it at both its old and new locations. See `move_journal`
+        // for details.
+        crate::move_journal::drain_move_journal(self.name(), |path| {
+            let Some(index) = self.open_index_reader(path)? else {
+                return Ok(());
+            };
+            self.index_mark_used_chunks(
+                index,
+                path,
+                &mut chunk_lru_cache,
+                status,
+                worker,
+                s3_client.as_ref().cloned(),
+            )
+        })?;
+
         Ok(())
     }
 
diff --git a/pbs-datastore/src/lib.rs b/pbs-datastore/src/lib.rs
index 4d8ac505..6647ee2b 100644
--- a/pbs-datastore/src/lib.rs
+++ b/pbs-datastore/src/lib.rs
@@ -192,6 +192,7 @@ pub mod data_blob_reader;
 pub mod file_formats;
 pub mod index;
 pub mod manifest;
+pub mod move_journal;
 pub mod paperkey;
 pub mod prune;
 pub mod read_chunk;
diff --git a/pbs-datastore/src/move_journal.rs b/pbs-datastore/src/move_journal.rs
new file mode 100644
index 00000000..891644d7
--- /dev/null
+++ b/pbs-datastore/src/move_journal.rs
@@ -0,0 +1,149 @@
+//! Per-datastore journal used to coordinate snapshot renames with a concurrent garbage collection
+//! phase-1 mark.
+//!
+//! # Race fixed
+//!
+//! GC phase 1 first calls `list_index_files()` to snapshot the set of absolute index-file paths in
+//! the datastore, then iterates namespaces live and touches the atime of every referenced chunk.
+//! If a `move_group`/`move_namespace` relocates a snapshot between those two steps, and the target
+//! namespace is visited by GC before the source (`readdir(2)` order, not deterministic), the moved
+//! index is at neither location when GC looks: missing from the target (already iterated) and
+//! missing from the source (iterated after the rename). Its old path lands in the leftover
+//! `unprocessed_index_list` and is discarded as a vanished file. Chunks referenced only by that
+//! index never get their atime bumped and phase 2 sweeps them.
+//!
+//! # Protocol
+//!
+//! Write-ahead journal for renames:
+//!
+//! - **Before** renaming a snapshot, the move records the new path of each index file it is about
+//!   to create, under a brief exclusive flock.
+//! - At the end of phase-1 mark, GC acquires the same exclusive flock, reads every recorded path,
+//!   runs the normal `index_mark_used_chunks` on each, truncates, and releases before entering phase 2.
+//!
+//! Why write-before-rename rather than write-after-rename with a long-held shared lock by each
+//! mover: the invariant is "if the new path exists, a journal entry for it exists too". So the
+//! drain - which runs only after iteration finishes - is guaranteed to catch anything iteration
+//! missed:
+//!
+//! - If the source-ns iteration found the index at the old path, its chunks are already marked.
+//!   The journal entry is then either a redundant re-mark (rename completed before the drain, LRU
+//!   dedups it) or a no-op skip (rename not yet, `open_index_reader` returns `None`) - harmless
+//!   either way.
+//! - If the source-ns iteration missed it, then the rename already happened by the time iteration
+//!   reached source, which is before the drain, so `open_index_reader(new_path)` at drain time
+//!   succeeds and marks the chunks.
+//!
+//! A move that crashes between the journal write and the rename leaves a "ghost" entry. The
+//! drain's `open_index_reader` returns `None` and skips, and the truncate step clears it. This is
+//! handled by the existing vanished-file logic in the caller.
+//!
+//! The file lives under `/run/proxmox-backup/locks/<datastore>/move-journal`. Tmpfs is correct
+//! here: a reboot aborts any in-progress GC, and the next GC rebuilds state from a fresh
+//! `list_index_files()` against the post-move filesystem - there is nothing worth persisting.
+
+use std::fs::File;
+use std::io::{Read, Seek, SeekFrom, Write};
+use std::path::{Path, PathBuf};
+use std::time::Duration;
+
+use anyhow::{bail, format_err, Context, Error};
+use nix::sys::stat::Mode;
+
+use proxmox_sys::fs::{open_file_locked, CreateOptions};
+
+use pbs_config::backup_user;
+
+use crate::backup_info::DATASTORE_LOCKS_DIR;
+
+const JOURNAL_FILENAME: &str = "move-journal";
+const APPEND_LOCK_TIMEOUT: Duration = Duration::from_secs(10);
+// Long enough to cover any in-flight append, if it takes longer than this something is very wrong
+// and we'd rather fail GC than hang forever.
+const DRAIN_LOCK_TIMEOUT: Duration = Duration::from_secs(10);
+
+fn journal_path(datastore_name: &str) -> PathBuf {
+    Path::new(DATASTORE_LOCKS_DIR)
+        .join(datastore_name)
+        .join(JOURNAL_FILENAME)
+}
+
+fn ensure_parent(path: &Path) -> Result<(), Error> {
+    if let Some(parent) = path.parent() {
+        std::fs::create_dir_all(parent)
+            .with_context(|| format!("failed to create move-journal parent dir {parent:?}"))?;
+    }
+    Ok(())
+}
+
+fn open_locked_exclusive(path: &Path, timeout: Duration) -> Result<File, Error> {
+    ensure_parent(path)?;
+    let user = backup_user()?;
+    let options = CreateOptions::new()
+        .perm(Mode::from_bits_truncate(0o660))
+        .owner(user.uid)
+        .group(user.gid);
+    open_file_locked(path, timeout, true, options)
+        .with_context(|| format!("failed to acquire exclusive move-journal lock at {path:?}"))
+}
+
+/// Append one or more absolute index-file paths to the journal under a brief exclusive flock. The
+/// caller passes the *post-rename* paths, the rename must happen after this returns.
+pub fn append_moved_indices(datastore_name: &str, paths: &[PathBuf]) -> Result<(), Error> {
+    if paths.is_empty() {
+        return Ok(());
+    }
+
+    let mut buf = Vec::new();
+    for path in paths {
+        if !path.is_absolute() {
+            bail!("move-journal: refusing to record non-absolute path {path:?}");
+        }
+        let s = path
+            .to_str()
+            .ok_or_else(|| format_err!("move-journal: non-UTF-8 path {path:?}"))?;
+        if s.as_bytes().contains(&b'\n') {
+            bail!("move-journal: path contains newline {path:?}");
+        }
+        buf.extend_from_slice(s.as_bytes());
+        buf.push(b'\n');
+    }
+
+    let path = journal_path(datastore_name);
+    let mut file = open_locked_exclusive(&path, APPEND_LOCK_TIMEOUT)?;
+    file.write_all(&buf)
+        .context("failed to append to move journal")?;
+    Ok(())
+}
+
+/// Drain the journal under an exclusive lock, calling `f` for each recorded path. Blocks only for
+/// the brief window of a concurrent append. After the callback runs on every entry, the journal is
+/// truncated under the same lock.
+///
+/// On a processing error the entry is left in the journal (no truncate) so the next GC will retry.
+pub fn drain_move_journal<F>(datastore_name: &str, mut f: F) -> Result<(), Error>
+where
+    F: FnMut(&Path) -> Result<(), Error>,
+{
+    let path = journal_path(datastore_name);
+    let mut file = open_locked_exclusive(&path, DRAIN_LOCK_TIMEOUT)?;
+
+    file.seek(SeekFrom::Start(0))
+        .context("failed to rewind move journal for draining")?;
+    let mut contents = String::new();
+    file.read_to_string(&mut contents)
+        .context("failed to read move journal")?;
+
+    for line in contents.lines() {
+        let entry = line.trim();
+        if entry.is_empty() {
+            continue;
+        }
+        f(Path::new(entry))
+            .with_context(|| format!("move-journal: processing '{entry}' failed"))?;
+    }
+
+    file.set_len(0)
+        .context("failed to truncate move journal after drain")?;
+    Ok(())
+}
-- 
2.47.3