* [pbs-devel] [PATCH proxmox-backup v8 0/4] refactor datastore locking to use tmpfs
@ 2025-03-24 12:51 Shannon Sterz
2025-03-24 12:51 ` [pbs-devel] [PATCH proxmox-backup v8 1/4] datastore/api/backup: prepare for fix of #3935 by adding lock helpers Shannon Sterz
` (4 more replies)
0 siblings, 5 replies; 17+ messages in thread
From: Shannon Sterz @ 2025-03-24 12:51 UTC (permalink / raw)
To: pbs-devel
The goal of this series is to make it safer to remove backup groups &
snapshots by separating the corresponding directories from their lock
files. By moving the lock files to the tmpfs-backed '/run' directory,
we also make sure that the lock files get cleaned up when the system
reboots.
This series refactors the locking mechanism inside the `DataStore`,
`BackupDir` and `BackupGroup` traits. In a first step locking methods
are added and the existing code is refactored to use them. Commit two
derives a lock file name under '/run' for each group/snapshot. It also
adds double stat'ing. To avoid issues when upgrading, the file
`/run/proxmox-backup/old-locking` is created through a post-install
hook which is used to determine whether the system has been rebooted
and we can safely use the new locking mechanism.
The third commit refactors locking for manifests and brings it in-line
with the group/snapshot locks. Finally, the last commit fixes a race
condition when changing the owner of a datastore.
----
changes from v7 (thanks @ Christian Ebner):
* use anyhow's `Context` to provide more context on the call site of a
locking helper call
* rebase on top of current master to apply cleanly again
changes from v6:
* add old locking safe guards to avoid different versions of the locking
mechanism being used at the same time (see discussion here [1]).
[1]: https://lore.proxmox.com/pbs-devel/20250306120810.361035-1-m.sandoval@proxmox.com/T/#u
changes from v5:
* re-phrase commit messages to make it clear which commit actually
fixes the issue and what the commit implies in-terms of semantic
changes for error messages (thanks @ Thomas Lamprecht)
* make it so the series applies cleanly again and clean up a newly
added usage of `lock_dir_noblock`
changes from v4 (thanks @ Wolfgang Bumiller):
* stop using `to_string_lossy()`
* switch funtion signature of `create_locked_backup_group()` and
`create_locked_backup_dir` to use the `Arc` version of a datastore.
* smaller clippy fixes
* rebased on current master
changes from v3:
* moved patch 2 to the front so it can be applied separatelly more
easily
* rebased on current master
changes from v2:
* different encoding scheme for lock file names
* refactored locking methods to be used by the new BackupDir and
BackupGroup traits
* adapted lock file names to include namespaces
changes from v1 (thanks @ Wolfgang Bumiller & Thomas Lamprecht):
* split adding locking helpers and move '/run' into two commits
* instead of stat'ing the path of lock file twice, only use the file
descriptor for one of the stat'ing procedures instead
* several improvements to helper functions and documentation
Shannon Sterz (4):
datastore/api/backup: prepare for fix of #3935 by adding lock helpers
fix #3935: datastore/api/backup: move datastore locking to '/run'
fix #3935: datastore: move manifest locking to new locking method
fix: api: avoid race condition in set_backup_owner
Cargo.toml | 2 +-
debian/postinst | 5 +
pbs-config/src/lib.rs | 32 +++-
pbs-datastore/Cargo.toml | 1 +
pbs-datastore/src/backup_info.rs | 236 ++++++++++++++++++++++++---
pbs-datastore/src/datastore.rs | 86 +++++-----
pbs-datastore/src/snapshot_reader.rs | 20 ++-
src/api2/admin/datastore.rs | 13 +-
src/api2/backup/environment.rs | 21 +--
src/api2/backup/mod.rs | 13 +-
src/api2/reader/mod.rs | 11 +-
src/backup/verify.rs | 12 +-
src/server/sync.rs | 13 +-
13 files changed, 342 insertions(+), 123 deletions(-)
--
2.39.5
_______________________________________________
pbs-devel mailing list
pbs-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel
^ permalink raw reply [flat|nested] 17+ messages in thread
* [pbs-devel] [PATCH proxmox-backup v8 1/4] datastore/api/backup: prepare for fix of #3935 by adding lock helpers
2025-03-24 12:51 [pbs-devel] [PATCH proxmox-backup v8 0/4] refactor datastore locking to use tmpfs Shannon Sterz
@ 2025-03-24 12:51 ` Shannon Sterz
2025-03-25 9:35 ` Christian Ebner
2025-03-24 12:51 ` [pbs-devel] [PATCH proxmox-backup v8 2/4] fix #3935: datastore/api/backup: move datastore locking to '/run' Shannon Sterz
` (3 subsequent siblings)
4 siblings, 1 reply; 17+ messages in thread
From: Shannon Sterz @ 2025-03-24 12:51 UTC (permalink / raw)
To: pbs-devel
to avoid duplicate code, add helpers for locking groups and snapshots
to the BackupGroup and BackupDir traits respectively and refactor
existing code to use them.
this also adapts error handling by adding relevant context to each
locking helper call site. otherwise, we might loose valuable
information useful for debugging. note, however, that users that
relied on specific error messages will break.
Signed-off-by: Shannon Sterz <s.sterz@proxmox.com>
---
pbs-datastore/src/backup_info.rs | 45 ++++++++++++---
pbs-datastore/src/datastore.rs | 82 +++++++++++++---------------
pbs-datastore/src/snapshot_reader.rs | 9 ++-
src/api2/backup/environment.rs | 16 +++---
src/api2/backup/mod.rs | 13 ++---
src/api2/reader/mod.rs | 11 ++--
src/backup/verify.rs | 12 ++--
src/server/sync.rs | 13 ++---
8 files changed, 106 insertions(+), 95 deletions(-)
diff --git a/pbs-datastore/src/backup_info.rs b/pbs-datastore/src/backup_info.rs
index ef2b982c..c7f7ed53 100644
--- a/pbs-datastore/src/backup_info.rs
+++ b/pbs-datastore/src/backup_info.rs
@@ -3,9 +3,11 @@ use std::os::unix::io::RawFd;
use std::path::PathBuf;
use std::sync::Arc;
-use anyhow::{bail, format_err, Error};
+use anyhow::{bail, format_err, Context, Error};
-use proxmox_sys::fs::{lock_dir_noblock, replace_file, CreateOptions};
+use proxmox_sys::fs::{
+ lock_dir_noblock, lock_dir_noblock_shared, replace_file, CreateOptions, DirLockGuard,
+};
use pbs_api_types::{
Authid, BackupGroupDeleteStats, BackupNamespace, BackupType, GroupFilter, VerifyState,
@@ -199,9 +201,10 @@ impl BackupGroup {
/// Returns `BackupGroupDeleteStats`, containing the number of deleted snapshots
/// and number of protected snaphsots, which therefore were not removed.
pub fn destroy(&self) -> Result<BackupGroupDeleteStats, Error> {
+ let _guard = self
+ .lock()
+ .with_context(|| format!("while destroying group '{self:?}'"))?;
let path = self.full_group_path();
- let _guard =
- proxmox_sys::fs::lock_dir_noblock(&path, "backup group", "possible running backup")?;
log::info!("removing backup group {:?}", path);
let mut delete_stats = BackupGroupDeleteStats::default();
@@ -237,6 +240,15 @@ impl BackupGroup {
self.store
.set_owner(&self.ns, self.as_ref(), auth_id, force)
}
+
+ /// Lock a group exclusively
+ pub fn lock(&self) -> Result<DirLockGuard, Error> {
+ lock_dir_noblock(
+ &self.full_group_path(),
+ "backup group",
+ "possible running backup",
+ )
+ }
}
impl AsRef<pbs_api_types::BackupNamespace> for BackupGroup {
@@ -442,15 +454,33 @@ impl BackupDir {
.map_err(|err| format_err!("unable to acquire manifest lock {:?} - {}", &path, err))
}
+ /// Lock this snapshot exclusively
+ pub fn lock(&self) -> Result<DirLockGuard, Error> {
+ lock_dir_noblock(
+ &self.full_path(),
+ "snapshot",
+ "backup is running or snapshot is in use",
+ )
+ }
+
+ /// Acquire a shared lock on this snapshot
+ pub fn lock_shared(&self) -> Result<DirLockGuard, Error> {
+ lock_dir_noblock_shared(
+ &self.full_path(),
+ "snapshot",
+ "backup is running or snapshot is in use, could not acquire shared lock",
+ )
+ }
+
/// Destroy the whole snapshot, bails if it's protected
///
/// Setting `force` to true skips locking and thus ignores if the backup is currently in use.
pub fn destroy(&self, force: bool) -> Result<(), Error> {
- let full_path = self.full_path();
-
let (_guard, _manifest_guard);
if !force {
- _guard = lock_dir_noblock(&full_path, "snapshot", "possibly running or in use")?;
+ _guard = self
+ .lock()
+ .with_context(|| format!("while destroying snapshot '{self:?}'"))?;
_manifest_guard = self.lock_manifest()?;
}
@@ -458,6 +488,7 @@ impl BackupDir {
bail!("cannot remove protected snapshot"); // use special error type?
}
+ let full_path = self.full_path();
log::info!("removing backup snapshot {:?}", full_path);
std::fs::remove_dir_all(&full_path).map_err(|err| {
format_err!("removing backup snapshot {:?} failed - {}", full_path, err,)
diff --git a/pbs-datastore/src/datastore.rs b/pbs-datastore/src/datastore.rs
index a6a91ca7..1cbd0038 100644
--- a/pbs-datastore/src/datastore.rs
+++ b/pbs-datastore/src/datastore.rs
@@ -5,7 +5,7 @@ use std::os::unix::io::AsRawFd;
use std::path::{Path, PathBuf};
use std::sync::{Arc, LazyLock, Mutex};
-use anyhow::{bail, format_err, Error};
+use anyhow::{bail, format_err, Context, Error};
use nix::unistd::{unlinkat, UnlinkatFlags};
use tracing::{info, warn};
@@ -13,8 +13,8 @@ use proxmox_human_byte::HumanByte;
use proxmox_schema::ApiType;
use proxmox_sys::error::SysError;
+use proxmox_sys::fs::DirLockGuard;
use proxmox_sys::fs::{file_read_optional_string, replace_file, CreateOptions};
-use proxmox_sys::fs::{lock_dir_noblock, DirLockGuard};
use proxmox_sys::linux::procfs::MountInfo;
use proxmox_sys::process_locker::ProcessLockSharedGuard;
use proxmox_worker_task::WorkerTaskContext;
@@ -774,41 +774,35 @@ impl DataStore {
///
/// This also acquires an exclusive lock on the directory and returns the lock guard.
pub fn create_locked_backup_group(
- &self,
+ self: &Arc<Self>,
ns: &BackupNamespace,
backup_group: &pbs_api_types::BackupGroup,
auth_id: &Authid,
) -> Result<(Authid, DirLockGuard), Error> {
- // create intermediate path first:
- let mut full_path = self.base_path();
- for ns in ns.components() {
- full_path.push("ns");
- full_path.push(ns);
- }
- full_path.push(backup_group.ty.as_str());
- std::fs::create_dir_all(&full_path)?;
+ let backup_group = self.backup_group(ns.clone(), backup_group.clone());
- full_path.push(&backup_group.id);
+ // create intermediate path first
+ let full_path = backup_group.full_group_path();
- // create the last component now
+ std::fs::create_dir_all(full_path.parent().ok_or_else(|| {
+ format_err!("could not construct parent path for group {backup_group:?}")
+ })?)?;
+
+ // now create the group, this allows us to check whether it existed before
match std::fs::create_dir(&full_path) {
Ok(_) => {
- let guard = lock_dir_noblock(
- &full_path,
- "backup group",
- "another backup is already running",
- )?;
- self.set_owner(ns, backup_group, auth_id, false)?;
- let owner = self.get_owner(ns, backup_group)?; // just to be sure
+ let guard = backup_group.lock().with_context(|| {
+ format!("while creating new locked backup group '{backup_group:?}'")
+ })?;
+ self.set_owner(ns, backup_group.group(), auth_id, false)?;
+ let owner = self.get_owner(ns, backup_group.group())?; // just to be sure
Ok((owner, guard))
}
Err(ref err) if err.kind() == io::ErrorKind::AlreadyExists => {
- let guard = lock_dir_noblock(
- &full_path,
- "backup group",
- "another backup is already running",
- )?;
- let owner = self.get_owner(ns, backup_group)?; // just to be sure
+ let guard = backup_group.lock().with_context(|| {
+ format!("while creating locked backup group '{backup_group:?}'")
+ })?;
+ let owner = self.get_owner(ns, backup_group.group())?; // just to be sure
Ok((owner, guard))
}
Err(err) => bail!("unable to create backup group {:?} - {}", full_path, err),
@@ -819,29 +813,25 @@ impl DataStore {
///
/// The BackupGroup directory needs to exist.
pub fn create_locked_backup_dir(
- &self,
+ self: &Arc<Self>,
ns: &BackupNamespace,
backup_dir: &pbs_api_types::BackupDir,
) -> Result<(PathBuf, bool, DirLockGuard), Error> {
- let full_path = self.snapshot_path(ns, backup_dir);
- let relative_path = full_path.strip_prefix(self.base_path()).map_err(|err| {
- format_err!(
- "failed to produce correct path for backup {backup_dir} in namespace {ns}: {err}"
- )
- })?;
+ let backup_dir = self.backup_dir(ns.clone(), backup_dir.clone())?;
+ let relative_path = backup_dir.relative_path();
- let lock = || {
- lock_dir_noblock(
- &full_path,
- "snapshot",
- "internal error - tried creating snapshot that's already in use",
- )
- };
-
- match std::fs::create_dir(&full_path) {
- Ok(_) => Ok((relative_path.to_owned(), true, lock()?)),
+ match std::fs::create_dir(backup_dir.full_path()) {
+ Ok(_) => {
+ let guard = backup_dir.lock().with_context(|| {
+ format!("while creating new locked snapshot '{backup_dir:?}'")
+ })?;
+ Ok((relative_path, true, guard))
+ }
Err(ref e) if e.kind() == io::ErrorKind::AlreadyExists => {
- Ok((relative_path.to_owned(), false, lock()?))
+ let guard = backup_dir
+ .lock()
+ .with_context(|| format!("while creating locked snapshot '{backup_dir:?}'"))?;
+ Ok((relative_path, false, guard))
}
Err(e) => Err(e.into()),
}
@@ -1305,7 +1295,9 @@ impl DataStore {
bail!("snapshot {} does not exist!", backup_dir.dir());
}
- let _guard = lock_dir_noblock(&full_path, "snapshot", "possibly running or in use")?;
+ let _guard = backup_dir.lock().with_context(|| {
+ format!("while updating the protection status of snapshot '{backup_dir:?}'")
+ })?;
let protected_path = backup_dir.protected_file();
if protection {
diff --git a/pbs-datastore/src/snapshot_reader.rs b/pbs-datastore/src/snapshot_reader.rs
index 99ae1180..edff19ef 100644
--- a/pbs-datastore/src/snapshot_reader.rs
+++ b/pbs-datastore/src/snapshot_reader.rs
@@ -4,11 +4,9 @@ use std::path::Path;
use std::rc::Rc;
use std::sync::Arc;
-use anyhow::{bail, Error};
+use anyhow::{bail, Context, Error};
use nix::dir::Dir;
-use proxmox_sys::fs::lock_dir_noblock_shared;
-
use pbs_api_types::{
print_store_and_ns, ArchiveType, BackupNamespace, Operation, CLIENT_LOG_BLOB_NAME,
MANIFEST_BLOB_NAME,
@@ -48,8 +46,9 @@ impl SnapshotReader {
bail!("snapshot {} does not exist!", snapshot.dir());
}
- let locked_dir =
- lock_dir_noblock_shared(&snapshot_path, "snapshot", "locked by another operation")?;
+ let locked_dir = snapshot
+ .lock_shared()
+ .with_context(|| format!("while trying to read snapshot '{snapshot:?}'"))?;
let datastore_name = datastore.name().to_string();
let manifest = match snapshot.load_manifest() {
diff --git a/src/api2/backup/environment.rs b/src/api2/backup/environment.rs
index 99d885e2..7943b576 100644
--- a/src/api2/backup/environment.rs
+++ b/src/api2/backup/environment.rs
@@ -1,4 +1,4 @@
-use anyhow::{bail, format_err, Error};
+use anyhow::{bail, format_err, Context, Error};
use nix::dir::Dir;
use std::collections::HashMap;
use std::sync::{Arc, Mutex};
@@ -8,7 +8,7 @@ use ::serde::Serialize;
use serde_json::{json, Value};
use proxmox_router::{RpcEnvironment, RpcEnvironmentType};
-use proxmox_sys::fs::{lock_dir_noblock_shared, replace_file, CreateOptions};
+use proxmox_sys::fs::{replace_file, CreateOptions};
use pbs_api_types::Authid;
use pbs_datastore::backup_info::{BackupDir, BackupInfo};
@@ -645,12 +645,12 @@ impl BackupEnvironment {
// Downgrade to shared lock, the backup itself is finished
drop(excl_snap_lock);
- let snap_lock = lock_dir_noblock_shared(
- &self.backup_dir.full_path(),
- "snapshot",
- "snapshot is already locked by another operation",
- )?;
-
+ let snap_lock = self.backup_dir.lock_shared().with_context(|| {
+ format!(
+ "while trying to verify snapshot '{:?}' after completion",
+ self.backup_dir
+ )
+ })?;
let worker_id = format!(
"{}:{}/{}/{:08X}",
self.datastore.name(),
diff --git a/src/api2/backup/mod.rs b/src/api2/backup/mod.rs
index efc97a1f..344c80d4 100644
--- a/src/api2/backup/mod.rs
+++ b/src/api2/backup/mod.rs
@@ -1,6 +1,6 @@
//! Backup protocol (HTTP2 upgrade)
-use anyhow::{bail, format_err, Error};
+use anyhow::{bail, format_err, Context, Error};
use futures::*;
use hex::FromHex;
use hyper::header::{HeaderValue, CONNECTION, UPGRADE};
@@ -17,7 +17,6 @@ use proxmox_router::{
};
use proxmox_schema::*;
use proxmox_sortable_macro::sortable;
-use proxmox_sys::fs::lock_dir_noblock_shared;
use pbs_api_types::{
ArchiveType, Authid, BackupNamespace, BackupType, Operation, VerifyState,
@@ -186,12 +185,10 @@ fn upgrade_to_backup_protocol(
}
// lock last snapshot to prevent forgetting/pruning it during backup
- let full_path = last.backup_dir.full_path();
- Some(lock_dir_noblock_shared(
- &full_path,
- "snapshot",
- "base snapshot is already locked by another operation",
- )?)
+ let guard = last.backup_dir
+ .lock_shared()
+ .with_context(|| format!("while locking last snapshot during backup '{last:?}'"))?;
+ Some(guard)
} else {
None
};
diff --git a/src/api2/reader/mod.rs b/src/api2/reader/mod.rs
index 1713f182..cc791299 100644
--- a/src/api2/reader/mod.rs
+++ b/src/api2/reader/mod.rs
@@ -1,6 +1,6 @@
//! Backup reader/restore protocol (HTTP2 upgrade)
-use anyhow::{bail, format_err, Error};
+use anyhow::{bail, format_err, Context, Error};
use futures::*;
use hex::FromHex;
use hyper::header::{self, HeaderValue, CONNECTION, UPGRADE};
@@ -16,7 +16,6 @@ use proxmox_router::{
};
use proxmox_schema::{BooleanSchema, ObjectSchema};
use proxmox_sortable_macro::sortable;
-use proxmox_sys::fs::lock_dir_noblock_shared;
use pbs_api_types::{
ArchiveType, Authid, Operation, BACKUP_ARCHIVE_NAME_SCHEMA, BACKUP_ID_SCHEMA,
@@ -129,11 +128,9 @@ fn upgrade_to_backup_reader_protocol(
bail!("snapshot {} does not exist.", backup_dir.dir());
}
- let _guard = lock_dir_noblock_shared(
- &backup_dir.full_path(),
- "snapshot",
- "locked by another operation",
- )?;
+ let _guard = backup_dir
+ .lock_shared()
+ .with_context(|| format!("while reading snapshot '{backup_dir:?}'"))?;
let path = datastore.base_path();
diff --git a/src/backup/verify.rs b/src/backup/verify.rs
index f652c262..10a64fda 100644
--- a/src/backup/verify.rs
+++ b/src/backup/verify.rs
@@ -4,10 +4,9 @@ use std::sync::{Arc, Mutex};
use std::time::Instant;
use anyhow::{bail, Error};
-use nix::dir::Dir;
+use proxmox_sys::fs::DirLockGuard;
use tracing::{error, info, warn};
-use proxmox_sys::fs::lock_dir_noblock_shared;
use proxmox_worker_task::WorkerTaskContext;
use pbs_api_types::{
@@ -307,11 +306,8 @@ pub fn verify_backup_dir(
return Ok(true);
}
- let snap_lock = lock_dir_noblock_shared(
- &backup_dir.full_path(),
- "snapshot",
- "locked by another operation",
- );
+ let snap_lock = backup_dir.lock_shared();
+
match snap_lock {
Ok(snap_lock) => {
verify_backup_dir_with_lock(verify_worker, backup_dir, upid, filter, snap_lock)
@@ -334,7 +330,7 @@ pub fn verify_backup_dir_with_lock(
backup_dir: &BackupDir,
upid: UPID,
filter: Option<&dyn Fn(&BackupManifest) -> bool>,
- _snap_lock: Dir,
+ _snap_lock: DirLockGuard,
) -> Result<bool, Error> {
let datastore_name = verify_worker.datastore.name();
let backup_dir_name = backup_dir.dir();
diff --git a/src/server/sync.rs b/src/server/sync.rs
index 4dd46c5a..b6852aff 100644
--- a/src/server/sync.rs
+++ b/src/server/sync.rs
@@ -10,6 +10,7 @@ use std::time::Duration;
use anyhow::{bail, format_err, Context, Error};
use futures::{future::FutureExt, select};
use hyper::http::StatusCode;
+use proxmox_sys::fs::DirLockGuard;
use serde_json::json;
use tracing::{info, warn};
@@ -105,7 +106,7 @@ pub(crate) struct RemoteSourceReader {
}
pub(crate) struct LocalSourceReader {
- pub(crate) _dir_lock: Arc<Mutex<proxmox_sys::fs::DirLockGuard>>,
+ pub(crate) _dir_lock: Arc<Mutex<DirLockGuard>>,
pub(crate) path: PathBuf,
pub(crate) datastore: Arc<DataStore>,
}
@@ -478,13 +479,11 @@ impl SyncSource for LocalSource {
dir: &BackupDir,
) -> Result<Arc<dyn SyncSourceReader>, Error> {
let dir = self.store.backup_dir(ns.clone(), dir.clone())?;
- let dir_lock = proxmox_sys::fs::lock_dir_noblock_shared(
- &dir.full_path(),
- "snapshot",
- "locked by another operation",
- )?;
+ let guard = dir
+ .lock()
+ .with_context(|| format!("while reading snapshot '{dir:?}' for a sync job"))?;
Ok(Arc::new(LocalSourceReader {
- _dir_lock: Arc::new(Mutex::new(dir_lock)),
+ _dir_lock: Arc::new(Mutex::new(guard)),
path: dir.full_path(),
datastore: dir.datastore().clone(),
}))
--
2.39.5
_______________________________________________
pbs-devel mailing list
pbs-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel
^ permalink raw reply [flat|nested] 17+ messages in thread
* [pbs-devel] [PATCH proxmox-backup v8 2/4] fix #3935: datastore/api/backup: move datastore locking to '/run'
2025-03-24 12:51 [pbs-devel] [PATCH proxmox-backup v8 0/4] refactor datastore locking to use tmpfs Shannon Sterz
2025-03-24 12:51 ` [pbs-devel] [PATCH proxmox-backup v8 1/4] datastore/api/backup: prepare for fix of #3935 by adding lock helpers Shannon Sterz
@ 2025-03-24 12:51 ` Shannon Sterz
2025-03-25 9:43 ` Christian Ebner
` (2 more replies)
2025-03-24 12:51 ` [pbs-devel] [PATCH proxmox-backup v8 3/4] fix #3935: datastore: move manifest locking to new locking method Shannon Sterz
` (2 subsequent siblings)
4 siblings, 3 replies; 17+ messages in thread
From: Shannon Sterz @ 2025-03-24 12:51 UTC (permalink / raw)
To: pbs-devel
to avoid issues when removing a group or snapshot directory where two
threads hold a lock to the same directory, move locking to the tmpfs
backed '/run' directory. also adds double stat'ing to make it possible
to remove locks without certain race condition issues.
this new mechanism is only employed when we can be sure, that a reboot
has occured so that all processes are using the new locking mechanism.
otherwise, two separate process could assume they have exclusive
rights to a group or snapshot.
bumps the rust version to 1.81 so we can use `std::fs::exists` without
issue.
Signed-off-by: Shannon Sterz <s.sterz@proxmox.com>
---
Cargo.toml | 2 +-
debian/postinst | 5 +
pbs-config/src/lib.rs | 32 ++++-
pbs-datastore/Cargo.toml | 1 +
pbs-datastore/src/backup_info.rs | 196 +++++++++++++++++++++++----
pbs-datastore/src/datastore.rs | 6 +-
pbs-datastore/src/snapshot_reader.rs | 17 ++-
src/api2/backup/environment.rs | 5 +-
src/backup/verify.rs | 4 +-
src/server/sync.rs | 4 +-
10 files changed, 230 insertions(+), 42 deletions(-)
diff --git a/Cargo.toml b/Cargo.toml
index 306d50e9..645f2d81 100644
--- a/Cargo.toml
+++ b/Cargo.toml
@@ -13,7 +13,7 @@ authors = [
edition = "2021"
license = "AGPL-3"
repository = "https://git.proxmox.com/?p=proxmox-backup.git"
-rust-version = "1.80"
+rust-version = "1.81"
[package]
name = "proxmox-backup"
diff --git a/debian/postinst b/debian/postinst
index 0485ca34..81c15826 100644
--- a/debian/postinst
+++ b/debian/postinst
@@ -80,6 +80,11 @@ EOF
update_sync_job "$prev_job"
fi
fi
+
+ if dpkg --compare-versions "$2" 'lt' '3.3.4~'; then
+ # ensure old locking is used by the daemon until a reboot happened
+ touch "/run/proxmox-backup/old-locking"
+ fi
fi
;;
diff --git a/pbs-config/src/lib.rs b/pbs-config/src/lib.rs
index 20a8238d..9c4d77c2 100644
--- a/pbs-config/src/lib.rs
+++ b/pbs-config/src/lib.rs
@@ -22,6 +22,8 @@ pub use config_version_cache::ConfigVersionCache;
use anyhow::{format_err, Error};
use nix::unistd::{Gid, Group, Uid, User};
+use proxmox_sys::fs::DirLockGuard;
+use std::os::unix::prelude::AsRawFd;
pub use pbs_buildcfg::{BACKUP_GROUP_NAME, BACKUP_USER_NAME};
@@ -46,13 +48,34 @@ pub fn backup_group() -> Result<nix::unistd::Group, Error> {
}
pub struct BackupLockGuard {
- _file: Option<std::fs::File>,
+ file: Option<std::fs::File>,
+ // TODO: Remove `_legacy_dir` with PBS 5
+ _legacy_dir: Option<DirLockGuard>,
+}
+
+impl AsRawFd for BackupLockGuard {
+ fn as_raw_fd(&self) -> i32 {
+ self.file.as_ref().map_or(-1, |f| f.as_raw_fd())
+ }
+}
+
+// TODO: Remove with PBS 5
+impl From<DirLockGuard> for BackupLockGuard {
+ fn from(value: DirLockGuard) -> Self {
+ Self {
+ file: None,
+ _legacy_dir: Some(value),
+ }
+ }
}
#[doc(hidden)]
/// Note: do not use for production code, this is only intended for tests
pub unsafe fn create_mocked_lock() -> BackupLockGuard {
- BackupLockGuard { _file: None }
+ BackupLockGuard {
+ file: None,
+ _legacy_dir: None,
+ }
}
/// Open or create a lock file owned by user "backup" and lock it.
@@ -76,7 +99,10 @@ pub fn open_backup_lockfile<P: AsRef<std::path::Path>>(
let timeout = timeout.unwrap_or(std::time::Duration::new(10, 0));
let file = proxmox_sys::fs::open_file_locked(&path, timeout, exclusive, options)?;
- Ok(BackupLockGuard { _file: Some(file) })
+ Ok(BackupLockGuard {
+ file: Some(file),
+ _legacy_dir: None,
+ })
}
/// Atomically write data to file owned by "root:backup" with permission "0640"
diff --git a/pbs-datastore/Cargo.toml b/pbs-datastore/Cargo.toml
index 4ebc5fdc..7623adc2 100644
--- a/pbs-datastore/Cargo.toml
+++ b/pbs-datastore/Cargo.toml
@@ -35,6 +35,7 @@ proxmox-lang.workspace=true
proxmox-schema = { workspace = true, features = [ "api-macro" ] }
proxmox-serde = { workspace = true, features = [ "serde_json" ] }
proxmox-sys.workspace = true
+proxmox-systemd.workspace = true
proxmox-time.workspace = true
proxmox-uuid.workspace = true
proxmox-worker-task.workspace = true
diff --git a/pbs-datastore/src/backup_info.rs b/pbs-datastore/src/backup_info.rs
index c7f7ed53..b79e8196 100644
--- a/pbs-datastore/src/backup_info.rs
+++ b/pbs-datastore/src/backup_info.rs
@@ -1,13 +1,15 @@
use std::fmt;
-use std::os::unix::io::RawFd;
+use std::os::unix::io::{AsRawFd, RawFd};
+use std::os::unix::prelude::OsStrExt;
+use std::path::Path;
use std::path::PathBuf;
-use std::sync::Arc;
+use std::sync::{Arc, LazyLock};
+use std::time::Duration;
use anyhow::{bail, format_err, Context, Error};
-use proxmox_sys::fs::{
- lock_dir_noblock, lock_dir_noblock_shared, replace_file, CreateOptions, DirLockGuard,
-};
+use proxmox_sys::fs::{lock_dir_noblock, lock_dir_noblock_shared, replace_file, CreateOptions};
+use proxmox_systemd::escape_unit;
use pbs_api_types::{
Authid, BackupGroupDeleteStats, BackupNamespace, BackupType, GroupFilter, VerifyState,
@@ -18,6 +20,18 @@ use pbs_config::{open_backup_lockfile, BackupLockGuard};
use crate::manifest::{BackupManifest, MANIFEST_LOCK_NAME};
use crate::{DataBlob, DataStore};
+pub const DATASTORE_LOCKS_DIR: &str = "/run/proxmox-backup/locks";
+
+// TODO: Remove with PBS 5
+// Note: The `expect()` call here will only happen if we can neither confirm nor deny the existence
+// of the file. this should only happen if a user messes with the `/run/proxmox-backup` directory.
+// if that happens, a lot more should fail as we rely on the existence of the directory throughout
+// the code. so just panic with a reasonable message.
+static OLD_LOCKING: LazyLock<bool> = LazyLock::new(|| {
+ std::fs::exists("/run/proxmox-backup/old-locking")
+ .expect("cannot read `/run/proxmox-backup`, please check permissions")
+});
+
/// BackupGroup is a directory containing a list of BackupDir
#[derive(Clone)]
pub struct BackupGroup {
@@ -225,6 +239,7 @@ impl BackupGroup {
delete_stats.increment_removed_groups();
}
+ let _ = std::fs::remove_file(self.lock_path());
Ok(delete_stats)
}
@@ -241,13 +256,34 @@ impl BackupGroup {
.set_owner(&self.ns, self.as_ref(), auth_id, force)
}
- /// Lock a group exclusively
- pub fn lock(&self) -> Result<DirLockGuard, Error> {
- lock_dir_noblock(
- &self.full_group_path(),
- "backup group",
- "possible running backup",
- )
+ /// Returns a file name for locking a group.
+ ///
+ /// The lock file will be located in:
+ /// `${DATASTORE_LOCKS_DIR}/${datastore name}/${lock_file_path_helper(rpath)}`
+ /// where `rpath` is the relative path of the group.
+ fn lock_path(&self) -> PathBuf {
+ let path = Path::new(DATASTORE_LOCKS_DIR).join(self.store.name());
+
+ let rpath = Path::new(self.group.ty.as_str()).join(&self.group.id);
+
+ path.join(lock_file_path_helper(&self.ns, rpath))
+ }
+
+ /// Locks a group exclusively.
+ pub fn lock(&self) -> Result<BackupLockGuard, Error> {
+ if *OLD_LOCKING {
+ lock_dir_noblock(
+ &self.full_group_path(),
+ "backup group",
+ "possible runing backup, group is in use",
+ )
+ .map(BackupLockGuard::from)
+ } else {
+ lock_helper(self.store.name(), &self.lock_path(), |p| {
+ open_backup_lockfile(p, Some(Duration::from_secs(0)), true)
+ .map_err(|err| format_err!("unable to acquire backup group lock {p:?} - {err}"))
+ })
+ }
}
}
@@ -454,22 +490,54 @@ impl BackupDir {
.map_err(|err| format_err!("unable to acquire manifest lock {:?} - {}", &path, err))
}
- /// Lock this snapshot exclusively
- pub fn lock(&self) -> Result<DirLockGuard, Error> {
- lock_dir_noblock(
- &self.full_path(),
- "snapshot",
- "backup is running or snapshot is in use",
- )
+ /// Returns a file name for locking a snapshot.
+ ///
+ /// The lock file will be located in:
+ /// `${DATASTORE_LOCKS_DIR}/${datastore name}/${lock_file_path_helper(rpath)}`
+ /// where `rpath` is the relative path of the snapshot.
+ fn lock_path(&self) -> PathBuf {
+ let path = Path::new(DATASTORE_LOCKS_DIR).join(self.store.name());
+
+ let rpath = Path::new(self.dir.group.ty.as_str())
+ .join(&self.dir.group.id)
+ .join(&self.backup_time_string);
+
+ path.join(lock_file_path_helper(&self.ns, rpath))
}
- /// Acquire a shared lock on this snapshot
- pub fn lock_shared(&self) -> Result<DirLockGuard, Error> {
- lock_dir_noblock_shared(
- &self.full_path(),
- "snapshot",
- "backup is running or snapshot is in use, could not acquire shared lock",
- )
+ /// Locks a snapshot exclusively.
+ pub fn lock(&self) -> Result<BackupLockGuard, Error> {
+ if *OLD_LOCKING {
+ lock_dir_noblock(
+ &self.full_path(),
+ "snapshot",
+ "backup is running or snapshot is in use",
+ )
+ .map(BackupLockGuard::from)
+ } else {
+ lock_helper(self.store.name(), &self.lock_path(), |p| {
+ open_backup_lockfile(p, Some(Duration::from_secs(0)), true)
+ .map_err(|err| format_err!("unable to acquire snapshot lock {p:?} - {err}"))
+ })
+ }
+ }
+
+ /// Acquires a shared lock on a snapshot.
+ pub fn lock_shared(&self) -> Result<BackupLockGuard, Error> {
+ if *OLD_LOCKING {
+ lock_dir_noblock_shared(
+ &self.full_path(),
+ "snapshot",
+ "backup is running or snapshot is in use, could not acquire shared lock",
+ )
+ .map(BackupLockGuard::from)
+ } else {
+ lock_helper(self.store.name(), &self.lock_path(), |p| {
+ open_backup_lockfile(p, Some(Duration::from_secs(0)), false).map_err(|err| {
+ format_err!("unable to acquire shared snapshot lock {p:?} - {err}")
+ })
+ })
+ }
}
/// Destroy the whole snapshot, bails if it's protected
@@ -494,11 +562,13 @@ impl BackupDir {
format_err!("removing backup snapshot {:?} failed - {}", full_path, err,)
})?;
- // the manifest doesn't exist anymore, no need to keep the lock (already done by guard?)
+ // remove no longer needed lock files
if let Ok(path) = self.manifest_lock_path() {
let _ = std::fs::remove_file(path); // ignore errors
}
+ let _ = std::fs::remove_file(self.lock_path()); // ignore errors
+
Ok(())
}
@@ -692,3 +762,75 @@ fn list_backup_files<P: ?Sized + nix::NixPath>(
Ok(files)
}
+
+/// Creates a path to a lock file depending on the relative path of an object (snapshot, group,
+/// manifest) in a datastore. First all namespaces will be concatenated with a colon (ns-folder).
+/// Then the actual file name will depend on the length of the relative path without namespaces. If
+/// it is shorter than 255 characters in its unit encoded form, than the unit encoded form will be
+/// used directly. If not, the file name will consist of the first 80 character, the last 80
+/// characters and the hash of the unit encoded relative path without namespaces. It will also be
+/// placed into a "hashed" subfolder in the namespace folder.
+///
+/// Examples:
+///
+/// - vm-100
+/// - vm-100-2022\x2d05\x2d02T08\x3a11\x3a33Z
+/// - ns1:ns2:ns3:ns4:ns5:ns6:ns7/vm-100-2022\x2d05\x2d02T08\x3a11\x3a33Z
+///
+/// A "hashed" lock file would look like this:
+/// - ns1:ns2:ns3/hashed/$first_eighty...$last_eighty-$hash
+fn lock_file_path_helper(ns: &BackupNamespace, path: PathBuf) -> PathBuf {
+ let to_return = PathBuf::from(
+ ns.components()
+ .map(String::from)
+ .reduce(|acc, n| format!("{acc}:{n}"))
+ .unwrap_or_default(),
+ );
+
+ let path_bytes = path.as_os_str().as_bytes();
+
+ let enc = escape_unit(path_bytes, true);
+
+ if enc.len() < 255 {
+ return to_return.join(enc);
+ }
+
+ let to_return = to_return.join("hashed");
+
+ let first_eigthy = &enc[..80];
+ let last_eighty = &enc[enc.len() - 80..];
+ let hash = hex::encode(openssl::sha::sha256(path_bytes));
+
+ to_return.join(format!("{first_eigthy}...{last_eighty}-{hash}"))
+}
+
+/// Helps implement the double stat'ing procedure. It avoids certain race conditions upon lock
+/// deletion.
+///
+/// It also creates the base directory for lock files.
+fn lock_helper<F>(
+ store_name: &str,
+ path: &std::path::Path,
+ lock_fn: F,
+) -> Result<BackupLockGuard, Error>
+where
+ F: Fn(&std::path::Path) -> Result<BackupLockGuard, Error>,
+{
+ let mut lock_dir = Path::new(DATASTORE_LOCKS_DIR).join(store_name);
+
+ if let Some(parent) = path.parent() {
+ lock_dir = lock_dir.join(parent);
+ };
+
+ std::fs::create_dir_all(&lock_dir)?;
+
+ let lock = lock_fn(path)?;
+
+ let stat = nix::sys::stat::fstat(lock.as_raw_fd())?.st_ino;
+
+ if nix::sys::stat::stat(path).map_or(true, |st| stat != st.st_ino) {
+ bail!("could not acquire lock, another thread modified the lock file");
+ }
+
+ Ok(lock)
+}
diff --git a/pbs-datastore/src/datastore.rs b/pbs-datastore/src/datastore.rs
index 1cbd0038..1e6157c0 100644
--- a/pbs-datastore/src/datastore.rs
+++ b/pbs-datastore/src/datastore.rs
@@ -13,7 +13,6 @@ use proxmox_human_byte::HumanByte;
use proxmox_schema::ApiType;
use proxmox_sys::error::SysError;
-use proxmox_sys::fs::DirLockGuard;
use proxmox_sys::fs::{file_read_optional_string, replace_file, CreateOptions};
use proxmox_sys::linux::procfs::MountInfo;
use proxmox_sys::process_locker::ProcessLockSharedGuard;
@@ -24,6 +23,7 @@ use pbs_api_types::{
DataStoreConfig, DatastoreFSyncLevel, DatastoreTuning, GarbageCollectionStatus,
MaintenanceMode, MaintenanceType, Operation, UPID,
};
+use pbs_config::BackupLockGuard;
use crate::backup_info::{BackupDir, BackupGroup};
use crate::chunk_store::ChunkStore;
@@ -778,7 +778,7 @@ impl DataStore {
ns: &BackupNamespace,
backup_group: &pbs_api_types::BackupGroup,
auth_id: &Authid,
- ) -> Result<(Authid, DirLockGuard), Error> {
+ ) -> Result<(Authid, BackupLockGuard), Error> {
let backup_group = self.backup_group(ns.clone(), backup_group.clone());
// create intermediate path first
@@ -816,7 +816,7 @@ impl DataStore {
self: &Arc<Self>,
ns: &BackupNamespace,
backup_dir: &pbs_api_types::BackupDir,
- ) -> Result<(PathBuf, bool, DirLockGuard), Error> {
+ ) -> Result<(PathBuf, bool, BackupLockGuard), Error> {
let backup_dir = self.backup_dir(ns.clone(), backup_dir.clone())?;
let relative_path = backup_dir.relative_path();
diff --git a/pbs-datastore/src/snapshot_reader.rs b/pbs-datastore/src/snapshot_reader.rs
index edff19ef..20fd388a 100644
--- a/pbs-datastore/src/snapshot_reader.rs
+++ b/pbs-datastore/src/snapshot_reader.rs
@@ -4,8 +4,11 @@ use std::path::Path;
use std::rc::Rc;
use std::sync::Arc;
-use anyhow::{bail, Context, Error};
+use anyhow::{bail, format_err, Context, Error};
use nix::dir::Dir;
+use nix::fcntl::OFlag;
+use nix::sys::stat::Mode;
+use pbs_config::BackupLockGuard;
use pbs_api_types::{
print_store_and_ns, ArchiveType, BackupNamespace, Operation, CLIENT_LOG_BLOB_NAME,
@@ -26,6 +29,10 @@ pub struct SnapshotReader {
datastore_name: String,
file_list: Vec<String>,
locked_dir: Dir,
+
+ // while this is never read, the lock needs to be kept until the
+ // reader is dropped to ensure valid locking semantics
+ _lock: BackupLockGuard,
}
impl SnapshotReader {
@@ -46,10 +53,15 @@ impl SnapshotReader {
bail!("snapshot {} does not exist!", snapshot.dir());
}
- let locked_dir = snapshot
+ let lock = snapshot
.lock_shared()
.with_context(|| format!("while trying to read snapshot '{snapshot:?}'"))?;
+ let locked_dir =
+ Dir::open(&snapshot_path, OFlag::O_RDONLY, Mode::empty()).map_err(|err| {
+ format_err!("unable to open snapshot directory {snapshot_path:?} - {err}")
+ })?;
+
let datastore_name = datastore.name().to_string();
let manifest = match snapshot.load_manifest() {
Ok((manifest, _)) => manifest,
@@ -79,6 +91,7 @@ impl SnapshotReader {
datastore_name,
file_list,
locked_dir,
+ _lock: lock,
})
}
diff --git a/src/api2/backup/environment.rs b/src/api2/backup/environment.rs
index 7943b576..3d541b46 100644
--- a/src/api2/backup/environment.rs
+++ b/src/api2/backup/environment.rs
@@ -1,5 +1,6 @@
use anyhow::{bail, format_err, Context, Error};
-use nix::dir::Dir;
+use pbs_config::BackupLockGuard;
+
use std::collections::HashMap;
use std::sync::{Arc, Mutex};
use tracing::info;
@@ -635,7 +636,7 @@ impl BackupEnvironment {
/// If verify-new is set on the datastore, this will run a new verify task
/// for the backup. If not, this will return and also drop the passed lock
/// immediately.
- pub fn verify_after_complete(&self, excl_snap_lock: Dir) -> Result<(), Error> {
+ pub fn verify_after_complete(&self, excl_snap_lock: BackupLockGuard) -> Result<(), Error> {
self.ensure_finished()?;
if !self.datastore.verify_new() {
diff --git a/src/backup/verify.rs b/src/backup/verify.rs
index 10a64fda..3d2cba8a 100644
--- a/src/backup/verify.rs
+++ b/src/backup/verify.rs
@@ -1,10 +1,10 @@
+use pbs_config::BackupLockGuard;
use std::collections::HashSet;
use std::sync::atomic::{AtomicUsize, Ordering};
use std::sync::{Arc, Mutex};
use std::time::Instant;
use anyhow::{bail, Error};
-use proxmox_sys::fs::DirLockGuard;
use tracing::{error, info, warn};
use proxmox_worker_task::WorkerTaskContext;
@@ -330,7 +330,7 @@ pub fn verify_backup_dir_with_lock(
backup_dir: &BackupDir,
upid: UPID,
filter: Option<&dyn Fn(&BackupManifest) -> bool>,
- _snap_lock: DirLockGuard,
+ _snap_lock: BackupLockGuard,
) -> Result<bool, Error> {
let datastore_name = verify_worker.datastore.name();
let backup_dir_name = backup_dir.dir();
diff --git a/src/server/sync.rs b/src/server/sync.rs
index b6852aff..10804b14 100644
--- a/src/server/sync.rs
+++ b/src/server/sync.rs
@@ -10,7 +10,7 @@ use std::time::Duration;
use anyhow::{bail, format_err, Context, Error};
use futures::{future::FutureExt, select};
use hyper::http::StatusCode;
-use proxmox_sys::fs::DirLockGuard;
+use pbs_config::BackupLockGuard;
use serde_json::json;
use tracing::{info, warn};
@@ -106,7 +106,7 @@ pub(crate) struct RemoteSourceReader {
}
pub(crate) struct LocalSourceReader {
- pub(crate) _dir_lock: Arc<Mutex<DirLockGuard>>,
+ pub(crate) _dir_lock: Arc<Mutex<BackupLockGuard>>,
pub(crate) path: PathBuf,
pub(crate) datastore: Arc<DataStore>,
}
--
2.39.5
_______________________________________________
pbs-devel mailing list
pbs-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel
^ permalink raw reply [flat|nested] 17+ messages in thread
* [pbs-devel] [PATCH proxmox-backup v8 3/4] fix #3935: datastore: move manifest locking to new locking method
2025-03-24 12:51 [pbs-devel] [PATCH proxmox-backup v8 0/4] refactor datastore locking to use tmpfs Shannon Sterz
2025-03-24 12:51 ` [pbs-devel] [PATCH proxmox-backup v8 1/4] datastore/api/backup: prepare for fix of #3935 by adding lock helpers Shannon Sterz
2025-03-24 12:51 ` [pbs-devel] [PATCH proxmox-backup v8 2/4] fix #3935: datastore/api/backup: move datastore locking to '/run' Shannon Sterz
@ 2025-03-24 12:51 ` Shannon Sterz
2025-03-25 9:44 ` Christian Ebner
2025-03-24 12:51 ` [pbs-devel] [PATCH proxmox-backup v8 4/4] fix: api: avoid race condition in set_backup_owner Shannon Sterz
2025-03-25 11:26 ` [pbs-devel] [PATCH proxmox-backup v8 0/4] refactor datastore locking to use tmpfs Wolfgang Bumiller
4 siblings, 1 reply; 17+ messages in thread
From: Shannon Sterz @ 2025-03-24 12:51 UTC (permalink / raw)
To: pbs-devel
adds double stat'ing and removes directory hierarchy to bring manifest
locking in-line with other locks used by the BackupDir trait.
if the old locking mechanism is still supposed to be used, this still
falls back to the previous lock file. however, we already add double
stat'ing since it is trivial to do here and should only provide better
safety when it comes to removing locks.
Signed-off-by: Shannon Sterz <s.sterz@proxmox.com>
---
pbs-datastore/src/backup_info.rs | 45 ++++++++++++++++++++------------
1 file changed, 29 insertions(+), 16 deletions(-)
diff --git a/pbs-datastore/src/backup_info.rs b/pbs-datastore/src/backup_info.rs
index b79e8196..0602867a 100644
--- a/pbs-datastore/src/backup_info.rs
+++ b/pbs-datastore/src/backup_info.rs
@@ -469,25 +469,41 @@ impl BackupDir {
/// Returns the filename to lock a manifest
///
/// Also creates the basedir. The lockfile is located in
- /// '/run/proxmox-backup/locks/{datastore}/[ns/{ns}/]+{type}/{id}/{timestamp}.index.json.lck'
- fn manifest_lock_path(&self) -> Result<PathBuf, Error> {
- let mut path = PathBuf::from(&format!("/run/proxmox-backup/locks/{}", self.store.name()));
- path.push(self.relative_path());
+ /// `${DATASTORE_LOCKS_DIR}/${datastore name}/${lock_file_path_helper(rpath)}.index.json.lck`
+ /// where rpath is the relative path of the snapshot.
+ fn manifest_lock_path(&self) -> PathBuf {
+ let path = Path::new(DATASTORE_LOCKS_DIR).join(self.store.name());
- std::fs::create_dir_all(&path)?;
- let ts = self.backup_time_string();
- path.push(format!("{ts}{MANIFEST_LOCK_NAME}"));
+ let rpath = Path::new(self.dir.group.ty.as_str())
+ .join(&self.dir.group.id)
+ .join(&self.backup_time_string)
+ .join(MANIFEST_LOCK_NAME);
- Ok(path)
+ path.join(lock_file_path_helper(&self.ns, rpath))
}
/// Locks the manifest of a snapshot, for example, to update or delete it.
pub(crate) fn lock_manifest(&self) -> Result<BackupLockGuard, Error> {
- let path = self.manifest_lock_path()?;
+ let path = if *OLD_LOCKING {
+ // old manifest lock path
+ let path = Path::new(DATASTORE_LOCKS_DIR)
+ .join(self.store.name())
+ .join(self.relative_path());
- // actions locking the manifest should be relatively short, only wait a few seconds
- open_backup_lockfile(&path, Some(std::time::Duration::from_secs(5)), true)
- .map_err(|err| format_err!("unable to acquire manifest lock {:?} - {}", &path, err))
+ std::fs::create_dir_all(&path)?;
+
+ path.join(format!("{}{MANIFEST_LOCK_NAME}", self.backup_time_string()))
+ } else {
+ self.manifest_lock_path()
+ };
+
+ lock_helper(self.store.name(), &path, |p| {
+ // update_manifest should never take a long time, so if
+ // someone else has the lock we can simply block a bit
+ // and should get it soon
+ open_backup_lockfile(p, Some(Duration::from_secs(5)), true)
+ .map_err(|err| format_err!("unable to acquire manifest lock {p:?} - {err}"))
+ })
}
/// Returns a file name for locking a snapshot.
@@ -563,10 +579,7 @@ impl BackupDir {
})?;
// remove no longer needed lock files
- if let Ok(path) = self.manifest_lock_path() {
- let _ = std::fs::remove_file(path); // ignore errors
- }
-
+ let _ = std::fs::remove_file(self.manifest_lock_path()); // ignore errors
let _ = std::fs::remove_file(self.lock_path()); // ignore errors
Ok(())
--
2.39.5
_______________________________________________
pbs-devel mailing list
pbs-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel
^ permalink raw reply [flat|nested] 17+ messages in thread
* [pbs-devel] [PATCH proxmox-backup v8 4/4] fix: api: avoid race condition in set_backup_owner
2025-03-24 12:51 [pbs-devel] [PATCH proxmox-backup v8 0/4] refactor datastore locking to use tmpfs Shannon Sterz
` (2 preceding siblings ...)
2025-03-24 12:51 ` [pbs-devel] [PATCH proxmox-backup v8 3/4] fix #3935: datastore: move manifest locking to new locking method Shannon Sterz
@ 2025-03-24 12:51 ` Shannon Sterz
2025-03-25 10:00 ` Christian Ebner
2025-03-25 11:26 ` [pbs-devel] [PATCH proxmox-backup v8 0/4] refactor datastore locking to use tmpfs Wolfgang Bumiller
4 siblings, 1 reply; 17+ messages in thread
From: Shannon Sterz @ 2025-03-24 12:51 UTC (permalink / raw)
To: pbs-devel
when two clients change the owner of a backup store, a race condition
arose. add locking to avoid this.
Signed-off-by: Shannon Sterz <s.sterz@proxmox.com>
---
src/api2/admin/datastore.rs | 13 ++++++++++---
1 file changed, 10 insertions(+), 3 deletions(-)
diff --git a/src/api2/admin/datastore.rs b/src/api2/admin/datastore.rs
index 483e595c..3e532636 100644
--- a/src/api2/admin/datastore.rs
+++ b/src/api2/admin/datastore.rs
@@ -7,7 +7,7 @@ use std::os::unix::ffi::OsStrExt;
use std::path::{Path, PathBuf};
use std::sync::Arc;
-use anyhow::{bail, format_err, Error};
+use anyhow::{bail, format_err, Context, Error};
use futures::*;
use hyper::http::request::Parts;
use hyper::{header, Body, Response, StatusCode};
@@ -2347,10 +2347,9 @@ pub async fn set_backup_owner(
let datastore = DataStore::lookup_datastore(&store, Some(Operation::Write))?;
let backup_group = datastore.backup_group(ns, backup_group);
+ let owner = backup_group.get_owner()?;
if owner_check_required {
- let owner = backup_group.get_owner()?;
-
let allowed = match (owner.is_token(), new_owner.is_token()) {
(true, true) => {
// API token to API token, owned by same user
@@ -2397,6 +2396,14 @@ pub async fn set_backup_owner(
);
}
+ let _guard = backup_group
+ .lock()
+ .with_context(|| format!("while setting the owner of group '{backup_group:?}'"))?;
+
+ if owner != backup_group.get_owner()? {
+ bail!("{owner} does not own this group anymore");
+ }
+
backup_group.set_owner(&new_owner, true)?;
Ok(())
--
2.39.5
_______________________________________________
pbs-devel mailing list
pbs-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [pbs-devel] [PATCH proxmox-backup v8 1/4] datastore/api/backup: prepare for fix of #3935 by adding lock helpers
2025-03-24 12:51 ` [pbs-devel] [PATCH proxmox-backup v8 1/4] datastore/api/backup: prepare for fix of #3935 by adding lock helpers Shannon Sterz
@ 2025-03-25 9:35 ` Christian Ebner
2025-03-25 9:57 ` Shannon Sterz
0 siblings, 1 reply; 17+ messages in thread
From: Christian Ebner @ 2025-03-25 9:35 UTC (permalink / raw)
To: Proxmox Backup Server development discussion, Shannon Sterz
some small nits inline
On 3/24/25 13:51, Shannon Sterz wrote:
> to avoid duplicate code, add helpers for locking groups and snapshots
> to the BackupGroup and BackupDir traits respectively and refactor
> existing code to use them.
>
> this also adapts error handling by adding relevant context to each
> locking helper call site. otherwise, we might loose valuable
> information useful for debugging. note, however, that users that
> relied on specific error messages will break.
>
> Signed-off-by: Shannon Sterz <s.sterz@proxmox.com>
> ---
> pbs-datastore/src/backup_info.rs | 45 ++++++++++++---
> pbs-datastore/src/datastore.rs | 82 +++++++++++++---------------
> pbs-datastore/src/snapshot_reader.rs | 9 ++-
> src/api2/backup/environment.rs | 16 +++---
> src/api2/backup/mod.rs | 13 ++---
> src/api2/reader/mod.rs | 11 ++--
> src/backup/verify.rs | 12 ++--
> src/server/sync.rs | 13 ++---
> 8 files changed, 106 insertions(+), 95 deletions(-)
>
> diff --git a/pbs-datastore/src/backup_info.rs b/pbs-datastore/src/backup_info.rs
> index ef2b982c..c7f7ed53 100644
> --- a/pbs-datastore/src/backup_info.rs
> +++ b/pbs-datastore/src/backup_info.rs
> @@ -3,9 +3,11 @@ use std::os::unix::io::RawFd;
> use std::path::PathBuf;
> use std::sync::Arc;
>
> -use anyhow::{bail, format_err, Error};
> +use anyhow::{bail, format_err, Context, Error};
>
> -use proxmox_sys::fs::{lock_dir_noblock, replace_file, CreateOptions};
> +use proxmox_sys::fs::{
> + lock_dir_noblock, lock_dir_noblock_shared, replace_file, CreateOptions, DirLockGuard,
> +};
>
> use pbs_api_types::{
> Authid, BackupGroupDeleteStats, BackupNamespace, BackupType, GroupFilter, VerifyState,
> @@ -199,9 +201,10 @@ impl BackupGroup {
> /// Returns `BackupGroupDeleteStats`, containing the number of deleted snapshots
> /// and number of protected snaphsots, which therefore were not removed.
> pub fn destroy(&self) -> Result<BackupGroupDeleteStats, Error> {
> + let _guard = self
> + .lock()
> + .with_context(|| format!("while destroying group '{self:?}'"))?;
> let path = self.full_group_path();
> - let _guard =
> - proxmox_sys::fs::lock_dir_noblock(&path, "backup group", "possible running backup")?;
>
> log::info!("removing backup group {:?}", path);
> let mut delete_stats = BackupGroupDeleteStats::default();
> @@ -237,6 +240,15 @@ impl BackupGroup {
> self.store
> .set_owner(&self.ns, self.as_ref(), auth_id, force)
> }
> +
> + /// Lock a group exclusively
> + pub fn lock(&self) -> Result<DirLockGuard, Error> {
> + lock_dir_noblock(
> + &self.full_group_path(),
this allocates the group path...
> + "backup group",
> + "possible running backup",
> + )
> + }
> }
>
> impl AsRef<pbs_api_types::BackupNamespace> for BackupGroup {
> @@ -442,15 +454,33 @@ impl BackupDir {
> .map_err(|err| format_err!("unable to acquire manifest lock {:?} - {}", &path, err))
> }
>
> + /// Lock this snapshot exclusively
> + pub fn lock(&self) -> Result<DirLockGuard, Error> {
> + lock_dir_noblock(
> + &self.full_path(),
... this allocates the snapshot path ...
> + "snapshot",
> + "backup is running or snapshot is in use",
> + )
> + }
> +
> + /// Acquire a shared lock on this snapshot
> + pub fn lock_shared(&self) -> Result<DirLockGuard, Error> {
> + lock_dir_noblock_shared(
> + &self.full_path(),
> + "snapshot",
> + "backup is running or snapshot is in use, could not acquire shared lock",
> + )
> + }
> +
> /// Destroy the whole snapshot, bails if it's protected
> ///
> /// Setting `force` to true skips locking and thus ignores if the backup is currently in use.
> pub fn destroy(&self, force: bool) -> Result<(), Error> {
> - let full_path = self.full_path();
> -
> let (_guard, _manifest_guard);
> if !force {
> - _guard = lock_dir_noblock(&full_path, "snapshot", "possibly running or in use")?;
> + _guard = self
> + .lock()
> + .with_context(|| format!("while destroying snapshot '{self:?}'"))?;
> _manifest_guard = self.lock_manifest()?;
> }
>
> @@ -458,6 +488,7 @@ impl BackupDir {
> bail!("cannot remove protected snapshot"); // use special error type?
> }
>
> + let full_path = self.full_path();
... and here it reallocates the snapshot path again.
> log::info!("removing backup snapshot {:?}", full_path);
> std::fs::remove_dir_all(&full_path).map_err(|err| {
> format_err!("removing backup snapshot {:?} failed - {}", full_path, err,)
> diff --git a/pbs-datastore/src/datastore.rs b/pbs-datastore/src/datastore.rs
> index a6a91ca7..1cbd0038 100644
> --- a/pbs-datastore/src/datastore.rs
> +++ b/pbs-datastore/src/datastore.rs
> @@ -5,7 +5,7 @@ use std::os::unix::io::AsRawFd;
> use std::path::{Path, PathBuf};
> use std::sync::{Arc, LazyLock, Mutex};
>
> -use anyhow::{bail, format_err, Error};
> +use anyhow::{bail, format_err, Context, Error};
> use nix::unistd::{unlinkat, UnlinkatFlags};
> use tracing::{info, warn};
>
> @@ -13,8 +13,8 @@ use proxmox_human_byte::HumanByte;
> use proxmox_schema::ApiType;
>
> use proxmox_sys::error::SysError;
> +use proxmox_sys::fs::DirLockGuard;
> use proxmox_sys::fs::{file_read_optional_string, replace_file, CreateOptions};
> -use proxmox_sys::fs::{lock_dir_noblock, DirLockGuard};
> use proxmox_sys::linux::procfs::MountInfo;
> use proxmox_sys::process_locker::ProcessLockSharedGuard;
> use proxmox_worker_task::WorkerTaskContext;
> @@ -774,41 +774,35 @@ impl DataStore {
> ///
> /// This also acquires an exclusive lock on the directory and returns the lock guard.
> pub fn create_locked_backup_group(
> - &self,
> + self: &Arc<Self>,
> ns: &BackupNamespace,
> backup_group: &pbs_api_types::BackupGroup,
> auth_id: &Authid,
> ) -> Result<(Authid, DirLockGuard), Error> {
> - // create intermediate path first:
> - let mut full_path = self.base_path();
> - for ns in ns.components() {
> - full_path.push("ns");
> - full_path.push(ns);
> - }
> - full_path.push(backup_group.ty.as_str());
> - std::fs::create_dir_all(&full_path)?;
> + let backup_group = self.backup_group(ns.clone(), backup_group.clone());
>
> - full_path.push(&backup_group.id);
> + // create intermediate path first
> + let full_path = backup_group.full_group_path();
... same here, this reallocates the group path.
I see that this will change with the new locking mechanism, in the
following patches, but might it make sense to also generate the
respective paths together with the locking in order to avoid double
allocation?
However not that relevant since it will get phased out once fully
switched to the new locking mechanism anyways.
>
> - // create the last component now
> + std::fs::create_dir_all(full_path.parent().ok_or_else(|| {
> + format_err!("could not construct parent path for group {backup_group:?}")
> + })?)?;
> +
> + // now create the group, this allows us to check whether it existed before
> match std::fs::create_dir(&full_path) {
> Ok(_) => {
> - let guard = lock_dir_noblock(
> - &full_path,
> - "backup group",
> - "another backup is already running",
> - )?;
> - self.set_owner(ns, backup_group, auth_id, false)?;
> - let owner = self.get_owner(ns, backup_group)?; // just to be sure
> + let guard = backup_group.lock().with_context(|| {
> + format!("while creating new locked backup group '{backup_group:?}'")
> + })?;
> + self.set_owner(ns, backup_group.group(), auth_id, false)?;
> + let owner = self.get_owner(ns, backup_group.group())?; // just to be sure
> Ok((owner, guard))
> }
> Err(ref err) if err.kind() == io::ErrorKind::AlreadyExists => {
> - let guard = lock_dir_noblock(
> - &full_path,
> - "backup group",
> - "another backup is already running",
> - )?;
> - let owner = self.get_owner(ns, backup_group)?; // just to be sure
> + let guard = backup_group.lock().with_context(|| {
> + format!("while creating locked backup group '{backup_group:?}'")
> + })?;
> + let owner = self.get_owner(ns, backup_group.group())?; // just to be sure
> Ok((owner, guard))
> }
> Err(err) => bail!("unable to create backup group {:?} - {}", full_path, err),
> @@ -819,29 +813,25 @@ impl DataStore {
> ///
> /// The BackupGroup directory needs to exist.
> pub fn create_locked_backup_dir(
> - &self,
> + self: &Arc<Self>,
> ns: &BackupNamespace,
> backup_dir: &pbs_api_types::BackupDir,
> ) -> Result<(PathBuf, bool, DirLockGuard), Error> {
> - let full_path = self.snapshot_path(ns, backup_dir);
> - let relative_path = full_path.strip_prefix(self.base_path()).map_err(|err| {
> - format_err!(
> - "failed to produce correct path for backup {backup_dir} in namespace {ns}: {err}"
> - )
> - })?;
> + let backup_dir = self.backup_dir(ns.clone(), backup_dir.clone())?;
> + let relative_path = backup_dir.relative_path();
>
> - let lock = || {
> - lock_dir_noblock(
> - &full_path,
> - "snapshot",
> - "internal error - tried creating snapshot that's already in use",
> - )
> - };
> -
> - match std::fs::create_dir(&full_path) {
> - Ok(_) => Ok((relative_path.to_owned(), true, lock()?)),
> + match std::fs::create_dir(backup_dir.full_path()) {
> + Ok(_) => {
> + let guard = backup_dir.lock().with_context(|| {
> + format!("while creating new locked snapshot '{backup_dir:?}'")
> + })?;
> + Ok((relative_path, true, guard))
> + }
> Err(ref e) if e.kind() == io::ErrorKind::AlreadyExists => {
> - Ok((relative_path.to_owned(), false, lock()?))
> + let guard = backup_dir
> + .lock()
> + .with_context(|| format!("while creating locked snapshot '{backup_dir:?}'"))?;
> + Ok((relative_path, false, guard))
> }
> Err(e) => Err(e.into()),
> }
> @@ -1305,7 +1295,9 @@ impl DataStore {
> bail!("snapshot {} does not exist!", backup_dir.dir());
> }
>
> - let _guard = lock_dir_noblock(&full_path, "snapshot", "possibly running or in use")?;
> + let _guard = backup_dir.lock().with_context(|| {
> + format!("while updating the protection status of snapshot '{backup_dir:?}'")
> + })?;
>
> let protected_path = backup_dir.protected_file();
> if protection {
> diff --git a/pbs-datastore/src/snapshot_reader.rs b/pbs-datastore/src/snapshot_reader.rs
> index 99ae1180..edff19ef 100644
> --- a/pbs-datastore/src/snapshot_reader.rs
> +++ b/pbs-datastore/src/snapshot_reader.rs
> @@ -4,11 +4,9 @@ use std::path::Path;
> use std::rc::Rc;
> use std::sync::Arc;
>
> -use anyhow::{bail, Error};
> +use anyhow::{bail, Context, Error};
> use nix::dir::Dir;
>
> -use proxmox_sys::fs::lock_dir_noblock_shared;
> -
> use pbs_api_types::{
> print_store_and_ns, ArchiveType, BackupNamespace, Operation, CLIENT_LOG_BLOB_NAME,
> MANIFEST_BLOB_NAME,
> @@ -48,8 +46,9 @@ impl SnapshotReader {
> bail!("snapshot {} does not exist!", snapshot.dir());
> }
>
> - let locked_dir =
> - lock_dir_noblock_shared(&snapshot_path, "snapshot", "locked by another operation")?;
> + let locked_dir = snapshot
> + .lock_shared()
> + .with_context(|| format!("while trying to read snapshot '{snapshot:?}'"))?;
>
> let datastore_name = datastore.name().to_string();
> let manifest = match snapshot.load_manifest() {
> diff --git a/src/api2/backup/environment.rs b/src/api2/backup/environment.rs
> index 99d885e2..7943b576 100644
> --- a/src/api2/backup/environment.rs
> +++ b/src/api2/backup/environment.rs
> @@ -1,4 +1,4 @@
> -use anyhow::{bail, format_err, Error};
> +use anyhow::{bail, format_err, Context, Error};
> use nix::dir::Dir;
> use std::collections::HashMap;
> use std::sync::{Arc, Mutex};
> @@ -8,7 +8,7 @@ use ::serde::Serialize;
> use serde_json::{json, Value};
>
> use proxmox_router::{RpcEnvironment, RpcEnvironmentType};
> -use proxmox_sys::fs::{lock_dir_noblock_shared, replace_file, CreateOptions};
> +use proxmox_sys::fs::{replace_file, CreateOptions};
>
> use pbs_api_types::Authid;
> use pbs_datastore::backup_info::{BackupDir, BackupInfo};
> @@ -645,12 +645,12 @@ impl BackupEnvironment {
>
> // Downgrade to shared lock, the backup itself is finished
> drop(excl_snap_lock);
> - let snap_lock = lock_dir_noblock_shared(
> - &self.backup_dir.full_path(),
> - "snapshot",
> - "snapshot is already locked by another operation",
> - )?;
> -
> + let snap_lock = self.backup_dir.lock_shared().with_context(|| {
> + format!(
> + "while trying to verify snapshot '{:?}' after completion",
> + self.backup_dir
> + )
> + })?;
> let worker_id = format!(
> "{}:{}/{}/{:08X}",
> self.datastore.name(),
> diff --git a/src/api2/backup/mod.rs b/src/api2/backup/mod.rs
> index efc97a1f..344c80d4 100644
> --- a/src/api2/backup/mod.rs
> +++ b/src/api2/backup/mod.rs
> @@ -1,6 +1,6 @@
> //! Backup protocol (HTTP2 upgrade)
>
> -use anyhow::{bail, format_err, Error};
> +use anyhow::{bail, format_err, Context, Error};
> use futures::*;
> use hex::FromHex;
> use hyper::header::{HeaderValue, CONNECTION, UPGRADE};
> @@ -17,7 +17,6 @@ use proxmox_router::{
> };
> use proxmox_schema::*;
> use proxmox_sortable_macro::sortable;
> -use proxmox_sys::fs::lock_dir_noblock_shared;
>
> use pbs_api_types::{
> ArchiveType, Authid, BackupNamespace, BackupType, Operation, VerifyState,
> @@ -186,12 +185,10 @@ fn upgrade_to_backup_protocol(
> }
>
> // lock last snapshot to prevent forgetting/pruning it during backup
> - let full_path = last.backup_dir.full_path();
> - Some(lock_dir_noblock_shared(
> - &full_path,
> - "snapshot",
> - "base snapshot is already locked by another operation",
> - )?)
> + let guard = last.backup_dir
> + .lock_shared()
> + .with_context(|| format!("while locking last snapshot during backup '{last:?}'"))?;
> + Some(guard)
> } else {
> None
> };
> diff --git a/src/api2/reader/mod.rs b/src/api2/reader/mod.rs
> index 1713f182..cc791299 100644
> --- a/src/api2/reader/mod.rs
> +++ b/src/api2/reader/mod.rs
> @@ -1,6 +1,6 @@
> //! Backup reader/restore protocol (HTTP2 upgrade)
>
> -use anyhow::{bail, format_err, Error};
> +use anyhow::{bail, format_err, Context, Error};
> use futures::*;
> use hex::FromHex;
> use hyper::header::{self, HeaderValue, CONNECTION, UPGRADE};
> @@ -16,7 +16,6 @@ use proxmox_router::{
> };
> use proxmox_schema::{BooleanSchema, ObjectSchema};
> use proxmox_sortable_macro::sortable;
> -use proxmox_sys::fs::lock_dir_noblock_shared;
>
> use pbs_api_types::{
> ArchiveType, Authid, Operation, BACKUP_ARCHIVE_NAME_SCHEMA, BACKUP_ID_SCHEMA,
> @@ -129,11 +128,9 @@ fn upgrade_to_backup_reader_protocol(
> bail!("snapshot {} does not exist.", backup_dir.dir());
> }
>
> - let _guard = lock_dir_noblock_shared(
> - &backup_dir.full_path(),
> - "snapshot",
> - "locked by another operation",
> - )?;
> + let _guard = backup_dir
> + .lock_shared()
> + .with_context(|| format!("while reading snapshot '{backup_dir:?}'"))?;
>
> let path = datastore.base_path();
>
> diff --git a/src/backup/verify.rs b/src/backup/verify.rs
> index f652c262..10a64fda 100644
> --- a/src/backup/verify.rs
> +++ b/src/backup/verify.rs
> @@ -4,10 +4,9 @@ use std::sync::{Arc, Mutex};
> use std::time::Instant;
>
> use anyhow::{bail, Error};
> -use nix::dir::Dir;
> +use proxmox_sys::fs::DirLockGuard;
> use tracing::{error, info, warn};
>
> -use proxmox_sys::fs::lock_dir_noblock_shared;
> use proxmox_worker_task::WorkerTaskContext;
>
> use pbs_api_types::{
> @@ -307,11 +306,8 @@ pub fn verify_backup_dir(
> return Ok(true);
> }
>
> - let snap_lock = lock_dir_noblock_shared(
> - &backup_dir.full_path(),
> - "snapshot",
> - "locked by another operation",
> - );
> + let snap_lock = backup_dir.lock_shared();
> +
> match snap_lock {
> Ok(snap_lock) => {
> verify_backup_dir_with_lock(verify_worker, backup_dir, upid, filter, snap_lock)
> @@ -334,7 +330,7 @@ pub fn verify_backup_dir_with_lock(
> backup_dir: &BackupDir,
> upid: UPID,
> filter: Option<&dyn Fn(&BackupManifest) -> bool>,
> - _snap_lock: Dir,
> + _snap_lock: DirLockGuard,
> ) -> Result<bool, Error> {
> let datastore_name = verify_worker.datastore.name();
> let backup_dir_name = backup_dir.dir();
> diff --git a/src/server/sync.rs b/src/server/sync.rs
> index 4dd46c5a..b6852aff 100644
> --- a/src/server/sync.rs
> +++ b/src/server/sync.rs
> @@ -10,6 +10,7 @@ use std::time::Duration;
> use anyhow::{bail, format_err, Context, Error};
> use futures::{future::FutureExt, select};
> use hyper::http::StatusCode;
> +use proxmox_sys::fs::DirLockGuard;
> use serde_json::json;
> use tracing::{info, warn};
>
> @@ -105,7 +106,7 @@ pub(crate) struct RemoteSourceReader {
> }
>
> pub(crate) struct LocalSourceReader {
> - pub(crate) _dir_lock: Arc<Mutex<proxmox_sys::fs::DirLockGuard>>,
> + pub(crate) _dir_lock: Arc<Mutex<DirLockGuard>>,
> pub(crate) path: PathBuf,
> pub(crate) datastore: Arc<DataStore>,
> }
> @@ -478,13 +479,11 @@ impl SyncSource for LocalSource {
> dir: &BackupDir,
> ) -> Result<Arc<dyn SyncSourceReader>, Error> {
> let dir = self.store.backup_dir(ns.clone(), dir.clone())?;
> - let dir_lock = proxmox_sys::fs::lock_dir_noblock_shared(
> - &dir.full_path(),
> - "snapshot",
> - "locked by another operation",
> - )?;
> + let guard = dir
> + .lock()
> + .with_context(|| format!("while reading snapshot '{dir:?}' for a sync job"))?;
> Ok(Arc::new(LocalSourceReader {
> - _dir_lock: Arc::new(Mutex::new(dir_lock)),
> + _dir_lock: Arc::new(Mutex::new(guard)),
> path: dir.full_path(),
> datastore: dir.datastore().clone(),
> }))
_______________________________________________
pbs-devel mailing list
pbs-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [pbs-devel] [PATCH proxmox-backup v8 2/4] fix #3935: datastore/api/backup: move datastore locking to '/run'
2025-03-24 12:51 ` [pbs-devel] [PATCH proxmox-backup v8 2/4] fix #3935: datastore/api/backup: move datastore locking to '/run' Shannon Sterz
@ 2025-03-25 9:43 ` Christian Ebner
2025-03-25 9:48 ` Christian Ebner
2025-03-25 11:25 ` Wolfgang Bumiller
2 siblings, 0 replies; 17+ messages in thread
From: Christian Ebner @ 2025-03-25 9:43 UTC (permalink / raw)
To: Proxmox Backup Server development discussion, Shannon Sterz
On 3/24/25 13:51, Shannon Sterz wrote:
> to avoid issues when removing a group or snapshot directory where two
> threads hold a lock to the same directory, move locking to the tmpfs
> backed '/run' directory. also adds double stat'ing to make it possible
> to remove locks without certain race condition issues.
>
> this new mechanism is only employed when we can be sure, that a reboot
> has occured so that all processes are using the new locking mechanism.
> otherwise, two separate process could assume they have exclusive
> rights to a group or snapshot.
>
> bumps the rust version to 1.81 so we can use `std::fs::exists` without
> issue.
>
> Signed-off-by: Shannon Sterz <s.sterz@proxmox.com>
> ---
> Cargo.toml | 2 +-
> debian/postinst | 5 +
> pbs-config/src/lib.rs | 32 ++++-
> pbs-datastore/Cargo.toml | 1 +
> pbs-datastore/src/backup_info.rs | 196 +++++++++++++++++++++++----
> pbs-datastore/src/datastore.rs | 6 +-
> pbs-datastore/src/snapshot_reader.rs | 17 ++-
> src/api2/backup/environment.rs | 5 +-
> src/backup/verify.rs | 4 +-
> src/server/sync.rs | 4 +-
> 10 files changed, 230 insertions(+), 42 deletions(-)
>
> diff --git a/Cargo.toml b/Cargo.toml
> index 306d50e9..645f2d81 100644
> --- a/Cargo.toml
> +++ b/Cargo.toml
> @@ -13,7 +13,7 @@ authors = [
> edition = "2021"
> license = "AGPL-3"
> repository = "https://git.proxmox.com/?p=proxmox-backup.git"
> -rust-version = "1.80"
> +rust-version = "1.81"
>
> [package]
> name = "proxmox-backup"
> diff --git a/debian/postinst b/debian/postinst
> index 0485ca34..81c15826 100644
> --- a/debian/postinst
> +++ b/debian/postinst
> @@ -80,6 +80,11 @@ EOF
> update_sync_job "$prev_job"
> fi
> fi
> +
> + if dpkg --compare-versions "$2" 'lt' '3.3.4~'; then
> + # ensure old locking is used by the daemon until a reboot happened
> + touch "/run/proxmox-backup/old-locking"
> + fi
> fi
> ;;
>
> diff --git a/pbs-config/src/lib.rs b/pbs-config/src/lib.rs
> index 20a8238d..9c4d77c2 100644
> --- a/pbs-config/src/lib.rs
> +++ b/pbs-config/src/lib.rs
> @@ -22,6 +22,8 @@ pub use config_version_cache::ConfigVersionCache;
>
> use anyhow::{format_err, Error};
> use nix::unistd::{Gid, Group, Uid, User};
> +use proxmox_sys::fs::DirLockGuard;
> +use std::os::unix::prelude::AsRawFd;
>
> pub use pbs_buildcfg::{BACKUP_GROUP_NAME, BACKUP_USER_NAME};
>
> @@ -46,13 +48,34 @@ pub fn backup_group() -> Result<nix::unistd::Group, Error> {
> }
>
> pub struct BackupLockGuard {
> - _file: Option<std::fs::File>,
> + file: Option<std::fs::File>,
> + // TODO: Remove `_legacy_dir` with PBS 5
> + _legacy_dir: Option<DirLockGuard>,
> +}
> +
> +impl AsRawFd for BackupLockGuard {
> + fn as_raw_fd(&self) -> i32 {
> + self.file.as_ref().map_or(-1, |f| f.as_raw_fd())
> + }
> +}
> +
> +// TODO: Remove with PBS 5
> +impl From<DirLockGuard> for BackupLockGuard {
> + fn from(value: DirLockGuard) -> Self {
> + Self {
> + file: None,
> + _legacy_dir: Some(value),
> + }
> + }
> }
>
> #[doc(hidden)]
> /// Note: do not use for production code, this is only intended for tests
> pub unsafe fn create_mocked_lock() -> BackupLockGuard {
> - BackupLockGuard { _file: None }
> + BackupLockGuard {
> + file: None,
> + _legacy_dir: None,
> + }
> }
>
> /// Open or create a lock file owned by user "backup" and lock it.
> @@ -76,7 +99,10 @@ pub fn open_backup_lockfile<P: AsRef<std::path::Path>>(
> let timeout = timeout.unwrap_or(std::time::Duration::new(10, 0));
>
> let file = proxmox_sys::fs::open_file_locked(&path, timeout, exclusive, options)?;
> - Ok(BackupLockGuard { _file: Some(file) })
> + Ok(BackupLockGuard {
> + file: Some(file),
> + _legacy_dir: None,
> + })
> }
>
> /// Atomically write data to file owned by "root:backup" with permission "0640"
> diff --git a/pbs-datastore/Cargo.toml b/pbs-datastore/Cargo.toml
> index 4ebc5fdc..7623adc2 100644
> --- a/pbs-datastore/Cargo.toml
> +++ b/pbs-datastore/Cargo.toml
> @@ -35,6 +35,7 @@ proxmox-lang.workspace=true
> proxmox-schema = { workspace = true, features = [ "api-macro" ] }
> proxmox-serde = { workspace = true, features = [ "serde_json" ] }
> proxmox-sys.workspace = true
> +proxmox-systemd.workspace = true
> proxmox-time.workspace = true
> proxmox-uuid.workspace = true
> proxmox-worker-task.workspace = true
> diff --git a/pbs-datastore/src/backup_info.rs b/pbs-datastore/src/backup_info.rs
> index c7f7ed53..b79e8196 100644
> --- a/pbs-datastore/src/backup_info.rs
> +++ b/pbs-datastore/src/backup_info.rs
> @@ -1,13 +1,15 @@
> use std::fmt;
> -use std::os::unix::io::RawFd;
> +use std::os::unix::io::{AsRawFd, RawFd};
> +use std::os::unix::prelude::OsStrExt;
> +use std::path::Path;
> use std::path::PathBuf;
> -use std::sync::Arc;
> +use std::sync::{Arc, LazyLock};
> +use std::time::Duration;
>
> use anyhow::{bail, format_err, Context, Error};
>
> -use proxmox_sys::fs::{
> - lock_dir_noblock, lock_dir_noblock_shared, replace_file, CreateOptions, DirLockGuard,
> -};
> +use proxmox_sys::fs::{lock_dir_noblock, lock_dir_noblock_shared, replace_file, CreateOptions};
> +use proxmox_systemd::escape_unit;
>
> use pbs_api_types::{
> Authid, BackupGroupDeleteStats, BackupNamespace, BackupType, GroupFilter, VerifyState,
> @@ -18,6 +20,18 @@ use pbs_config::{open_backup_lockfile, BackupLockGuard};
> use crate::manifest::{BackupManifest, MANIFEST_LOCK_NAME};
> use crate::{DataBlob, DataStore};
>
> +pub const DATASTORE_LOCKS_DIR: &str = "/run/proxmox-backup/locks";
> +
> +// TODO: Remove with PBS 5
> +// Note: The `expect()` call here will only happen if we can neither confirm nor deny the existence
> +// of the file. this should only happen if a user messes with the `/run/proxmox-backup` directory.
> +// if that happens, a lot more should fail as we rely on the existence of the directory throughout
> +// the code. so just panic with a reasonable message.
> +static OLD_LOCKING: LazyLock<bool> = LazyLock::new(|| {
> + std::fs::exists("/run/proxmox-backup/old-locking")
> + .expect("cannot read `/run/proxmox-backup`, please check permissions")
> +});
> +
> /// BackupGroup is a directory containing a list of BackupDir
> #[derive(Clone)]
> pub struct BackupGroup {
> @@ -225,6 +239,7 @@ impl BackupGroup {
> delete_stats.increment_removed_groups();
> }
>
> + let _ = std::fs::remove_file(self.lock_path());
> Ok(delete_stats)
> }
>
> @@ -241,13 +256,34 @@ impl BackupGroup {
> .set_owner(&self.ns, self.as_ref(), auth_id, force)
> }
>
> - /// Lock a group exclusively
> - pub fn lock(&self) -> Result<DirLockGuard, Error> {
> - lock_dir_noblock(
> - &self.full_group_path(),
> - "backup group",
> - "possible running backup",
> - )
> + /// Returns a file name for locking a group.
> + ///
> + /// The lock file will be located in:
> + /// `${DATASTORE_LOCKS_DIR}/${datastore name}/${lock_file_path_helper(rpath)}`
> + /// where `rpath` is the relative path of the group.
> + fn lock_path(&self) -> PathBuf {
> + let path = Path::new(DATASTORE_LOCKS_DIR).join(self.store.name());
> +
> + let rpath = Path::new(self.group.ty.as_str()).join(&self.group.id);
> +
> + path.join(lock_file_path_helper(&self.ns, rpath))
> + }
> +
> + /// Locks a group exclusively.
> + pub fn lock(&self) -> Result<BackupLockGuard, Error> {
> + if *OLD_LOCKING {
> + lock_dir_noblock(
> + &self.full_group_path(),
> + "backup group",
> + "possible runing backup, group is in use",
> + )
> + .map(BackupLockGuard::from)
> + } else {
> + lock_helper(self.store.name(), &self.lock_path(), |p| {
> + open_backup_lockfile(p, Some(Duration::from_secs(0)), true)
> + .map_err(|err| format_err!("unable to acquire backup group lock {p:?} - {err}"))
> + })
nit: should this be switched to use the anyhow error context instead,
and made sure that callsides log or display this to the user correctly?
> + }
> }
> }
>
> @@ -454,22 +490,54 @@ impl BackupDir {
> .map_err(|err| format_err!("unable to acquire manifest lock {:?} - {}", &path, err))
> }
>
> - /// Lock this snapshot exclusively
> - pub fn lock(&self) -> Result<DirLockGuard, Error> {
> - lock_dir_noblock(
> - &self.full_path(),
> - "snapshot",
> - "backup is running or snapshot is in use",
> - )
> + /// Returns a file name for locking a snapshot.
> + ///
> + /// The lock file will be located in:
> + /// `${DATASTORE_LOCKS_DIR}/${datastore name}/${lock_file_path_helper(rpath)}`
> + /// where `rpath` is the relative path of the snapshot.
> + fn lock_path(&self) -> PathBuf {
> + let path = Path::new(DATASTORE_LOCKS_DIR).join(self.store.name());
> +
> + let rpath = Path::new(self.dir.group.ty.as_str())
> + .join(&self.dir.group.id)
> + .join(&self.backup_time_string);
> +
> + path.join(lock_file_path_helper(&self.ns, rpath))
> }
>
> - /// Acquire a shared lock on this snapshot
> - pub fn lock_shared(&self) -> Result<DirLockGuard, Error> {
> - lock_dir_noblock_shared(
> - &self.full_path(),
> - "snapshot",
> - "backup is running or snapshot is in use, could not acquire shared lock",
> - )
> + /// Locks a snapshot exclusively.
> + pub fn lock(&self) -> Result<BackupLockGuard, Error> {
> + if *OLD_LOCKING {
> + lock_dir_noblock(
> + &self.full_path(),
> + "snapshot",
> + "backup is running or snapshot is in use",
> + )
> + .map(BackupLockGuard::from)
> + } else {
> + lock_helper(self.store.name(), &self.lock_path(), |p| {
> + open_backup_lockfile(p, Some(Duration::from_secs(0)), true)
> + .map_err(|err| format_err!("unable to acquire snapshot lock {p:?} - {err}"))
nit: same as above
> + })
> + }
> + }
> +
> + /// Acquires a shared lock on a snapshot.
> + pub fn lock_shared(&self) -> Result<BackupLockGuard, Error> {
> + if *OLD_LOCKING {
> + lock_dir_noblock_shared(
> + &self.full_path(),
> + "snapshot",
> + "backup is running or snapshot is in use, could not acquire shared lock",
> + )
> + .map(BackupLockGuard::from)
> + } else {
> + lock_helper(self.store.name(), &self.lock_path(), |p| {
> + open_backup_lockfile(p, Some(Duration::from_secs(0)), false).map_err(|err| {
> + format_err!("unable to acquire shared snapshot lock {p:?} - {err}")
> + })
nit: same as above
> + })
> + }
> }
>
> /// Destroy the whole snapshot, bails if it's protected
> @@ -494,11 +562,13 @@ impl BackupDir {
> format_err!("removing backup snapshot {:?} failed - {}", full_path, err,)
> })?;
>
> - // the manifest doesn't exist anymore, no need to keep the lock (already done by guard?)
> + // remove no longer needed lock files
> if let Ok(path) = self.manifest_lock_path() {
> let _ = std::fs::remove_file(path); // ignore errors
> }
>
> + let _ = std::fs::remove_file(self.lock_path()); // ignore errors
> +
> Ok(())
> }
>
> @@ -692,3 +762,75 @@ fn list_backup_files<P: ?Sized + nix::NixPath>(
>
> Ok(files)
> }
> +
> +/// Creates a path to a lock file depending on the relative path of an object (snapshot, group,
> +/// manifest) in a datastore. First all namespaces will be concatenated with a colon (ns-folder).
> +/// Then the actual file name will depend on the length of the relative path without namespaces. If
> +/// it is shorter than 255 characters in its unit encoded form, than the unit encoded form will be
> +/// used directly. If not, the file name will consist of the first 80 character, the last 80
> +/// characters and the hash of the unit encoded relative path without namespaces. It will also be
> +/// placed into a "hashed" subfolder in the namespace folder.
> +///
> +/// Examples:
> +///
> +/// - vm-100
> +/// - vm-100-2022\x2d05\x2d02T08\x3a11\x3a33Z
> +/// - ns1:ns2:ns3:ns4:ns5:ns6:ns7/vm-100-2022\x2d05\x2d02T08\x3a11\x3a33Z
> +///
> +/// A "hashed" lock file would look like this:
> +/// - ns1:ns2:ns3/hashed/$first_eighty...$last_eighty-$hash
> +fn lock_file_path_helper(ns: &BackupNamespace, path: PathBuf) -> PathBuf {
> + let to_return = PathBuf::from(
> + ns.components()
> + .map(String::from)
> + .reduce(|acc, n| format!("{acc}:{n}"))
> + .unwrap_or_default(),
> + );
> +
> + let path_bytes = path.as_os_str().as_bytes();
> +
> + let enc = escape_unit(path_bytes, true);
> +
> + if enc.len() < 255 {
> + return to_return.join(enc);
> + }
> +
> + let to_return = to_return.join("hashed");
> +
> + let first_eigthy = &enc[..80];
> + let last_eighty = &enc[enc.len() - 80..];
> + let hash = hex::encode(openssl::sha::sha256(path_bytes));
> +
> + to_return.join(format!("{first_eigthy}...{last_eighty}-{hash}"))
> +}
> +
> +/// Helps implement the double stat'ing procedure. It avoids certain race conditions upon lock
> +/// deletion.
> +///
> +/// It also creates the base directory for lock files.
> +fn lock_helper<F>(
> + store_name: &str,
> + path: &std::path::Path,
> + lock_fn: F,
> +) -> Result<BackupLockGuard, Error>
> +where
> + F: Fn(&std::path::Path) -> Result<BackupLockGuard, Error>,
> +{
> + let mut lock_dir = Path::new(DATASTORE_LOCKS_DIR).join(store_name);
> +
> + if let Some(parent) = path.parent() {
> + lock_dir = lock_dir.join(parent);
> + };
> +
> + std::fs::create_dir_all(&lock_dir)?;
> +
> + let lock = lock_fn(path)?;
> +
> + let stat = nix::sys::stat::fstat(lock.as_raw_fd())?.st_ino;
> +
> + if nix::sys::stat::stat(path).map_or(true, |st| stat != st.st_ino) {
> + bail!("could not acquire lock, another thread modified the lock file");
> + }
> +
> + Ok(lock)
> +}
> diff --git a/pbs-datastore/src/datastore.rs b/pbs-datastore/src/datastore.rs
> index 1cbd0038..1e6157c0 100644
> --- a/pbs-datastore/src/datastore.rs
> +++ b/pbs-datastore/src/datastore.rs
> @@ -13,7 +13,6 @@ use proxmox_human_byte::HumanByte;
> use proxmox_schema::ApiType;
>
> use proxmox_sys::error::SysError;
> -use proxmox_sys::fs::DirLockGuard;
> use proxmox_sys::fs::{file_read_optional_string, replace_file, CreateOptions};
> use proxmox_sys::linux::procfs::MountInfo;
> use proxmox_sys::process_locker::ProcessLockSharedGuard;
> @@ -24,6 +23,7 @@ use pbs_api_types::{
> DataStoreConfig, DatastoreFSyncLevel, DatastoreTuning, GarbageCollectionStatus,
> MaintenanceMode, MaintenanceType, Operation, UPID,
> };
> +use pbs_config::BackupLockGuard;
>
> use crate::backup_info::{BackupDir, BackupGroup};
> use crate::chunk_store::ChunkStore;
> @@ -778,7 +778,7 @@ impl DataStore {
> ns: &BackupNamespace,
> backup_group: &pbs_api_types::BackupGroup,
> auth_id: &Authid,
> - ) -> Result<(Authid, DirLockGuard), Error> {
> + ) -> Result<(Authid, BackupLockGuard), Error> {
> let backup_group = self.backup_group(ns.clone(), backup_group.clone());
>
> // create intermediate path first
> @@ -816,7 +816,7 @@ impl DataStore {
> self: &Arc<Self>,
> ns: &BackupNamespace,
> backup_dir: &pbs_api_types::BackupDir,
> - ) -> Result<(PathBuf, bool, DirLockGuard), Error> {
> + ) -> Result<(PathBuf, bool, BackupLockGuard), Error> {
> let backup_dir = self.backup_dir(ns.clone(), backup_dir.clone())?;
> let relative_path = backup_dir.relative_path();
>
> diff --git a/pbs-datastore/src/snapshot_reader.rs b/pbs-datastore/src/snapshot_reader.rs
> index edff19ef..20fd388a 100644
> --- a/pbs-datastore/src/snapshot_reader.rs
> +++ b/pbs-datastore/src/snapshot_reader.rs
> @@ -4,8 +4,11 @@ use std::path::Path;
> use std::rc::Rc;
> use std::sync::Arc;
>
> -use anyhow::{bail, Context, Error};
> +use anyhow::{bail, format_err, Context, Error};
> use nix::dir::Dir;
> +use nix::fcntl::OFlag;
> +use nix::sys::stat::Mode;
> +use pbs_config::BackupLockGuard;
>
> use pbs_api_types::{
> print_store_and_ns, ArchiveType, BackupNamespace, Operation, CLIENT_LOG_BLOB_NAME,
> @@ -26,6 +29,10 @@ pub struct SnapshotReader {
> datastore_name: String,
> file_list: Vec<String>,
> locked_dir: Dir,
> +
> + // while this is never read, the lock needs to be kept until the
> + // reader is dropped to ensure valid locking semantics
> + _lock: BackupLockGuard,
> }
>
> impl SnapshotReader {
> @@ -46,10 +53,15 @@ impl SnapshotReader {
> bail!("snapshot {} does not exist!", snapshot.dir());
> }
>
> - let locked_dir = snapshot
> + let lock = snapshot
> .lock_shared()
> .with_context(|| format!("while trying to read snapshot '{snapshot:?}'"))?;
>
> + let locked_dir =
> + Dir::open(&snapshot_path, OFlag::O_RDONLY, Mode::empty()).map_err(|err| {
> + format_err!("unable to open snapshot directory {snapshot_path:?} - {err}")
> + })?;
nit: should this be switched to use the anyhow error context, same as
right above?
> +
> let datastore_name = datastore.name().to_string();
> let manifest = match snapshot.load_manifest() {
> Ok((manifest, _)) => manifest,
> @@ -79,6 +91,7 @@ impl SnapshotReader {
> datastore_name,
> file_list,
> locked_dir,
> + _lock: lock,
> })
> }
>
> diff --git a/src/api2/backup/environment.rs b/src/api2/backup/environment.rs
> index 7943b576..3d541b46 100644
> --- a/src/api2/backup/environment.rs
> +++ b/src/api2/backup/environment.rs
> @@ -1,5 +1,6 @@
> use anyhow::{bail, format_err, Context, Error};
> -use nix::dir::Dir;
> +use pbs_config::BackupLockGuard;
> +
> use std::collections::HashMap;
> use std::sync::{Arc, Mutex};
> use tracing::info;
> @@ -635,7 +636,7 @@ impl BackupEnvironment {
> /// If verify-new is set on the datastore, this will run a new verify task
> /// for the backup. If not, this will return and also drop the passed lock
> /// immediately.
> - pub fn verify_after_complete(&self, excl_snap_lock: Dir) -> Result<(), Error> {
> + pub fn verify_after_complete(&self, excl_snap_lock: BackupLockGuard) -> Result<(), Error> {
> self.ensure_finished()?;
>
> if !self.datastore.verify_new() {
> diff --git a/src/backup/verify.rs b/src/backup/verify.rs
> index 10a64fda..3d2cba8a 100644
> --- a/src/backup/verify.rs
> +++ b/src/backup/verify.rs
> @@ -1,10 +1,10 @@
> +use pbs_config::BackupLockGuard;
> use std::collections::HashSet;
> use std::sync::atomic::{AtomicUsize, Ordering};
> use std::sync::{Arc, Mutex};
> use std::time::Instant;
>
> use anyhow::{bail, Error};
> -use proxmox_sys::fs::DirLockGuard;
> use tracing::{error, info, warn};
>
> use proxmox_worker_task::WorkerTaskContext;
> @@ -330,7 +330,7 @@ pub fn verify_backup_dir_with_lock(
> backup_dir: &BackupDir,
> upid: UPID,
> filter: Option<&dyn Fn(&BackupManifest) -> bool>,
> - _snap_lock: DirLockGuard,
> + _snap_lock: BackupLockGuard,
> ) -> Result<bool, Error> {
> let datastore_name = verify_worker.datastore.name();
> let backup_dir_name = backup_dir.dir();
> diff --git a/src/server/sync.rs b/src/server/sync.rs
> index b6852aff..10804b14 100644
> --- a/src/server/sync.rs
> +++ b/src/server/sync.rs
> @@ -10,7 +10,7 @@ use std::time::Duration;
> use anyhow::{bail, format_err, Context, Error};
> use futures::{future::FutureExt, select};
> use hyper::http::StatusCode;
> -use proxmox_sys::fs::DirLockGuard;
> +use pbs_config::BackupLockGuard;
> use serde_json::json;
> use tracing::{info, warn};
>
> @@ -106,7 +106,7 @@ pub(crate) struct RemoteSourceReader {
> }
>
> pub(crate) struct LocalSourceReader {
> - pub(crate) _dir_lock: Arc<Mutex<DirLockGuard>>,
> + pub(crate) _dir_lock: Arc<Mutex<BackupLockGuard>>,
> pub(crate) path: PathBuf,
> pub(crate) datastore: Arc<DataStore>,
> }
_______________________________________________
pbs-devel mailing list
pbs-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [pbs-devel] [PATCH proxmox-backup v8 3/4] fix #3935: datastore: move manifest locking to new locking method
2025-03-24 12:51 ` [pbs-devel] [PATCH proxmox-backup v8 3/4] fix #3935: datastore: move manifest locking to new locking method Shannon Sterz
@ 2025-03-25 9:44 ` Christian Ebner
0 siblings, 0 replies; 17+ messages in thread
From: Christian Ebner @ 2025-03-25 9:44 UTC (permalink / raw)
To: pbs-devel
On 3/24/25 13:51, Shannon Sterz wrote:
> adds double stat'ing and removes directory hierarchy to bring manifest
> locking in-line with other locks used by the BackupDir trait.
>
> if the old locking mechanism is still supposed to be used, this still
> falls back to the previous lock file. however, we already add double
> stat'ing since it is trivial to do here and should only provide better
> safety when it comes to removing locks.
>
> Signed-off-by: Shannon Sterz <s.sterz@proxmox.com>
> ---
> pbs-datastore/src/backup_info.rs | 45 ++++++++++++++++++++------------
> 1 file changed, 29 insertions(+), 16 deletions(-)
>
> diff --git a/pbs-datastore/src/backup_info.rs b/pbs-datastore/src/backup_info.rs
> index b79e8196..0602867a 100644
> --- a/pbs-datastore/src/backup_info.rs
> +++ b/pbs-datastore/src/backup_info.rs
> @@ -469,25 +469,41 @@ impl BackupDir {
> /// Returns the filename to lock a manifest
> ///
> /// Also creates the basedir. The lockfile is located in
> - /// '/run/proxmox-backup/locks/{datastore}/[ns/{ns}/]+{type}/{id}/{timestamp}.index.json.lck'
> - fn manifest_lock_path(&self) -> Result<PathBuf, Error> {
> - let mut path = PathBuf::from(&format!("/run/proxmox-backup/locks/{}", self.store.name()));
> - path.push(self.relative_path());
> + /// `${DATASTORE_LOCKS_DIR}/${datastore name}/${lock_file_path_helper(rpath)}.index.json.lck`
> + /// where rpath is the relative path of the snapshot.
> + fn manifest_lock_path(&self) -> PathBuf {
> + let path = Path::new(DATASTORE_LOCKS_DIR).join(self.store.name());
>
> - std::fs::create_dir_all(&path)?;
> - let ts = self.backup_time_string();
> - path.push(format!("{ts}{MANIFEST_LOCK_NAME}"));
> + let rpath = Path::new(self.dir.group.ty.as_str())
> + .join(&self.dir.group.id)
> + .join(&self.backup_time_string)
> + .join(MANIFEST_LOCK_NAME);
>
> - Ok(path)
> + path.join(lock_file_path_helper(&self.ns, rpath))
> }
>
> /// Locks the manifest of a snapshot, for example, to update or delete it.
> pub(crate) fn lock_manifest(&self) -> Result<BackupLockGuard, Error> {
> - let path = self.manifest_lock_path()?;
> + let path = if *OLD_LOCKING {
> + // old manifest lock path
> + let path = Path::new(DATASTORE_LOCKS_DIR)
> + .join(self.store.name())
> + .join(self.relative_path());
>
> - // actions locking the manifest should be relatively short, only wait a few seconds
> - open_backup_lockfile(&path, Some(std::time::Duration::from_secs(5)), true)
> - .map_err(|err| format_err!("unable to acquire manifest lock {:?} - {}", &path, err))
> + std::fs::create_dir_all(&path)?;
> +
> + path.join(format!("{}{MANIFEST_LOCK_NAME}", self.backup_time_string()))
> + } else {
> + self.manifest_lock_path()
> + };
> +
> + lock_helper(self.store.name(), &path, |p| {
> + // update_manifest should never take a long time, so if
> + // someone else has the lock we can simply block a bit
> + // and should get it soon
> + open_backup_lockfile(p, Some(Duration::from_secs(5)), true)
> + .map_err(|err| format_err!("unable to acquire manifest lock {p:?} - {err}"))
nit: same as other patch. Should this use anyhow error context instead
of map_err?
> + })
> }
>
> /// Returns a file name for locking a snapshot.
> @@ -563,10 +579,7 @@ impl BackupDir {
> })?;
>
> // remove no longer needed lock files
> - if let Ok(path) = self.manifest_lock_path() {
> - let _ = std::fs::remove_file(path); // ignore errors
> - }
> -
> + let _ = std::fs::remove_file(self.manifest_lock_path()); // ignore errors
> let _ = std::fs::remove_file(self.lock_path()); // ignore errors
>
> Ok(())
_______________________________________________
pbs-devel mailing list
pbs-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [pbs-devel] [PATCH proxmox-backup v8 2/4] fix #3935: datastore/api/backup: move datastore locking to '/run'
2025-03-24 12:51 ` [pbs-devel] [PATCH proxmox-backup v8 2/4] fix #3935: datastore/api/backup: move datastore locking to '/run' Shannon Sterz
2025-03-25 9:43 ` Christian Ebner
@ 2025-03-25 9:48 ` Christian Ebner
2025-03-25 11:25 ` Wolfgang Bumiller
2 siblings, 0 replies; 17+ messages in thread
From: Christian Ebner @ 2025-03-25 9:48 UTC (permalink / raw)
To: Proxmox Backup Server development discussion, Shannon Sterz
On 3/24/25 13:51, Shannon Sterz wrote:
> to avoid issues when removing a group or snapshot directory where two
> threads hold a lock to the same directory, move locking to the tmpfs
> backed '/run' directory. also adds double stat'ing to make it possible
> to remove locks without certain race condition issues.
>
> this new mechanism is only employed when we can be sure, that a reboot
> has occured so that all processes are using the new locking mechanism.
> otherwise, two separate process could assume they have exclusive
> rights to a group or snapshot.
>
> bumps the rust version to 1.81 so we can use `std::fs::exists` without
> issue.
>
> Signed-off-by: Shannon Sterz <s.sterz@proxmox.com>
> ---
> Cargo.toml | 2 +-
> debian/postinst | 5 +
> pbs-config/src/lib.rs | 32 ++++-
> pbs-datastore/Cargo.toml | 1 +
> pbs-datastore/src/backup_info.rs | 196 +++++++++++++++++++++++----
> pbs-datastore/src/datastore.rs | 6 +-
> pbs-datastore/src/snapshot_reader.rs | 17 ++-
> src/api2/backup/environment.rs | 5 +-
> src/backup/verify.rs | 4 +-
> src/server/sync.rs | 4 +-
> 10 files changed, 230 insertions(+), 42 deletions(-)
>
> diff --git a/Cargo.toml b/Cargo.toml
> index 306d50e9..645f2d81 100644
> --- a/Cargo.toml
> +++ b/Cargo.toml
> @@ -13,7 +13,7 @@ authors = [
> edition = "2021"
> license = "AGPL-3"
> repository = "https://git.proxmox.com/?p=proxmox-backup.git"
> -rust-version = "1.80"
> +rust-version = "1.81"
>
> [package]
> name = "proxmox-backup"
> diff --git a/debian/postinst b/debian/postinst
> index 0485ca34..81c15826 100644
> --- a/debian/postinst
> +++ b/debian/postinst
> @@ -80,6 +80,11 @@ EOF
> update_sync_job "$prev_job"
> fi
> fi
> +
> + if dpkg --compare-versions "$2" 'lt' '3.3.4~'; then
> + # ensure old locking is used by the daemon until a reboot happened
> + touch "/run/proxmox-backup/old-locking"
> + fi
> fi
> ;;
>
> diff --git a/pbs-config/src/lib.rs b/pbs-config/src/lib.rs
> index 20a8238d..9c4d77c2 100644
> --- a/pbs-config/src/lib.rs
> +++ b/pbs-config/src/lib.rs
> @@ -22,6 +22,8 @@ pub use config_version_cache::ConfigVersionCache;
>
> use anyhow::{format_err, Error};
> use nix::unistd::{Gid, Group, Uid, User};
> +use proxmox_sys::fs::DirLockGuard;
> +use std::os::unix::prelude::AsRawFd;
>
> pub use pbs_buildcfg::{BACKUP_GROUP_NAME, BACKUP_USER_NAME};
>
> @@ -46,13 +48,34 @@ pub fn backup_group() -> Result<nix::unistd::Group, Error> {
> }
>
> pub struct BackupLockGuard {
> - _file: Option<std::fs::File>,
> + file: Option<std::fs::File>,
> + // TODO: Remove `_legacy_dir` with PBS 5
> + _legacy_dir: Option<DirLockGuard>,
> +}
> +
> +impl AsRawFd for BackupLockGuard {
> + fn as_raw_fd(&self) -> i32 {
> + self.file.as_ref().map_or(-1, |f| f.as_raw_fd())
> + }
> +}
> +
> +// TODO: Remove with PBS 5
> +impl From<DirLockGuard> for BackupLockGuard {
> + fn from(value: DirLockGuard) -> Self {
> + Self {
> + file: None,
> + _legacy_dir: Some(value),
> + }
> + }
> }
>
> #[doc(hidden)]
> /// Note: do not use for production code, this is only intended for tests
> pub unsafe fn create_mocked_lock() -> BackupLockGuard {
> - BackupLockGuard { _file: None }
> + BackupLockGuard {
> + file: None,
> + _legacy_dir: None,
> + }
> }
>
> /// Open or create a lock file owned by user "backup" and lock it.
> @@ -76,7 +99,10 @@ pub fn open_backup_lockfile<P: AsRef<std::path::Path>>(
> let timeout = timeout.unwrap_or(std::time::Duration::new(10, 0));
>
> let file = proxmox_sys::fs::open_file_locked(&path, timeout, exclusive, options)?;
> - Ok(BackupLockGuard { _file: Some(file) })
> + Ok(BackupLockGuard {
> + file: Some(file),
> + _legacy_dir: None,
> + })
> }
>
> /// Atomically write data to file owned by "root:backup" with permission "0640"
> diff --git a/pbs-datastore/Cargo.toml b/pbs-datastore/Cargo.toml
> index 4ebc5fdc..7623adc2 100644
> --- a/pbs-datastore/Cargo.toml
> +++ b/pbs-datastore/Cargo.toml
> @@ -35,6 +35,7 @@ proxmox-lang.workspace=true
> proxmox-schema = { workspace = true, features = [ "api-macro" ] }
> proxmox-serde = { workspace = true, features = [ "serde_json" ] }
> proxmox-sys.workspace = true
> +proxmox-systemd.workspace = true
> proxmox-time.workspace = true
> proxmox-uuid.workspace = true
> proxmox-worker-task.workspace = true
> diff --git a/pbs-datastore/src/backup_info.rs b/pbs-datastore/src/backup_info.rs
> index c7f7ed53..b79e8196 100644
> --- a/pbs-datastore/src/backup_info.rs
> +++ b/pbs-datastore/src/backup_info.rs
> @@ -1,13 +1,15 @@
> use std::fmt;
> -use std::os::unix::io::RawFd;
> +use std::os::unix::io::{AsRawFd, RawFd};
> +use std::os::unix::prelude::OsStrExt;
> +use std::path::Path;
> use std::path::PathBuf;
> -use std::sync::Arc;
> +use std::sync::{Arc, LazyLock};
> +use std::time::Duration;
>
> use anyhow::{bail, format_err, Context, Error};
>
> -use proxmox_sys::fs::{
> - lock_dir_noblock, lock_dir_noblock_shared, replace_file, CreateOptions, DirLockGuard,
> -};
> +use proxmox_sys::fs::{lock_dir_noblock, lock_dir_noblock_shared, replace_file, CreateOptions};
> +use proxmox_systemd::escape_unit;
>
> use pbs_api_types::{
> Authid, BackupGroupDeleteStats, BackupNamespace, BackupType, GroupFilter, VerifyState,
> @@ -18,6 +20,18 @@ use pbs_config::{open_backup_lockfile, BackupLockGuard};
> use crate::manifest::{BackupManifest, MANIFEST_LOCK_NAME};
> use crate::{DataBlob, DataStore};
>
> +pub const DATASTORE_LOCKS_DIR: &str = "/run/proxmox-backup/locks";
> +
> +// TODO: Remove with PBS 5
> +// Note: The `expect()` call here will only happen if we can neither confirm nor deny the existence
> +// of the file. this should only happen if a user messes with the `/run/proxmox-backup` directory.
> +// if that happens, a lot more should fail as we rely on the existence of the directory throughout
> +// the code. so just panic with a reasonable message.
> +static OLD_LOCKING: LazyLock<bool> = LazyLock::new(|| {
> + std::fs::exists("/run/proxmox-backup/old-locking")
> + .expect("cannot read `/run/proxmox-backup`, please check permissions")
> +});
> +
> /// BackupGroup is a directory containing a list of BackupDir
> #[derive(Clone)]
> pub struct BackupGroup {
> @@ -225,6 +239,7 @@ impl BackupGroup {
> delete_stats.increment_removed_groups();
> }
>
> + let _ = std::fs::remove_file(self.lock_path());
> Ok(delete_stats)
> }
>
> @@ -241,13 +256,34 @@ impl BackupGroup {
> .set_owner(&self.ns, self.as_ref(), auth_id, force)
> }
>
> - /// Lock a group exclusively
> - pub fn lock(&self) -> Result<DirLockGuard, Error> {
> - lock_dir_noblock(
> - &self.full_group_path(),
> - "backup group",
> - "possible running backup",
> - )
> + /// Returns a file name for locking a group.
> + ///
> + /// The lock file will be located in:
> + /// `${DATASTORE_LOCKS_DIR}/${datastore name}/${lock_file_path_helper(rpath)}`
> + /// where `rpath` is the relative path of the group.
> + fn lock_path(&self) -> PathBuf {
> + let path = Path::new(DATASTORE_LOCKS_DIR).join(self.store.name());
> +
> + let rpath = Path::new(self.group.ty.as_str()).join(&self.group.id);
> +
> + path.join(lock_file_path_helper(&self.ns, rpath))
> + }
> +
> + /// Locks a group exclusively.
> + pub fn lock(&self) -> Result<BackupLockGuard, Error> {
> + if *OLD_LOCKING {
> + lock_dir_noblock(
> + &self.full_group_path(),
> + "backup group",
> + "possible runing backup, group is in use",
> + )
> + .map(BackupLockGuard::from)
> + } else {
> + lock_helper(self.store.name(), &self.lock_path(), |p| {
> + open_backup_lockfile(p, Some(Duration::from_secs(0)), true)
> + .map_err(|err| format_err!("unable to acquire backup group lock {p:?} - {err}"))
> + })
> + }
> }
> }
>
> @@ -454,22 +490,54 @@ impl BackupDir {
> .map_err(|err| format_err!("unable to acquire manifest lock {:?} - {}", &path, err))
> }
>
> - /// Lock this snapshot exclusively
> - pub fn lock(&self) -> Result<DirLockGuard, Error> {
> - lock_dir_noblock(
> - &self.full_path(),
> - "snapshot",
> - "backup is running or snapshot is in use",
> - )
> + /// Returns a file name for locking a snapshot.
> + ///
> + /// The lock file will be located in:
> + /// `${DATASTORE_LOCKS_DIR}/${datastore name}/${lock_file_path_helper(rpath)}`
> + /// where `rpath` is the relative path of the snapshot.
> + fn lock_path(&self) -> PathBuf {
> + let path = Path::new(DATASTORE_LOCKS_DIR).join(self.store.name());
> +
> + let rpath = Path::new(self.dir.group.ty.as_str())
> + .join(&self.dir.group.id)
> + .join(&self.backup_time_string);
> +
> + path.join(lock_file_path_helper(&self.ns, rpath))
> }
>
> - /// Acquire a shared lock on this snapshot
> - pub fn lock_shared(&self) -> Result<DirLockGuard, Error> {
> - lock_dir_noblock_shared(
> - &self.full_path(),
> - "snapshot",
> - "backup is running or snapshot is in use, could not acquire shared lock",
> - )
> + /// Locks a snapshot exclusively.
> + pub fn lock(&self) -> Result<BackupLockGuard, Error> {
> + if *OLD_LOCKING {
> + lock_dir_noblock(
> + &self.full_path(),
> + "snapshot",
> + "backup is running or snapshot is in use",
> + )
> + .map(BackupLockGuard::from)
> + } else {
> + lock_helper(self.store.name(), &self.lock_path(), |p| {
> + open_backup_lockfile(p, Some(Duration::from_secs(0)), true)
> + .map_err(|err| format_err!("unable to acquire snapshot lock {p:?} - {err}"))
> + })
> + }
> + }
> +
> + /// Acquires a shared lock on a snapshot.
> + pub fn lock_shared(&self) -> Result<BackupLockGuard, Error> {
> + if *OLD_LOCKING {
> + lock_dir_noblock_shared(
> + &self.full_path(),
> + "snapshot",
> + "backup is running or snapshot is in use, could not acquire shared lock",
> + )
> + .map(BackupLockGuard::from)
> + } else {
> + lock_helper(self.store.name(), &self.lock_path(), |p| {
> + open_backup_lockfile(p, Some(Duration::from_secs(0)), false).map_err(|err| {
> + format_err!("unable to acquire shared snapshot lock {p:?} - {err}")
> + })
> + })
> + }
> }
>
> /// Destroy the whole snapshot, bails if it's protected
> @@ -494,11 +562,13 @@ impl BackupDir {
> format_err!("removing backup snapshot {:?} failed - {}", full_path, err,)
> })?;
>
> - // the manifest doesn't exist anymore, no need to keep the lock (already done by guard?)
> + // remove no longer needed lock files
> if let Ok(path) = self.manifest_lock_path() {
> let _ = std::fs::remove_file(path); // ignore errors
> }
>
> + let _ = std::fs::remove_file(self.lock_path()); // ignore errors
> +
> Ok(())
> }
>
> @@ -692,3 +762,75 @@ fn list_backup_files<P: ?Sized + nix::NixPath>(
>
> Ok(files)
> }
> +
> +/// Creates a path to a lock file depending on the relative path of an object (snapshot, group,
> +/// manifest) in a datastore. First all namespaces will be concatenated with a colon (ns-folder).
> +/// Then the actual file name will depend on the length of the relative path without namespaces. If
> +/// it is shorter than 255 characters in its unit encoded form, than the unit encoded form will be
> +/// used directly. If not, the file name will consist of the first 80 character, the last 80
> +/// characters and the hash of the unit encoded relative path without namespaces. It will also be
> +/// placed into a "hashed" subfolder in the namespace folder.
> +///
> +/// Examples:
> +///
> +/// - vm-100
> +/// - vm-100-2022\x2d05\x2d02T08\x3a11\x3a33Z
> +/// - ns1:ns2:ns3:ns4:ns5:ns6:ns7/vm-100-2022\x2d05\x2d02T08\x3a11\x3a33Z
> +///
> +/// A "hashed" lock file would look like this:
> +/// - ns1:ns2:ns3/hashed/$first_eighty...$last_eighty-$hash
> +fn lock_file_path_helper(ns: &BackupNamespace, path: PathBuf) -> PathBuf {
> + let to_return = PathBuf::from(
> + ns.components()
> + .map(String::from)
> + .reduce(|acc, n| format!("{acc}:{n}"))
> + .unwrap_or_default(),
> + );
> +
> + let path_bytes = path.as_os_str().as_bytes();
> +
> + let enc = escape_unit(path_bytes, true);
> +
> + if enc.len() < 255 {
> + return to_return.join(enc);
> + }
> +
> + let to_return = to_return.join("hashed");
> +
> + let first_eigthy = &enc[..80];
> + let last_eighty = &enc[enc.len() - 80..];
> + let hash = hex::encode(openssl::sha::sha256(path_bytes));
> +
> + to_return.join(format!("{first_eigthy}...{last_eighty}-{hash}"))
> +}
> +
> +/// Helps implement the double stat'ing procedure. It avoids certain race conditions upon lock
> +/// deletion.
> +///
> +/// It also creates the base directory for lock files.
> +fn lock_helper<F>(
> + store_name: &str,
> + path: &std::path::Path,
> + lock_fn: F,
> +) -> Result<BackupLockGuard, Error>
> +where
> + F: Fn(&std::path::Path) -> Result<BackupLockGuard, Error>,
> +{
> + let mut lock_dir = Path::new(DATASTORE_LOCKS_DIR).join(store_name);
> +
> + if let Some(parent) = path.parent() {
> + lock_dir = lock_dir.join(parent);
> + };
> +
> + std::fs::create_dir_all(&lock_dir)?;
> +
> + let lock = lock_fn(path)?;
> +
> + let stat = nix::sys::stat::fstat(lock.as_raw_fd())?.st_ino;
nit, forgot this one in my previous reply: This actually is the inode
number, so either rename the variable or use the full stat output as
binding and compare using stat.st_ino in the closure below.
> +
> + if nix::sys::stat::stat(path).map_or(true, |st| stat != st.st_ino) {
> + bail!("could not acquire lock, another thread modified the lock file");
> + }
> +
> + Ok(lock)
> +}
> diff --git a/pbs-datastore/src/datastore.rs b/pbs-datastore/src/datastore.rs
> index 1cbd0038..1e6157c0 100644
> --- a/pbs-datastore/src/datastore.rs
> +++ b/pbs-datastore/src/datastore.rs
> @@ -13,7 +13,6 @@ use proxmox_human_byte::HumanByte;
> use proxmox_schema::ApiType;
>
> use proxmox_sys::error::SysError;
> -use proxmox_sys::fs::DirLockGuard;
> use proxmox_sys::fs::{file_read_optional_string, replace_file, CreateOptions};
> use proxmox_sys::linux::procfs::MountInfo;
> use proxmox_sys::process_locker::ProcessLockSharedGuard;
> @@ -24,6 +23,7 @@ use pbs_api_types::{
> DataStoreConfig, DatastoreFSyncLevel, DatastoreTuning, GarbageCollectionStatus,
> MaintenanceMode, MaintenanceType, Operation, UPID,
> };
> +use pbs_config::BackupLockGuard;
>
> use crate::backup_info::{BackupDir, BackupGroup};
> use crate::chunk_store::ChunkStore;
> @@ -778,7 +778,7 @@ impl DataStore {
> ns: &BackupNamespace,
> backup_group: &pbs_api_types::BackupGroup,
> auth_id: &Authid,
> - ) -> Result<(Authid, DirLockGuard), Error> {
> + ) -> Result<(Authid, BackupLockGuard), Error> {
> let backup_group = self.backup_group(ns.clone(), backup_group.clone());
>
> // create intermediate path first
> @@ -816,7 +816,7 @@ impl DataStore {
> self: &Arc<Self>,
> ns: &BackupNamespace,
> backup_dir: &pbs_api_types::BackupDir,
> - ) -> Result<(PathBuf, bool, DirLockGuard), Error> {
> + ) -> Result<(PathBuf, bool, BackupLockGuard), Error> {
> let backup_dir = self.backup_dir(ns.clone(), backup_dir.clone())?;
> let relative_path = backup_dir.relative_path();
>
> diff --git a/pbs-datastore/src/snapshot_reader.rs b/pbs-datastore/src/snapshot_reader.rs
> index edff19ef..20fd388a 100644
> --- a/pbs-datastore/src/snapshot_reader.rs
> +++ b/pbs-datastore/src/snapshot_reader.rs
> @@ -4,8 +4,11 @@ use std::path::Path;
> use std::rc::Rc;
> use std::sync::Arc;
>
> -use anyhow::{bail, Context, Error};
> +use anyhow::{bail, format_err, Context, Error};
> use nix::dir::Dir;
> +use nix::fcntl::OFlag;
> +use nix::sys::stat::Mode;
> +use pbs_config::BackupLockGuard;
>
> use pbs_api_types::{
> print_store_and_ns, ArchiveType, BackupNamespace, Operation, CLIENT_LOG_BLOB_NAME,
> @@ -26,6 +29,10 @@ pub struct SnapshotReader {
> datastore_name: String,
> file_list: Vec<String>,
> locked_dir: Dir,
> +
> + // while this is never read, the lock needs to be kept until the
> + // reader is dropped to ensure valid locking semantics
> + _lock: BackupLockGuard,
> }
>
> impl SnapshotReader {
> @@ -46,10 +53,15 @@ impl SnapshotReader {
> bail!("snapshot {} does not exist!", snapshot.dir());
> }
>
> - let locked_dir = snapshot
> + let lock = snapshot
> .lock_shared()
> .with_context(|| format!("while trying to read snapshot '{snapshot:?}'"))?;
>
> + let locked_dir =
> + Dir::open(&snapshot_path, OFlag::O_RDONLY, Mode::empty()).map_err(|err| {
> + format_err!("unable to open snapshot directory {snapshot_path:?} - {err}")
> + })?;
> +
> let datastore_name = datastore.name().to_string();
> let manifest = match snapshot.load_manifest() {
> Ok((manifest, _)) => manifest,
> @@ -79,6 +91,7 @@ impl SnapshotReader {
> datastore_name,
> file_list,
> locked_dir,
> + _lock: lock,
> })
> }
>
> diff --git a/src/api2/backup/environment.rs b/src/api2/backup/environment.rs
> index 7943b576..3d541b46 100644
> --- a/src/api2/backup/environment.rs
> +++ b/src/api2/backup/environment.rs
> @@ -1,5 +1,6 @@
> use anyhow::{bail, format_err, Context, Error};
> -use nix::dir::Dir;
> +use pbs_config::BackupLockGuard;
> +
> use std::collections::HashMap;
> use std::sync::{Arc, Mutex};
> use tracing::info;
> @@ -635,7 +636,7 @@ impl BackupEnvironment {
> /// If verify-new is set on the datastore, this will run a new verify task
> /// for the backup. If not, this will return and also drop the passed lock
> /// immediately.
> - pub fn verify_after_complete(&self, excl_snap_lock: Dir) -> Result<(), Error> {
> + pub fn verify_after_complete(&self, excl_snap_lock: BackupLockGuard) -> Result<(), Error> {
> self.ensure_finished()?;
>
> if !self.datastore.verify_new() {
> diff --git a/src/backup/verify.rs b/src/backup/verify.rs
> index 10a64fda..3d2cba8a 100644
> --- a/src/backup/verify.rs
> +++ b/src/backup/verify.rs
> @@ -1,10 +1,10 @@
> +use pbs_config::BackupLockGuard;
> use std::collections::HashSet;
> use std::sync::atomic::{AtomicUsize, Ordering};
> use std::sync::{Arc, Mutex};
> use std::time::Instant;
>
> use anyhow::{bail, Error};
> -use proxmox_sys::fs::DirLockGuard;
> use tracing::{error, info, warn};
>
> use proxmox_worker_task::WorkerTaskContext;
> @@ -330,7 +330,7 @@ pub fn verify_backup_dir_with_lock(
> backup_dir: &BackupDir,
> upid: UPID,
> filter: Option<&dyn Fn(&BackupManifest) -> bool>,
> - _snap_lock: DirLockGuard,
> + _snap_lock: BackupLockGuard,
> ) -> Result<bool, Error> {
> let datastore_name = verify_worker.datastore.name();
> let backup_dir_name = backup_dir.dir();
> diff --git a/src/server/sync.rs b/src/server/sync.rs
> index b6852aff..10804b14 100644
> --- a/src/server/sync.rs
> +++ b/src/server/sync.rs
> @@ -10,7 +10,7 @@ use std::time::Duration;
> use anyhow::{bail, format_err, Context, Error};
> use futures::{future::FutureExt, select};
> use hyper::http::StatusCode;
> -use proxmox_sys::fs::DirLockGuard;
> +use pbs_config::BackupLockGuard;
> use serde_json::json;
> use tracing::{info, warn};
>
> @@ -106,7 +106,7 @@ pub(crate) struct RemoteSourceReader {
> }
>
> pub(crate) struct LocalSourceReader {
> - pub(crate) _dir_lock: Arc<Mutex<DirLockGuard>>,
> + pub(crate) _dir_lock: Arc<Mutex<BackupLockGuard>>,
> pub(crate) path: PathBuf,
> pub(crate) datastore: Arc<DataStore>,
> }
_______________________________________________
pbs-devel mailing list
pbs-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [pbs-devel] [PATCH proxmox-backup v8 1/4] datastore/api/backup: prepare for fix of #3935 by adding lock helpers
2025-03-25 9:35 ` Christian Ebner
@ 2025-03-25 9:57 ` Shannon Sterz
2025-03-25 10:12 ` Christian Ebner
0 siblings, 1 reply; 17+ messages in thread
From: Shannon Sterz @ 2025-03-25 9:57 UTC (permalink / raw)
To: Christian Ebner, Proxmox Backup Server development discussion
On Tue Mar 25, 2025 at 10:35 AM CET, Christian Ebner wrote:
> some small nits inline
>
> On 3/24/25 13:51, Shannon Sterz wrote:
>> to avoid duplicate code, add helpers for locking groups and snapshots
>> to the BackupGroup and BackupDir traits respectively and refactor
>> existing code to use them.
>>
>> this also adapts error handling by adding relevant context to each
>> locking helper call site. otherwise, we might loose valuable
>> information useful for debugging. note, however, that users that
>> relied on specific error messages will break.
>>
>> Signed-off-by: Shannon Sterz <s.sterz@proxmox.com>
>> ---
-->8 snip 8<--
>> +
>> + /// Lock a group exclusively
>> + pub fn lock(&self) -> Result<DirLockGuard, Error> {
>> + lock_dir_noblock(
>> + &self.full_group_path(),
>
> this allocates the group path...
>
>> + "backup group",
>> + "possible running backup",
>> + )
>> + }
>> }
>>
>> impl AsRef<pbs_api_types::BackupNamespace> for BackupGroup {
>> @@ -442,15 +454,33 @@ impl BackupDir {
>> .map_err(|err| format_err!("unable to acquire manifest lock {:?} - {}", &path, err))
>> }
>>
>> + /// Lock this snapshot exclusively
>> + pub fn lock(&self) -> Result<DirLockGuard, Error> {
>> + lock_dir_noblock(
>> + &self.full_path(),
>
> ... this allocates the snapshot path ...
>
>> + "snapshot",
>> + "backup is running or snapshot is in use",
>> + )
>> + }
>> +
>> + /// Acquire a shared lock on this snapshot
>> + pub fn lock_shared(&self) -> Result<DirLockGuard, Error> {
>> + lock_dir_noblock_shared(
>> + &self.full_path(),
>> + "snapshot",
>> + "backup is running or snapshot is in use, could not acquire shared lock",
>> + )
>> + }
>> +
>> /// Destroy the whole snapshot, bails if it's protected
>> ///
>> /// Setting `force` to true skips locking and thus ignores if the backup is currently in use.
>> pub fn destroy(&self, force: bool) -> Result<(), Error> {
>> - let full_path = self.full_path();
>> -
>> let (_guard, _manifest_guard);
>> if !force {
>> - _guard = lock_dir_noblock(&full_path, "snapshot", "possibly running or in use")?;
>> + _guard = self
>> + .lock()
>> + .with_context(|| format!("while destroying snapshot '{self:?}'"))?;
>> _manifest_guard = self.lock_manifest()?;
>> }
>>
>> @@ -458,6 +488,7 @@ impl BackupDir {
>> bail!("cannot remove protected snapshot"); // use special error type?
>> }
>>
>> + let full_path = self.full_path();
>
> ... and here it reallocates the snapshot path again.
>
>> log::info!("removing backup snapshot {:?}", full_path);
>> std::fs::remove_dir_all(&full_path).map_err(|err| {
>> format_err!("removing backup snapshot {:?} failed - {}", full_path, err,)
>> diff --git a/pbs-datastore/src/datastore.rs b/pbs-datastore/src/datastore.rs
>> index a6a91ca7..1cbd0038 100644
>> --- a/pbs-datastore/src/datastore.rs
>> +++ b/pbs-datastore/src/datastore.rs
>> @@ -5,7 +5,7 @@ use std::os::unix::io::AsRawFd;
>> use std::path::{Path, PathBuf};
>> use std::sync::{Arc, LazyLock, Mutex};
>>
>> -use anyhow::{bail, format_err, Error};
>> +use anyhow::{bail, format_err, Context, Error};
>> use nix::unistd::{unlinkat, UnlinkatFlags};
>> use tracing::{info, warn};
>>
>> @@ -13,8 +13,8 @@ use proxmox_human_byte::HumanByte;
>> use proxmox_schema::ApiType;
>>
>> use proxmox_sys::error::SysError;
>> +use proxmox_sys::fs::DirLockGuard;
>> use proxmox_sys::fs::{file_read_optional_string, replace_file, CreateOptions};
>> -use proxmox_sys::fs::{lock_dir_noblock, DirLockGuard};
>> use proxmox_sys::linux::procfs::MountInfo;
>> use proxmox_sys::process_locker::ProcessLockSharedGuard;
>> use proxmox_worker_task::WorkerTaskContext;
>> @@ -774,41 +774,35 @@ impl DataStore {
>> ///
>> /// This also acquires an exclusive lock on the directory and returns the lock guard.
>> pub fn create_locked_backup_group(
>> - &self,
>> + self: &Arc<Self>,
>> ns: &BackupNamespace,
>> backup_group: &pbs_api_types::BackupGroup,
>> auth_id: &Authid,
>> ) -> Result<(Authid, DirLockGuard), Error> {
>> - // create intermediate path first:
>> - let mut full_path = self.base_path();
>> - for ns in ns.components() {
>> - full_path.push("ns");
>> - full_path.push(ns);
>> - }
>> - full_path.push(backup_group.ty.as_str());
>> - std::fs::create_dir_all(&full_path)?;
>> + let backup_group = self.backup_group(ns.clone(), backup_group.clone());
>>
>> - full_path.push(&backup_group.id);
>> + // create intermediate path first
>> + let full_path = backup_group.full_group_path();
>
> ... same here, this reallocates the group path.
>
> I see that this will change with the new locking mechanism, in the
> following patches, but might it make sense to also generate the
> respective paths together with the locking in order to avoid double
> allocation?
>
> However not that relevant since it will get phased out once fully
> switched to the new locking mechanism anyways.
if i understand you correctly then we have basically two choices. a)
make the locking handlers take a parameter that will eventually be
ignored, which is the actual path of the group or b) let the locking
helper return the path.
both would mean that there is now more churn, as we are messing with the
signatures of the locking helpers now and when we remove the old locking
mechanism completely.
also i don't think having the path be allocated twice is that much of a
performance hit. especially considering that both options mean that
during the transition period we would always create the locking path and
the full path. even when already using the new locking mechanism where
we don't technically need the latter at all call sites anymore.
there is also option c) which could add a field to these structs like
`full_path: Option<PathBuf>` (or similar) that caches these allocations,
but that seems a bit like over-engineering to me tbh. at least i
wouldn't want to add that unless it has actual performance impact.
-->8 snip 8<--
_______________________________________________
pbs-devel mailing list
pbs-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [pbs-devel] [PATCH proxmox-backup v8 4/4] fix: api: avoid race condition in set_backup_owner
2025-03-24 12:51 ` [pbs-devel] [PATCH proxmox-backup v8 4/4] fix: api: avoid race condition in set_backup_owner Shannon Sterz
@ 2025-03-25 10:00 ` Christian Ebner
2025-03-25 10:13 ` Shannon Sterz
0 siblings, 1 reply; 17+ messages in thread
From: Christian Ebner @ 2025-03-25 10:00 UTC (permalink / raw)
To: Proxmox Backup Server development discussion, Shannon Sterz
On 3/24/25 13:51, Shannon Sterz wrote:
> when two clients change the owner of a backup store, a race condition
> arose. add locking to avoid this.
>
> Signed-off-by: Shannon Sterz <s.sterz@proxmox.com>
> ---
> src/api2/admin/datastore.rs | 13 ++++++++++---
> 1 file changed, 10 insertions(+), 3 deletions(-)
>
> diff --git a/src/api2/admin/datastore.rs b/src/api2/admin/datastore.rs
> index 483e595c..3e532636 100644
> --- a/src/api2/admin/datastore.rs
> +++ b/src/api2/admin/datastore.rs
> @@ -7,7 +7,7 @@ use std::os::unix::ffi::OsStrExt;
> use std::path::{Path, PathBuf};
> use std::sync::Arc;
>
> -use anyhow::{bail, format_err, Error};
> +use anyhow::{bail, format_err, Context, Error};
> use futures::*;
> use hyper::http::request::Parts;
> use hyper::{header, Body, Response, StatusCode};
> @@ -2347,10 +2347,9 @@ pub async fn set_backup_owner(
> let datastore = DataStore::lookup_datastore(&store, Some(Operation::Write))?;
>
> let backup_group = datastore.backup_group(ns, backup_group);
> + let owner = backup_group.get_owner()?;
this needs to read the content from the owner file
>
> if owner_check_required {
> - let owner = backup_group.get_owner()?;
> -
> let allowed = match (owner.is_token(), new_owner.is_token()) {
> (true, true) => {
> // API token to API token, owned by same user
> @@ -2397,6 +2396,14 @@ pub async fn set_backup_owner(
> );
> }
>
> + let _guard = backup_group
> + .lock()
> + .with_context(|| format!("while setting the owner of group '{backup_group:?}'"))?;
> +
> + if owner != backup_group.get_owner()? {
same here.
> + bail!("{owner} does not own this group anymore");
> + }
> +
> backup_group.set_owner(&new_owner, true)?;
>
> Ok(())
reading from the filesystem is more expensive than the ownership checks
if required, so IMO it is better to lock the whole operation instead of
reading the owner file twice. Or is this done to avoid locking over and
over by unauthorized users?
_______________________________________________
pbs-devel mailing list
pbs-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [pbs-devel] [PATCH proxmox-backup v8 1/4] datastore/api/backup: prepare for fix of #3935 by adding lock helpers
2025-03-25 9:57 ` Shannon Sterz
@ 2025-03-25 10:12 ` Christian Ebner
0 siblings, 0 replies; 17+ messages in thread
From: Christian Ebner @ 2025-03-25 10:12 UTC (permalink / raw)
To: Shannon Sterz, Proxmox Backup Server development discussion
On 3/25/25 10:57, Shannon Sterz wrote:
> On Tue Mar 25, 2025 at 10:35 AM CET, Christian Ebner wrote:
>> some small nits inline
>>
>> On 3/24/25 13:51, Shannon Sterz wrote:
>>> to avoid duplicate code, add helpers for locking groups and snapshots
>>> to the BackupGroup and BackupDir traits respectively and refactor
>>> existing code to use them.
>>>
>>> this also adapts error handling by adding relevant context to each
>>> locking helper call site. otherwise, we might loose valuable
>>> information useful for debugging. note, however, that users that
>>> relied on specific error messages will break.
>>>
>>> Signed-off-by: Shannon Sterz <s.sterz@proxmox.com>
>>> ---
>
> -->8 snip 8<--
>
>>> +
>>> + /// Lock a group exclusively
>>> + pub fn lock(&self) -> Result<DirLockGuard, Error> {
>>> + lock_dir_noblock(
>>> + &self.full_group_path(),
>>
>> this allocates the group path...
>>
>>> + "backup group",
>>> + "possible running backup",
>>> + )
>>> + }
>>> }
>>>
>>> impl AsRef<pbs_api_types::BackupNamespace> for BackupGroup {
>>> @@ -442,15 +454,33 @@ impl BackupDir {
>>> .map_err(|err| format_err!("unable to acquire manifest lock {:?} - {}", &path, err))
>>> }
>>>
>>> + /// Lock this snapshot exclusively
>>> + pub fn lock(&self) -> Result<DirLockGuard, Error> {
>>> + lock_dir_noblock(
>>> + &self.full_path(),
>>
>> ... this allocates the snapshot path ...
>>
>>> + "snapshot",
>>> + "backup is running or snapshot is in use",
>>> + )
>>> + }
>>> +
>>> + /// Acquire a shared lock on this snapshot
>>> + pub fn lock_shared(&self) -> Result<DirLockGuard, Error> {
>>> + lock_dir_noblock_shared(
>>> + &self.full_path(),
>>> + "snapshot",
>>> + "backup is running or snapshot is in use, could not acquire shared lock",
>>> + )
>>> + }
>>> +
>>> /// Destroy the whole snapshot, bails if it's protected
>>> ///
>>> /// Setting `force` to true skips locking and thus ignores if the backup is currently in use.
>>> pub fn destroy(&self, force: bool) -> Result<(), Error> {
>>> - let full_path = self.full_path();
>>> -
>>> let (_guard, _manifest_guard);
>>> if !force {
>>> - _guard = lock_dir_noblock(&full_path, "snapshot", "possibly running or in use")?;
>>> + _guard = self
>>> + .lock()
>>> + .with_context(|| format!("while destroying snapshot '{self:?}'"))?;
>>> _manifest_guard = self.lock_manifest()?;
>>> }
>>>
>>> @@ -458,6 +488,7 @@ impl BackupDir {
>>> bail!("cannot remove protected snapshot"); // use special error type?
>>> }
>>>
>>> + let full_path = self.full_path();
>>
>> ... and here it reallocates the snapshot path again.
>>
>>> log::info!("removing backup snapshot {:?}", full_path);
>>> std::fs::remove_dir_all(&full_path).map_err(|err| {
>>> format_err!("removing backup snapshot {:?} failed - {}", full_path, err,)
>>> diff --git a/pbs-datastore/src/datastore.rs b/pbs-datastore/src/datastore.rs
>>> index a6a91ca7..1cbd0038 100644
>>> --- a/pbs-datastore/src/datastore.rs
>>> +++ b/pbs-datastore/src/datastore.rs
>>> @@ -5,7 +5,7 @@ use std::os::unix::io::AsRawFd;
>>> use std::path::{Path, PathBuf};
>>> use std::sync::{Arc, LazyLock, Mutex};
>>>
>>> -use anyhow::{bail, format_err, Error};
>>> +use anyhow::{bail, format_err, Context, Error};
>>> use nix::unistd::{unlinkat, UnlinkatFlags};
>>> use tracing::{info, warn};
>>>
>>> @@ -13,8 +13,8 @@ use proxmox_human_byte::HumanByte;
>>> use proxmox_schema::ApiType;
>>>
>>> use proxmox_sys::error::SysError;
>>> +use proxmox_sys::fs::DirLockGuard;
>>> use proxmox_sys::fs::{file_read_optional_string, replace_file, CreateOptions};
>>> -use proxmox_sys::fs::{lock_dir_noblock, DirLockGuard};
>>> use proxmox_sys::linux::procfs::MountInfo;
>>> use proxmox_sys::process_locker::ProcessLockSharedGuard;
>>> use proxmox_worker_task::WorkerTaskContext;
>>> @@ -774,41 +774,35 @@ impl DataStore {
>>> ///
>>> /// This also acquires an exclusive lock on the directory and returns the lock guard.
>>> pub fn create_locked_backup_group(
>>> - &self,
>>> + self: &Arc<Self>,
>>> ns: &BackupNamespace,
>>> backup_group: &pbs_api_types::BackupGroup,
>>> auth_id: &Authid,
>>> ) -> Result<(Authid, DirLockGuard), Error> {
>>> - // create intermediate path first:
>>> - let mut full_path = self.base_path();
>>> - for ns in ns.components() {
>>> - full_path.push("ns");
>>> - full_path.push(ns);
>>> - }
>>> - full_path.push(backup_group.ty.as_str());
>>> - std::fs::create_dir_all(&full_path)?;
>>> + let backup_group = self.backup_group(ns.clone(), backup_group.clone());
>>>
>>> - full_path.push(&backup_group.id);
>>> + // create intermediate path first
>>> + let full_path = backup_group.full_group_path();
>>
>> ... same here, this reallocates the group path.
>>
>> I see that this will change with the new locking mechanism, in the
>> following patches, but might it make sense to also generate the
>> respective paths together with the locking in order to avoid double
>> allocation?
>>
>> However not that relevant since it will get phased out once fully
>> switched to the new locking mechanism anyways.
>
> if i understand you correctly then we have basically two choices. a)
> make the locking handlers take a parameter that will eventually be
> ignored, which is the actual path of the group or b) let the locking
> helper return the path.
>
> both would mean that there is now more churn, as we are messing with the
> signatures of the locking helpers now and when we remove the old locking
> mechanism completely.
>
> also i don't think having the path be allocated twice is that much of a
> performance hit. especially considering that both options mean that
> during the transition period we would always create the locking path and
> the full path. even when already using the new locking mechanism where
> we don't technically need the latter at all call sites anymore.
>
> there is also option c) which could add a field to these structs like
> `full_path: Option<PathBuf>` (or similar) that caches these allocations,
> but that seems a bit like over-engineering to me tbh. at least i
> wouldn't want to add that unless it has actual performance impact.
>
> -->8 snip 8<--
Agreed, just wanted to bring your attention to this in case it was not
considered. Nothing in the path creation callstack seems expensive
enough to justify any of your suggested approaches, also given that the
locking path will change anyways.
_______________________________________________
pbs-devel mailing list
pbs-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [pbs-devel] [PATCH proxmox-backup v8 4/4] fix: api: avoid race condition in set_backup_owner
2025-03-25 10:00 ` Christian Ebner
@ 2025-03-25 10:13 ` Shannon Sterz
2025-03-25 10:18 ` Christian Ebner
0 siblings, 1 reply; 17+ messages in thread
From: Shannon Sterz @ 2025-03-25 10:13 UTC (permalink / raw)
To: Christian Ebner, Proxmox Backup Server development discussion
On Tue Mar 25, 2025 at 11:00 AM CET, Christian Ebner wrote:
> On 3/24/25 13:51, Shannon Sterz wrote:
>> when two clients change the owner of a backup store, a race condition
>> arose. add locking to avoid this.
>>
>> Signed-off-by: Shannon Sterz <s.sterz@proxmox.com>
>> ---
>> src/api2/admin/datastore.rs | 13 ++++++++++---
>> 1 file changed, 10 insertions(+), 3 deletions(-)
>>
>> diff --git a/src/api2/admin/datastore.rs b/src/api2/admin/datastore.rs
>> index 483e595c..3e532636 100644
>> --- a/src/api2/admin/datastore.rs
>> +++ b/src/api2/admin/datastore.rs
>> @@ -7,7 +7,7 @@ use std::os::unix::ffi::OsStrExt;
>> use std::path::{Path, PathBuf};
>> use std::sync::Arc;
>>
>> -use anyhow::{bail, format_err, Error};
>> +use anyhow::{bail, format_err, Context, Error};
>> use futures::*;
>> use hyper::http::request::Parts;
>> use hyper::{header, Body, Response, StatusCode};
>> @@ -2347,10 +2347,9 @@ pub async fn set_backup_owner(
>> let datastore = DataStore::lookup_datastore(&store, Some(Operation::Write))?;
>>
>> let backup_group = datastore.backup_group(ns, backup_group);
>> + let owner = backup_group.get_owner()?;
>
> this needs to read the content from the owner file
>
>>
>> if owner_check_required {
>> - let owner = backup_group.get_owner()?;
>> -
>> let allowed = match (owner.is_token(), new_owner.is_token()) {
>> (true, true) => {
>> // API token to API token, owned by same user
>> @@ -2397,6 +2396,14 @@ pub async fn set_backup_owner(
>> );
>> }
>>
>> + let _guard = backup_group
>> + .lock()
>> + .with_context(|| format!("while setting the owner of group '{backup_group:?}'"))?;
>> +
>> + if owner != backup_group.get_owner()? {
>
> same here.
>
>> + bail!("{owner} does not own this group anymore");
>> + }
>> +
>> backup_group.set_owner(&new_owner, true)?;
>>
>> Ok(())
>
> reading from the filesystem is more expensive than the ownership checks
> if required, so IMO it is better to lock the whole operation instead of
> reading the owner file twice. Or is this done to avoid locking over and
> over by unauthorized users?
yes that function has `Permission::Anybody` if you locked the file right
away, creating a dos attack by *any* logged in user for *any* group
would be trivial. getting the owner might be a bit more expensive, but
this way we can still handle multiple requests in parallel easily and
only hold the lock for the part that actually requires it. hence, the
dos potential is lower (imo) and group owners should not change *that*
often in normal operations. so the performance hit is acceptable.
_______________________________________________
pbs-devel mailing list
pbs-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [pbs-devel] [PATCH proxmox-backup v8 4/4] fix: api: avoid race condition in set_backup_owner
2025-03-25 10:13 ` Shannon Sterz
@ 2025-03-25 10:18 ` Christian Ebner
0 siblings, 0 replies; 17+ messages in thread
From: Christian Ebner @ 2025-03-25 10:18 UTC (permalink / raw)
To: Shannon Sterz, Proxmox Backup Server development discussion
On 3/25/25 11:13, Shannon Sterz wrote:
> On Tue Mar 25, 2025 at 11:00 AM CET, Christian Ebner wrote:
>> On 3/24/25 13:51, Shannon Sterz wrote:
>>> when two clients change the owner of a backup store, a race condition
>>> arose. add locking to avoid this.
>>>
>>> Signed-off-by: Shannon Sterz <s.sterz@proxmox.com>
>>> ---
>>> src/api2/admin/datastore.rs | 13 ++++++++++---
>>> 1 file changed, 10 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/src/api2/admin/datastore.rs b/src/api2/admin/datastore.rs
>>> index 483e595c..3e532636 100644
>>> --- a/src/api2/admin/datastore.rs
>>> +++ b/src/api2/admin/datastore.rs
>>> @@ -7,7 +7,7 @@ use std::os::unix::ffi::OsStrExt;
>>> use std::path::{Path, PathBuf};
>>> use std::sync::Arc;
>>>
>>> -use anyhow::{bail, format_err, Error};
>>> +use anyhow::{bail, format_err, Context, Error};
>>> use futures::*;
>>> use hyper::http::request::Parts;
>>> use hyper::{header, Body, Response, StatusCode};
>>> @@ -2347,10 +2347,9 @@ pub async fn set_backup_owner(
>>> let datastore = DataStore::lookup_datastore(&store, Some(Operation::Write))?;
>>>
>>> let backup_group = datastore.backup_group(ns, backup_group);
>>> + let owner = backup_group.get_owner()?;
>>
>> this needs to read the content from the owner file
>>
>>>
>>> if owner_check_required {
>>> - let owner = backup_group.get_owner()?;
>>> -
>>> let allowed = match (owner.is_token(), new_owner.is_token()) {
>>> (true, true) => {
>>> // API token to API token, owned by same user
>>> @@ -2397,6 +2396,14 @@ pub async fn set_backup_owner(
>>> );
>>> }
>>>
>>> + let _guard = backup_group
>>> + .lock()
>>> + .with_context(|| format!("while setting the owner of group '{backup_group:?}'"))?;
>>> +
>>> + if owner != backup_group.get_owner()? {
>>
>> same here.
>>
>>> + bail!("{owner} does not own this group anymore");
>>> + }
>>> +
>>> backup_group.set_owner(&new_owner, true)?;
>>>
>>> Ok(())
>>
>> reading from the filesystem is more expensive than the ownership checks
>> if required, so IMO it is better to lock the whole operation instead of
>> reading the owner file twice. Or is this done to avoid locking over and
>> over by unauthorized users?
>
> yes that function has `Permission::Anybody` if you locked the file right
> away, creating a dos attack by *any* logged in user for *any* group
> would be trivial. getting the owner might be a bit more expensive, but
> this way we can still handle multiple requests in parallel easily and
> only hold the lock for the part that actually requires it. hence, the
> dos potential is lower (imo) and group owners should not change *that*
> often in normal operations. so the performance hit is acceptable.
Acked, given that information it makes sense to me. Maybe the commit
message or a dedicated comment could explicitly mention this so this is
not changed unintentionally.
_______________________________________________
pbs-devel mailing list
pbs-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [pbs-devel] [PATCH proxmox-backup v8 2/4] fix #3935: datastore/api/backup: move datastore locking to '/run'
2025-03-24 12:51 ` [pbs-devel] [PATCH proxmox-backup v8 2/4] fix #3935: datastore/api/backup: move datastore locking to '/run' Shannon Sterz
2025-03-25 9:43 ` Christian Ebner
2025-03-25 9:48 ` Christian Ebner
@ 2025-03-25 11:25 ` Wolfgang Bumiller
2 siblings, 0 replies; 17+ messages in thread
From: Wolfgang Bumiller @ 2025-03-25 11:25 UTC (permalink / raw)
To: Shannon Sterz; +Cc: pbs-devel
On Mon, Mar 24, 2025 at 01:51:56PM +0100, Shannon Sterz wrote:
> to avoid issues when removing a group or snapshot directory where two
> threads hold a lock to the same directory, move locking to the tmpfs
> backed '/run' directory. also adds double stat'ing to make it possible
> to remove locks without certain race condition issues.
>
> this new mechanism is only employed when we can be sure, that a reboot
> has occured so that all processes are using the new locking mechanism.
> otherwise, two separate process could assume they have exclusive
> rights to a group or snapshot.
>
> bumps the rust version to 1.81 so we can use `std::fs::exists` without
> issue.
>
> Signed-off-by: Shannon Sterz <s.sterz@proxmox.com>
> ---
> Cargo.toml | 2 +-
> debian/postinst | 5 +
> pbs-config/src/lib.rs | 32 ++++-
> pbs-datastore/Cargo.toml | 1 +
> pbs-datastore/src/backup_info.rs | 196 +++++++++++++++++++++++----
> pbs-datastore/src/datastore.rs | 6 +-
> pbs-datastore/src/snapshot_reader.rs | 17 ++-
> src/api2/backup/environment.rs | 5 +-
> src/backup/verify.rs | 4 +-
> src/server/sync.rs | 4 +-
> 10 files changed, 230 insertions(+), 42 deletions(-)
>
> diff --git a/Cargo.toml b/Cargo.toml
> index 306d50e9..645f2d81 100644
> --- a/Cargo.toml
> +++ b/Cargo.toml
> @@ -13,7 +13,7 @@ authors = [
> edition = "2021"
> license = "AGPL-3"
> repository = "https://git.proxmox.com/?p=proxmox-backup.git"
> -rust-version = "1.80"
> +rust-version = "1.81"
>
> [package]
> name = "proxmox-backup"
> diff --git a/debian/postinst b/debian/postinst
> index 0485ca34..81c15826 100644
> --- a/debian/postinst
> +++ b/debian/postinst
> @@ -80,6 +80,11 @@ EOF
> update_sync_job "$prev_job"
> fi
> fi
> +
> + if dpkg --compare-versions "$2" 'lt' '3.3.4~'; then
> + # ensure old locking is used by the daemon until a reboot happened
> + touch "/run/proxmox-backup/old-locking"
> + fi
IMO we should accompany a change like this with a hunk like:
---8<---
diff --git a/debian/rules b/debian/rules
index a03fe11ba..9a605d735 100755
--- a/debian/rules
+++ b/debian/rules
@@ -22,6 +22,8 @@ ifneq ("$(wildcard .repoid)","")
endif
%:
+ echo fix locking version comparison in postinst
+ false
dh $@ --with=bash-completion
override_dh_auto_configure:
--->8---
to make sure we can't build broken packages once this is applied,
especially in case we make more bumps in between.
> fi
> ;;
>
> diff --git a/pbs-config/src/lib.rs b/pbs-config/src/lib.rs
> index 20a8238d..9c4d77c2 100644
> --- a/pbs-config/src/lib.rs
> +++ b/pbs-config/src/lib.rs
> @@ -22,6 +22,8 @@ pub use config_version_cache::ConfigVersionCache;
>
> use anyhow::{format_err, Error};
> use nix::unistd::{Gid, Group, Uid, User};
> +use proxmox_sys::fs::DirLockGuard;
> +use std::os::unix::prelude::AsRawFd;
>
> pub use pbs_buildcfg::{BACKUP_GROUP_NAME, BACKUP_USER_NAME};
>
> @@ -46,13 +48,34 @@ pub fn backup_group() -> Result<nix::unistd::Group, Error> {
> }
>
> pub struct BackupLockGuard {
> - _file: Option<std::fs::File>,
> + file: Option<std::fs::File>,
> + // TODO: Remove `_legacy_dir` with PBS 5
> + _legacy_dir: Option<DirLockGuard>,
> +}
> +
> +impl AsRawFd for BackupLockGuard {
> + fn as_raw_fd(&self) -> i32 {
> + self.file.as_ref().map_or(-1, |f| f.as_raw_fd())
> + }
> +}
> +
> +// TODO: Remove with PBS 5
> +impl From<DirLockGuard> for BackupLockGuard {
> + fn from(value: DirLockGuard) -> Self {
> + Self {
> + file: None,
> + _legacy_dir: Some(value),
> + }
> + }
> }
>
> #[doc(hidden)]
> /// Note: do not use for production code, this is only intended for tests
> pub unsafe fn create_mocked_lock() -> BackupLockGuard {
> - BackupLockGuard { _file: None }
> + BackupLockGuard {
> + file: None,
> + _legacy_dir: None,
> + }
> }
>
> /// Open or create a lock file owned by user "backup" and lock it.
> @@ -76,7 +99,10 @@ pub fn open_backup_lockfile<P: AsRef<std::path::Path>>(
> let timeout = timeout.unwrap_or(std::time::Duration::new(10, 0));
>
> let file = proxmox_sys::fs::open_file_locked(&path, timeout, exclusive, options)?;
> - Ok(BackupLockGuard { _file: Some(file) })
> + Ok(BackupLockGuard {
> + file: Some(file),
> + _legacy_dir: None,
> + })
> }
>
> /// Atomically write data to file owned by "root:backup" with permission "0640"
> diff --git a/pbs-datastore/Cargo.toml b/pbs-datastore/Cargo.toml
> index 4ebc5fdc..7623adc2 100644
> --- a/pbs-datastore/Cargo.toml
> +++ b/pbs-datastore/Cargo.toml
> @@ -35,6 +35,7 @@ proxmox-lang.workspace=true
> proxmox-schema = { workspace = true, features = [ "api-macro" ] }
> proxmox-serde = { workspace = true, features = [ "serde_json" ] }
> proxmox-sys.workspace = true
> +proxmox-systemd.workspace = true
> proxmox-time.workspace = true
> proxmox-uuid.workspace = true
> proxmox-worker-task.workspace = true
> diff --git a/pbs-datastore/src/backup_info.rs b/pbs-datastore/src/backup_info.rs
> index c7f7ed53..b79e8196 100644
> --- a/pbs-datastore/src/backup_info.rs
> +++ b/pbs-datastore/src/backup_info.rs
> @@ -1,13 +1,15 @@
> use std::fmt;
> -use std::os::unix::io::RawFd;
> +use std::os::unix::io::{AsRawFd, RawFd};
> +use std::os::unix::prelude::OsStrExt;
> +use std::path::Path;
> use std::path::PathBuf;
> -use std::sync::Arc;
> +use std::sync::{Arc, LazyLock};
> +use std::time::Duration;
>
> use anyhow::{bail, format_err, Context, Error};
>
> -use proxmox_sys::fs::{
> - lock_dir_noblock, lock_dir_noblock_shared, replace_file, CreateOptions, DirLockGuard,
> -};
> +use proxmox_sys::fs::{lock_dir_noblock, lock_dir_noblock_shared, replace_file, CreateOptions};
> +use proxmox_systemd::escape_unit;
>
> use pbs_api_types::{
> Authid, BackupGroupDeleteStats, BackupNamespace, BackupType, GroupFilter, VerifyState,
> @@ -18,6 +20,18 @@ use pbs_config::{open_backup_lockfile, BackupLockGuard};
> use crate::manifest::{BackupManifest, MANIFEST_LOCK_NAME};
> use crate::{DataBlob, DataStore};
>
> +pub const DATASTORE_LOCKS_DIR: &str = "/run/proxmox-backup/locks";
> +
> +// TODO: Remove with PBS 5
> +// Note: The `expect()` call here will only happen if we can neither confirm nor deny the existence
> +// of the file. this should only happen if a user messes with the `/run/proxmox-backup` directory.
> +// if that happens, a lot more should fail as we rely on the existence of the directory throughout
> +// the code. so just panic with a reasonable message.
> +static OLD_LOCKING: LazyLock<bool> = LazyLock::new(|| {
> + std::fs::exists("/run/proxmox-backup/old-locking")
> + .expect("cannot read `/run/proxmox-backup`, please check permissions")
> +});
> +
> /// BackupGroup is a directory containing a list of BackupDir
> #[derive(Clone)]
> pub struct BackupGroup {
> @@ -225,6 +239,7 @@ impl BackupGroup {
> delete_stats.increment_removed_groups();
> }
>
> + let _ = std::fs::remove_file(self.lock_path());
> Ok(delete_stats)
> }
>
> @@ -241,13 +256,34 @@ impl BackupGroup {
> .set_owner(&self.ns, self.as_ref(), auth_id, force)
> }
>
> - /// Lock a group exclusively
> - pub fn lock(&self) -> Result<DirLockGuard, Error> {
> - lock_dir_noblock(
> - &self.full_group_path(),
> - "backup group",
> - "possible running backup",
> - )
> + /// Returns a file name for locking a group.
> + ///
> + /// The lock file will be located in:
> + /// `${DATASTORE_LOCKS_DIR}/${datastore name}/${lock_file_path_helper(rpath)}`
> + /// where `rpath` is the relative path of the group.
> + fn lock_path(&self) -> PathBuf {
> + let path = Path::new(DATASTORE_LOCKS_DIR).join(self.store.name());
> +
> + let rpath = Path::new(self.group.ty.as_str()).join(&self.group.id);
> +
> + path.join(lock_file_path_helper(&self.ns, rpath))
> + }
> +
> + /// Locks a group exclusively.
> + pub fn lock(&self) -> Result<BackupLockGuard, Error> {
> + if *OLD_LOCKING {
> + lock_dir_noblock(
> + &self.full_group_path(),
> + "backup group",
> + "possible runing backup, group is in use",
> + )
> + .map(BackupLockGuard::from)
> + } else {
> + lock_helper(self.store.name(), &self.lock_path(), |p| {
> + open_backup_lockfile(p, Some(Duration::from_secs(0)), true)
> + .map_err(|err| format_err!("unable to acquire backup group lock {p:?} - {err}"))
> + })
> + }
> }
> }
>
> @@ -454,22 +490,54 @@ impl BackupDir {
> .map_err(|err| format_err!("unable to acquire manifest lock {:?} - {}", &path, err))
> }
>
> - /// Lock this snapshot exclusively
> - pub fn lock(&self) -> Result<DirLockGuard, Error> {
> - lock_dir_noblock(
> - &self.full_path(),
> - "snapshot",
> - "backup is running or snapshot is in use",
> - )
> + /// Returns a file name for locking a snapshot.
> + ///
> + /// The lock file will be located in:
> + /// `${DATASTORE_LOCKS_DIR}/${datastore name}/${lock_file_path_helper(rpath)}`
> + /// where `rpath` is the relative path of the snapshot.
> + fn lock_path(&self) -> PathBuf {
> + let path = Path::new(DATASTORE_LOCKS_DIR).join(self.store.name());
> +
> + let rpath = Path::new(self.dir.group.ty.as_str())
> + .join(&self.dir.group.id)
> + .join(&self.backup_time_string);
> +
> + path.join(lock_file_path_helper(&self.ns, rpath))
> }
>
> - /// Acquire a shared lock on this snapshot
> - pub fn lock_shared(&self) -> Result<DirLockGuard, Error> {
> - lock_dir_noblock_shared(
> - &self.full_path(),
> - "snapshot",
> - "backup is running or snapshot is in use, could not acquire shared lock",
> - )
> + /// Locks a snapshot exclusively.
> + pub fn lock(&self) -> Result<BackupLockGuard, Error> {
> + if *OLD_LOCKING {
> + lock_dir_noblock(
> + &self.full_path(),
> + "snapshot",
> + "backup is running or snapshot is in use",
> + )
> + .map(BackupLockGuard::from)
> + } else {
> + lock_helper(self.store.name(), &self.lock_path(), |p| {
> + open_backup_lockfile(p, Some(Duration::from_secs(0)), true)
> + .map_err(|err| format_err!("unable to acquire snapshot lock {p:?} - {err}"))
> + })
> + }
> + }
> +
> + /// Acquires a shared lock on a snapshot.
> + pub fn lock_shared(&self) -> Result<BackupLockGuard, Error> {
> + if *OLD_LOCKING {
> + lock_dir_noblock_shared(
> + &self.full_path(),
> + "snapshot",
> + "backup is running or snapshot is in use, could not acquire shared lock",
> + )
> + .map(BackupLockGuard::from)
> + } else {
> + lock_helper(self.store.name(), &self.lock_path(), |p| {
> + open_backup_lockfile(p, Some(Duration::from_secs(0)), false).map_err(|err| {
> + format_err!("unable to acquire shared snapshot lock {p:?} - {err}")
> + })
> + })
> + }
> }
>
> /// Destroy the whole snapshot, bails if it's protected
> @@ -494,11 +562,13 @@ impl BackupDir {
> format_err!("removing backup snapshot {:?} failed - {}", full_path, err,)
> })?;
>
> - // the manifest doesn't exist anymore, no need to keep the lock (already done by guard?)
> + // remove no longer needed lock files
> if let Ok(path) = self.manifest_lock_path() {
> let _ = std::fs::remove_file(path); // ignore errors
> }
>
> + let _ = std::fs::remove_file(self.lock_path()); // ignore errors
> +
> Ok(())
> }
>
> @@ -692,3 +762,75 @@ fn list_backup_files<P: ?Sized + nix::NixPath>(
>
> Ok(files)
> }
> +
> +/// Creates a path to a lock file depending on the relative path of an object (snapshot, group,
> +/// manifest) in a datastore. First all namespaces will be concatenated with a colon (ns-folder).
> +/// Then the actual file name will depend on the length of the relative path without namespaces. If
> +/// it is shorter than 255 characters in its unit encoded form, than the unit encoded form will be
> +/// used directly. If not, the file name will consist of the first 80 character, the last 80
> +/// characters and the hash of the unit encoded relative path without namespaces. It will also be
> +/// placed into a "hashed" subfolder in the namespace folder.
> +///
> +/// Examples:
> +///
> +/// - vm-100
> +/// - vm-100-2022\x2d05\x2d02T08\x3a11\x3a33Z
> +/// - ns1:ns2:ns3:ns4:ns5:ns6:ns7/vm-100-2022\x2d05\x2d02T08\x3a11\x3a33Z
> +///
> +/// A "hashed" lock file would look like this:
> +/// - ns1:ns2:ns3/hashed/$first_eighty...$last_eighty-$hash
> +fn lock_file_path_helper(ns: &BackupNamespace, path: PathBuf) -> PathBuf {
> + let to_return = PathBuf::from(
> + ns.components()
> + .map(String::from)
> + .reduce(|acc, n| format!("{acc}:{n}"))
> + .unwrap_or_default(),
> + );
> +
> + let path_bytes = path.as_os_str().as_bytes();
> +
> + let enc = escape_unit(path_bytes, true);
> +
> + if enc.len() < 255 {
> + return to_return.join(enc);
> + }
> +
> + let to_return = to_return.join("hashed");
> +
> + let first_eigthy = &enc[..80];
> + let last_eighty = &enc[enc.len() - 80..];
> + let hash = hex::encode(openssl::sha::sha256(path_bytes));
> +
> + to_return.join(format!("{first_eigthy}...{last_eighty}-{hash}"))
> +}
> +
> +/// Helps implement the double stat'ing procedure. It avoids certain race conditions upon lock
> +/// deletion.
> +///
> +/// It also creates the base directory for lock files.
> +fn lock_helper<F>(
> + store_name: &str,
> + path: &std::path::Path,
> + lock_fn: F,
> +) -> Result<BackupLockGuard, Error>
> +where
> + F: Fn(&std::path::Path) -> Result<BackupLockGuard, Error>,
> +{
> + let mut lock_dir = Path::new(DATASTORE_LOCKS_DIR).join(store_name);
> +
> + if let Some(parent) = path.parent() {
> + lock_dir = lock_dir.join(parent);
> + };
> +
> + std::fs::create_dir_all(&lock_dir)?;
> +
> + let lock = lock_fn(path)?;
> +
> + let stat = nix::sys::stat::fstat(lock.as_raw_fd())?.st_ino;
> +
> + if nix::sys::stat::stat(path).map_or(true, |st| stat != st.st_ino) {
> + bail!("could not acquire lock, another thread modified the lock file");
> + }
> +
> + Ok(lock)
> +}
> diff --git a/pbs-datastore/src/datastore.rs b/pbs-datastore/src/datastore.rs
> index 1cbd0038..1e6157c0 100644
> --- a/pbs-datastore/src/datastore.rs
> +++ b/pbs-datastore/src/datastore.rs
> @@ -13,7 +13,6 @@ use proxmox_human_byte::HumanByte;
> use proxmox_schema::ApiType;
>
> use proxmox_sys::error::SysError;
> -use proxmox_sys::fs::DirLockGuard;
> use proxmox_sys::fs::{file_read_optional_string, replace_file, CreateOptions};
> use proxmox_sys::linux::procfs::MountInfo;
> use proxmox_sys::process_locker::ProcessLockSharedGuard;
> @@ -24,6 +23,7 @@ use pbs_api_types::{
> DataStoreConfig, DatastoreFSyncLevel, DatastoreTuning, GarbageCollectionStatus,
> MaintenanceMode, MaintenanceType, Operation, UPID,
> };
> +use pbs_config::BackupLockGuard;
>
> use crate::backup_info::{BackupDir, BackupGroup};
> use crate::chunk_store::ChunkStore;
> @@ -778,7 +778,7 @@ impl DataStore {
> ns: &BackupNamespace,
> backup_group: &pbs_api_types::BackupGroup,
> auth_id: &Authid,
> - ) -> Result<(Authid, DirLockGuard), Error> {
> + ) -> Result<(Authid, BackupLockGuard), Error> {
> let backup_group = self.backup_group(ns.clone(), backup_group.clone());
>
> // create intermediate path first
> @@ -816,7 +816,7 @@ impl DataStore {
> self: &Arc<Self>,
> ns: &BackupNamespace,
> backup_dir: &pbs_api_types::BackupDir,
> - ) -> Result<(PathBuf, bool, DirLockGuard), Error> {
> + ) -> Result<(PathBuf, bool, BackupLockGuard), Error> {
> let backup_dir = self.backup_dir(ns.clone(), backup_dir.clone())?;
> let relative_path = backup_dir.relative_path();
>
> diff --git a/pbs-datastore/src/snapshot_reader.rs b/pbs-datastore/src/snapshot_reader.rs
> index edff19ef..20fd388a 100644
> --- a/pbs-datastore/src/snapshot_reader.rs
> +++ b/pbs-datastore/src/snapshot_reader.rs
> @@ -4,8 +4,11 @@ use std::path::Path;
> use std::rc::Rc;
> use std::sync::Arc;
>
> -use anyhow::{bail, Context, Error};
> +use anyhow::{bail, format_err, Context, Error};
> use nix::dir::Dir;
> +use nix::fcntl::OFlag;
> +use nix::sys::stat::Mode;
> +use pbs_config::BackupLockGuard;
>
> use pbs_api_types::{
> print_store_and_ns, ArchiveType, BackupNamespace, Operation, CLIENT_LOG_BLOB_NAME,
> @@ -26,6 +29,10 @@ pub struct SnapshotReader {
> datastore_name: String,
> file_list: Vec<String>,
> locked_dir: Dir,
> +
> + // while this is never read, the lock needs to be kept until the
> + // reader is dropped to ensure valid locking semantics
> + _lock: BackupLockGuard,
> }
>
> impl SnapshotReader {
> @@ -46,10 +53,15 @@ impl SnapshotReader {
> bail!("snapshot {} does not exist!", snapshot.dir());
> }
>
> - let locked_dir = snapshot
> + let lock = snapshot
> .lock_shared()
> .with_context(|| format!("while trying to read snapshot '{snapshot:?}'"))?;
>
> + let locked_dir =
> + Dir::open(&snapshot_path, OFlag::O_RDONLY, Mode::empty()).map_err(|err| {
> + format_err!("unable to open snapshot directory {snapshot_path:?} - {err}")
> + })?;
> +
> let datastore_name = datastore.name().to_string();
> let manifest = match snapshot.load_manifest() {
> Ok((manifest, _)) => manifest,
> @@ -79,6 +91,7 @@ impl SnapshotReader {
> datastore_name,
> file_list,
> locked_dir,
> + _lock: lock,
> })
> }
>
> diff --git a/src/api2/backup/environment.rs b/src/api2/backup/environment.rs
> index 7943b576..3d541b46 100644
> --- a/src/api2/backup/environment.rs
> +++ b/src/api2/backup/environment.rs
> @@ -1,5 +1,6 @@
> use anyhow::{bail, format_err, Context, Error};
> -use nix::dir::Dir;
> +use pbs_config::BackupLockGuard;
> +
> use std::collections::HashMap;
> use std::sync::{Arc, Mutex};
> use tracing::info;
> @@ -635,7 +636,7 @@ impl BackupEnvironment {
> /// If verify-new is set on the datastore, this will run a new verify task
> /// for the backup. If not, this will return and also drop the passed lock
> /// immediately.
> - pub fn verify_after_complete(&self, excl_snap_lock: Dir) -> Result<(), Error> {
> + pub fn verify_after_complete(&self, excl_snap_lock: BackupLockGuard) -> Result<(), Error> {
> self.ensure_finished()?;
>
> if !self.datastore.verify_new() {
> diff --git a/src/backup/verify.rs b/src/backup/verify.rs
> index 10a64fda..3d2cba8a 100644
> --- a/src/backup/verify.rs
> +++ b/src/backup/verify.rs
> @@ -1,10 +1,10 @@
> +use pbs_config::BackupLockGuard;
> use std::collections::HashSet;
> use std::sync::atomic::{AtomicUsize, Ordering};
> use std::sync::{Arc, Mutex};
> use std::time::Instant;
>
> use anyhow::{bail, Error};
> -use proxmox_sys::fs::DirLockGuard;
> use tracing::{error, info, warn};
>
> use proxmox_worker_task::WorkerTaskContext;
> @@ -330,7 +330,7 @@ pub fn verify_backup_dir_with_lock(
> backup_dir: &BackupDir,
> upid: UPID,
> filter: Option<&dyn Fn(&BackupManifest) -> bool>,
> - _snap_lock: DirLockGuard,
> + _snap_lock: BackupLockGuard,
> ) -> Result<bool, Error> {
> let datastore_name = verify_worker.datastore.name();
> let backup_dir_name = backup_dir.dir();
> diff --git a/src/server/sync.rs b/src/server/sync.rs
> index b6852aff..10804b14 100644
> --- a/src/server/sync.rs
> +++ b/src/server/sync.rs
> @@ -10,7 +10,7 @@ use std::time::Duration;
> use anyhow::{bail, format_err, Context, Error};
> use futures::{future::FutureExt, select};
> use hyper::http::StatusCode;
> -use proxmox_sys::fs::DirLockGuard;
> +use pbs_config::BackupLockGuard;
> use serde_json::json;
> use tracing::{info, warn};
>
> @@ -106,7 +106,7 @@ pub(crate) struct RemoteSourceReader {
> }
>
> pub(crate) struct LocalSourceReader {
> - pub(crate) _dir_lock: Arc<Mutex<DirLockGuard>>,
> + pub(crate) _dir_lock: Arc<Mutex<BackupLockGuard>>,
> pub(crate) path: PathBuf,
> pub(crate) datastore: Arc<DataStore>,
> }
> --
> 2.39.5
>
>
>
> _______________________________________________
> pbs-devel mailing list
> pbs-devel@lists.proxmox.com
> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel
>
>
_______________________________________________
pbs-devel mailing list
pbs-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [pbs-devel] [PATCH proxmox-backup v8 0/4] refactor datastore locking to use tmpfs
2025-03-24 12:51 [pbs-devel] [PATCH proxmox-backup v8 0/4] refactor datastore locking to use tmpfs Shannon Sterz
` (3 preceding siblings ...)
2025-03-24 12:51 ` [pbs-devel] [PATCH proxmox-backup v8 4/4] fix: api: avoid race condition in set_backup_owner Shannon Sterz
@ 2025-03-25 11:26 ` Wolfgang Bumiller
2025-03-25 11:33 ` Shannon Sterz
4 siblings, 1 reply; 17+ messages in thread
From: Wolfgang Bumiller @ 2025-03-25 11:26 UTC (permalink / raw)
To: Shannon Sterz; +Cc: pbs-devel
The suggested anyhow::Context changes make sense, and maybe the d/rules
change I proposed in 2/4.
With those this is:
Acked-by: Wolfgang Bumiller <w.bumiller@proxmox.com>
On Mon, Mar 24, 2025 at 01:51:54PM +0100, Shannon Sterz wrote:
> The goal of this series is to make it safer to remove backup groups &
> snapshots by separating the corresponding directories from their lock
> files. By moving the lock files to the tmpfs-backed '/run' directory,
> we also make sure that the lock files get cleaned up when the system
> reboots.
>
> This series refactors the locking mechanism inside the `DataStore`,
> `BackupDir` and `BackupGroup` traits. In a first step locking methods
> are added and the existing code is refactored to use them. Commit two
> derives a lock file name under '/run' for each group/snapshot. It also
> adds double stat'ing. To avoid issues when upgrading, the file
> `/run/proxmox-backup/old-locking` is created through a post-install
> hook which is used to determine whether the system has been rebooted
> and we can safely use the new locking mechanism.
>
> The third commit refactors locking for manifests and brings it in-line
> with the group/snapshot locks. Finally, the last commit fixes a race
> condition when changing the owner of a datastore.
>
> ----
> changes from v7 (thanks @ Christian Ebner):
> * use anyhow's `Context` to provide more context on the call site of a
> locking helper call
> * rebase on top of current master to apply cleanly again
>
> changes from v6:
> * add old locking safe guards to avoid different versions of the locking
> mechanism being used at the same time (see discussion here [1]).
>
> [1]: https://lore.proxmox.com/pbs-devel/20250306120810.361035-1-m.sandoval@proxmox.com/T/#u
>
> changes from v5:
> * re-phrase commit messages to make it clear which commit actually
> fixes the issue and what the commit implies in-terms of semantic
> changes for error messages (thanks @ Thomas Lamprecht)
> * make it so the series applies cleanly again and clean up a newly
> added usage of `lock_dir_noblock`
>
> changes from v4 (thanks @ Wolfgang Bumiller):
> * stop using `to_string_lossy()`
> * switch funtion signature of `create_locked_backup_group()` and
> `create_locked_backup_dir` to use the `Arc` version of a datastore.
> * smaller clippy fixes
> * rebased on current master
>
> changes from v3:
> * moved patch 2 to the front so it can be applied separatelly more
> easily
> * rebased on current master
>
> changes from v2:
> * different encoding scheme for lock file names
> * refactored locking methods to be used by the new BackupDir and
> BackupGroup traits
> * adapted lock file names to include namespaces
>
> changes from v1 (thanks @ Wolfgang Bumiller & Thomas Lamprecht):
> * split adding locking helpers and move '/run' into two commits
> * instead of stat'ing the path of lock file twice, only use the file
> descriptor for one of the stat'ing procedures instead
> * several improvements to helper functions and documentation
>
> Shannon Sterz (4):
> datastore/api/backup: prepare for fix of #3935 by adding lock helpers
> fix #3935: datastore/api/backup: move datastore locking to '/run'
> fix #3935: datastore: move manifest locking to new locking method
> fix: api: avoid race condition in set_backup_owner
>
> Cargo.toml | 2 +-
> debian/postinst | 5 +
> pbs-config/src/lib.rs | 32 +++-
> pbs-datastore/Cargo.toml | 1 +
> pbs-datastore/src/backup_info.rs | 236 ++++++++++++++++++++++++---
> pbs-datastore/src/datastore.rs | 86 +++++-----
> pbs-datastore/src/snapshot_reader.rs | 20 ++-
> src/api2/admin/datastore.rs | 13 +-
> src/api2/backup/environment.rs | 21 +--
> src/api2/backup/mod.rs | 13 +-
> src/api2/reader/mod.rs | 11 +-
> src/backup/verify.rs | 12 +-
> src/server/sync.rs | 13 +-
> 13 files changed, 342 insertions(+), 123 deletions(-)
>
> --
> 2.39.5
_______________________________________________
pbs-devel mailing list
pbs-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [pbs-devel] [PATCH proxmox-backup v8 0/4] refactor datastore locking to use tmpfs
2025-03-25 11:26 ` [pbs-devel] [PATCH proxmox-backup v8 0/4] refactor datastore locking to use tmpfs Wolfgang Bumiller
@ 2025-03-25 11:33 ` Shannon Sterz
0 siblings, 0 replies; 17+ messages in thread
From: Shannon Sterz @ 2025-03-25 11:33 UTC (permalink / raw)
To: Wolfgang Bumiller; +Cc: pbs-devel
On Tue Mar 25, 2025 at 12:26 PM CET, Wolfgang Bumiller wrote:
> The suggested anyhow::Context changes make sense, and maybe the d/rules
> change I proposed in 2/4.
>
> With those this is:
>
> Acked-by: Wolfgang Bumiller <w.bumiller@proxmox.com>
alright, will add those changes and submit a v9. thanks!
>
> On Mon, Mar 24, 2025 at 01:51:54PM +0100, Shannon Sterz wrote:
>> The goal of this series is to make it safer to remove backup groups &
>> snapshots by separating the corresponding directories from their lock
>> files. By moving the lock files to the tmpfs-backed '/run' directory,
>> we also make sure that the lock files get cleaned up when the system
>> reboots.
>>
>> This series refactors the locking mechanism inside the `DataStore`,
>> `BackupDir` and `BackupGroup` traits. In a first step locking methods
>> are added and the existing code is refactored to use them. Commit two
>> derives a lock file name under '/run' for each group/snapshot. It also
>> adds double stat'ing. To avoid issues when upgrading, the file
>> `/run/proxmox-backup/old-locking` is created through a post-install
>> hook which is used to determine whether the system has been rebooted
>> and we can safely use the new locking mechanism.
>>
>> The third commit refactors locking for manifests and brings it in-line
>> with the group/snapshot locks. Finally, the last commit fixes a race
>> condition when changing the owner of a datastore.
>>
>> ----
>> changes from v7 (thanks @ Christian Ebner):
>> * use anyhow's `Context` to provide more context on the call site of a
>> locking helper call
>> * rebase on top of current master to apply cleanly again
>>
>> changes from v6:
>> * add old locking safe guards to avoid different versions of the locking
>> mechanism being used at the same time (see discussion here [1]).
>>
>> [1]: https://lore.proxmox.com/pbs-devel/20250306120810.361035-1-m.sandoval@proxmox.com/T/#u
>>
>> changes from v5:
>> * re-phrase commit messages to make it clear which commit actually
>> fixes the issue and what the commit implies in-terms of semantic
>> changes for error messages (thanks @ Thomas Lamprecht)
>> * make it so the series applies cleanly again and clean up a newly
>> added usage of `lock_dir_noblock`
>>
>> changes from v4 (thanks @ Wolfgang Bumiller):
>> * stop using `to_string_lossy()`
>> * switch funtion signature of `create_locked_backup_group()` and
>> `create_locked_backup_dir` to use the `Arc` version of a datastore.
>> * smaller clippy fixes
>> * rebased on current master
>>
>> changes from v3:
>> * moved patch 2 to the front so it can be applied separatelly more
>> easily
>> * rebased on current master
>>
>> changes from v2:
>> * different encoding scheme for lock file names
>> * refactored locking methods to be used by the new BackupDir and
>> BackupGroup traits
>> * adapted lock file names to include namespaces
>>
>> changes from v1 (thanks @ Wolfgang Bumiller & Thomas Lamprecht):
>> * split adding locking helpers and move '/run' into two commits
>> * instead of stat'ing the path of lock file twice, only use the file
>> descriptor for one of the stat'ing procedures instead
>> * several improvements to helper functions and documentation
>>
>> Shannon Sterz (4):
>> datastore/api/backup: prepare for fix of #3935 by adding lock helpers
>> fix #3935: datastore/api/backup: move datastore locking to '/run'
>> fix #3935: datastore: move manifest locking to new locking method
>> fix: api: avoid race condition in set_backup_owner
>>
>> Cargo.toml | 2 +-
>> debian/postinst | 5 +
>> pbs-config/src/lib.rs | 32 +++-
>> pbs-datastore/Cargo.toml | 1 +
>> pbs-datastore/src/backup_info.rs | 236 ++++++++++++++++++++++++---
>> pbs-datastore/src/datastore.rs | 86 +++++-----
>> pbs-datastore/src/snapshot_reader.rs | 20 ++-
>> src/api2/admin/datastore.rs | 13 +-
>> src/api2/backup/environment.rs | 21 +--
>> src/api2/backup/mod.rs | 13 +-
>> src/api2/reader/mod.rs | 11 +-
>> src/backup/verify.rs | 12 +-
>> src/server/sync.rs | 13 +-
>> 13 files changed, 342 insertions(+), 123 deletions(-)
>>
>> --
>> 2.39.5
_______________________________________________
pbs-devel mailing list
pbs-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel
^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2025-03-25 11:33 UTC | newest]
Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-03-24 12:51 [pbs-devel] [PATCH proxmox-backup v8 0/4] refactor datastore locking to use tmpfs Shannon Sterz
2025-03-24 12:51 ` [pbs-devel] [PATCH proxmox-backup v8 1/4] datastore/api/backup: prepare for fix of #3935 by adding lock helpers Shannon Sterz
2025-03-25 9:35 ` Christian Ebner
2025-03-25 9:57 ` Shannon Sterz
2025-03-25 10:12 ` Christian Ebner
2025-03-24 12:51 ` [pbs-devel] [PATCH proxmox-backup v8 2/4] fix #3935: datastore/api/backup: move datastore locking to '/run' Shannon Sterz
2025-03-25 9:43 ` Christian Ebner
2025-03-25 9:48 ` Christian Ebner
2025-03-25 11:25 ` Wolfgang Bumiller
2025-03-24 12:51 ` [pbs-devel] [PATCH proxmox-backup v8 3/4] fix #3935: datastore: move manifest locking to new locking method Shannon Sterz
2025-03-25 9:44 ` Christian Ebner
2025-03-24 12:51 ` [pbs-devel] [PATCH proxmox-backup v8 4/4] fix: api: avoid race condition in set_backup_owner Shannon Sterz
2025-03-25 10:00 ` Christian Ebner
2025-03-25 10:13 ` Shannon Sterz
2025-03-25 10:18 ` Christian Ebner
2025-03-25 11:26 ` [pbs-devel] [PATCH proxmox-backup v8 0/4] refactor datastore locking to use tmpfs Wolfgang Bumiller
2025-03-25 11:33 ` Shannon Sterz
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox
Service provided by Proxmox Server Solutions GmbH | Privacy | Legal