[pbs-devel] [PATCH proxmox-backup 0/3] rework GC-related locks

public inbox for pbs-devel@lists.proxmox.com
 help / color / mirror / Atom feed

* [pbs-devel] [PATCH proxmox-backup 0/3] rework GC-related locks
@ 2025-10-01 11:19 Fabian Grünbichler
  2025-10-01 11:19 ` [pbs-devel] [PATCH proxmox-backup 1/3] GC: rework locking logic Fabian Grünbichler
                   ` (3 more replies)
  0 siblings, 4 replies; 6+ messages in thread
From: Fabian Grünbichler @ 2025-10-01 11:19 UTC (permalink / raw)
  To: pbs-devel

this series reworks some of the locking behaviour surrounding GC.

the first patch removes a potentially problematic combination of
std::sync::Mutex and async S3 code (this should be the last such code
path), by storing the ProcessLocker guard inside the mutex while holding
it just for storing/querying/dropping that guard, instead of for the
duration of GC.

the second and third patches get rid of racy interactions between
pre-reload backup sessions and post-reload GC tasks. while this race
doesn't lead to data loss, it breaks the expected priority of such
tasks.

Fabian Grünbichler (3):
  GC: rework locking logic
  backup env: keep a shared chunk store lock for duration of backup
  index writers: remove chunk store lock

 pbs-datastore/src/datastore.rs     | 544 +++++++++++++++--------------
 pbs-datastore/src/dynamic_index.rs |   6 +-
 pbs-datastore/src/fixed_index.rs   |   6 +-
 src/api2/backup/environment.rs     |   5 +
 src/api2/backup/mod.rs             |   4 +-
 5 files changed, 295 insertions(+), 270 deletions(-)

-- 
2.47.3

_______________________________________________
pbs-devel mailing list
pbs-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [pbs-devel] [PATCH proxmox-backup 1/3] GC: rework locking logic
  2025-10-01 11:19 [pbs-devel] [PATCH proxmox-backup 0/3] rework GC-related locks Fabian Grünbichler
@ 2025-10-01 11:19 ` Fabian Grünbichler
  2025-10-01 13:11   ` Christian Ebner
  2025-10-01 11:19 ` [pbs-devel] [PATCH proxmox-backup 2/3] backup env: keep a shared chunk store lock for duration of backup Fabian Grünbichler
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 6+ messages in thread
From: Fabian Grünbichler @ 2025-10-01 11:19 UTC (permalink / raw)
  To: pbs-devel

to avoid holding a std Mutex for the whole duration of GC, since GC now calls
into async code via S3.

instead, store the guard return by process locker guarded by the mutex, and
query it being present to determine whether a GC is currently running.

Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
---

Notes:
    best viewed with -w

 pbs-datastore/src/datastore.rs | 542 +++++++++++++++++----------------
 1 file changed, 283 insertions(+), 259 deletions(-)

diff --git a/pbs-datastore/src/datastore.rs b/pbs-datastore/src/datastore.rs
index 7cf020fc0..c821ac547 100644
--- a/pbs-datastore/src/datastore.rs
+++ b/pbs-datastore/src/datastore.rs
@@ -20,7 +20,7 @@ use proxmox_schema::ApiType;
 use proxmox_sys::error::SysError;
 use proxmox_sys::fs::{file_read_optional_string, replace_file, CreateOptions};
 use proxmox_sys::linux::procfs::MountInfo;
-use proxmox_sys::process_locker::ProcessLockSharedGuard;
+use proxmox_sys::process_locker::{ProcessLockExclusiveGuard, ProcessLockSharedGuard};
 use proxmox_time::TimeSpan;
 use proxmox_worker_task::WorkerTaskContext;
 
@@ -136,7 +136,7 @@ pub fn ensure_datastore_is_mounted(config: &DataStoreConfig) -> Result<(), Error
 /// management interface for backup.
 pub struct DataStoreImpl {
     chunk_store: Arc<ChunkStore>,
-    gc_mutex: Mutex<()>,
+    gc_mutex: Mutex<Option<ProcessLockExclusiveGuard>>,
     last_gc_status: Mutex<GarbageCollectionStatus>,
     verify_new: bool,
     chunk_order: ChunkOrder,
@@ -152,7 +152,7 @@ impl DataStoreImpl {
     pub(crate) unsafe fn new_test() -> Arc<Self> {
         Arc::new(Self {
             chunk_store: Arc::new(unsafe { ChunkStore::panic_store() }),
-            gc_mutex: Mutex::new(()),
+            gc_mutex: Mutex::new(None),
             last_gc_status: Mutex::new(GarbageCollectionStatus::default()),
             verify_new: false,
             chunk_order: Default::default(),
@@ -513,7 +513,7 @@ impl DataStore {
 
         Ok(DataStoreImpl {
             chunk_store,
-            gc_mutex: Mutex::new(()),
+            gc_mutex: Mutex::new(None),
             last_gc_status: Mutex::new(gc_status),
             verify_new: config.verify_new.unwrap_or(false),
             chunk_order: tuning.chunk_order.unwrap_or_default(),
@@ -1505,7 +1505,10 @@ impl DataStore {
     }
 
     pub fn garbage_collection_running(&self) -> bool {
-        self.inner.gc_mutex.try_lock().is_err()
+        let lock = self.inner.gc_mutex.try_lock();
+
+        // Mutex currently locked or contains an exclusive lock
+        lock.is_err() || lock.unwrap().is_some()
     }
 
     pub fn garbage_collection(
@@ -1513,263 +1516,284 @@ impl DataStore {
         worker: &dyn WorkerTaskContext,
         upid: &UPID,
     ) -> Result<(), Error> {
-        if let Ok(ref mut _mutex) = self.inner.gc_mutex.try_lock() {
-            // avoids that we run GC if an old daemon process has still a
-            // running backup writer, which is not save as we have no "oldest
-            // writer" information and thus no safe atime cutoff
-            let _exclusive_lock = self.inner.chunk_store.try_exclusive_lock()?;
-
-            let (config, _digest) = pbs_config::datastore::config()?;
-            let gc_store_config: DataStoreConfig = config.lookup("datastore", self.name())?;
-            let all_stores = config.convert_to_typed_array("datastore")?;
-            if let Err(err) = gc_store_config.ensure_not_nested(&all_stores) {
-                info!(
-                    "Current datastore path: {path}",
-                    path = gc_store_config.absolute_path()
-                );
-                bail!("Aborting GC for safety reasons: {err}");
-            }
-
-            let phase1_start_time = proxmox_time::epoch_i64();
-            let oldest_writer = self
-                .inner
-                .chunk_store
-                .oldest_writer()
-                .unwrap_or(phase1_start_time);
-
-            let mut gc_status = GarbageCollectionStatus {
-                upid: Some(upid.to_string()),
-                cache_stats: Some(GarbageCollectionCacheStats::default()),
-                ..Default::default()
-            };
-            let tuning: DatastoreTuning = serde_json::from_value(
-                DatastoreTuning::API_SCHEMA
-                    .parse_property_string(gc_store_config.tuning.as_deref().unwrap_or(""))?,
-            )?;
-
-            let s3_client = match self.backend()? {
-                DatastoreBackend::Filesystem => None,
-                DatastoreBackend::S3(s3_client) => {
-                    proxmox_async::runtime::block_on(s3_client.head_bucket())
-                        .context("failed to reach bucket")?;
-                    Some(s3_client)
-                }
-            };
-
-            if tuning.gc_atime_safety_check.unwrap_or(true) {
-                self.inner
-                    .chunk_store
-                    .check_fs_atime_updates(true, s3_client.clone())
-                    .context("atime safety check failed")?;
-                info!("Access time update check successful, proceeding with GC.");
-            } else {
-                info!("Access time update check disabled by datastore tuning options.");
-            };
-
-            // Fallback to default 24h 5m if not set
-            let cutoff = tuning
-                .gc_atime_cutoff
-                .map(|cutoff| cutoff * 60)
-                .unwrap_or(3600 * 24 + 300);
-
-            let mut min_atime = phase1_start_time - cutoff as i64;
-            info!(
-                "Using access time cutoff {}, minimum access time is {}",
-                TimeSpan::from(Duration::from_secs(cutoff as u64)),
-                proxmox_time::epoch_to_rfc3339_utc(min_atime)?,
-            );
-            if oldest_writer < min_atime {
-                min_atime = oldest_writer - 300; // account for 5 min safety gap
-                info!(
-                    "Oldest backup writer started at {}, extending minimum access time to {}",
-                    TimeSpan::from(Duration::from_secs(oldest_writer as u64)),
-                    proxmox_time::epoch_to_rfc3339_utc(min_atime)?,
-                );
-            }
-
-            let tuning: DatastoreTuning = serde_json::from_value(
-                DatastoreTuning::API_SCHEMA
-                    .parse_property_string(gc_store_config.tuning.as_deref().unwrap_or(""))?,
-            )?;
-            let gc_cache_capacity = if let Some(capacity) = tuning.gc_cache_capacity {
-                info!("Using chunk digest cache capacity of {capacity}.");
-                capacity
-            } else {
-                1024 * 1024
-            };
-
-            info!("Start GC phase1 (mark used chunks)");
-
-            self.mark_used_chunks(
-                &mut gc_status,
-                worker,
-                gc_cache_capacity,
-                s3_client.as_ref().cloned(),
-            )
-            .context("marking used chunks failed")?;
-
-            info!("Start GC phase2 (sweep unused chunks)");
-
-            if let Some(ref s3_client) = s3_client {
-                let mut chunk_count = 0;
-                let prefix = S3PathPrefix::Some(".chunks/".to_string());
-                // Operates in batches of 1000 objects max per request
-                let mut list_bucket_result =
-                    proxmox_async::runtime::block_on(s3_client.list_objects_v2(&prefix, None))
-                        .context("failed to list chunk in s3 object store")?;
-
-                let mut delete_list = Vec::with_capacity(1000);
-                loop {
-                    let lock = self.inner.chunk_store.mutex().lock().unwrap();
-
-                    for content in list_bucket_result.contents {
-                        if self
-                            .mark_chunk_for_object_key(
-                                &content.key,
-                                content.size,
-                                min_atime,
-                                oldest_writer,
-                                &mut delete_list,
-                                &mut gc_status,
-                            )
-                            .with_context(|| {
-                                format!("failed to mark chunk for object key {}", content.key)
-                            })?
-                        {
-                            chunk_count += 1;
-                        }
-                    }
-
-                    if !delete_list.is_empty() {
-                        let delete_objects_result = proxmox_async::runtime::block_on(
-                            s3_client.delete_objects(&delete_list),
-                        )?;
-                        if let Some(_err) = delete_objects_result.error {
-                            bail!("failed to delete some objects");
-                        }
-                        delete_list.clear();
-                    }
-
-                    drop(lock);
-
-                    // Process next batch of chunks if there is more
-                    if list_bucket_result.is_truncated {
-                        list_bucket_result =
-                            proxmox_async::runtime::block_on(s3_client.list_objects_v2(
-                                &prefix,
-                                list_bucket_result.next_continuation_token.as_deref(),
-                            ))?;
-                        continue;
-                    }
-
-                    break;
-                }
-                info!("processed {chunk_count} total chunks");
-
-                // Phase 2 GC of Filesystem backed storage is phase 3 for S3 backed GC
-                info!("Start GC phase3 (sweep unused chunk markers)");
-
-                let mut tmp_gc_status = GarbageCollectionStatus {
-                    upid: Some(upid.to_string()),
-                    ..Default::default()
-                };
-                self.inner.chunk_store.sweep_unused_chunks(
-                    oldest_writer,
-                    min_atime,
-                    &mut tmp_gc_status,
-                    worker,
-                )?;
-            } else {
-                self.inner.chunk_store.sweep_unused_chunks(
-                    oldest_writer,
-                    min_atime,
-                    &mut gc_status,
-                    worker,
-                )?;
-            }
-
-            if let Some(cache_stats) = &gc_status.cache_stats {
-                let total_cache_counts = cache_stats.hits + cache_stats.misses;
-                if total_cache_counts > 0 {
-                    let cache_hit_ratio =
-                        (cache_stats.hits as f64 * 100.) / total_cache_counts as f64;
-                    info!(
-                        "Chunk cache: hits {}, misses {} (hit ratio {cache_hit_ratio:.2}%)",
-                        cache_stats.hits, cache_stats.misses,
-                    );
-                }
-            }
-            info!(
-                "Removed garbage: {}",
-                HumanByte::from(gc_status.removed_bytes),
-            );
-            info!("Removed chunks: {}", gc_status.removed_chunks);
-            if gc_status.pending_bytes > 0 {
-                info!(
-                    "Pending removals: {} (in {} chunks)",
-                    HumanByte::from(gc_status.pending_bytes),
-                    gc_status.pending_chunks,
-                );
-            }
-            if gc_status.removed_bad > 0 {
-                info!("Removed bad chunks: {}", gc_status.removed_bad);
-            }
-
-            if gc_status.still_bad > 0 {
-                info!("Leftover bad chunks: {}", gc_status.still_bad);
-            }
-
-            info!(
-                "Original data usage: {}",
-                HumanByte::from(gc_status.index_data_bytes),
-            );
-
-            if gc_status.index_data_bytes > 0 {
-                let comp_per =
-                    (gc_status.disk_bytes as f64 * 100.) / gc_status.index_data_bytes as f64;
-                info!(
-                    "On-Disk usage: {} ({comp_per:.2}%)",
-                    HumanByte::from(gc_status.disk_bytes)
-                );
-            }
-
-            info!("On-Disk chunks: {}", gc_status.disk_chunks);
-
-            let deduplication_factor = if gc_status.disk_bytes > 0 {
-                (gc_status.index_data_bytes as f64) / (gc_status.disk_bytes as f64)
-            } else {
-                1.0
-            };
-
-            info!("Deduplication factor: {deduplication_factor:.2}");
-
-            if gc_status.disk_chunks > 0 {
-                let avg_chunk = gc_status.disk_bytes / (gc_status.disk_chunks as u64);
-                info!("Average chunk size: {}", HumanByte::from(avg_chunk));
-            }
-
-            if let Ok(serialized) = serde_json::to_string(&gc_status) {
-                let mut path = self.base_path();
-                path.push(".gc-status");
-
-                let backup_user = pbs_config::backup_user()?;
-                let mode = nix::sys::stat::Mode::from_bits_truncate(0o0644);
-                // set the correct owner/group/permissions while saving file
-                // owner(rw) = backup, group(r)= backup
-                let options = CreateOptions::new()
-                    .perm(mode)
-                    .owner(backup_user.uid)
-                    .group(backup_user.gid);
-
-                // ignore errors
-                let _ = replace_file(path, serialized.as_bytes(), options, false);
-            }
-
-            *self.inner.last_gc_status.lock().unwrap() = gc_status;
-        } else {
+        // avoids that we run GC if an old daemon process has still a
+        // running backup writer, which is not save as we have no "oldest
+        // writer" information and thus no safe atime cutoff
+        if self.garbage_collection_running() {
             bail!("Start GC failed - (already running/locked)");
         }
+        if let Ok(exclusive_lock) = self.inner.chunk_store.try_exclusive_lock() {
+            // keep the exclusive lock around for the duration of GC
+            let mut lock = self.inner.gc_mutex.lock().unwrap();
+            if lock.is_some() {
+                bail!("Obtained exclusive lock on datastore twice - this should never happen!");
+            }
+            *lock = Some(exclusive_lock);
+            drop(lock);
 
+            let res = self.garbage_collection_impl(worker, upid);
+
+            // and now drop it again
+            let mut lock = self.inner.gc_mutex.lock().unwrap();
+            if lock.take().is_none() {
+                warn!("Lost exclusive datastore lock during GC - this should never happen!");
+            }
+
+            res
+        } else {
+            bail!("Start GC failed - already running or backup writer in old process holds lock");
+        }
+    }
+
+    fn garbage_collection_impl(
+        &self,
+        worker: &dyn WorkerTaskContext,
+        upid: &UPID,
+    ) -> Result<(), Error> {
+        let (config, _digest) = pbs_config::datastore::config()?;
+        let gc_store_config: DataStoreConfig = config.lookup("datastore", self.name())?;
+        let all_stores = config.convert_to_typed_array("datastore")?;
+        if let Err(err) = gc_store_config.ensure_not_nested(&all_stores) {
+            info!(
+                "Current datastore path: {path}",
+                path = gc_store_config.absolute_path()
+            );
+            bail!("Aborting GC for safety reasons: {err}");
+        }
+
+        let phase1_start_time = proxmox_time::epoch_i64();
+        let oldest_writer = self
+            .inner
+            .chunk_store
+            .oldest_writer()
+            .unwrap_or(phase1_start_time);
+
+        let mut gc_status = GarbageCollectionStatus {
+            upid: Some(upid.to_string()),
+            cache_stats: Some(GarbageCollectionCacheStats::default()),
+            ..Default::default()
+        };
+        let tuning: DatastoreTuning = serde_json::from_value(
+            DatastoreTuning::API_SCHEMA
+                .parse_property_string(gc_store_config.tuning.as_deref().unwrap_or(""))?,
+        )?;
+
+        let s3_client = match self.backend()? {
+            DatastoreBackend::Filesystem => None,
+            DatastoreBackend::S3(s3_client) => {
+                proxmox_async::runtime::block_on(s3_client.head_bucket())
+                    .context("failed to reach bucket")?;
+                Some(s3_client)
+            }
+        };
+
+        if tuning.gc_atime_safety_check.unwrap_or(true) {
+            self.inner
+                .chunk_store
+                .check_fs_atime_updates(true, s3_client.clone())
+                .context("atime safety check failed")?;
+            info!("Access time update check successful, proceeding with GC.");
+        } else {
+            info!("Access time update check disabled by datastore tuning options.");
+        };
+
+        // Fallback to default 24h 5m if not set
+        let cutoff = tuning
+            .gc_atime_cutoff
+            .map(|cutoff| cutoff * 60)
+            .unwrap_or(3600 * 24 + 300);
+
+        let mut min_atime = phase1_start_time - cutoff as i64;
+        info!(
+            "Using access time cutoff {}, minimum access time is {}",
+            TimeSpan::from(Duration::from_secs(cutoff as u64)),
+            proxmox_time::epoch_to_rfc3339_utc(min_atime)?,
+        );
+        if oldest_writer < min_atime {
+            min_atime = oldest_writer - 300; // account for 5 min safety gap
+            info!(
+                "Oldest backup writer started at {}, extending minimum access time to {}",
+                TimeSpan::from(Duration::from_secs(oldest_writer as u64)),
+                proxmox_time::epoch_to_rfc3339_utc(min_atime)?,
+            );
+        }
+
+        let tuning: DatastoreTuning = serde_json::from_value(
+            DatastoreTuning::API_SCHEMA
+                .parse_property_string(gc_store_config.tuning.as_deref().unwrap_or(""))?,
+        )?;
+        let gc_cache_capacity = if let Some(capacity) = tuning.gc_cache_capacity {
+            info!("Using chunk digest cache capacity of {capacity}.");
+            capacity
+        } else {
+            1024 * 1024
+        };
+
+        info!("Start GC phase1 (mark used chunks)");
+
+        self.mark_used_chunks(
+            &mut gc_status,
+            worker,
+            gc_cache_capacity,
+            s3_client.as_ref().cloned(),
+        )
+        .context("marking used chunks failed")?;
+
+        info!("Start GC phase2 (sweep unused chunks)");
+
+        if let Some(ref s3_client) = s3_client {
+            let mut chunk_count = 0;
+            let prefix = S3PathPrefix::Some(".chunks/".to_string());
+            // Operates in batches of 1000 objects max per request
+            let mut list_bucket_result =
+                proxmox_async::runtime::block_on(s3_client.list_objects_v2(&prefix, None))
+                    .context("failed to list chunk in s3 object store")?;
+
+            let mut delete_list = Vec::with_capacity(1000);
+            loop {
+                let lock = self.inner.chunk_store.mutex().lock().unwrap();
+
+                for content in list_bucket_result.contents {
+                    if self
+                        .mark_chunk_for_object_key(
+                            &content.key,
+                            content.size,
+                            min_atime,
+                            oldest_writer,
+                            &mut delete_list,
+                            &mut gc_status,
+                        )
+                        .with_context(|| {
+                            format!("failed to mark chunk for object key {}", content.key)
+                        })?
+                    {
+                        chunk_count += 1;
+                    }
+                }
+
+                if !delete_list.is_empty() {
+                    let delete_objects_result =
+                        proxmox_async::runtime::block_on(s3_client.delete_objects(&delete_list))?;
+                    if let Some(_err) = delete_objects_result.error {
+                        bail!("failed to delete some objects");
+                    }
+                    delete_list.clear();
+                }
+
+                drop(lock);
+
+                // Process next batch of chunks if there is more
+                if list_bucket_result.is_truncated {
+                    list_bucket_result =
+                        proxmox_async::runtime::block_on(s3_client.list_objects_v2(
+                            &prefix,
+                            list_bucket_result.next_continuation_token.as_deref(),
+                        ))?;
+                    continue;
+                }
+
+                break;
+            }
+            info!("processed {chunk_count} total chunks");
+
+            // Phase 2 GC of Filesystem backed storage is phase 3 for S3 backed GC
+            info!("Start GC phase3 (sweep unused chunk markers)");
+
+            let mut tmp_gc_status = GarbageCollectionStatus {
+                upid: Some(upid.to_string()),
+                ..Default::default()
+            };
+            self.inner.chunk_store.sweep_unused_chunks(
+                oldest_writer,
+                min_atime,
+                &mut tmp_gc_status,
+                worker,
+            )?;
+        } else {
+            self.inner.chunk_store.sweep_unused_chunks(
+                oldest_writer,
+                min_atime,
+                &mut gc_status,
+                worker,
+            )?;
+        }
+
+        if let Some(cache_stats) = &gc_status.cache_stats {
+            let total_cache_counts = cache_stats.hits + cache_stats.misses;
+            if total_cache_counts > 0 {
+                let cache_hit_ratio = (cache_stats.hits as f64 * 100.) / total_cache_counts as f64;
+                info!(
+                    "Chunk cache: hits {}, misses {} (hit ratio {cache_hit_ratio:.2}%)",
+                    cache_stats.hits, cache_stats.misses,
+                );
+            }
+        }
+        info!(
+            "Removed garbage: {}",
+            HumanByte::from(gc_status.removed_bytes),
+        );
+        info!("Removed chunks: {}", gc_status.removed_chunks);
+        if gc_status.pending_bytes > 0 {
+            info!(
+                "Pending removals: {} (in {} chunks)",
+                HumanByte::from(gc_status.pending_bytes),
+                gc_status.pending_chunks,
+            );
+        }
+        if gc_status.removed_bad > 0 {
+            info!("Removed bad chunks: {}", gc_status.removed_bad);
+        }
+
+        if gc_status.still_bad > 0 {
+            info!("Leftover bad chunks: {}", gc_status.still_bad);
+        }
+
+        info!(
+            "Original data usage: {}",
+            HumanByte::from(gc_status.index_data_bytes),
+        );
+
+        if gc_status.index_data_bytes > 0 {
+            let comp_per = (gc_status.disk_bytes as f64 * 100.) / gc_status.index_data_bytes as f64;
+            info!(
+                "On-Disk usage: {} ({comp_per:.2}%)",
+                HumanByte::from(gc_status.disk_bytes)
+            );
+        }
+
+        info!("On-Disk chunks: {}", gc_status.disk_chunks);
+
+        let deduplication_factor = if gc_status.disk_bytes > 0 {
+            (gc_status.index_data_bytes as f64) / (gc_status.disk_bytes as f64)
+        } else {
+            1.0
+        };
+
+        info!("Deduplication factor: {deduplication_factor:.2}");
+
+        if gc_status.disk_chunks > 0 {
+            let avg_chunk = gc_status.disk_bytes / (gc_status.disk_chunks as u64);
+            info!("Average chunk size: {}", HumanByte::from(avg_chunk));
+        }
+
+        if let Ok(serialized) = serde_json::to_string(&gc_status) {
+            let mut path = self.base_path();
+            path.push(".gc-status");
+
+            let backup_user = pbs_config::backup_user()?;
+            let mode = nix::sys::stat::Mode::from_bits_truncate(0o0644);
+            // set the correct owner/group/permissions while saving file
+            // owner(rw) = backup, group(r)= backup
+            let options = CreateOptions::new()
+                .perm(mode)
+                .owner(backup_user.uid)
+                .group(backup_user.gid);
+
+            // ignore errors
+            let _ = replace_file(path, serialized.as_bytes(), options, false);
+        }
+
+        *self.inner.last_gc_status.lock().unwrap() = gc_status;
         Ok(())
     }
 
-- 
2.47.3



_______________________________________________
pbs-devel mailing list
pbs-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [pbs-devel] [PATCH proxmox-backup 2/3] backup env: keep a shared chunk store lock for duration of backup
  2025-10-01 11:19 [pbs-devel] [PATCH proxmox-backup 0/3] rework GC-related locks Fabian Grünbichler
  2025-10-01 11:19 ` [pbs-devel] [PATCH proxmox-backup 1/3] GC: rework locking logic Fabian Grünbichler
@ 2025-10-01 11:19 ` Fabian Grünbichler
  2025-10-01 11:19 ` [pbs-devel] [PATCH proxmox-backup 3/3] index writers: remove chunk store lock Fabian Grünbichler
  2025-10-02  7:29 ` [pbs-devel] superseded: [PATCH proxmox-backup 0/3] rework GC-related locks Fabian Grünbichler
  3 siblings, 0 replies; 6+ messages in thread
From: Fabian Grünbichler @ 2025-10-01 11:19 UTC (permalink / raw)
  To: pbs-devel

instead of relying on individual index writers obtaining it.

this closes a race window by extending the existence of "backup writers" (as
defined for GC cutoff calculation purposes) from start of the backup task until
the backup is finished and no more index writers exist by definition.

the race windows looked like this:

backup session is started
backup-proxy is reloaded
...
index writer (in old process!) is done and releases shared lock on the chunk store
new proxy process starts a GC which takes exclusive lock
backup in old process fails to obtain shared lock for next index and is aborted

or this:
backup session is started
backup-proxy is reloaded
...
backup session is still busy downloading previous snapshot
new proxy process starts a GC which takes exclusive lock
backup in old process fails to obtain shared lock for first index and is aborted

since the desired semantics for GC and backup sessions is for backup sessions
to have higher priority, this is not desired behaviour.

GC itself is still safe, as it only looks at indices and not the manifest, so
the third race window (no shared lock being held by the backup env inbetween
last index writer being closed and the session itself being finished) has no
practical effect.

Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
---

Notes:
    next patch will drop the locks being held by the index writer, since their only
    use case was protecting the backup session, which is now covered by the backup
    env itself..

 src/api2/backup/environment.rs | 5 +++++
 src/api2/backup/mod.rs         | 4 +++-
 2 files changed, 8 insertions(+), 1 deletion(-)

diff --git a/src/api2/backup/environment.rs b/src/api2/backup/environment.rs
index fa1444ab1..8dc274a29 100644
--- a/src/api2/backup/environment.rs
+++ b/src/api2/backup/environment.rs
@@ -1,5 +1,6 @@
 use anyhow::{bail, format_err, Context, Error};
 use pbs_config::BackupLockGuard;
+use proxmox_sys::process_locker::ProcessLockSharedGuard;
 
 use std::collections::HashMap;
 use std::sync::{Arc, Mutex};
@@ -103,6 +104,7 @@ pub struct BackupLockGuards {
     previous_snapshot: Option<BackupLockGuard>,
     group: Option<BackupLockGuard>,
     snapshot: Option<BackupLockGuard>,
+    chunk_store: Option<ProcessLockSharedGuard>,
 }
 
 impl BackupLockGuards {
@@ -110,11 +112,13 @@ impl BackupLockGuards {
         previous_snapshot: Option<BackupLockGuard>,
         group: BackupLockGuard,
         snapshot: BackupLockGuard,
+        chunk_store: ProcessLockSharedGuard,
     ) -> Self {
         Self {
             previous_snapshot,
             group: Some(group),
             snapshot: Some(snapshot),
+            chunk_store: Some(chunk_store),
         }
     }
 }
@@ -744,6 +748,7 @@ impl BackupEnvironment {
         }
         // drop previous snapshot lock
         state.backup_lock_guards.previous_snapshot.take();
+        state.backup_lock_guards.chunk_store.take();
 
         let stats = serde_json::to_value(state.backup_stat)?;
 
diff --git a/src/api2/backup/mod.rs b/src/api2/backup/mod.rs
index 246b0946d..e15d813a2 100644
--- a/src/api2/backup/mod.rs
+++ b/src/api2/backup/mod.rs
@@ -214,7 +214,9 @@ fn upgrade_to_backup_protocol(
                 // Drop them on successful backup finish or when dropping the env after cleanup in
                 // case of errors. The former is required for immediate subsequent backups (e.g.
                 // during a push sync) to be able to lock the group and snapshots.
-                let backup_lock_guards = BackupLockGuards::new(last_guard, group_guard, snap_guard);
+                let chunk_store_guard = datastore.try_shared_chunk_store_lock()?;
+                let backup_lock_guards =
+                    BackupLockGuards::new(last_guard, group_guard, snap_guard, chunk_store_guard);
 
                 let mut env = BackupEnvironment::new(
                     env_type,
-- 
2.47.3



_______________________________________________
pbs-devel mailing list
pbs-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [pbs-devel] [PATCH proxmox-backup 3/3] index writers: remove chunk store lock
  2025-10-01 11:19 [pbs-devel] [PATCH proxmox-backup 0/3] rework GC-related locks Fabian Grünbichler
  2025-10-01 11:19 ` [pbs-devel] [PATCH proxmox-backup 1/3] GC: rework locking logic Fabian Grünbichler
  2025-10-01 11:19 ` [pbs-devel] [PATCH proxmox-backup 2/3] backup env: keep a shared chunk store lock for duration of backup Fabian Grünbichler
@ 2025-10-01 11:19 ` Fabian Grünbichler
  2025-10-02  7:29 ` [pbs-devel] superseded: [PATCH proxmox-backup 0/3] rework GC-related locks Fabian Grünbichler
  3 siblings, 0 replies; 6+ messages in thread
From: Fabian Grünbichler @ 2025-10-01 11:19 UTC (permalink / raw)
  To: pbs-devel

the only user of these (the backup session) now holds its own chunk store lock
for the duration of the backup, like pull and tape already do as well.

Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
---

Notes:
    we could extend the process locker to allow checking whether a certain guard id
    is still valid as a safeguard. passing around references to the guard itself
    from the backup session down here is very cumbersome..

 pbs-datastore/src/datastore.rs     | 2 ++
 pbs-datastore/src/dynamic_index.rs | 6 +-----
 pbs-datastore/src/fixed_index.rs   | 6 +-----
 3 files changed, 4 insertions(+), 10 deletions(-)

diff --git a/pbs-datastore/src/datastore.rs b/pbs-datastore/src/datastore.rs
index c821ac547..79c93fb28 100644
--- a/pbs-datastore/src/datastore.rs
+++ b/pbs-datastore/src/datastore.rs
@@ -533,6 +533,7 @@ impl DataStore {
         self.inner.chunk_store.get_chunk_iterator()
     }
 
+    // Requires obtaining a shared chunk store lock beforehand
     pub fn create_fixed_writer<P: AsRef<Path>>(
         &self,
         filename: P,
@@ -560,6 +561,7 @@ impl DataStore {
         Ok(index)
     }
 
+    // Requires obtaining a shared chunk store lock beforehand
     pub fn create_dynamic_writer<P: AsRef<Path>>(
         &self,
         filename: P,
diff --git a/pbs-datastore/src/dynamic_index.rs b/pbs-datastore/src/dynamic_index.rs
index 83e13b311..ff6c36782 100644
--- a/pbs-datastore/src/dynamic_index.rs
+++ b/pbs-datastore/src/dynamic_index.rs
@@ -11,7 +11,6 @@ use anyhow::{bail, format_err, Error};
 
 use proxmox_io::ReadExt;
 use proxmox_sys::mmap::Mmap;
-use proxmox_sys::process_locker::ProcessLockSharedGuard;
 use proxmox_uuid::Uuid;
 use pxar::accessor::{MaybeReady, ReadAt, ReadAtOperation};
 
@@ -277,7 +276,6 @@ impl IndexFile for DynamicIndexReader {
 /// Create dynamic index files (`.dixd`)
 pub struct DynamicIndexWriter {
     store: Arc<ChunkStore>,
-    _lock: ProcessLockSharedGuard,
     writer: BufWriter<File>,
     closed: bool,
     filename: PathBuf,
@@ -294,9 +292,8 @@ impl Drop for DynamicIndexWriter {
 }
 
 impl DynamicIndexWriter {
+    // Requires obtaining a shared chunk store lock beforehand
     pub fn create(store: Arc<ChunkStore>, path: &Path) -> Result<Self, Error> {
-        let shared_lock = store.try_shared_lock()?;
-
         let full_path = store.relative_path(path);
         let mut tmp_path = full_path.clone();
         tmp_path.set_extension("tmp_didx");
@@ -325,7 +322,6 @@ impl DynamicIndexWriter {
 
         Ok(Self {
             store,
-            _lock: shared_lock,
             writer,
             closed: false,
             filename: full_path,
diff --git a/pbs-datastore/src/fixed_index.rs b/pbs-datastore/src/fixed_index.rs
index 8d9173e86..0f289543f 100644
--- a/pbs-datastore/src/fixed_index.rs
+++ b/pbs-datastore/src/fixed_index.rs
@@ -9,7 +9,6 @@ use std::sync::Arc;
 use anyhow::{bail, format_err, Error};
 
 use proxmox_io::ReadExt;
-use proxmox_sys::process_locker::ProcessLockSharedGuard;
 use proxmox_uuid::Uuid;
 
 use crate::chunk_stat::ChunkStat;
@@ -218,7 +217,6 @@ impl IndexFile for FixedIndexReader {
 pub struct FixedIndexWriter {
     store: Arc<ChunkStore>,
     file: File,
-    _lock: ProcessLockSharedGuard,
     filename: PathBuf,
     tmp_filename: PathBuf,
     chunk_size: usize,
@@ -243,14 +241,13 @@ impl Drop for FixedIndexWriter {
 
 impl FixedIndexWriter {
     #[allow(clippy::cast_ptr_alignment)]
+    // Requires obtaining a shared chunk store lock beforehand
     pub fn create(
         store: Arc<ChunkStore>,
         path: &Path,
         size: usize,
         chunk_size: usize,
     ) -> Result<Self, Error> {
-        let shared_lock = store.try_shared_lock()?;
-
         let full_path = store.relative_path(path);
         let mut tmp_path = full_path.clone();
         tmp_path.set_extension("tmp_fidx");
@@ -307,7 +304,6 @@ impl FixedIndexWriter {
         Ok(Self {
             store,
             file,
-            _lock: shared_lock,
             filename: full_path,
             tmp_filename: tmp_path,
             chunk_size,
-- 
2.47.3



_______________________________________________
pbs-devel mailing list
pbs-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [pbs-devel] [PATCH proxmox-backup 1/3] GC: rework locking logic
  2025-10-01 11:19 ` [pbs-devel] [PATCH proxmox-backup 1/3] GC: rework locking logic Fabian Grünbichler
@ 2025-10-01 13:11   ` Christian Ebner
  0 siblings, 0 replies; 6+ messages in thread
From: Christian Ebner @ 2025-10-01 13:11 UTC (permalink / raw)
  To: Proxmox Backup Server development discussion, Fabian Grünbichler

one small nit inline

On 10/1/25 1:19 PM, Fabian Grünbichler wrote:
> to avoid holding a std Mutex for the whole duration of GC, since GC now calls
> into async code via S3.
> 
> instead, store the guard return by process locker guarded by the mutex, and
> query it being present to determine whether a GC is currently running.
> 
> Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
> ---
> 
> Notes:
>      best viewed with -w
> 
>   pbs-datastore/src/datastore.rs | 542 +++++++++++++++++----------------
>   1 file changed, 283 insertions(+), 259 deletions(-)
> 
> diff --git a/pbs-datastore/src/datastore.rs b/pbs-datastore/src/datastore.rs
> index 7cf020fc0..c821ac547 100644
> --- a/pbs-datastore/src/datastore.rs
> +++ b/pbs-datastore/src/datastore.rs
> @@ -20,7 +20,7 @@ use proxmox_schema::ApiType;
>   use proxmox_sys::error::SysError;
>   use proxmox_sys::fs::{file_read_optional_string, replace_file, CreateOptions};
>   use proxmox_sys::linux::procfs::MountInfo;
> -use proxmox_sys::process_locker::ProcessLockSharedGuard;
> +use proxmox_sys::process_locker::{ProcessLockExclusiveGuard, ProcessLockSharedGuard};
>   use proxmox_time::TimeSpan;
>   use proxmox_worker_task::WorkerTaskContext;
>   
> @@ -136,7 +136,7 @@ pub fn ensure_datastore_is_mounted(config: &DataStoreConfig) -> Result<(), Error
>   /// management interface for backup.
>   pub struct DataStoreImpl {
>       chunk_store: Arc<ChunkStore>,
> -    gc_mutex: Mutex<()>,
> +    gc_mutex: Mutex<Option<ProcessLockExclusiveGuard>>,
>       last_gc_status: Mutex<GarbageCollectionStatus>,
>       verify_new: bool,
>       chunk_order: ChunkOrder,
> @@ -152,7 +152,7 @@ impl DataStoreImpl {
>       pub(crate) unsafe fn new_test() -> Arc<Self> {
>           Arc::new(Self {
>               chunk_store: Arc::new(unsafe { ChunkStore::panic_store() }),
> -            gc_mutex: Mutex::new(()),
> +            gc_mutex: Mutex::new(None),
>               last_gc_status: Mutex::new(GarbageCollectionStatus::default()),
>               verify_new: false,
>               chunk_order: Default::default(),
> @@ -513,7 +513,7 @@ impl DataStore {
>   
>           Ok(DataStoreImpl {
>               chunk_store,
> -            gc_mutex: Mutex::new(()),
> +            gc_mutex: Mutex::new(None),
>               last_gc_status: Mutex::new(gc_status),
>               verify_new: config.verify_new.unwrap_or(false),
>               chunk_order: tuning.chunk_order.unwrap_or_default(),
> @@ -1505,7 +1505,10 @@ impl DataStore {
>       }
>   
>       pub fn garbage_collection_running(&self) -> bool {
> -        self.inner.gc_mutex.try_lock().is_err()
> +        let lock = self.inner.gc_mutex.try_lock();
> +
> +        // Mutex currently locked or contains an exclusive lock
> +        lock.is_err() || lock.unwrap().is_some()
>       }
>   
>       pub fn garbage_collection(
> @@ -1513,263 +1516,284 @@ impl DataStore {
>           worker: &dyn WorkerTaskContext,
>           upid: &UPID,
>       ) -> Result<(), Error> {
> -        if let Ok(ref mut _mutex) = self.inner.gc_mutex.try_lock() {
> -            // avoids that we run GC if an old daemon process has still a
> -            // running backup writer, which is not save as we have no "oldest
> -            // writer" information and thus no safe atime cutoff
> -            let _exclusive_lock = self.inner.chunk_store.try_exclusive_lock()?;
> -
> -            let (config, _digest) = pbs_config::datastore::config()?;
> -            let gc_store_config: DataStoreConfig = config.lookup("datastore", self.name())?;
> -            let all_stores = config.convert_to_typed_array("datastore")?;
> -            if let Err(err) = gc_store_config.ensure_not_nested(&all_stores) {
> -                info!(
> -                    "Current datastore path: {path}",
> -                    path = gc_store_config.absolute_path()
> -                );
> -                bail!("Aborting GC for safety reasons: {err}");
> -            }
> -
> -            let phase1_start_time = proxmox_time::epoch_i64();
> -            let oldest_writer = self
> -                .inner
> -                .chunk_store
> -                .oldest_writer()
> -                .unwrap_or(phase1_start_time);
> -
> -            let mut gc_status = GarbageCollectionStatus {
> -                upid: Some(upid.to_string()),
> -                cache_stats: Some(GarbageCollectionCacheStats::default()),
> -                ..Default::default()
> -            };
> -            let tuning: DatastoreTuning = serde_json::from_value(
> -                DatastoreTuning::API_SCHEMA
> -                    .parse_property_string(gc_store_config.tuning.as_deref().unwrap_or(""))?,
> -            )?;
> -
> -            let s3_client = match self.backend()? {
> -                DatastoreBackend::Filesystem => None,
> -                DatastoreBackend::S3(s3_client) => {
> -                    proxmox_async::runtime::block_on(s3_client.head_bucket())
> -                        .context("failed to reach bucket")?;
> -                    Some(s3_client)
> -                }
> -            };
> -
> -            if tuning.gc_atime_safety_check.unwrap_or(true) {
> -                self.inner
> -                    .chunk_store
> -                    .check_fs_atime_updates(true, s3_client.clone())
> -                    .context("atime safety check failed")?;
> -                info!("Access time update check successful, proceeding with GC.");
> -            } else {
> -                info!("Access time update check disabled by datastore tuning options.");
> -            };
> -
> -            // Fallback to default 24h 5m if not set
> -            let cutoff = tuning
> -                .gc_atime_cutoff
> -                .map(|cutoff| cutoff * 60)
> -                .unwrap_or(3600 * 24 + 300);
> -
> -            let mut min_atime = phase1_start_time - cutoff as i64;
> -            info!(
> -                "Using access time cutoff {}, minimum access time is {}",
> -                TimeSpan::from(Duration::from_secs(cutoff as u64)),
> -                proxmox_time::epoch_to_rfc3339_utc(min_atime)?,
> -            );
> -            if oldest_writer < min_atime {
> -                min_atime = oldest_writer - 300; // account for 5 min safety gap
> -                info!(
> -                    "Oldest backup writer started at {}, extending minimum access time to {}",
> -                    TimeSpan::from(Duration::from_secs(oldest_writer as u64)),
> -                    proxmox_time::epoch_to_rfc3339_utc(min_atime)?,
> -                );
> -            }
> -
> -            let tuning: DatastoreTuning = serde_json::from_value(
> -                DatastoreTuning::API_SCHEMA
> -                    .parse_property_string(gc_store_config.tuning.as_deref().unwrap_or(""))?,
> -            )?;
> -            let gc_cache_capacity = if let Some(capacity) = tuning.gc_cache_capacity {
> -                info!("Using chunk digest cache capacity of {capacity}.");
> -                capacity
> -            } else {
> -                1024 * 1024
> -            };
> -
> -            info!("Start GC phase1 (mark used chunks)");
> -
> -            self.mark_used_chunks(
> -                &mut gc_status,
> -                worker,
> -                gc_cache_capacity,
> -                s3_client.as_ref().cloned(),
> -            )
> -            .context("marking used chunks failed")?;
> -
> -            info!("Start GC phase2 (sweep unused chunks)");
> -
> -            if let Some(ref s3_client) = s3_client {
> -                let mut chunk_count = 0;
> -                let prefix = S3PathPrefix::Some(".chunks/".to_string());
> -                // Operates in batches of 1000 objects max per request
> -                let mut list_bucket_result =
> -                    proxmox_async::runtime::block_on(s3_client.list_objects_v2(&prefix, None))
> -                        .context("failed to list chunk in s3 object store")?;
> -
> -                let mut delete_list = Vec::with_capacity(1000);
> -                loop {
> -                    let lock = self.inner.chunk_store.mutex().lock().unwrap();
> -
> -                    for content in list_bucket_result.contents {
> -                        if self
> -                            .mark_chunk_for_object_key(
> -                                &content.key,
> -                                content.size,
> -                                min_atime,
> -                                oldest_writer,
> -                                &mut delete_list,
> -                                &mut gc_status,
> -                            )
> -                            .with_context(|| {
> -                                format!("failed to mark chunk for object key {}", content.key)
> -                            })?
> -                        {
> -                            chunk_count += 1;
> -                        }
> -                    }
> -
> -                    if !delete_list.is_empty() {
> -                        let delete_objects_result = proxmox_async::runtime::block_on(
> -                            s3_client.delete_objects(&delete_list),
> -                        )?;
> -                        if let Some(_err) = delete_objects_result.error {
> -                            bail!("failed to delete some objects");
> -                        }
> -                        delete_list.clear();
> -                    }
> -
> -                    drop(lock);
> -
> -                    // Process next batch of chunks if there is more
> -                    if list_bucket_result.is_truncated {
> -                        list_bucket_result =
> -                            proxmox_async::runtime::block_on(s3_client.list_objects_v2(
> -                                &prefix,
> -                                list_bucket_result.next_continuation_token.as_deref(),
> -                            ))?;
> -                        continue;
> -                    }
> -
> -                    break;
> -                }
> -                info!("processed {chunk_count} total chunks");
> -
> -                // Phase 2 GC of Filesystem backed storage is phase 3 for S3 backed GC
> -                info!("Start GC phase3 (sweep unused chunk markers)");
> -
> -                let mut tmp_gc_status = GarbageCollectionStatus {
> -                    upid: Some(upid.to_string()),
> -                    ..Default::default()
> -                };
> -                self.inner.chunk_store.sweep_unused_chunks(
> -                    oldest_writer,
> -                    min_atime,
> -                    &mut tmp_gc_status,
> -                    worker,
> -                )?;
> -            } else {
> -                self.inner.chunk_store.sweep_unused_chunks(
> -                    oldest_writer,
> -                    min_atime,
> -                    &mut gc_status,
> -                    worker,
> -                )?;
> -            }
> -
> -            if let Some(cache_stats) = &gc_status.cache_stats {
> -                let total_cache_counts = cache_stats.hits + cache_stats.misses;
> -                if total_cache_counts > 0 {
> -                    let cache_hit_ratio =
> -                        (cache_stats.hits as f64 * 100.) / total_cache_counts as f64;
> -                    info!(
> -                        "Chunk cache: hits {}, misses {} (hit ratio {cache_hit_ratio:.2}%)",
> -                        cache_stats.hits, cache_stats.misses,
> -                    );
> -                }
> -            }
> -            info!(
> -                "Removed garbage: {}",
> -                HumanByte::from(gc_status.removed_bytes),
> -            );
> -            info!("Removed chunks: {}", gc_status.removed_chunks);
> -            if gc_status.pending_bytes > 0 {
> -                info!(
> -                    "Pending removals: {} (in {} chunks)",
> -                    HumanByte::from(gc_status.pending_bytes),
> -                    gc_status.pending_chunks,
> -                );
> -            }
> -            if gc_status.removed_bad > 0 {
> -                info!("Removed bad chunks: {}", gc_status.removed_bad);
> -            }
> -
> -            if gc_status.still_bad > 0 {
> -                info!("Leftover bad chunks: {}", gc_status.still_bad);
> -            }
> -
> -            info!(
> -                "Original data usage: {}",
> -                HumanByte::from(gc_status.index_data_bytes),
> -            );
> -
> -            if gc_status.index_data_bytes > 0 {
> -                let comp_per =
> -                    (gc_status.disk_bytes as f64 * 100.) / gc_status.index_data_bytes as f64;
> -                info!(
> -                    "On-Disk usage: {} ({comp_per:.2}%)",
> -                    HumanByte::from(gc_status.disk_bytes)
> -                );
> -            }
> -
> -            info!("On-Disk chunks: {}", gc_status.disk_chunks);
> -
> -            let deduplication_factor = if gc_status.disk_bytes > 0 {
> -                (gc_status.index_data_bytes as f64) / (gc_status.disk_bytes as f64)
> -            } else {
> -                1.0
> -            };
> -
> -            info!("Deduplication factor: {deduplication_factor:.2}");
> -
> -            if gc_status.disk_chunks > 0 {
> -                let avg_chunk = gc_status.disk_bytes / (gc_status.disk_chunks as u64);
> -                info!("Average chunk size: {}", HumanByte::from(avg_chunk));
> -            }
> -
> -            if let Ok(serialized) = serde_json::to_string(&gc_status) {
> -                let mut path = self.base_path();
> -                path.push(".gc-status");
> -
> -                let backup_user = pbs_config::backup_user()?;
> -                let mode = nix::sys::stat::Mode::from_bits_truncate(0o0644);
> -                // set the correct owner/group/permissions while saving file
> -                // owner(rw) = backup, group(r)= backup
> -                let options = CreateOptions::new()
> -                    .perm(mode)
> -                    .owner(backup_user.uid)
> -                    .group(backup_user.gid);
> -
> -                // ignore errors
> -                let _ = replace_file(path, serialized.as_bytes(), options, false);
> -            }
> -
> -            *self.inner.last_gc_status.lock().unwrap() = gc_status;
> -        } else {
> +        // avoids that we run GC if an old daemon process has still a
> +        // running backup writer, which is not save as we have no "oldest
> +        // writer" information and thus no safe atime cutoff
> +        if self.garbage_collection_running() {
>               bail!("Start GC failed - (already running/locked)");
>           }

nit: the ordering of Mutex locking and getting the exclusive process 
lock are now reversed. Not an issue but maybe we should stick to the 
current order here. Makes also the diff smaller as you already mentioned 
in offlist discussion.

> +        if let Ok(exclusive_lock) = self.inner.chunk_store.try_exclusive_lock() {
> +            // keep the exclusive lock around for the duration of GC
> +            let mut lock = self.inner.gc_mutex.lock().unwrap();
> +            if lock.is_some() {
> +                bail!("Obtained exclusive lock on datastore twice - this should never happen!");
> +            }
> +            *lock = Some(exclusive_lock);
> +            drop(lock);
>   
> +            let res = self.garbage_collection_impl(worker, upid);
> +
> +            // and now drop it again
> +            let mut lock = self.inner.gc_mutex.lock().unwrap();
> +            if lock.take().is_none() {
> +                warn!("Lost exclusive datastore lock during GC - this should never happen!");
> +            }
> +
> +            res
> +        } else {
> +            bail!("Start GC failed - already running or backup writer in old process holds lock");
> +        }
> +    }
> +
> +    fn garbage_collection_impl(
> +        &self,
> +        worker: &dyn WorkerTaskContext,
> +        upid: &UPID,
> +    ) -> Result<(), Error> {
> +        let (config, _digest) = pbs_config::datastore::config()?;
> +        let gc_store_config: DataStoreConfig = config.lookup("datastore", self.name())?;
> +        let all_stores = config.convert_to_typed_array("datastore")?;
> +        if let Err(err) = gc_store_config.ensure_not_nested(&all_stores) {
> +            info!(
> +                "Current datastore path: {path}",
> +                path = gc_store_config.absolute_path()
> +            );
> +            bail!("Aborting GC for safety reasons: {err}");
> +        }
> +
> +        let phase1_start_time = proxmox_time::epoch_i64();
> +        let oldest_writer = self
> +            .inner
> +            .chunk_store
> +            .oldest_writer()
> +            .unwrap_or(phase1_start_time);
> +
> +        let mut gc_status = GarbageCollectionStatus {
> +            upid: Some(upid.to_string()),
> +            cache_stats: Some(GarbageCollectionCacheStats::default()),
> +            ..Default::default()
> +        };
> +        let tuning: DatastoreTuning = serde_json::from_value(
> +            DatastoreTuning::API_SCHEMA
> +                .parse_property_string(gc_store_config.tuning.as_deref().unwrap_or(""))?,
> +        )?;
> +
> +        let s3_client = match self.backend()? {
> +            DatastoreBackend::Filesystem => None,
> +            DatastoreBackend::S3(s3_client) => {
> +                proxmox_async::runtime::block_on(s3_client.head_bucket())
> +                    .context("failed to reach bucket")?;
> +                Some(s3_client)
> +            }
> +        };
> +
> +        if tuning.gc_atime_safety_check.unwrap_or(true) {
> +            self.inner
> +                .chunk_store
> +                .check_fs_atime_updates(true, s3_client.clone())
> +                .context("atime safety check failed")?;
> +            info!("Access time update check successful, proceeding with GC.");
> +        } else {
> +            info!("Access time update check disabled by datastore tuning options.");
> +        };
> +
> +        // Fallback to default 24h 5m if not set
> +        let cutoff = tuning
> +            .gc_atime_cutoff
> +            .map(|cutoff| cutoff * 60)
> +            .unwrap_or(3600 * 24 + 300);
> +
> +        let mut min_atime = phase1_start_time - cutoff as i64;
> +        info!(
> +            "Using access time cutoff {}, minimum access time is {}",
> +            TimeSpan::from(Duration::from_secs(cutoff as u64)),
> +            proxmox_time::epoch_to_rfc3339_utc(min_atime)?,
> +        );
> +        if oldest_writer < min_atime {
> +            min_atime = oldest_writer - 300; // account for 5 min safety gap
> +            info!(
> +                "Oldest backup writer started at {}, extending minimum access time to {}",
> +                TimeSpan::from(Duration::from_secs(oldest_writer as u64)),
> +                proxmox_time::epoch_to_rfc3339_utc(min_atime)?,
> +            );
> +        }
> +
> +        let tuning: DatastoreTuning = serde_json::from_value(
> +            DatastoreTuning::API_SCHEMA
> +                .parse_property_string(gc_store_config.tuning.as_deref().unwrap_or(""))?,
> +        )?;
> +        let gc_cache_capacity = if let Some(capacity) = tuning.gc_cache_capacity {
> +            info!("Using chunk digest cache capacity of {capacity}.");
> +            capacity
> +        } else {
> +            1024 * 1024
> +        };
> +
> +        info!("Start GC phase1 (mark used chunks)");
> +
> +        self.mark_used_chunks(
> +            &mut gc_status,
> +            worker,
> +            gc_cache_capacity,
> +            s3_client.as_ref().cloned(),
> +        )
> +        .context("marking used chunks failed")?;
> +
> +        info!("Start GC phase2 (sweep unused chunks)");
> +
> +        if let Some(ref s3_client) = s3_client {
> +            let mut chunk_count = 0;
> +            let prefix = S3PathPrefix::Some(".chunks/".to_string());
> +            // Operates in batches of 1000 objects max per request
> +            let mut list_bucket_result =
> +                proxmox_async::runtime::block_on(s3_client.list_objects_v2(&prefix, None))
> +                    .context("failed to list chunk in s3 object store")?;
> +
> +            let mut delete_list = Vec::with_capacity(1000);
> +            loop {
> +                let lock = self.inner.chunk_store.mutex().lock().unwrap();
> +
> +                for content in list_bucket_result.contents {
> +                    if self
> +                        .mark_chunk_for_object_key(
> +                            &content.key,
> +                            content.size,
> +                            min_atime,
> +                            oldest_writer,
> +                            &mut delete_list,
> +                            &mut gc_status,
> +                        )
> +                        .with_context(|| {
> +                            format!("failed to mark chunk for object key {}", content.key)
> +                        })?
> +                    {
> +                        chunk_count += 1;
> +                    }
> +                }
> +
> +                if !delete_list.is_empty() {
> +                    let delete_objects_result =
> +                        proxmox_async::runtime::block_on(s3_client.delete_objects(&delete_list))?;
> +                    if let Some(_err) = delete_objects_result.error {
> +                        bail!("failed to delete some objects");
> +                    }
> +                    delete_list.clear();
> +                }
> +
> +                drop(lock);
> +
> +                // Process next batch of chunks if there is more
> +                if list_bucket_result.is_truncated {
> +                    list_bucket_result =
> +                        proxmox_async::runtime::block_on(s3_client.list_objects_v2(
> +                            &prefix,
> +                            list_bucket_result.next_continuation_token.as_deref(),
> +                        ))?;
> +                    continue;
> +                }
> +
> +                break;
> +            }
> +            info!("processed {chunk_count} total chunks");
> +
> +            // Phase 2 GC of Filesystem backed storage is phase 3 for S3 backed GC
> +            info!("Start GC phase3 (sweep unused chunk markers)");
> +
> +            let mut tmp_gc_status = GarbageCollectionStatus {
> +                upid: Some(upid.to_string()),
> +                ..Default::default()
> +            };
> +            self.inner.chunk_store.sweep_unused_chunks(
> +                oldest_writer,
> +                min_atime,
> +                &mut tmp_gc_status,
> +                worker,
> +            )?;
> +        } else {
> +            self.inner.chunk_store.sweep_unused_chunks(
> +                oldest_writer,
> +                min_atime,
> +                &mut gc_status,
> +                worker,
> +            )?;
> +        }
> +
> +        if let Some(cache_stats) = &gc_status.cache_stats {
> +            let total_cache_counts = cache_stats.hits + cache_stats.misses;
> +            if total_cache_counts > 0 {
> +                let cache_hit_ratio = (cache_stats.hits as f64 * 100.) / total_cache_counts as f64;
> +                info!(
> +                    "Chunk cache: hits {}, misses {} (hit ratio {cache_hit_ratio:.2}%)",
> +                    cache_stats.hits, cache_stats.misses,
> +                );
> +            }
> +        }
> +        info!(
> +            "Removed garbage: {}",
> +            HumanByte::from(gc_status.removed_bytes),
> +        );
> +        info!("Removed chunks: {}", gc_status.removed_chunks);
> +        if gc_status.pending_bytes > 0 {
> +            info!(
> +                "Pending removals: {} (in {} chunks)",
> +                HumanByte::from(gc_status.pending_bytes),
> +                gc_status.pending_chunks,
> +            );
> +        }
> +        if gc_status.removed_bad > 0 {
> +            info!("Removed bad chunks: {}", gc_status.removed_bad);
> +        }
> +
> +        if gc_status.still_bad > 0 {
> +            info!("Leftover bad chunks: {}", gc_status.still_bad);
> +        }
> +
> +        info!(
> +            "Original data usage: {}",
> +            HumanByte::from(gc_status.index_data_bytes),
> +        );
> +
> +        if gc_status.index_data_bytes > 0 {
> +            let comp_per = (gc_status.disk_bytes as f64 * 100.) / gc_status.index_data_bytes as f64;
> +            info!(
> +                "On-Disk usage: {} ({comp_per:.2}%)",
> +                HumanByte::from(gc_status.disk_bytes)
> +            );
> +        }
> +
> +        info!("On-Disk chunks: {}", gc_status.disk_chunks);
> +
> +        let deduplication_factor = if gc_status.disk_bytes > 0 {
> +            (gc_status.index_data_bytes as f64) / (gc_status.disk_bytes as f64)
> +        } else {
> +            1.0
> +        };
> +
> +        info!("Deduplication factor: {deduplication_factor:.2}");
> +
> +        if gc_status.disk_chunks > 0 {
> +            let avg_chunk = gc_status.disk_bytes / (gc_status.disk_chunks as u64);
> +            info!("Average chunk size: {}", HumanByte::from(avg_chunk));
> +        }
> +
> +        if let Ok(serialized) = serde_json::to_string(&gc_status) {
> +            let mut path = self.base_path();
> +            path.push(".gc-status");
> +
> +            let backup_user = pbs_config::backup_user()?;
> +            let mode = nix::sys::stat::Mode::from_bits_truncate(0o0644);
> +            // set the correct owner/group/permissions while saving file
> +            // owner(rw) = backup, group(r)= backup
> +            let options = CreateOptions::new()
> +                .perm(mode)
> +                .owner(backup_user.uid)
> +                .group(backup_user.gid);
> +
> +            // ignore errors
> +            let _ = replace_file(path, serialized.as_bytes(), options, false);
> +        }
> +
> +        *self.inner.last_gc_status.lock().unwrap() = gc_status;
>           Ok(())
>       }
>   



_______________________________________________
pbs-devel mailing list
pbs-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [pbs-devel] superseded: [PATCH proxmox-backup 0/3] rework GC-related locks
  2025-10-01 11:19 [pbs-devel] [PATCH proxmox-backup 0/3] rework GC-related locks Fabian Grünbichler
                   ` (2 preceding siblings ...)
  2025-10-01 11:19 ` [pbs-devel] [PATCH proxmox-backup 3/3] index writers: remove chunk store lock Fabian Grünbichler
@ 2025-10-02  7:29 ` Fabian Grünbichler
  3 siblings, 0 replies; 6+ messages in thread
From: Fabian Grünbichler @ 2025-10-02  7:29 UTC (permalink / raw)
  To: Proxmox Backup Server development discussion

https://lore.proxmox.com/pbs-devel/20251002072843.88042-1-f.gruenbichler@proxmox.com/T/#t

On October 1, 2025 1:19 pm, Fabian Grünbichler wrote:
> this series reworks some of the locking behaviour surrounding GC.
> 
> the first patch removes a potentially problematic combination of
> std::sync::Mutex and async S3 code (this should be the last such code
> path), by storing the ProcessLocker guard inside the mutex while holding
> it just for storing/querying/dropping that guard, instead of for the
> duration of GC.
> 
> the second and third patches get rid of racy interactions between
> pre-reload backup sessions and post-reload GC tasks. while this race
> doesn't lead to data loss, it breaks the expected priority of such
> tasks.
> 
> Fabian Grünbichler (3):
>   GC: rework locking logic
>   backup env: keep a shared chunk store lock for duration of backup
>   index writers: remove chunk store lock
> 
>  pbs-datastore/src/datastore.rs     | 544 +++++++++++++++--------------
>  pbs-datastore/src/dynamic_index.rs |   6 +-
>  pbs-datastore/src/fixed_index.rs   |   6 +-
>  src/api2/backup/environment.rs     |   5 +
>  src/api2/backup/mod.rs             |   4 +-
>  5 files changed, 295 insertions(+), 270 deletions(-)
> 
> -- 
> 2.47.3
> 
> 
> 
> _______________________________________________
> pbs-devel mailing list
> pbs-devel@lists.proxmox.com
> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel
> 


_______________________________________________
pbs-devel mailing list
pbs-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2025-10-02  7:29 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-10-01 11:19 [pbs-devel] [PATCH proxmox-backup 0/3] rework GC-related locks Fabian Grünbichler
2025-10-01 11:19 ` [pbs-devel] [PATCH proxmox-backup 1/3] GC: rework locking logic Fabian Grünbichler
2025-10-01 13:11   ` Christian Ebner
2025-10-01 11:19 ` [pbs-devel] [PATCH proxmox-backup 2/3] backup env: keep a shared chunk store lock for duration of backup Fabian Grünbichler
2025-10-01 11:19 ` [pbs-devel] [PATCH proxmox-backup 3/3] index writers: remove chunk store lock Fabian Grünbichler
2025-10-02  7:29 ` [pbs-devel] superseded: [PATCH proxmox-backup 0/3] rework GC-related locks Fabian Grünbichler

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox

Service provided by Proxmox Server Solutions GmbH | Privacy | Legal