public inbox for pbs-devel@lists.proxmox.com
 help / color / mirror / Atom feed
From: Christian Ebner <c.ebner@proxmox.com>
To: "Fabian Grünbichler" <f.gruenbichler@proxmox.com>,
	pbs-devel@lists.proxmox.com
Subject: Re: [PATCH proxmox-backup 4/4] api/datastore: use maintenance-mode lock to protect against changes
Date: Tue, 5 May 2026 15:02:29 +0200	[thread overview]
Message-ID: <189d0714-90c1-4607-95d0-3ae74febef59@proxmox.com> (raw)
In-Reply-To: <1777980053.m1uthlu73r.astroid@yuna.none>

On 5/5/26 2:12 PM, Fabian Grünbichler wrote:
> On May 5, 2026 10:11 am, Christian Ebner wrote:
>> Instead of locking the whole config, which causes also unrelated
>> datastores to not be available, use a per-datastore maintenance mode
>> lock to protect against changes during potentially longer running
>> operations such as s3-refresh or removable datastore mount/unmount.
>>
>> However keep the current semantic of first setting the maintenance
>> mode, allow to still opt out from the mode while waiting for active
>> operations to finish, and only then enforce the maintenance mode
>> during the s3 refresh or removable datastore mount operation.
>>
>> unset_maintenance() is adapted to either reacquire the datastore
>> config lock or use the optional lock guard.
>>
>> Fixes: https://forum.proxmox.com/threads/183244/
>> Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
>> ---
>>   pbs-datastore/src/datastore.rs |  2 ++
>>   pbs-datastore/src/lib.rs       | 11 ++++++++++-
>>   src/api2/admin/datastore.rs    | 26 ++++++++++++++++++++------
>>   src/api2/config/datastore.rs   |  6 +++++-
>>   4 files changed, 37 insertions(+), 8 deletions(-)
>>
>> diff --git a/pbs-datastore/src/datastore.rs b/pbs-datastore/src/datastore.rs
>> index 34efcd398..aa372cd73 100644
>> --- a/pbs-datastore/src/datastore.rs
>> +++ b/pbs-datastore/src/datastore.rs
>> @@ -3036,6 +3036,7 @@ impl DataStore {
>>           let (mut config, _digest) = pbs_config::datastore::config()?;
>>           let mut datastore_config: DataStoreConfig = config.lookup("datastore", name)?;
>>   
>> +        let maintenance_mode_lock = crate::maintenance_mode_lock(name)?;
>>           datastore_config.set_maintenance_mode(Some(MaintenanceMode {
> 
> meh that this is in pbs-api-types, else we could require it being passed
> the lock to ensure we never miss call sites..

Yeah, unfortunately. That is the reason I kept it independet.
> 
> right now, datastore creation doesn't take the maintenance_mode_lock
> despite setting it (in the reuse-existing S3 case), and it suffers from
> a similar problem - it holds the lock for the whole duration of the
> creation, but that also entails S3 requests that are allowed to take up
> to 30m(!) - and other expensive things. have we maybe found our second
> use case for this lock? if we hold it during creation of datastores, we
> could drop the config lock before re-acquiring it for persisting the
> config entry? or we need to move this into a two-phase creation, first
> add the entry with maintenance mode "creating", then while only holding
> the maintenance_mode_lock do all the prep work, and then at the end
> re-acquire the config lock and unset the maintenance mode? this would
> have considerable overlap with how we handle s3 refreshing..
> 
> IMHO moving to a two-step creation would be cleanest, since we've also
> had reports in the past about a long chunk store initialization being
> cumbersome (since it effectively blocks all modifications of the config
> while it is going on).

Good catch, the s3 requests there and the creation are indeed worth the 
suggested 2-step approach with a corresponding maintenance-mode and 
holding it's lock.

> 
>>               ty: MaintenanceType::Delete,
>>               message: None,
>> @@ -3043,6 +3044,7 @@ impl DataStore {
>>   
>>           config.set_data(name, "datastore", &datastore_config)?;
>>           pbs_config::datastore::save_config(&config)?;
>> +        drop(maintenance_mode_lock);
>>           drop(config_lock);
>>   
>>           let (operations, _lock) = task_tracking::get_active_operations_locked(name)?;
>> diff --git a/pbs-datastore/src/lib.rs b/pbs-datastore/src/lib.rs
>> index 48acba4c8..823f3bf2f 100644
>> --- a/pbs-datastore/src/lib.rs
>> +++ b/pbs-datastore/src/lib.rs
>> @@ -159,8 +159,9 @@
>>   
>>   use std::os::unix::io::AsRawFd;
>>   use std::path::Path;
>> +use std::time::Duration;
>>   
>> -use anyhow::{bail, Error};
>> +use anyhow::{bail, Context, Error};
>>   
>>   use pbs_config::BackupLockGuard;
>>   
>> @@ -271,3 +272,11 @@ where
>>   
>>       Ok(lock)
>>   }
>> +
>> +/// Acquire an exclusive lock for the datastore's maintenance-mode
>> +pub fn maintenance_mode_lock(store: &str) -> Result<BackupLockGuard, Error> {
>> +    lock_helper(store, Path::new("maintenance-mode.lck"), |p| {
>> +        pbs_config::open_backup_lockfile(p, Some(Duration::from_secs(0)), true)
> 
> this might warrant a comment, usually we open config related locks with
> a timeout to prevent flaky locking in case of concurrency, waiting a few
> seconds is normally fine.
> 
> but here, we must always first acquire the datastore config lock (should
> we encode that in the signature??) anyway, which makes the 0
> timeout/non-blocking behaviour okay, if I understand it correctly?

Right, it makes sense to pass in the config lock to "prove" that it was 
acquired first.

But it could make sense to add the short delay here nevertheless. After 
all, another operation might have dropped the config lock already, while 
still holding onto the maintenance-mode lock. OTOH the operations which 
do hold it for longer probably will outlive the timeout, so a bit torn 
as I'm not to sure if there is much benefit.

> 
>> +            .context("unable to acquire exclusive datastore's maintenance-mode lock")
>> +    })
>> +}
>> diff --git a/src/api2/admin/datastore.rs b/src/api2/admin/datastore.rs
>> index 1e0a78f3a..883091e6f 100644
>> --- a/src/api2/admin/datastore.rs
>> +++ b/src/api2/admin/datastore.rs
>> @@ -61,8 +61,8 @@ use pbs_datastore::index::IndexFile;
>>   use pbs_datastore::manifest::BackupManifest;
>>   use pbs_datastore::prune::compute_prune_info;
>>   use pbs_datastore::{
>> -    check_backup_owner, ensure_datastore_is_mounted, task_tracking, BackupDir, DataStore,
>> -    LocalChunkReader, StoreProgress,
>> +    check_backup_owner, ensure_datastore_is_mounted, maintenance_mode_lock, task_tracking,
>> +    BackupDir, DataStore, LocalChunkReader, StoreProgress,
>>   };
>>   use pbs_tools::json::required_string_param;
>>   use proxmox_rest_server::{formatter, worker_is_active, WorkerTask};
>> @@ -2705,9 +2705,13 @@ fn expect_maintenance_type(
>>   }
>>   
>>   fn unset_maintenance(
>> -    _lock: pbs_config::BackupLockGuard,
>> +    lock: Option<pbs_config::BackupLockGuard>,
> 
> I think this should take Option<(lock, lock)>, and
> expect_maintenance_type should acquire and return both locks, but it
> needs to be adapted further, because atm it is broken:
> 
> 1. start s3-refresh while active operations are running
> 2. s3-refresh will be in loop waiting for operations to disappear
> 3. roughly every second, it will call expect_maintenance_type
> 4. if somebody else is holding the datastore config lock for more than
>     ten seconds (let's say you started a re-use existing datastore
>     creation with S3 ;)), expect_maintenance_type will error out!

Yeah, point 4 is broken in any case... :/ But will fix according to the 
suggestions above.

> 
> we currently call expect_maintenance_type in three places:
> 1. to check whether the maintenance mode changed under us while waiting
> for operations to stop (no need for locking here)
> 2. to check whether the maintenance mode changed under us if we abort
> waiting - we don't really need to lock here if we move this call to
> unset_maintenance
> 3. to do a final check before we do the actual work - this one uses the
> locks then, but we could just as well lock first, then check?
> 
> what do you think about the following (untested!):
> 
> ----8<----
> diff --git a/src/api2/admin/datastore.rs b/src/api2/admin/datastore.rs
> index 883091e6f..1acb0baf3 100644
> --- a/src/api2/admin/datastore.rs
> +++ b/src/api2/admin/datastore.rs
> @@ -49,7 +49,7 @@ use pbs_api_types::{
>       VERIFY_JOB_READ_THREADS_SCHEMA, VERIFY_JOB_VERIFY_THREADS_SCHEMA,
>   };
>   use pbs_client::pxar::{create_tar, create_zip};
> -use pbs_config::CachedUserInfo;
> +use pbs_config::{BackupLockGuard, CachedUserInfo};
>   use pbs_datastore::backup_info::BackupInfo;
>   use pbs_datastore::cached_chunk_reader::CachedChunkReader;
>   use pbs_datastore::catalog::{ArchiveEntry, CatalogReader};
> @@ -2686,11 +2686,7 @@ pub fn mount(store: String, rpcenv: &mut dyn RpcEnvironment) -> Result<Value, Er
>       Ok(json!(upid))
>   }
>   
> -fn expect_maintenance_type(
> -    store: &str,
> -    maintenance_type: MaintenanceType,
> -) -> Result<(pbs_config::BackupLockGuard, DataStoreConfig), Error> {
> -    let lock = pbs_config::datastore::lock_config()?;
> +fn expect_maintenance_type(store: &str, maintenance_type: MaintenanceType) -> Result<(), Error> {
>       let (section_config, _digest) = pbs_config::datastore::config()?;
>       let store_config: DataStoreConfig = section_config.lookup("datastore", store)?;
>   
> @@ -2701,20 +2697,24 @@ fn expect_maintenance_type(
>           bail!("maintenance mode is not '{maintenance_type}'");
>       }
>   
> -    Ok((lock, store_config))
> +    Ok(())
>   }
>   
>   fn unset_maintenance(
> -    lock: Option<pbs_config::BackupLockGuard>,
> -    mut config: DataStoreConfig,
> +    maintenance_lock: Option<BackupLockGuard>,
> +    store: &str,
> +    expected: MaintenanceType,
>   ) -> Result<(), Error> {
> -    let _lock = match lock {
> +    let _lock = pbs_config::datastore::lock_config()?;
> +    let _maintenance_mode_lock = match maintenance_lock {
>           Some(lock) => lock,
> -        None => pbs_config::datastore::lock_config()?,
> +        None => maintenance_mode_lock(store)?,
>       };
> +    expect_maintenance_type(store, expected)?;
>       let (mut section_config, _digest) = pbs_config::datastore::config()?;
> -    config.maintenance_mode = None;
> -    section_config.set_data(&config.name, "datastore", &config)?;
> +    let mut store_config: DataStoreConfig = section_config.lookup("datastore", store)?;
> +    store_config.maintenance_mode = None;
> +    section_config.set_data(store, "datastore", &store_config)?;
>       pbs_config::datastore::save_config(&section_config)?;
>       Ok(())
>   }
> @@ -2914,23 +2914,20 @@ fn run_maintenance_locked(
>       }
>   
>       if aborted || worker.abort_requested() {
> -        let _ = expect_maintenance_type(store, maintenance_expected)
> -            .inspect_err(|e| warn!("maintenance mode was not as expected: {e}"))
> -            .and_then(|(lock, config)| {
> -                unset_maintenance(Some(lock), config)
> -                    .inspect_err(|e| warn!("could not reset maintenance mode: {e}"))
> -            });
> +        let _ = unset_maintenance(None, store, maintenance_expected)
> +            .inspect_err(|e| warn!("could not reset maintenance mode: {e}"));
>           bail!("aborted, due to user request");
>       }
>   
> -    let (lock, config) = expect_maintenance_type(store, maintenance_expected)?;
> -    let _maintenance_mode_lock = maintenance_mode_lock(&config.name)?;
> +    let lock = pbs_config::datastore::lock_config()?;
> +    let maintenance_mode_lock = maintenance_mode_lock(store)?;
> +    expect_maintenance_type(store, maintenance_expected)?;
>       // avoid blocking config access/updates, but keep maintenance-mode locked
>       // until done.
>       drop(lock);
>   
>       callback()?;
> -    unset_maintenance(None, config)
> +    unset_maintenance(Some(maintenance_mode_lock), store, maintenance_expected)
>           .map_err(|e| format_err!("could not reset maintenance mode: {e}"))?;
>   
>       Ok(())
> ---->8----
> 
> the tiny bit of double work of reparsing the config is IMHO not that
> bad, since this is not on any hot path..

Hmm, yes, can incorporate this, moving the locking and expected mode 
check into the unset helper makes indeed sense.

> 
> 
> 
> 
>>       mut config: DataStoreConfig,
>>   ) -> Result<(), Error> {
>> +    let _lock = match lock {
>> +        Some(lock) => lock,
>> +        None => pbs_config::datastore::lock_config()?,
>> +    };
>>       let (mut section_config, _digest) = pbs_config::datastore::config()?;
>>       config.maintenance_mode = None;
>>       section_config.set_data(&config.name, "datastore", &config)?;
>> @@ -2763,6 +2767,7 @@ async fn do_unmount(store: String, auth_id: Authid, to_stdout: bool) -> Result<V
>>       let (mut section_config, _digest) = pbs_config::datastore::config()?;
>>       let mut datastore: DataStoreConfig = section_config.lookup("datastore", &store)?;
>>   
>> +    let maintenance_mode_lock = maintenance_mode_lock(&store)?;
>>       datastore.set_maintenance_mode(Some(MaintenanceMode {
>>           ty: MaintenanceType::Unmount,
>>           message: None,
>> @@ -2770,6 +2775,7 @@ async fn do_unmount(store: String, auth_id: Authid, to_stdout: bool) -> Result<V
>>       section_config.set_data(&store, "datastore", &datastore)?;
>>       pbs_config::datastore::save_config(&section_config)?;
>>   
>> +    drop(maintenance_mode_lock);
>>       drop(_lock);
>>   
>>       if let Ok(proxy_pid) = proxmox_rest_server::read_pid(pbs_buildcfg::PROXMOX_BACKUP_PROXY_PID_FN)
>> @@ -2845,6 +2851,7 @@ pub fn s3_refresh(store: String, rpcenv: &mut dyn RpcEnvironment) -> Result<Valu
>>       let (mut section_config, _digest) = pbs_config::datastore::config()?;
>>       let mut datastore_config: DataStoreConfig = section_config.lookup("datastore", &store)?;
>>   
>> +    let maintenance_mode_lock = maintenance_mode_lock(&store)?;
>>       datastore_config.set_maintenance_mode(Some(MaintenanceMode {
>>           ty: MaintenanceType::S3Refresh,
>>           message: None,
>> @@ -2852,6 +2859,7 @@ pub fn s3_refresh(store: String, rpcenv: &mut dyn RpcEnvironment) -> Result<Valu
>>   
>>       section_config.set_data(&store, "datastore", &datastore_config)?;
>>       pbs_config::datastore::save_config(&section_config)?;
>> +    drop(maintenance_mode_lock);
>>       drop(_lock);
>>   
>>       let upid = WorkerTask::new_thread(
>> @@ -2875,7 +2883,8 @@ pub(crate) fn do_s3_refresh(store: &str, worker: &dyn WorkerTaskContext) -> Resu
>>   }
>>   
>>   /// Wait for no more active operations on the given datastore and run the provided callback in
>> -/// a datastore locked context, protecting against maintenance mode changes.
>> +/// a maintenance mode locked context, protecting against leaving the mode for the functions
>> +/// lifetime.
>>   fn run_maintenance_locked(
>>       store: &str,
>>       maintenance_expected: MaintenanceType,
>> @@ -2908,15 +2917,20 @@ fn run_maintenance_locked(
>>           let _ = expect_maintenance_type(store, maintenance_expected)
>>               .inspect_err(|e| warn!("maintenance mode was not as expected: {e}"))
>>               .and_then(|(lock, config)| {
>> -                unset_maintenance(lock, config)
>> +                unset_maintenance(Some(lock), config)
>>                       .inspect_err(|e| warn!("could not reset maintenance mode: {e}"))
>>               });
>>           bail!("aborted, due to user request");
>>       }
>>   
>>       let (lock, config) = expect_maintenance_type(store, maintenance_expected)?;
>> +    let _maintenance_mode_lock = maintenance_mode_lock(&config.name)?;
>> +    // avoid blocking config access/updates, but keep maintenance-mode locked
>> +    // until done.
>> +    drop(lock);
>> +
>>       callback()?;
>> -    unset_maintenance(lock, config)
>> +    unset_maintenance(None, config)
>>           .map_err(|e| format_err!("could not reset maintenance mode: {e}"))?;
>>   
>>       Ok(())
>> diff --git a/src/api2/config/datastore.rs b/src/api2/config/datastore.rs
>> index 16e85a636..b07e7fc7e 100644
>> --- a/src/api2/config/datastore.rs
>> +++ b/src/api2/config/datastore.rs
>> @@ -31,7 +31,9 @@ use crate::api2::config::tape_backup_job::{delete_tape_backup_job, list_tape_bac
>>   use crate::api2::config::verify::delete_verification_job;
>>   use pbs_config::CachedUserInfo;
>>   
>> -use pbs_datastore::{get_datastore_mount_status, DataStore, S3_DATASTORE_IN_USE_MARKER};
>> +use pbs_datastore::{
>> +    get_datastore_mount_status, maintenance_mode_lock, DataStore, S3_DATASTORE_IN_USE_MARKER,
>> +};
>>   use proxmox_rest_server::WorkerTask;
>>   use proxmox_s3_client::{S3ObjectKey, S3_HTTP_REQUEST_TIMEOUT};
>>   
>> @@ -540,6 +542,7 @@ pub fn update_datastore(
>>                       data.tuning = None;
>>                   }
>>                   DeletableProperty::MaintenanceMode => {
>> +                    let _maintenance_mode_lock = maintenance_mode_lock(&name)?;
>>                       data.set_maintenance_mode(None)?;
>>                   }
>>                   DeletableProperty::NotificationThresholds => {
>> @@ -638,6 +641,7 @@ pub fn update_datastore(
>>               )?),
>>               None => None,
>>           };
>> +        let _maintenance_mode_lock = maintenance_mode_lock(&name)?;
>>           data.set_maintenance_mode(maintenance_mode)?;
>>       }
>>   
>> -- 
>> 2.47.3
>>
>>
>>
>>
>>
>>





      reply	other threads:[~2026-05-05 13:02 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-05  8:11 [PATCH proxmox-backup 0/4] don't hold datastore config lock during s3-refresh Christian Ebner
2026-05-05  8:11 ` [PATCH proxmox-backup 1/4] pbs-config: reorganize crate use statements Christian Ebner
2026-05-05 12:15   ` applied: " Fabian Grünbichler
2026-05-05  8:11 ` [PATCH proxmox-backup 2/4] datastore: move lock files base path constant to central location Christian Ebner
2026-05-05  8:11 ` [PATCH proxmox-backup 3/4] datastore: move file lock helper to centralized place Christian Ebner
2026-05-05 12:15   ` Fabian Grünbichler
2026-05-05  8:11 ` [PATCH proxmox-backup 4/4] api/datastore: use maintenance-mode lock to protect against changes Christian Ebner
2026-05-05 12:14   ` Fabian Grünbichler
2026-05-05 13:02     ` Christian Ebner [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=189d0714-90c1-4607-95d0-3ae74febef59@proxmox.com \
    --to=c.ebner@proxmox.com \
    --cc=f.gruenbichler@proxmox.com \
    --cc=pbs-devel@lists.proxmox.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox
Service provided by Proxmox Server Solutions GmbH | Privacy | Legal