From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from firstgate.proxmox.com (firstgate.proxmox.com [IPv6:2a01:7e0:0:424::9]) by lore.proxmox.com (Postfix) with ESMTPS id 4B1BE1FF13B for ; Wed, 06 May 2026 08:30:28 +0200 (CEST) Received: from firstgate.proxmox.com (localhost [127.0.0.1]) by firstgate.proxmox.com (Proxmox) with ESMTP id 6CA5816E6A; Wed, 6 May 2026 08:30:34 +0200 (CEST) Date: Wed, 06 May 2026 08:30:23 +0200 From: Fabian =?iso-8859-1?q?Gr=FCnbichler?= Subject: Re: [PATCH proxmox-backup 4/4] api/datastore: use maintenance-mode lock to protect against changes To: Christian Ebner , pbs-devel@lists.proxmox.com References: <20260505081137.227901-1-c.ebner@proxmox.com> <20260505081137.227901-5-c.ebner@proxmox.com> <1777980053.m1uthlu73r.astroid@yuna.none> <189d0714-90c1-4607-95d0-3ae74febef59@proxmox.com> In-Reply-To: <189d0714-90c1-4607-95d0-3ae74febef59@proxmox.com> MIME-Version: 1.0 User-Agent: astroid/0.17.0 (https://github.com/astroidmail/astroid) Message-Id: <1778048819.s4fbqpt0x2.astroid@yuna.none> Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Bm-Milter-Handled: 55990f41-d878-4baa-be0a-ee34c49e34d2 X-Bm-Transport-Timestamp: 1778048920145 X-SPAM-LEVEL: Spam detection results: 0 AWL 0.054 Adjusted score from AWL reputation of From: address BAYES_00 -1.9 Bayes spam probability is 0 to 1% DMARC_MISSING 0.1 Missing DMARC policy KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment SPF_HELO_NONE 0.001 SPF: HELO does not publish an SPF Record SPF_PASS -0.001 SPF: sender matches SPF record URIBL_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to URIBL was blocked. See http://wiki.apache.org/spamassassin/DnsBlocklists#dnsbl-block for more information. [lib.rs,datastore.rs] Message-ID-Hash: KIWTPAJCTAO4E6Y4JERTXNFT6LVM34NF X-Message-ID-Hash: KIWTPAJCTAO4E6Y4JERTXNFT6LVM34NF X-MailFrom: f.gruenbichler@proxmox.com X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; loop; banned-address; emergency; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header X-Mailman-Version: 3.3.10 Precedence: list List-Id: Proxmox Backup Server development discussion List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: On May 5, 2026 3:02 pm, Christian Ebner wrote: > On 5/5/26 2:12 PM, Fabian Gr=C3=BCnbichler wrote: >> On May 5, 2026 10:11 am, Christian Ebner wrote: > [..] >>> diff --git a/pbs-datastore/src/lib.rs b/pbs-datastore/src/lib.rs >>> index 48acba4c8..823f3bf2f 100644 >>> --- a/pbs-datastore/src/lib.rs >>> +++ b/pbs-datastore/src/lib.rs >>> @@ -159,8 +159,9 @@ >>> =20 >>> use std::os::unix::io::AsRawFd; >>> use std::path::Path; >>> +use std::time::Duration; >>> =20 >>> -use anyhow::{bail, Error}; >>> +use anyhow::{bail, Context, Error}; >>> =20 >>> use pbs_config::BackupLockGuard; >>> =20 >>> @@ -271,3 +272,11 @@ where >>> =20 >>> Ok(lock) >>> } >>> + >>> +/// Acquire an exclusive lock for the datastore's maintenance-mode >>> +pub fn maintenance_mode_lock(store: &str) -> Result { >>> + lock_helper(store, Path::new("maintenance-mode.lck"), |p| { >>> + pbs_config::open_backup_lockfile(p, Some(Duration::from_secs(0= )), true) >>=20 >> this might warrant a comment, usually we open config related locks with >> a timeout to prevent flaky locking in case of concurrency, waiting a few >> seconds is normally fine. >>=20 >> but here, we must always first acquire the datastore config lock (should >> we encode that in the signature??) anyway, which makes the 0 >> timeout/non-blocking behaviour okay, if I understand it correctly? >=20 > Right, it makes sense to pass in the config lock to "prove" that it was=20 > acquired first. >=20 > But it could make sense to add the short delay here nevertheless. After=20 > all, another operation might have dropped the config lock already, while=20 > still holding onto the maintenance-mode lock. OTOH the operations which=20 > do hold it for longer probably will outlive the timeout, so a bit torn=20 > as I'm not to sure if there is much benefit. but the only time that happens is if the operation holding *only* the maintainence mode lock is doing a long-running maintenance operation on that datastore (the code modified below). and we don't want a long timeout, because that means holding up the much more important global datastore.cfg lock (that we must be holding to reach this fn). I'd keep this non-blocking but add a comment *why*. >>> + .context("unable to acquire exclusive datastore's maintena= nce-mode lock") >>> + }) >>> +} >>> diff --git a/src/api2/admin/datastore.rs b/src/api2/admin/datastore.rs >>> index 1e0a78f3a..883091e6f 100644 >>> --- a/src/api2/admin/datastore.rs >>> +++ b/src/api2/admin/datastore.rs >>> @@ -61,8 +61,8 @@ use pbs_datastore::index::IndexFile; >>> use pbs_datastore::manifest::BackupManifest; >>> use pbs_datastore::prune::compute_prune_info; >>> use pbs_datastore::{ >>> - check_backup_owner, ensure_datastore_is_mounted, task_tracking, Ba= ckupDir, DataStore, >>> - LocalChunkReader, StoreProgress, >>> + check_backup_owner, ensure_datastore_is_mounted, maintenance_mode_= lock, task_tracking, >>> + BackupDir, DataStore, LocalChunkReader, StoreProgress, >>> }; >>> use pbs_tools::json::required_string_param; >>> use proxmox_rest_server::{formatter, worker_is_active, WorkerTask}; >>> @@ -2705,9 +2705,13 @@ fn expect_maintenance_type( >>> } >>> =20 >>> fn unset_maintenance( >>> - _lock: pbs_config::BackupLockGuard, >>> + lock: Option, >>=20 >> I think this should take Option<(lock, lock)>, and >> expect_maintenance_type should acquire and return both locks, but it >> needs to be adapted further, because atm it is broken: >>=20 >> 1. start s3-refresh while active operations are running >> 2. s3-refresh will be in loop waiting for operations to disappear >> 3. roughly every second, it will call expect_maintenance_type >> 4. if somebody else is holding the datastore config lock for more than >> ten seconds (let's say you started a re-use existing datastore >> creation with S3 ;)), expect_maintenance_type will error out! >=20 > Yeah, point 4 is broken in any case... :/ But will fix according to the=20 > suggestions above. it could also happen cause of lock contention, or something else blocking for more than a few seconds, or a combination of those factors. the datastore creation I just stumbled upon and seemed like an obvious way to trigger it :)