From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <pbs-devel-bounces@lists.proxmox.com>
Received: from firstgate.proxmox.com (firstgate.proxmox.com [IPv6:2a01:7e0:0:424::9])
	by lore.proxmox.com (Postfix) with ESMTPS id 4B1BE1FF13B
	for <inbox@lore.proxmox.com>; Wed, 06 May 2026 08:30:28 +0200 (CEST)
Received: from firstgate.proxmox.com (localhost [127.0.0.1])
	by firstgate.proxmox.com (Proxmox) with ESMTP id 6CA5816E6A;
	Wed,  6 May 2026 08:30:34 +0200 (CEST)
Date: Wed, 06 May 2026 08:30:23 +0200
From: Fabian =?iso-8859-1?q?Gr=FCnbichler?= <f.gruenbichler@proxmox.com>
Subject: Re: [PATCH proxmox-backup 4/4] api/datastore: use maintenance-mode
 lock to protect against changes
To: Christian Ebner <c.ebner@proxmox.com>, pbs-devel@lists.proxmox.com
References: <20260505081137.227901-1-c.ebner@proxmox.com>
	<20260505081137.227901-5-c.ebner@proxmox.com>
	<1777980053.m1uthlu73r.astroid@yuna.none>
	<189d0714-90c1-4607-95d0-3ae74febef59@proxmox.com>
In-Reply-To: <189d0714-90c1-4607-95d0-3ae74febef59@proxmox.com>
MIME-Version: 1.0
User-Agent: astroid/0.17.0 (https://github.com/astroidmail/astroid)
Message-Id: <1778048819.s4fbqpt0x2.astroid@yuna.none>
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
X-Bm-Milter-Handled: 55990f41-d878-4baa-be0a-ee34c49e34d2
X-Bm-Transport-Timestamp: 1778048920145
X-SPAM-LEVEL: Spam detection results:  0
	AWL                     0.054 Adjusted score from AWL reputation of From:
 address
	BAYES_00                 -1.9 Bayes spam probability is 0 to 1%
	DMARC_MISSING             0.1 Missing DMARC policy
	KAM_DMARC_STATUS         0.01 Test Rule for DKIM or SPF Failure with Strict
 Alignment
	SPF_HELO_NONE           0.001 SPF: HELO does not publish an SPF Record
	SPF_PASS               -0.001 SPF: sender matches SPF record
	URIBL_BLOCKED           0.001 ADMINISTRATOR NOTICE: The query to URIBL was
 blocked.  See http://wiki.apache.org/spamassassin/DnsBlocklists#dnsbl-block
 for more information. [lib.rs,datastore.rs]
Message-ID-Hash: KIWTPAJCTAO4E6Y4JERTXNFT6LVM34NF
X-Message-ID-Hash: KIWTPAJCTAO4E6Y4JERTXNFT6LVM34NF
X-MailFrom: f.gruenbichler@proxmox.com
X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; loop;
 banned-address; emergency; member-moderation; nonmember-moderation;
 administrivia; implicit-dest; max-recipients; max-size; news-moderation;
 no-subject; digests; suspicious-header
X-Mailman-Version: 3.3.10
Precedence: list
List-Id: Proxmox Backup Server development discussion
 <pbs-devel.lists.proxmox.com>
List-Help: <mailto:pbs-devel-request@lists.proxmox.com?subject=help>
List-Owner: <mailto:pbs-devel-owner@lists.proxmox.com>
List-Post: <mailto:pbs-devel@lists.proxmox.com>
List-Subscribe: <mailto:pbs-devel-join@lists.proxmox.com>
List-Unsubscribe: <mailto:pbs-devel-leave@lists.proxmox.com>

On May 5, 2026 3:02 pm, Christian Ebner wrote:
> On 5/5/26 2:12 PM, Fabian Gr=C3=BCnbichler wrote:
>> On May 5, 2026 10:11 am, Christian Ebner wrote:

> [..]

>>> diff --git a/pbs-datastore/src/lib.rs b/pbs-datastore/src/lib.rs
>>> index 48acba4c8..823f3bf2f 100644
>>> --- a/pbs-datastore/src/lib.rs
>>> +++ b/pbs-datastore/src/lib.rs
>>> @@ -159,8 +159,9 @@
>>>  =20
>>>   use std::os::unix::io::AsRawFd;
>>>   use std::path::Path;
>>> +use std::time::Duration;
>>>  =20
>>> -use anyhow::{bail, Error};
>>> +use anyhow::{bail, Context, Error};
>>>  =20
>>>   use pbs_config::BackupLockGuard;
>>>  =20
>>> @@ -271,3 +272,11 @@ where
>>>  =20
>>>       Ok(lock)
>>>   }
>>> +
>>> +/// Acquire an exclusive lock for the datastore's maintenance-mode
>>> +pub fn maintenance_mode_lock(store: &str) -> Result<BackupLockGuard, E=
rror> {
>>> +    lock_helper(store, Path::new("maintenance-mode.lck"), |p| {
>>> +        pbs_config::open_backup_lockfile(p, Some(Duration::from_secs(0=
)), true)
>>=20
>> this might warrant a comment, usually we open config related locks with
>> a timeout to prevent flaky locking in case of concurrency, waiting a few
>> seconds is normally fine.
>>=20
>> but here, we must always first acquire the datastore config lock (should
>> we encode that in the signature??) anyway, which makes the 0
>> timeout/non-blocking behaviour okay, if I understand it correctly?
>=20
> Right, it makes sense to pass in the config lock to "prove" that it was=20
> acquired first.
>=20
> But it could make sense to add the short delay here nevertheless. After=20
> all, another operation might have dropped the config lock already, while=20
> still holding onto the maintenance-mode lock. OTOH the operations which=20
> do hold it for longer probably will outlive the timeout, so a bit torn=20
> as I'm not to sure if there is much benefit.

but the only time that happens is if the operation holding *only* the
maintainence mode lock is doing a long-running maintenance operation on
that datastore (the code modified below). and we don't want a long
timeout, because that means holding up the much more important global
datastore.cfg lock (that we must be holding to reach this fn).

I'd keep this non-blocking but add a comment *why*.

>>> +            .context("unable to acquire exclusive datastore's maintena=
nce-mode lock")
>>> +    })
>>> +}
>>> diff --git a/src/api2/admin/datastore.rs b/src/api2/admin/datastore.rs
>>> index 1e0a78f3a..883091e6f 100644
>>> --- a/src/api2/admin/datastore.rs
>>> +++ b/src/api2/admin/datastore.rs
>>> @@ -61,8 +61,8 @@ use pbs_datastore::index::IndexFile;
>>>   use pbs_datastore::manifest::BackupManifest;
>>>   use pbs_datastore::prune::compute_prune_info;
>>>   use pbs_datastore::{
>>> -    check_backup_owner, ensure_datastore_is_mounted, task_tracking, Ba=
ckupDir, DataStore,
>>> -    LocalChunkReader, StoreProgress,
>>> +    check_backup_owner, ensure_datastore_is_mounted, maintenance_mode_=
lock, task_tracking,
>>> +    BackupDir, DataStore, LocalChunkReader, StoreProgress,
>>>   };
>>>   use pbs_tools::json::required_string_param;
>>>   use proxmox_rest_server::{formatter, worker_is_active, WorkerTask};
>>> @@ -2705,9 +2705,13 @@ fn expect_maintenance_type(
>>>   }
>>>  =20
>>>   fn unset_maintenance(
>>> -    _lock: pbs_config::BackupLockGuard,
>>> +    lock: Option<pbs_config::BackupLockGuard>,
>>=20
>> I think this should take Option<(lock, lock)>, and
>> expect_maintenance_type should acquire and return both locks, but it
>> needs to be adapted further, because atm it is broken:
>>=20
>> 1. start s3-refresh while active operations are running
>> 2. s3-refresh will be in loop waiting for operations to disappear
>> 3. roughly every second, it will call expect_maintenance_type
>> 4. if somebody else is holding the datastore config lock for more than
>>     ten seconds (let's say you started a re-use existing datastore
>>     creation with S3 ;)), expect_maintenance_type will error out!
>=20
> Yeah, point 4 is broken in any case... :/ But will fix according to the=20
> suggestions above.

it could also happen cause of lock contention, or something else
blocking for more than a few seconds, or a combination of those factors.
the datastore creation I just stumbled upon and seemed like an obvious
way to trigger it :)