From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <pbs-devel-bounces@lists.proxmox.com>
Received: from firstgate.proxmox.com (firstgate.proxmox.com [IPv6:2a01:7e0:0:424::9])
	by lore.proxmox.com (Postfix) with ESMTPS id 9A1811FF142
	for <inbox@lore.proxmox.com>; Fri, 05 Jun 2026 17:06:51 +0200 (CEST)
Received: from firstgate.proxmox.com (localhost [127.0.0.1])
	by firstgate.proxmox.com (Proxmox) with ESMTP id DF1F51D40C;
	Fri,  5 Jun 2026 17:06:50 +0200 (CEST)
Message-ID: <d035db1f-8344-454e-bf9f-90f5fb790c22@proxmox.com>
Date: Fri, 5 Jun 2026 17:06:44 +0200
MIME-Version: 1.0
User-Agent: Mozilla Thunderbird
Subject: Re: [RFC PATCH proxmox-backup] fix #7670: datastore: s3: allow for
 per-chunk file lock cleanup
To: =?UTF-8?Q?Fabian_Gr=C3=BCnbichler?= <f.gruenbichler@proxmox.com>,
 pbs-devel@lists.proxmox.com
References: <20260604140919.97686-1-c.ebner@proxmox.com>
 <178066679899.993050.14588796821627975859@yuna.proxmox.com>
Content-Language: en-US, de-DE
From: Christian Ebner <c.ebner@proxmox.com>
In-Reply-To: <178066679899.993050.14588796821627975859@yuna.proxmox.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
X-Bm-Milter-Handled: 55990f41-d878-4baa-be0a-ee34c49e34d2
X-Bm-Transport-Timestamp: 1780671965600
X-SPAM-LEVEL: Spam detection results:  0
	AWL                     0.069 Adjusted score from AWL reputation of From:
 address
	BAYES_00                 -1.9 Bayes spam probability is 0 to 1%
	DMARC_MISSING             0.1 Missing DMARC policy
	KAM_DMARC_STATUS         0.01 Test Rule for DKIM or SPF Failure with Strict
 Alignment
	SPF_HELO_NONE           0.001 SPF: HELO does not publish an SPF Record
	SPF_PASS               -0.001 SPF: sender matches SPF record
Message-ID-Hash: DR5KYOGKMA7JM3ZFIKHISKKDXJXBGV4P
X-Message-ID-Hash: DR5KYOGKMA7JM3ZFIKHISKKDXJXBGV4P
X-MailFrom: c.ebner@proxmox.com
X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; loop;
 banned-address; emergency; member-moderation; nonmember-moderation;
 administrivia; implicit-dest; max-recipients; max-size; news-moderation;
 no-subject; digests; suspicious-header
X-Mailman-Version: 3.3.10
Precedence: list
List-Id: Proxmox Backup Server development discussion
 <pbs-devel.lists.proxmox.com>
List-Help: <mailto:pbs-devel-request@lists.proxmox.com?subject=help>
List-Owner: <mailto:pbs-devel-owner@lists.proxmox.com>
List-Post: <mailto:pbs-devel@lists.proxmox.com>
List-Subscribe: <mailto:pbs-devel-join@lists.proxmox.com>
List-Unsubscribe: <mailto:pbs-devel-leave@lists.proxmox.com>

On 6/5/26 3:39 PM, Fabian Grünbichler wrote:
> Quoting Christian Ebner (2026-06-04 16:09:19)
>> Per-chunk file locks are located on a tmpfs but never cleaned up to
>> avoid TOCTOU race conditions. Therefore lock files can accumulate
>> over time, the memory required to store the inodes finally lead to
>> OOM conditions if the system is not rebooted for a long time or a
>> high number of different chunks is written to the s3 backed
>> datastore.
>>
>> To fix this, use the double stating strategy already implemented by
>> pbs_datastore::backup_info::lock_helper(), but adapt it so that the
>> lock file is cleaned up before unlocking. Since after file removal
>> the lock can be acquired by a different thread/process, the file lock
>> must also be dropped immediately without performing any other
>> critical operation. To assure this, a ChunkLockGuard is implemented
>> which removes the file and drops the file descriptor by implementing
>> the Drop trait.
>>
>> After each flock() call, which is performed as part of
>> proxmox_sys::fs::open_file_locked(), stating the file and comparing
>> locked files inode from the open file handle to the one currently
>> present on the filesystem is performed. By this possible races are
>> detected, resulting in missing or newly create lock files. In that
>> case locking must be retried.
>>
>> Note that this locking mechanism is not fair, the first caller might
>> end up being the last to actually acquire the lock. This is however
>> not problematic for the intended use case of per-chunk file locking,
>> with limited lock contention.
>>
>> Fixes: https://bugzilla.proxmox.com/show_bug.cgi?id=7670
>> Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
>> ---
>> Sending this as RFC in case there are ideas for a different solution
>> to the problem at hand which might be preferable.
> 
> to recap - when we originally introduced this, we checked for memory usage by
> querying tmpfs file system usage. the 1kb per file overhead in slab was missed.
> 
> one solution to this would be to move the locks back to a regular filesystem,
> and either
> 
> - clean them up when removing chunks (with corresponding retry logic
>    similar to this patch, but less frequent?)

 From the two options for this solutions, this one sounds preferable. 
Ideally the locking would be performed on the chunk file itself... But 
that has issues with non-local filesystems AFAIR, so not an option 
unfortunately.

But probably we want to go with moving to a filesystem backed lock-files 
AND reducing the granularity as suggested below.

> - clean them up by dropping the lock dir on reboot (e.g., via
>    systemd-tmpfiles.d), which would be the equivalent of the current tmpfs
>    approach but trading disk usage vs. memory usage, and probably slightly slower
>    lock allocation
> 
> for ext4, 100k empty lock files take up 2.2M (roughly 2% of the tmpfs slab
> usage), 500k take up 11.2M, for ZFS it's 56M for 100k (roughly half the slab
> usage), though in both cases we of course would need a hierarchy of prefix dirs
> again to keep access times okayish? for 23M chunk locks we'd still end up with
> (estimated) 5G of usage on ext4..
> 
> another approach would be to switch to a different kind of locking mechanism
> entirely (though with the cross-process and multi-threaded requirements we
> have, this might not be easy either ;)) or to reduce the lock granularity.

Yeah, reducing the lock granularity and moving to something with 
deterministic count could be a way out, and given the usage estimates 
from above, using the 4 hex-digit prefix used also for chunk store 
directory hierarchy might be a viable candidate for defining such 
lock-files.
> given that a mistake in the handling of retries below can cause dataloss, doing
> it for every lock/unlock pair sounds a bit dangerous. there's also the
> additional overhead if the lock is actually contended to account for - we need
> at least two loop iterations if it is, potentially a lot more?

This is however not so frequently happening though? But yes, there is 
overhead by the then unavoidable retry/retries.

> or we go back to square one, and revisit the whole interaction here to see if
> we can get rid of the per-chunk locks again in favor of something that scales
> with the number of pending uploads.. the last time we tried this turned out to
> be quite tricky though.
The major downside for all solutions is unfortunately that moving 
forward can only happen with the current locking mechanism still in 
place, so a transitional step for cleanup unavoidable. But at least it 
is fine to remove the per-chunk lock files once they are being held in 
the process instance implementing the new locking logic.

Thanks for your inputs!