From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from firstgate.proxmox.com (firstgate.proxmox.com [IPv6:2a01:7e0:0:424::9]) by lore.proxmox.com (Postfix) with ESMTPS id 4D6851FF146 for ; Tue, 09 Jun 2026 09:07:22 +0200 (CEST) Received: from firstgate.proxmox.com (localhost [127.0.0.1]) by firstgate.proxmox.com (Proxmox) with ESMTP id 1C5206790; Tue, 9 Jun 2026 09:07:22 +0200 (CEST) Date: Tue, 09 Jun 2026 09:06:44 +0200 From: Fabian =?iso-8859-1?q?Gr=FCnbichler?= Subject: Re: [RFC PATCH proxmox-backup] fix #7670: datastore: s3: allow for per-chunk file lock cleanup To: Christian Ebner , pbs-devel@lists.proxmox.com References: <20260604140919.97686-1-c.ebner@proxmox.com> <178066679899.993050.14588796821627975859@yuna.proxmox.com> In-Reply-To: MIME-Version: 1.0 User-Agent: astroid/0.17.0 (https://github.com/astroidmail/astroid) Message-Id: <1780988475.h44gt9timk.astroid@yuna.none> Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Bm-Milter-Handled: 55990f41-d878-4baa-be0a-ee34c49e34d2 X-Bm-Transport-Timestamp: 1780988763126 X-SPAM-LEVEL: Spam detection results: 0 AWL 0.054 Adjusted score from AWL reputation of From: address BAYES_00 -1.9 Bayes spam probability is 0 to 1% DMARC_MISSING 0.1 Missing DMARC policy KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment SPF_HELO_NONE 0.001 SPF: HELO does not publish an SPF Record SPF_PASS -0.001 SPF: sender matches SPF record Message-ID-Hash: YRGRHBF4F75JUJK4LYZA23FKA3NMO5ZS X-Message-ID-Hash: YRGRHBF4F75JUJK4LYZA23FKA3NMO5ZS X-MailFrom: f.gruenbichler@proxmox.com X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; loop; banned-address; emergency; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header X-Mailman-Version: 3.3.10 Precedence: list List-Id: Proxmox Backup Server development discussion List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: On June 5, 2026 5:17 pm, Christian Ebner wrote: > On 6/5/26 5:06 PM, Christian Ebner wrote: >> On 6/5/26 3:39 PM, Fabian Gr=C3=BCnbichler wrote: >>> Quoting Christian Ebner (2026-06-04 16:09:19) [..] >>> another approach would be to switch to a different kind of locking=20 >>> mechanism >>> entirely (though with the cross-process and multi-threaded=20 >>> requirements we >>> have, this might not be easy either ;)) or to reduce the lock=20 >>> granularity. >>=20 >> Yeah, reducing the lock granularity and moving to something with=20 >> deterministic count could be a way out, and given the usage estimates=20 >> from above, using the 4 hex-digit prefix used also for chunk store=20 >> directory hierarchy might be a viable candidate for defining such lock-=20 >> files. that would give us 64M at most if we keep it on tmpfs, but might be a bit too coarse (the lock is held for the duration of an upload after all, which can take a bit..). >>> given that a mistake in the handling of retries below can cause=20 >>> dataloss, doing >>> it for every lock/unlock pair sounds a bit dangerous. there's also the >>> additional overhead if the lock is actually contended to account for -=20 >>> we need >>> at least two loop iterations if it is, potentially a lot more? >>=20 >> This is however not so frequently happening though? But yes, there is=20 >> overhead by the then unavoidable retry/retries. >>=20 >>> or we go back to square one, and revisit the whole interaction here to=20 >>> see if >>> we can get rid of the per-chunk locks again in favor of something that=20 >>> scales >>> with the number of pending uploads.. the last time we tried this=20 >>> turned out to >>> be quite tricky though. >> The major downside for all solutions is unfortunately that moving=20 >> forward can only happen with the current locking mechanism still in=20 >> place, so a transitional step for cleanup unavoidable. But at least it=20 >> is fine to remove the per-chunk lock files once they are being held in=20 >> the process instance implementing the new locking logic. yeah, it requires the usual "set flag-file, keep old behaviour until reboot/while it exists, switch over after" migration. > Oh, and another thing to consider as well: The backup snapshot=20 > lock-files are also not cleaned up AFAIKT? While way less problematic,=20 > they will also accumulate and add unneeded memory overhead on systems=20 > with high frequency backups which are not rebooted frequently. >=20 > There removing the lock-file on prune and retry for vanished lock-files=20 > after flock calls might be acceptable? Also performance wise. snapshot counts are in a different ball park though.. but still the next-most-problematic one ;) so we might need to reconsider other tmpfs-located locks as well.