From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <pbs-devel-bounces@lists.proxmox.com>
Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68])
	by lore.proxmox.com (Postfix) with ESMTPS id 60C311FF187
	for <inbox@lore.proxmox.com>; Mon,  6 Oct 2025 15:18:38 +0200 (CEST)
Received: from firstgate.proxmox.com (localhost [127.0.0.1])
	by firstgate.proxmox.com (Proxmox) with ESMTP id E6A494D1A;
	Mon,  6 Oct 2025 15:18:41 +0200 (CEST)
Date: Mon, 06 Oct 2025 15:18:35 +0200
From: Fabian =?iso-8859-1?q?Gr=FCnbichler?= <f.gruenbichler@proxmox.com>
To: Proxmox Backup Server development discussion <pbs-devel@lists.proxmox.com>
References: <20251006104151.487202-1-c.ebner@proxmox.com>
In-Reply-To: <20251006104151.487202-1-c.ebner@proxmox.com>
MIME-Version: 1.0
User-Agent: astroid/0.17.0 (https://github.com/astroidmail/astroid)
Message-Id: <1759754692.4isfk73089.astroid@yuna.none>
X-Bm-Milter-Handled: 55990f41-d878-4baa-be0a-ee34c49e34d2
X-Bm-Transport-Timestamp: 1759756690210
X-SPAM-LEVEL: Spam detection results:  0
 AWL 0.049 Adjusted score from AWL reputation of From: address
 BAYES_00                 -1.9 Bayes spam probability is 0 to 1%
 DMARC_MISSING             0.1 Missing DMARC policy
 KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment
 SPF_HELO_NONE           0.001 SPF: HELO does not publish an SPF Record
 SPF_PASS               -0.001 SPF: sender matches SPF record
Subject: Re: [pbs-devel] [PATCH proxmox-backup 0/7] s3 store: fix issues
 with chunk s3 backend upload and cache eviction
X-BeenThere: pbs-devel@lists.proxmox.com
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Proxmox Backup Server development discussion
 <pbs-devel.lists.proxmox.com>
List-Unsubscribe: <https://lists.proxmox.com/cgi-bin/mailman/options/pbs-devel>, 
 <mailto:pbs-devel-request@lists.proxmox.com?subject=unsubscribe>
List-Archive: <http://lists.proxmox.com/pipermail/pbs-devel/>
List-Post: <mailto:pbs-devel@lists.proxmox.com>
List-Help: <mailto:pbs-devel-request@lists.proxmox.com?subject=help>
List-Subscribe: <https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel>, 
 <mailto:pbs-devel-request@lists.proxmox.com?subject=subscribe>
Reply-To: Proxmox Backup Server development discussion
 <pbs-devel@lists.proxmox.com>
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Errors-To: pbs-devel-bounces@lists.proxmox.com
Sender: "pbs-devel" <pbs-devel-bounces@lists.proxmox.com>

On October 6, 2025 12:41 pm, Christian Ebner wrote:
> These patches fix 2 issues with the current s3 backend implementation
> and reduce code duplication.
> 
> Patch 1 to 3 rework the garbage collection, deduplicating the common atime
> check and garbage collection status update logic and refactor the chunk removal
> from local datastore cache on s3 backends.
> 
> Patch 4 fixes an issue which could lead to backup restores failing
> during concurrent backups, if the local datastore cache was small, leading
> to chunks being evicted from the cache by truncating the contents, while the
> chunk reader still accessed the chunk. This is now circumvented by replacing
> the chunk file instead, leaving the contents on the already opened file
> handle for the reader still accessible.

these parts here are not 100% there yet (see comments), but I think the
approach taken is fine in principial!

> The remaining patches fix a possible race condition between s3 backend upload
> and garbage collection, which can result in chunk loss. If the chunk upload
> finished, garbage collection listed and checked the chunk's in-use marker, just
> before it being written by the cache insert after the upload, garbage collection
> can incorrectly delete the chunk. This is circumvented by setting and checking
> an additional chunk marker file, which is created before starting the upload
> and removed after cache insert, assuring that these chunks are not removed.

see detailed comments on individual patches, but I think the current
approach of using marker files guarded by the existing chunk store mutex
has a lot of subtle pitfalls surrounding concurrent insertions and error
handling..

one potential alternative would be to use an flock instead, but that
might lead to scalability issues if there are lots of concurrent chunk
uploads..

maybe we could instead do the following:
- lock the mutex & touch the pending marker file before uploading the
  chunk (but allow it to already exist, as that might be a concurrent or
  older failed upload)
- clean it up when inserting into the cache (but treat it not existing
  as benign?)
- in GC, lock and
  - check whether the chunk was properly inserted in the meantime (then
    we know it's fresh and mustn't be cleaned up)
  - if not, check the atime of the pending marker (if it is before the
    cutoff then it must be from an already aborted backup that we don't
    care about, and we can remove the pending marker and the chunk on S3)

and I think at this point we should have a single helper for inserting
an in-memory chunk into an S3-backed datastore, including the upload
handling.. while atm we don't support restoring from tape to S3
directly, once we add that it would be a third place where we'd need the
same/similar code and potentially introduce bugs related to handling all
this (backup API, pull sync are the existing two).

> proxmox-backup:
> 
> Christian Ebner (7):
>   datastore: gc: inline single callsite method
>   gc: chunk store: rework atime check and gc status into common helper
>   chunk store: add and use method to remove chunks
>   chunk store: fix: replace evicted cache chunks instead of truncate
>   api: chunk upload: fix race between chunk backend upload and insert
>   api: chunk upload: fix race with garbage collection for no-cache on s3
>   pull: guard chunk upload and only insert into cache after upload
> 
>  pbs-datastore/src/chunk_store.rs              | 211 +++++++++++++++---
>  pbs-datastore/src/datastore.rs                | 155 +++++++------
>  .../src/local_datastore_lru_cache.rs          |  32 +--
>  src/api2/backup/upload_chunk.rs               |  29 ++-
>  src/api2/config/datastore.rs                  |   2 +
>  src/server/pull.rs                            |  14 +-
>  6 files changed, 299 insertions(+), 144 deletions(-)
> 
> 
> Summary over all repositories:
>   6 files changed, 299 insertions(+), 144 deletions(-)
> 
> -- 
> Generated by git-murpp 0.8.1
> 
> 
> _______________________________________________
> pbs-devel mailing list
> pbs-devel@lists.proxmox.com
> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel
> 
> 
> 


_______________________________________________
pbs-devel mailing list
pbs-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel