From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <pbs-devel-bounces@lists.proxmox.com>
Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68])
	by lore.proxmox.com (Postfix) with ESMTPS id 2CF161FF185
	for <inbox@lore.proxmox.com>; Mon, 21 Jul 2025 17:37:18 +0200 (CEST)
Received: from firstgate.proxmox.com (localhost [127.0.0.1])
	by firstgate.proxmox.com (Proxmox) with ESMTP id 8042B1425B;
	Mon, 21 Jul 2025 17:38:28 +0200 (CEST)
Message-ID: <0fa63092-3e6b-4f1c-a9bd-59bbb5cfc117@proxmox.com>
Date: Mon, 21 Jul 2025 17:37:30 +0200
MIME-Version: 1.0
User-Agent: Mozilla Thunderbird
To: Lukas Wagner <l.wagner@proxmox.com>,
 Proxmox Backup Server development discussion <pbs-devel@lists.proxmox.com>
References: <20250719125035.9926-1-c.ebner@proxmox.com>
 <DBHTQMK5EAWR.2JWM1G0971KG@proxmox.com>
Content-Language: en-US, de-DE
From: Christian Ebner <c.ebner@proxmox.com>
In-Reply-To: <DBHTQMK5EAWR.2JWM1G0971KG@proxmox.com>
X-Bm-Milter-Handled: 55990f41-d878-4baa-be0a-ee34c49e34d2
X-Bm-Transport-Timestamp: 1753112243941
X-SPAM-LEVEL: Spam detection results:  0
 AWL -0.081 Adjusted score from AWL reputation of From: address
 BAYES_00                 -1.9 Bayes spam probability is 0 to 1%
 DMARC_MISSING             0.1 Missing DMARC policy
 KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment
 KAM_LOTSOFHASH           0.25 Emails with lots of hash-like gibberish
 RCVD_IN_VALIDITY_CERTIFIED_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to
 Validity was blocked. See
 https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more
 information.
 RCVD_IN_VALIDITY_RPBL_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to
 Validity was blocked. See
 https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more
 information.
 RCVD_IN_VALIDITY_SAFE_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to
 Validity was blocked. See
 https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more
 information.
 SPF_HELO_NONE           0.001 SPF: HELO does not publish an SPF Record
 SPF_PASS               -0.001 SPF: sender matches SPF record
Subject: Re: [pbs-devel] [PATCH proxmox{,
 -backup} v9 00/49] fix #2943: S3 storage backend for datastores
X-BeenThere: pbs-devel@lists.proxmox.com
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Proxmox Backup Server development discussion
 <pbs-devel.lists.proxmox.com>
List-Unsubscribe: <https://lists.proxmox.com/cgi-bin/mailman/options/pbs-devel>, 
 <mailto:pbs-devel-request@lists.proxmox.com?subject=unsubscribe>
List-Archive: <http://lists.proxmox.com/pipermail/pbs-devel/>
List-Post: <mailto:pbs-devel@lists.proxmox.com>
List-Help: <mailto:pbs-devel-request@lists.proxmox.com?subject=help>
List-Subscribe: <https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel>, 
 <mailto:pbs-devel-request@lists.proxmox.com?subject=subscribe>
Reply-To: Proxmox Backup Server development discussion
 <pbs-devel@lists.proxmox.com>
Content-Transfer-Encoding: 7bit
Content-Type: text/plain; charset="us-ascii"; Format="flowed"
Errors-To: pbs-devel-bounces@lists.proxmox.com
Sender: "pbs-devel" <pbs-devel-bounces@lists.proxmox.com>

On 7/21/25 5:05 PM, Lukas Wagner wrote:
> Retested these patches on the latest master branch(es).
> 
> Retested basic backups, sync jobs, verification, GC, pruning, etc.
> 
> This time I tried to focus more on different failure scenarios, e.g. a
> failing connection to the S3 server during different operations.
> 
> Here's what I found, most of these issues I did already discuss and
> debug off-list with @Chris:
> 
> 1.)
> 
> When doing an S3 Refresh and PBS cannot connect to S3, a `tmp_xxxxxxx`
> directory is left over in the local datastore directory. After clearing
> S3 Refresh maintenance mode (or doing a successful S3 refresh), GC jobs
> will fail because they cannot access this left-over directory (it is
> owned by root:root).
> AFAIK Chris has already prepared a fix for this.

Will be fixed in the next version of the patch series, thanks!

> 
> 2.)
> 
> I backed up some VMs to my local MinIO server which ran out of disk
> space during backup. Since even delete operations failed in this
> scenario, PBS could not clean up the snapshot directory, which was
> left over after this failed backup. In some instances, the snapshot
> directory was completely empty, in some other case two blobs were
> written, but the fidx files were missing:
> 
>    root@pbs-s3:/s3-store/ns/pali/vm# ls 160/2025-07-21T12\:51\:44Z/
>    fw.conf.blob  qemu-server.conf.blob
>    root@pbs-s3:/s3-store/ns/pali/vm# ls 165/
>    2025-07-21T12:52:42Z/ owner
>    root@pbs-s3:/s3-store/ns/pali/vm# ls 165/2025-07-21T12\:52\:42Z/
>    root@pbs-s3:/s3-store/ns/pali/vm#
> 
> I could fix this by doing a "S3 Refresh" and then manually deleting the
> affected snapshot under the "Content" view - something that could be
> very annoying if one has hundred/thousands of snapshots, so I think we
> need some form of automatic cleanup for fragments from incomplete/failed
> backups. After all, I'm pretty sure that one could end up in a similar
> situation by just cutting the network connection to the S3 server at the
> right moment in time.

As discussed already a bit off-list, this would indeed be nice to have, 
however I see no way of doing this consistently atm without manual user 
interaction. In your tests cleanup of objects from the s3 backend failed 
because of the out-of-memory, so the user needs to fix that first.

And automatic cleanup of fragments from the S3 store after a connection 
loss might be doable during garbage collection, or verification, I will 
however have to think this through in detail. So best for a followup.

> 
> 3.)
> 
> Cut the connection to my MinIO server during a verification job.
> The task log was spammed by the following messages:
> 
>    2025-07-21T16:06:51+02:00: failed to copy corrupt chunk on s3 backend: 747835eb948591da7c4ebe892a9eb28c0daa8978bb80b70350f5b07225a1b9b0
>    2025-07-21T16:06:51+02:00: corrupted chunk renamed to "/s3-store/.chunks/7478/747835eb948591da7c4ebe892a9eb28c0daa8978bb80b70350f5b07225a1b9b0.0.bad"
>    2025-07-21T16:06:51+02:00: "can't verify chunk, load failed - client error (Connect)"
>    2025-07-21T16:06:51+02:00: failed to copy corrupt chunk on s3 backend: 5680458c0dba35dd1b528b5e38d32d410aee285f4d0328bbd8814fb5eb129aaf
>    2025-07-21T16:06:51+02:00: corrupted chunk renamed to "/s3-store/.chunks/5680/5680458c0dba35dd1b528b5e38d32d410aee285f4d0328bbd8814fb5eb129aaf.0.bad"
> 
> While not really catastrophic, since these chunks would then just be
> refetched from S3 on the next access, this probably should be handled
> better/more gracefully.

Fixed this as well already for the upcoming v10 of the patches, thanks!

> 
> One thing that I spotted in the documentation was the following:
> 
>    proxmox-backup-manager s3 client create my-s3-client --secrets-id my-s3-client ...
> 
> The user has to specify the client ID twice, one for the regular config,
> one for the secret config. This was implemented this way due to how
> parameter flattening for API type structs work. I discussed this
> with @Chris and suggested another approach, one that works without
> duplicating the ID to hopefully make the UX a bit nicer.

Same, this will be fixed with the next iteration.

> Apart from these issues everything seemed to work fine.
> 
> Tested-by: Lukas Wagner <l.wagner@proxmox.com>


_______________________________________________
pbs-devel mailing list
pbs-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel