From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <pbs-devel-bounces@lists.proxmox.com>
Received: from firstgate.proxmox.com (firstgate.proxmox.com [IPv6:2a01:7e0:0:424::9])
	by lore.proxmox.com (Postfix) with ESMTPS id DE33F1FF185
	for <inbox@lore.proxmox.com>; Mon, 21 Jul 2025 17:04:03 +0200 (CEST)
Received: from firstgate.proxmox.com (localhost [127.0.0.1])
	by firstgate.proxmox.com (Proxmox) with ESMTP id 61A9F133DD;
	Mon, 21 Jul 2025 17:05:14 +0200 (CEST)
Mime-Version: 1.0
Date: Mon, 21 Jul 2025 17:05:10 +0200
Message-Id: <DBHTQMK5EAWR.2JWM1G0971KG@proxmox.com>
From: "Lukas Wagner" <l.wagner@proxmox.com>
To: "Proxmox Backup Server development discussion"
 <pbs-devel@lists.proxmox.com>, "Christian Ebner" <c.ebner@proxmox.com>
X-Mailer: aerc 0.20.1-0-g2ecb8770224a
References: <20250719125035.9926-1-c.ebner@proxmox.com>
In-Reply-To: <20250719125035.9926-1-c.ebner@proxmox.com>
X-Bm-Milter-Handled: 55990f41-d878-4baa-be0a-ee34c49e34d2
X-Bm-Transport-Timestamp: 1753110303345
X-SPAM-LEVEL: Spam detection results:  0
 AWL -0.107 Adjusted score from AWL reputation of From: address
 BAYES_00                 -1.9 Bayes spam probability is 0 to 1%
 DMARC_MISSING             0.1 Missing DMARC policy
 KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment
 KAM_LOTSOFHASH           0.25 Emails with lots of hash-like gibberish
 RCVD_IN_VALIDITY_CERTIFIED_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to
 Validity was blocked. See
 https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more
 information.
 RCVD_IN_VALIDITY_RPBL_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to
 Validity was blocked. See
 https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more
 information.
 RCVD_IN_VALIDITY_SAFE_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to
 Validity was blocked. See
 https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more
 information.
 SPF_HELO_NONE           0.001 SPF: HELO does not publish an SPF Record
 SPF_PASS               -0.001 SPF: sender matches SPF record
Subject: Re: [pbs-devel] [PATCH proxmox{,
 -backup} v9 00/49] fix #2943: S3 storage backend for datastores
X-BeenThere: pbs-devel@lists.proxmox.com
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Proxmox Backup Server development discussion
 <pbs-devel.lists.proxmox.com>
List-Unsubscribe: <https://lists.proxmox.com/cgi-bin/mailman/options/pbs-devel>, 
 <mailto:pbs-devel-request@lists.proxmox.com?subject=unsubscribe>
List-Archive: <http://lists.proxmox.com/pipermail/pbs-devel/>
List-Post: <mailto:pbs-devel@lists.proxmox.com>
List-Help: <mailto:pbs-devel-request@lists.proxmox.com?subject=help>
List-Subscribe: <https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel>, 
 <mailto:pbs-devel-request@lists.proxmox.com?subject=subscribe>
Reply-To: Proxmox Backup Server development discussion
 <pbs-devel@lists.proxmox.com>
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Errors-To: pbs-devel-bounces@lists.proxmox.com
Sender: "pbs-devel" <pbs-devel-bounces@lists.proxmox.com>

Retested these patches on the latest master branch(es).

Retested basic backups, sync jobs, verification, GC, pruning, etc.

This time I tried to focus more on different failure scenarios, e.g. a
failing connection to the S3 server during different operations.

Here's what I found, most of these issues I did already discuss and
debug off-list with @Chris:

1.)

When doing an S3 Refresh and PBS cannot connect to S3, a `tmp_xxxxxxx`
directory is left over in the local datastore directory. After clearing
S3 Refresh maintenance mode (or doing a successful S3 refresh), GC jobs
will fail because they cannot access this left-over directory (it is
owned by root:root).
AFAIK Chris has already prepared a fix for this.

2.)

I backed up some VMs to my local MinIO server which ran out of disk
space during backup. Since even delete operations failed in this
scenario, PBS could not clean up the snapshot directory, which was
left over after this failed backup. In some instances, the snapshot
directory was completely empty, in some other case two blobs were
written, but the fidx files were missing:

  root@pbs-s3:/s3-store/ns/pali/vm# ls 160/2025-07-21T12\:51\:44Z/
  fw.conf.blob  qemu-server.conf.blob
  root@pbs-s3:/s3-store/ns/pali/vm# ls 165/
  2025-07-21T12:52:42Z/ owner                 
  root@pbs-s3:/s3-store/ns/pali/vm# ls 165/2025-07-21T12\:52\:42Z/
  root@pbs-s3:/s3-store/ns/pali/vm#

I could fix this by doing a "S3 Refresh" and then manually deleting the
affected snapshot under the "Content" view - something that could be
very annoying if one has hundred/thousands of snapshots, so I think we
need some form of automatic cleanup for fragments from incomplete/failed
backups. After all, I'm pretty sure that one could end up in a similar
situation by just cutting the network connection to the S3 server at the
right moment in time.

3.)

Cut the connection to my MinIO server during a verification job.
The task log was spammed by the following messages:

  2025-07-21T16:06:51+02:00: failed to copy corrupt chunk on s3 backend: 747835eb948591da7c4ebe892a9eb28c0daa8978bb80b70350f5b07225a1b9b0
  2025-07-21T16:06:51+02:00: corrupted chunk renamed to "/s3-store/.chunks/7478/747835eb948591da7c4ebe892a9eb28c0daa8978bb80b70350f5b07225a1b9b0.0.bad"
  2025-07-21T16:06:51+02:00: "can't verify chunk, load failed - client error (Connect)"
  2025-07-21T16:06:51+02:00: failed to copy corrupt chunk on s3 backend: 5680458c0dba35dd1b528b5e38d32d410aee285f4d0328bbd8814fb5eb129aaf
  2025-07-21T16:06:51+02:00: corrupted chunk renamed to "/s3-store/.chunks/5680/5680458c0dba35dd1b528b5e38d32d410aee285f4d0328bbd8814fb5eb129aaf.0.bad"

While not really catastrophic, since these chunks would then just be
refetched from S3 on the next access, this probably should be handled
better/more gracefully.

One thing that I spotted in the documentation was the following:

  proxmox-backup-manager s3 client create my-s3-client --secrets-id my-s3-client ...

The user has to specify the client ID twice, one for the regular config,
one for the secret config. This was implemented this way due to how
parameter flattening for API type structs work. I discussed this
with @Chris and suggested another approach, one that works without
duplicating the ID to hopefully make the UX a bit nicer.

Apart from these issues everything seemed to work fine.

Tested-by: Lukas Wagner <l.wagner@proxmox.com>


_______________________________________________
pbs-devel mailing list
pbs-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel