From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by lists.proxmox.com (Postfix) with ESMTPS id BFB89604D8 for ; Wed, 9 Sep 2020 08:02:19 +0200 (CEST) Received: from firstgate.proxmox.com (localhost [127.0.0.1]) by firstgate.proxmox.com (Proxmox) with ESMTP id A8341EB3A for ; Wed, 9 Sep 2020 08:01:49 +0200 (CEST) Received: from proxmox-new.maurer-it.com (proxmox-new.maurer-it.com [212.186.127.180]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by firstgate.proxmox.com (Proxmox) with ESMTPS id 48594EB2D for ; Wed, 9 Sep 2020 08:01:48 +0200 (CEST) Received: from proxmox-new.maurer-it.com (localhost.localdomain [127.0.0.1]) by proxmox-new.maurer-it.com (Proxmox) with ESMTP id E31B844AC5 for ; Wed, 9 Sep 2020 08:01:47 +0200 (CEST) Date: Wed, 09 Sep 2020 08:01:40 +0200 From: Fabian =?iso-8859-1?q?Gr=FCnbichler?= To: Proxmox Backup Server development discussion , Stefan Reiter References: <20200908091804.27685-1-s.reiter@proxmox.com> <1599563352.ezakbc52qx.astroid@nora.none> <2be25648-c858-e5eb-a447-76d8ab662bc2@proxmox.com> In-Reply-To: <2be25648-c858-e5eb-a447-76d8ab662bc2@proxmox.com> MIME-Version: 1.0 User-Agent: astroid/0.15.0 (https://github.com/astroidmail/astroid) Message-Id: <1599630655.6wa1bcvxiy.astroid@nora.none> Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-SPAM-LEVEL: Spam detection results: 0 AWL 0.029 Adjusted score from AWL reputation of From: address KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment RCVD_IN_DNSWL_MED -2.3 Sender listed at https://www.dnswl.org/, medium trust SPF_HELO_NONE 0.001 SPF: HELO does not publish an SPF Record SPF_PASS -0.001 SPF: sender matches SPF record Subject: Re: [pbs-devel] [PATCH proxmox-backup] gc: attach context to index reader errors and ignore NotFound X-BeenThere: pbs-devel@lists.proxmox.com X-Mailman-Version: 2.1.29 Precedence: list List-Id: Proxmox Backup Server development discussion List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 09 Sep 2020 06:02:19 -0000 On September 8, 2020 1:18 pm, Stefan Reiter wrote: > On 9/8/20 1:12 PM, Fabian Gr=C3=BCnbichler wrote: >> On September 8, 2020 11:18 am, Stefan Reiter wrote: >>> Ignore NotFound errors during phase 1, this just means that a snapshot >>> was forgotten or pruned between scanning for .fidx/.didx files and >>> actually opening the index to touch the chunks. >>=20 >> I originally had a similar patch already lying around, but I am not sure >> whether this is not too dangerous in the face of transient errors? >>=20 >> I'd much rather get to a point where we are sure that no concurrent >> prune/forget operation can happen, and treat all errors as errors, >> instead of treating all not found errors as benign 'must have happened >> cause of concurrent actions'. >>=20 >=20 > So no forget/prune during phase 1 of GC? That sounds like it would cause=20 > quite some congestion. or locking and touching group-wise, to reduce granularity and=20 contention? or let prune/forget wait until GC phase 1 is over, by having=20 a higher lock timeout? phase 1 does not take too long here, but it probably depends a lot on=20 datastore setup and size (special vdevs and enough RAM for caching=20 probably help a lot here..) we could also just mark them as deleted (touch $snapshot/.deleted) and=20 let GC do the actual deletion of metadata as well, but that would be a=20 much more involved change. added benefit that GC is now the only thing=20 that deletes stuff (except for cleanup of aborted backup tasks, but that=20 could also switch to that mechanism I guess). >=20 >> this is not pull, or download/restore, where we can just retry later - >> if we skip the index here, all the chunks it referenced are up for >> garbage collection unless they are saved by another index! >>=20 >=20 > I do see where you're coming from, but what alternative is there? If the=20 > index file is not found, we can't touch any referenced chunks anyway -=20 > there are none for us to see. the alternatives are A) treat index files which we expected to read that have vanished as=20 'must be benign', and continue GC B) try to not have a scenario where that can happen benignly (e.g.,=20 because of a mutex between operations that delete indices and this phase=20 of GC), so that we can know that it is an error and treat it as such I'd like to choose B since it is the safe alternative, and this is the=20 one path where having a bug could wipe out whole datastores, but if it's=20 too involved then we have to go with A =