From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from firstgate.proxmox.com (firstgate.proxmox.com [IPv6:2a01:7e0:0:424::9]) by lore.proxmox.com (Postfix) with ESMTPS id B76A81FF2B0 for ; Fri, 5 Jul 2024 11:41:32 +0200 (CEST) Received: from firstgate.proxmox.com (localhost [127.0.0.1]) by firstgate.proxmox.com (Proxmox) with ESMTP id 1CF2D3F4C0; Fri, 5 Jul 2024 11:41:49 +0200 (CEST) MIME-Version: 1.0 In-Reply-To: <20240610125942.116985-1-f.ebner@proxmox.com> References: <20240610125942.116985-1-f.ebner@proxmox.com> From: Fabian =?utf-8?q?Gr=C3=BCnbichler?= To: Proxmox VE development discussion Date: Fri, 05 Jul 2024 11:41:07 +0200 Message-ID: <172017246796.125161.25996190556527554@yuna.proxmox.com> User-Agent: alot/0.10 X-SPAM-LEVEL: Spam detection results: 0 AWL 0.050 Adjusted score from AWL reputation of From: address BAYES_00 -1.9 Bayes spam probability is 0 to 1% DMARC_MISSING 0.1 Missing DMARC policy KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment SPF_HELO_NONE 0.001 SPF: HELO does not publish an SPF Record SPF_PASS -0.001 SPF: sender matches SPF record Subject: Re: [pve-devel] [RFC qemu] fix #3231+#3631: PVE backup: fail backup rather than guest write when backup target cannot be reached or is too slow X-BeenThere: pve-devel@lists.proxmox.com X-Mailman-Version: 2.1.29 Precedence: list List-Id: Proxmox VE development discussion List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: Proxmox VE development discussion Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: pve-devel-bounces@lists.proxmox.com Sender: "pve-devel" Quoting Fiona Ebner (2024-06-10 14:59:35) > A long-standing issue with VM backups in Proxmox VE is that a slow or > unreachable target would lead to a copy-before-write (cbw) operation > to break the guest write rather than abort the backup. This is > unexpected to users and the will end up without a successful backup > and without a working guest in such cases. This series prevents the > latter by changing the behavior to fail the backup instead of the > guest write. > > This is done by re-using the already existing 'on-cbw-error' and > 'cbw-timeout' options that are already used for fleecing and having > regular backup also check for the cbw's snapshot_error (unfortunately > this becomes a bit of a misnomer). If a given copy-before-write > operation cannot complete within 45 seconds, it's extremely likely > that aborting the backup is the better choice than keeping the guest > IO blocked. > > Just checking for the error already makes it work (i.e. without the > last two patches), but backup can only check the error at the end. To > abort backup immediately, an error callback for the copy-before-write > node is introduced. A potential alternative would be give the > block-copy operation a pointer to the snapshot_error and have it check > it during its operation, but my initial attempt failed. Likely I > missed adapting certain logic that checks for whether the block-copy > operation failed and it's questionable if this approach would be > cleaner. An error callback is nice and explicit. > > Note for testers: if e.g. the PBS is compeletly unreachable, the > backup job still will need to wait until the in-flight request is > aborted after 15 minutes. But the guest writes should be fast again. > > Should it really be required to make the option more flexible, i.e. > allow users to specify a custom timeout or go back to the old behavior > then the 'backup' QMP call can be extended with those parameters. > > Unfortunately, this is a non-trivial amount of code to make it work, > but there is quite a bit of boilerplate and some comments, so > hopefully the logic is straight-forward enough. this sentence here made me except a lot worse ;) code seems very straight-forward and clean, two small comments inline. not sure whether we want to entangle this with 9.0, but I think this could be applied soonish after some more in-depth testing, since it should solve a pretty big pain point user consistenly run into.. I am sure we will have users clamoring for a configurable timeout soon after though ;) > > The first patch can be applied regardless of whether we want to go > with this or not. > > > Fiona Ebner (7): > PVE backup: fleecing: properly set minimum cluster size > block/copy-before-write: allow passing additional options for > bdrv_cbw_append() > block/backup: allow passing additional options for copy-before-write > upon job creation > block/backup: make cbw error also fail backup that does not use > fleecing > fix #3231+#3631: PVE backup: add timeout for copy-before-write > operations and fail backup instead of guest writes > block/copy-before-write: allow specifying error callback > block/backup: set callback for cbw errors > > block/backup.c | 57 +++++++++++++++++++++++++- > block/copy-before-write.c | 41 +++++++++++++++--- > block/copy-before-write.h | 9 +++- > block/replication.c | 2 +- > blockdev.c | 2 +- > include/block/block_int-global-state.h | 2 + > pve-backup.c | 13 +++++- > 7 files changed, 115 insertions(+), 11 deletions(-) > > -- > 2.39.2 > > > > _______________________________________________ > pve-devel mailing list > pve-devel@lists.proxmox.com > https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel > > _______________________________________________ pve-devel mailing list pve-devel@lists.proxmox.com https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel