From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from firstgate.proxmox.com (firstgate.proxmox.com [IPv6:2a01:7e0:0:424::9]) by lore.proxmox.com (Postfix) with ESMTPS id 163F31FF39E for ; Mon, 10 Jun 2024 14:59:21 +0200 (CEST) Received: from firstgate.proxmox.com (localhost [127.0.0.1]) by firstgate.proxmox.com (Proxmox) with ESMTP id 47FA717389; Mon, 10 Jun 2024 14:59:50 +0200 (CEST) From: Fiona Ebner To: pve-devel@lists.proxmox.com Date: Mon, 10 Jun 2024 14:59:35 +0200 Message-Id: <20240610125942.116985-1-f.ebner@proxmox.com> X-Mailer: git-send-email 2.39.2 MIME-Version: 1.0 X-SPAM-LEVEL: Spam detection results: 0 AWL -0.059 Adjusted score from AWL reputation of From: address BAYES_00 -1.9 Bayes spam probability is 0 to 1% DMARC_MISSING 0.1 Missing DMARC policy KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment SPF_HELO_NONE 0.001 SPF: HELO does not publish an SPF Record SPF_PASS -0.001 SPF: sender matches SPF record T_SCC_BODY_TEXT_LINE -0.01 - Subject: [pve-devel] [RFC qemu] fix #3231+#3631: PVE backup: fail backup rather than guest write when backup target cannot be reached or is too slow X-BeenThere: pve-devel@lists.proxmox.com X-Mailman-Version: 2.1.29 Precedence: list List-Id: Proxmox VE development discussion List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: Proxmox VE development discussion Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: pve-devel-bounces@lists.proxmox.com Sender: "pve-devel" A long-standing issue with VM backups in Proxmox VE is that a slow or unreachable target would lead to a copy-before-write (cbw) operation to break the guest write rather than abort the backup. This is unexpected to users and the will end up without a successful backup and without a working guest in such cases. This series prevents the latter by changing the behavior to fail the backup instead of the guest write. This is done by re-using the already existing 'on-cbw-error' and 'cbw-timeout' options that are already used for fleecing and having regular backup also check for the cbw's snapshot_error (unfortunately this becomes a bit of a misnomer). If a given copy-before-write operation cannot complete within 45 seconds, it's extremely likely that aborting the backup is the better choice than keeping the guest IO blocked. Just checking for the error already makes it work (i.e. without the last two patches), but backup can only check the error at the end. To abort backup immediately, an error callback for the copy-before-write node is introduced. A potential alternative would be give the block-copy operation a pointer to the snapshot_error and have it check it during its operation, but my initial attempt failed. Likely I missed adapting certain logic that checks for whether the block-copy operation failed and it's questionable if this approach would be cleaner. An error callback is nice and explicit. Note for testers: if e.g. the PBS is compeletly unreachable, the backup job still will need to wait until the in-flight request is aborted after 15 minutes. But the guest writes should be fast again. Should it really be required to make the option more flexible, i.e. allow users to specify a custom timeout or go back to the old behavior then the 'backup' QMP call can be extended with those parameters. Unfortunately, this is a non-trivial amount of code to make it work, but there is quite a bit of boilerplate and some comments, so hopefully the logic is straight-forward enough. The first patch can be applied regardless of whether we want to go with this or not. Fiona Ebner (7): PVE backup: fleecing: properly set minimum cluster size block/copy-before-write: allow passing additional options for bdrv_cbw_append() block/backup: allow passing additional options for copy-before-write upon job creation block/backup: make cbw error also fail backup that does not use fleecing fix #3231+#3631: PVE backup: add timeout for copy-before-write operations and fail backup instead of guest writes block/copy-before-write: allow specifying error callback block/backup: set callback for cbw errors block/backup.c | 57 +++++++++++++++++++++++++- block/copy-before-write.c | 41 +++++++++++++++--- block/copy-before-write.h | 9 +++- block/replication.c | 2 +- blockdev.c | 2 +- include/block/block_int-global-state.h | 2 + pve-backup.c | 13 +++++- 7 files changed, 115 insertions(+), 11 deletions(-) -- 2.39.2 _______________________________________________ pve-devel mailing list pve-devel@lists.proxmox.com https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel