From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <pve-devel-bounces@lists.proxmox.com>
Received: from firstgate.proxmox.com (firstgate.proxmox.com [IPv6:2a01:7e0:0:424::9])
	by lore.proxmox.com (Postfix) with ESMTPS id 163F31FF39E
	for <inbox@lore.proxmox.com>; Mon, 10 Jun 2024 14:59:21 +0200 (CEST)
Received: from firstgate.proxmox.com (localhost [127.0.0.1])
	by firstgate.proxmox.com (Proxmox) with ESMTP id 47FA717389;
	Mon, 10 Jun 2024 14:59:50 +0200 (CEST)
From: Fiona Ebner <f.ebner@proxmox.com>
To: pve-devel@lists.proxmox.com
Date: Mon, 10 Jun 2024 14:59:35 +0200
Message-Id: <20240610125942.116985-1-f.ebner@proxmox.com>
X-Mailer: git-send-email 2.39.2
MIME-Version: 1.0
X-SPAM-LEVEL: Spam detection results:  0
 AWL -0.059 Adjusted score from AWL reputation of From: address
 BAYES_00                 -1.9 Bayes spam probability is 0 to 1%
 DMARC_MISSING             0.1 Missing DMARC policy
 KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment
 SPF_HELO_NONE           0.001 SPF: HELO does not publish an SPF Record
 SPF_PASS               -0.001 SPF: sender matches SPF record
 T_SCC_BODY_TEXT_LINE    -0.01 -
Subject: [pve-devel] [RFC qemu] fix #3231+#3631: PVE backup: fail backup
 rather than guest write when backup target cannot be reached or is too slow
X-BeenThere: pve-devel@lists.proxmox.com
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Proxmox VE development discussion <pve-devel.lists.proxmox.com>
List-Unsubscribe: <https://lists.proxmox.com/cgi-bin/mailman/options/pve-devel>, 
 <mailto:pve-devel-request@lists.proxmox.com?subject=unsubscribe>
List-Archive: <http://lists.proxmox.com/pipermail/pve-devel/>
List-Post: <mailto:pve-devel@lists.proxmox.com>
List-Help: <mailto:pve-devel-request@lists.proxmox.com?subject=help>
List-Subscribe: <https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel>, 
 <mailto:pve-devel-request@lists.proxmox.com?subject=subscribe>
Reply-To: Proxmox VE development discussion <pve-devel@lists.proxmox.com>
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Errors-To: pve-devel-bounces@lists.proxmox.com
Sender: "pve-devel" <pve-devel-bounces@lists.proxmox.com>

A long-standing issue with VM backups in Proxmox VE is that a slow or
unreachable target would lead to a copy-before-write (cbw) operation
to break the guest write rather than abort the backup. This is
unexpected to users and the will end up without a successful backup
and without a working guest in such cases. This series prevents the
latter by changing the behavior to fail the backup instead of the
guest write.

This is done by re-using the already existing 'on-cbw-error' and
'cbw-timeout' options that are already used for fleecing and having
regular backup also check for the cbw's snapshot_error (unfortunately
this becomes a bit of a misnomer). If a given copy-before-write
operation cannot complete within 45 seconds, it's extremely likely
that aborting the backup is the better choice than keeping the guest
IO blocked.

Just checking for the error already makes it work (i.e. without the
last two patches), but backup can only check the error at the end. To
abort backup immediately, an error callback for the copy-before-write
node is introduced. A potential alternative would be give the
block-copy operation a pointer to the snapshot_error and have it check
it during its operation, but my initial attempt failed. Likely I
missed adapting certain logic that checks for whether the block-copy
operation failed and it's questionable if this approach would be
cleaner. An error callback is nice and explicit.

Note for testers: if e.g. the PBS is compeletly unreachable, the
backup job still will need to wait until the in-flight request is
aborted after 15 minutes. But the guest writes should be fast again.

Should it really be required to make the option more flexible, i.e.
allow users to specify a custom timeout or go back to the old behavior
then the 'backup' QMP call can be extended with those parameters.

Unfortunately, this is a non-trivial amount of code to make it work,
but there is quite a bit of boilerplate and some comments, so
hopefully the logic is straight-forward enough.


The first patch can be applied regardless of whether we want to go
with this or not.


Fiona Ebner (7):
  PVE backup: fleecing: properly set minimum cluster size
  block/copy-before-write: allow passing additional options for
    bdrv_cbw_append()
  block/backup: allow passing additional options for copy-before-write
    upon job creation
  block/backup: make cbw error also fail backup that does not use
    fleecing
  fix #3231+#3631: PVE backup: add timeout for copy-before-write
    operations and fail backup instead of guest writes
  block/copy-before-write: allow specifying error callback
  block/backup: set callback for cbw errors

 block/backup.c                         | 57 +++++++++++++++++++++++++-
 block/copy-before-write.c              | 41 +++++++++++++++---
 block/copy-before-write.h              |  9 +++-
 block/replication.c                    |  2 +-
 blockdev.c                             |  2 +-
 include/block/block_int-global-state.h |  2 +
 pve-backup.c                           | 13 +++++-
 7 files changed, 115 insertions(+), 11 deletions(-)

-- 
2.39.2



_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel