public inbox for pve-devel@lists.proxmox.com
 help / color / mirror / Atom feed
From: Fiona Ebner <f.ebner@proxmox.com>
To: pve-devel@lists.proxmox.com
Subject: [pve-devel] [RFC qemu] fix #3231+#3631: PVE backup: fail backup rather than guest write when backup target cannot be reached or is too slow
Date: Mon, 10 Jun 2024 14:59:35 +0200	[thread overview]
Message-ID: <20240610125942.116985-1-f.ebner@proxmox.com> (raw)

A long-standing issue with VM backups in Proxmox VE is that a slow or
unreachable target would lead to a copy-before-write (cbw) operation
to break the guest write rather than abort the backup. This is
unexpected to users and the will end up without a successful backup
and without a working guest in such cases. This series prevents the
latter by changing the behavior to fail the backup instead of the
guest write.

This is done by re-using the already existing 'on-cbw-error' and
'cbw-timeout' options that are already used for fleecing and having
regular backup also check for the cbw's snapshot_error (unfortunately
this becomes a bit of a misnomer). If a given copy-before-write
operation cannot complete within 45 seconds, it's extremely likely
that aborting the backup is the better choice than keeping the guest
IO blocked.

Just checking for the error already makes it work (i.e. without the
last two patches), but backup can only check the error at the end. To
abort backup immediately, an error callback for the copy-before-write
node is introduced. A potential alternative would be give the
block-copy operation a pointer to the snapshot_error and have it check
it during its operation, but my initial attempt failed. Likely I
missed adapting certain logic that checks for whether the block-copy
operation failed and it's questionable if this approach would be
cleaner. An error callback is nice and explicit.

Note for testers: if e.g. the PBS is compeletly unreachable, the
backup job still will need to wait until the in-flight request is
aborted after 15 minutes. But the guest writes should be fast again.

Should it really be required to make the option more flexible, i.e.
allow users to specify a custom timeout or go back to the old behavior
then the 'backup' QMP call can be extended with those parameters.

Unfortunately, this is a non-trivial amount of code to make it work,
but there is quite a bit of boilerplate and some comments, so
hopefully the logic is straight-forward enough.


The first patch can be applied regardless of whether we want to go
with this or not.


Fiona Ebner (7):
  PVE backup: fleecing: properly set minimum cluster size
  block/copy-before-write: allow passing additional options for
    bdrv_cbw_append()
  block/backup: allow passing additional options for copy-before-write
    upon job creation
  block/backup: make cbw error also fail backup that does not use
    fleecing
  fix #3231+#3631: PVE backup: add timeout for copy-before-write
    operations and fail backup instead of guest writes
  block/copy-before-write: allow specifying error callback
  block/backup: set callback for cbw errors

 block/backup.c                         | 57 +++++++++++++++++++++++++-
 block/copy-before-write.c              | 41 +++++++++++++++---
 block/copy-before-write.h              |  9 +++-
 block/replication.c                    |  2 +-
 blockdev.c                             |  2 +-
 include/block/block_int-global-state.h |  2 +
 pve-backup.c                           | 13 +++++-
 7 files changed, 115 insertions(+), 11 deletions(-)

-- 
2.39.2



_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel


             reply	other threads:[~2024-06-10 12:59 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-06-10 12:59 Fiona Ebner [this message]
2024-06-10 12:59 ` [pve-devel] [PATCH qemu 1/7] PVE backup: fleecing: properly set minimum cluster size Fiona Ebner
2024-06-10 12:59 ` [pve-devel] [RFC qemu 2/7] block/copy-before-write: allow passing additional options for bdrv_cbw_append() Fiona Ebner
2024-06-10 12:59 ` [pve-devel] [RFC qemu 3/7] block/backup: allow passing additional options for copy-before-write upon job creation Fiona Ebner
2024-07-05  9:30   ` Fabian Grünbichler
2024-06-10 12:59 ` [pve-devel] [RFC qemu 4/7] block/backup: make cbw error also fail backup that does not use fleecing Fiona Ebner
2024-06-10 12:59 ` [pve-devel] [RFC qemu 5/7] fix #3231+#3631: PVE backup: add timeout for copy-before-write operations and fail backup instead of guest writes Fiona Ebner
2024-06-10 12:59 ` [pve-devel] [RFC qemu 6/7] block/copy-before-write: allow specifying error callback Fiona Ebner
2024-06-10 12:59 ` [pve-devel] [RFC qemu 7/7] block/backup: set callback for cbw errors Fiona Ebner
2024-07-05  9:37   ` Fabian Grünbichler
2024-07-05  9:41 ` [pve-devel] [RFC qemu] fix #3231+#3631: PVE backup: fail backup rather than guest write when backup target cannot be reached or is too slow Fabian Grünbichler

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20240610125942.116985-1-f.ebner@proxmox.com \
    --to=f.ebner@proxmox.com \
    --cc=pve-devel@lists.proxmox.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox
Service provided by Proxmox Server Solutions GmbH | Privacy | Legal