public inbox for pve-devel@lists.proxmox.com
 help / color / mirror / Atom feed
* [pve-devel] [RFC qemu] fix #3231+#3631: PVE backup: fail backup rather than guest write when backup target cannot be reached or is too slow
@ 2024-06-10 12:59 Fiona Ebner
  2024-06-10 12:59 ` [pve-devel] [PATCH qemu 1/7] PVE backup: fleecing: properly set minimum cluster size Fiona Ebner
                   ` (7 more replies)
  0 siblings, 8 replies; 11+ messages in thread
From: Fiona Ebner @ 2024-06-10 12:59 UTC (permalink / raw)
  To: pve-devel

A long-standing issue with VM backups in Proxmox VE is that a slow or
unreachable target would lead to a copy-before-write (cbw) operation
to break the guest write rather than abort the backup. This is
unexpected to users and the will end up without a successful backup
and without a working guest in such cases. This series prevents the
latter by changing the behavior to fail the backup instead of the
guest write.

This is done by re-using the already existing 'on-cbw-error' and
'cbw-timeout' options that are already used for fleecing and having
regular backup also check for the cbw's snapshot_error (unfortunately
this becomes a bit of a misnomer). If a given copy-before-write
operation cannot complete within 45 seconds, it's extremely likely
that aborting the backup is the better choice than keeping the guest
IO blocked.

Just checking for the error already makes it work (i.e. without the
last two patches), but backup can only check the error at the end. To
abort backup immediately, an error callback for the copy-before-write
node is introduced. A potential alternative would be give the
block-copy operation a pointer to the snapshot_error and have it check
it during its operation, but my initial attempt failed. Likely I
missed adapting certain logic that checks for whether the block-copy
operation failed and it's questionable if this approach would be
cleaner. An error callback is nice and explicit.

Note for testers: if e.g. the PBS is compeletly unreachable, the
backup job still will need to wait until the in-flight request is
aborted after 15 minutes. But the guest writes should be fast again.

Should it really be required to make the option more flexible, i.e.
allow users to specify a custom timeout or go back to the old behavior
then the 'backup' QMP call can be extended with those parameters.

Unfortunately, this is a non-trivial amount of code to make it work,
but there is quite a bit of boilerplate and some comments, so
hopefully the logic is straight-forward enough.


The first patch can be applied regardless of whether we want to go
with this or not.


Fiona Ebner (7):
  PVE backup: fleecing: properly set minimum cluster size
  block/copy-before-write: allow passing additional options for
    bdrv_cbw_append()
  block/backup: allow passing additional options for copy-before-write
    upon job creation
  block/backup: make cbw error also fail backup that does not use
    fleecing
  fix #3231+#3631: PVE backup: add timeout for copy-before-write
    operations and fail backup instead of guest writes
  block/copy-before-write: allow specifying error callback
  block/backup: set callback for cbw errors

 block/backup.c                         | 57 +++++++++++++++++++++++++-
 block/copy-before-write.c              | 41 +++++++++++++++---
 block/copy-before-write.h              |  9 +++-
 block/replication.c                    |  2 +-
 blockdev.c                             |  2 +-
 include/block/block_int-global-state.h |  2 +
 pve-backup.c                           | 13 +++++-
 7 files changed, 115 insertions(+), 11 deletions(-)

-- 
2.39.2



_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2024-07-05  9:41 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-06-10 12:59 [pve-devel] [RFC qemu] fix #3231+#3631: PVE backup: fail backup rather than guest write when backup target cannot be reached or is too slow Fiona Ebner
2024-06-10 12:59 ` [pve-devel] [PATCH qemu 1/7] PVE backup: fleecing: properly set minimum cluster size Fiona Ebner
2024-06-10 12:59 ` [pve-devel] [RFC qemu 2/7] block/copy-before-write: allow passing additional options for bdrv_cbw_append() Fiona Ebner
2024-06-10 12:59 ` [pve-devel] [RFC qemu 3/7] block/backup: allow passing additional options for copy-before-write upon job creation Fiona Ebner
2024-07-05  9:30   ` Fabian Grünbichler
2024-06-10 12:59 ` [pve-devel] [RFC qemu 4/7] block/backup: make cbw error also fail backup that does not use fleecing Fiona Ebner
2024-06-10 12:59 ` [pve-devel] [RFC qemu 5/7] fix #3231+#3631: PVE backup: add timeout for copy-before-write operations and fail backup instead of guest writes Fiona Ebner
2024-06-10 12:59 ` [pve-devel] [RFC qemu 6/7] block/copy-before-write: allow specifying error callback Fiona Ebner
2024-06-10 12:59 ` [pve-devel] [RFC qemu 7/7] block/backup: set callback for cbw errors Fiona Ebner
2024-07-05  9:37   ` Fabian Grünbichler
2024-07-05  9:41 ` [pve-devel] [RFC qemu] fix #3231+#3631: PVE backup: fail backup rather than guest write when backup target cannot be reached or is too slow Fabian Grünbichler

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox
Service provided by Proxmox Server Solutions GmbH | Privacy | Legal