public inbox for pve-devel@lists.proxmox.com
 help / color / mirror / Atom feed
* [pve-devel] [RFC qemu/guest-common/manager/qemu-server/docs 00/13] fix #4136: implement backup fleecing
@ 2024-01-25 14:41 Fiona Ebner
  2024-01-25 14:41 ` [pve-devel] [PATCH qemu 01/13] backup: factor out gathering device info into helper Fiona Ebner
                   ` (13 more replies)
  0 siblings, 14 replies; 31+ messages in thread
From: Fiona Ebner @ 2024-01-25 14:41 UTC (permalink / raw)
  To: pve-devel

When a backup for a VM is started, QEMU will install a
"copy-before-write" filter in its block layer. This filter ensures
that upon new guest writes, old data still needed for the backup is
sent to the backup target first. The guest write blocks until this
operation is finished so guest IO to not-yet-backed-up sectors will be
limited by the speed of the backup target.

With backup fleecing, such old data is cached in a fleecing image
rather than sent directly to the backup target. This can help guest IO
performance and even prevent hangs in certain scenarios, at the cost
of requiring more storage space.

With this series it will be possible to enable backup-fleecing via
e.g. `vzdump 123 --fleecing enabled=1,storage=local-zfs` with fleecing
images created on the storage `local-zfs`. If no storage is specified,
the fleecing image will be created on the same storage as the original
image.


Fleecing images are created by qemu-server via pve-storage and
attached to QEMU before the backup starts, and cleaned up after the
backup finished or failed. Currently, just a "-fleecing(.raw)" suffix
is added and there is no special handling yet for e.g. qm rescan/etc..
And previous left-overs are not automatically cleaned up, because
while unlikely, images with this name might've been created by a user
too. Happy to discuss alternatives!

The fleecing image needs to be the exact same size as the source, but
luckily, an explicit size can be specified when attaching a raw image
to QEMU so there are no size issues when using storages that have
coarser allocation/round up.


While initial tests seem fine, bitmap handling needs to be carefully
checked for correctness. More eyeballs can't hurt there.

QEMU patches are for the submodule for better reviewability. There are
unfortunately a few prerequisites which are also still being worked on
upstream. These are:

Fix for qcow2 block status querying when used as a source image [0].
Already reviewed and being pulled.

For being able to discard the fleecing image, addition of a
discard-source parameter[1]. This series was adapted for downstream
and I tried to address the two remaining issues:

1. Permission issue when backup source node is read-only (e.g. TMP
state): Made permissions conditional for when discard-source is set
with a new option for the copy-before-write block driver. Currently,
it's part of QAPI, nicer would be to make it internal-only.

2. Cluster size issue when fleecing image has a larger cluster size
than backup target: Made a workaround by also considering source image
when calculating cluster size for block copy and had to hack
.bdrv_co_get_info implementations for snapshot-access and
copy-before-write. Not super confident and better to wait for an
answer from upstream.

Upstream reports/discussions for these can also be found at [1].


No hard dependencies AFAICS, but of course pve-manager should depend
on both new pve-guest-common and qemu-server to actually be able to
use the option.


[0]: https://lore.kernel.org/qemu-devel/20240116154839.401030-1-f.ebner@proxmox.com/
[1]: https://lore.kernel.org/qemu-devel/20240117160737.1057513-1-vsementsov@yandex-team.ru/

qemu:

Fiona Ebner (6):
  backup: factor out gathering device info into helper
  backup: get device info: code cleanup
  block/io: clear BDRV_BLOCK_RECURSE flag after recursing in
    bdrv_co_block_status
  block/{copy-before-write,snapshot-access}: implement bdrv_co_get_info
    driver callback
  block/block-copy: always consider source cluster size too
  PVE backup: add fleecing option

Vladimir Sementsov-Ogievskiy (2):
  block/copy-before-write: create block_copy bitmap in filter node
  qapi: blockdev-backup: add discard-source parameter

 block/backup.c                         |  15 +-
 block/block-copy.c                     |  36 ++--
 block/copy-before-write.c              |  46 ++++-
 block/copy-before-write.h              |   1 +
 block/io.c                             |  10 ++
 block/monitor/block-hmp-cmds.c         |   1 +
 block/replication.c                    |   4 +-
 block/snapshot-access.c                |   7 +
 blockdev.c                             |   2 +-
 include/block/block-copy.h             |   3 +-
 include/block/block_int-global-state.h |   2 +-
 pve-backup.c                           | 234 +++++++++++++++++++------
 qapi/block-core.json                   |  18 +-
 13 files changed, 300 insertions(+), 79 deletions(-)


guest-common:

Fiona Ebner (1):
  vzdump: schema: add fleecing property string

 src/PVE/VZDump/Common.pm | 25 +++++++++++++++++++++++++
 1 file changed, 25 insertions(+)


manager:

Fiona Ebner (1):
  vzdump: handle new 'fleecing' property string

 PVE/VZDump.pm | 12 ++++++++++++
 1 file changed, 12 insertions(+)


qemu-server:

Fiona Ebner (2):
  backup: disk info: also keep track of size
  backup: implement fleecing option

 PVE/VZDump/QemuServer.pm | 141 ++++++++++++++++++++++++++++++++++++++-
 1 file changed, 139 insertions(+), 2 deletions(-)


docs:

Fiona Ebner (1):
  vzdump: add section about backup fleecing

 vzdump.adoc | 28 ++++++++++++++++++++++++++++
 1 file changed, 28 insertions(+)


Summary over all repositories:
  17 files changed, 504 insertions(+), 0 deletions(-)

-- 
Generated by git-murpp 0.5.0




^ permalink raw reply	[flat|nested] 31+ messages in thread

* [pve-devel] [PATCH qemu 01/13] backup: factor out gathering device info into helper
  2024-01-25 14:41 [pve-devel] [RFC qemu/guest-common/manager/qemu-server/docs 00/13] fix #4136: implement backup fleecing Fiona Ebner
@ 2024-01-25 14:41 ` Fiona Ebner
  2024-01-25 14:41 ` [pve-devel] [PATCH qemu 02/13] backup: get device info: code cleanup Fiona Ebner
                   ` (12 subsequent siblings)
  13 siblings, 0 replies; 31+ messages in thread
From: Fiona Ebner @ 2024-01-25 14:41 UTC (permalink / raw)
  To: pve-devel

no functional change intended. Should make it easier to read and add
more logic on top (for backup fleecing).

Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
---

git diff --patience to produce a better diff

Can be squashed into "PVE-Backup: Proxmox backup patches for QEMU".

 pve-backup.c | 112 ++++++++++++++++++++++++++++++---------------------
 1 file changed, 67 insertions(+), 45 deletions(-)

diff --git a/pve-backup.c b/pve-backup.c
index 9c8b88d075..75af865437 100644
--- a/pve-backup.c
+++ b/pve-backup.c
@@ -576,54 +576,19 @@ static void create_backup_jobs_bh(void *opaque) {
     aio_co_enter(data->ctx, data->co);
 }
 
-UuidInfo coroutine_fn *qmp_backup(
-    const char *backup_file,
-    const char *password,
-    const char *keyfile,
-    const char *key_password,
-    const char *master_keyfile,
-    const char *fingerprint,
-    const char *backup_ns,
-    const char *backup_id,
-    bool has_backup_time, int64_t backup_time,
-    bool has_use_dirty_bitmap, bool use_dirty_bitmap,
-    bool has_compress, bool compress,
-    bool has_encrypt, bool encrypt,
-    bool has_format, BackupFormat format,
-    const char *config_file,
-    const char *firewall_file,
+/*
+ * Returns a list of device infos, which needs to be freed by the caller. In
+ * case of an error, errp will be set, but the returned value might still be a
+ * list.
+ */
+static GList coroutine_fn *get_device_info(
     const char *devlist,
-    bool has_speed, int64_t speed,
-    bool has_max_workers, int64_t max_workers,
     Error **errp)
 {
-    assert(qemu_in_coroutine());
-
-    qemu_co_mutex_lock(&backup_state.backup_mutex);
-
     BlockBackend *blk;
     BlockDriverState *bs = NULL;
-    Error *local_err = NULL;
-    uuid_t uuid;
-    VmaWriter *vmaw = NULL;
-    ProxmoxBackupHandle *pbs = NULL;
     gchar **devs = NULL;
     GList *di_list = NULL;
-    GList *l;
-    UuidInfo *uuid_info;
-
-    const char *config_name = "qemu-server.conf";
-    const char *firewall_name = "qemu-server.fw";
-
-    if (backup_state.di_list) {
-        error_set(errp, ERROR_CLASS_GENERIC_ERROR,
-                  "previous backup not finished");
-        qemu_co_mutex_unlock(&backup_state.backup_mutex);
-        return NULL;
-    }
-
-    /* Todo: try to auto-detect format based on file name */
-    format = has_format ? format : BACKUP_FORMAT_VMA;
 
     if (devlist) {
         devs = g_strsplit_set(devlist, ",;:", -1);
@@ -668,6 +633,67 @@ UuidInfo coroutine_fn *qmp_backup(
         goto err;
     }
 
+err:
+    if (devs) {
+        g_strfreev(devs);
+    }
+
+    return di_list;
+}
+
+UuidInfo coroutine_fn *qmp_backup(
+    const char *backup_file,
+    const char *password,
+    const char *keyfile,
+    const char *key_password,
+    const char *master_keyfile,
+    const char *fingerprint,
+    const char *backup_ns,
+    const char *backup_id,
+    bool has_backup_time, int64_t backup_time,
+    bool has_use_dirty_bitmap, bool use_dirty_bitmap,
+    bool has_compress, bool compress,
+    bool has_encrypt, bool encrypt,
+    bool has_format, BackupFormat format,
+    const char *config_file,
+    const char *firewall_file,
+    const char *devlist,
+    bool has_speed, int64_t speed,
+    bool has_max_workers, int64_t max_workers,
+    Error **errp)
+{
+    assert(qemu_in_coroutine());
+
+    qemu_co_mutex_lock(&backup_state.backup_mutex);
+
+    Error *local_err = NULL;
+    uuid_t uuid;
+    VmaWriter *vmaw = NULL;
+    ProxmoxBackupHandle *pbs = NULL;
+    GList *di_list = NULL;
+    GList *l;
+    UuidInfo *uuid_info;
+
+    const char *config_name = "qemu-server.conf";
+    const char *firewall_name = "qemu-server.fw";
+
+    if (backup_state.di_list) {
+        error_set(errp, ERROR_CLASS_GENERIC_ERROR,
+                  "previous backup not finished");
+        qemu_co_mutex_unlock(&backup_state.backup_mutex);
+        return NULL;
+    }
+
+    /* Todo: try to auto-detect format based on file name */
+    format = has_format ? format : BACKUP_FORMAT_VMA;
+
+    di_list = get_device_info(devlist, &local_err);
+    if (local_err) {
+        error_propagate(errp, local_err);
+        goto err;
+    }
+    assert(di_list);
+
     size_t total = 0;
 
     l = di_list;
@@ -954,10 +980,6 @@ err:
     g_list_free(di_list);
     backup_state.di_list = NULL;
 
-    if (devs) {
-        g_strfreev(devs);
-    }
-
     if (vmaw) {
         Error *err = NULL;
         vma_writer_close(vmaw, &err);
-- 
2.39.2





^ permalink raw reply	[flat|nested] 31+ messages in thread

* [pve-devel] [PATCH qemu 02/13] backup: get device info: code cleanup
  2024-01-25 14:41 [pve-devel] [RFC qemu/guest-common/manager/qemu-server/docs 00/13] fix #4136: implement backup fleecing Fiona Ebner
  2024-01-25 14:41 ` [pve-devel] [PATCH qemu 01/13] backup: factor out gathering device info into helper Fiona Ebner
@ 2024-01-25 14:41 ` Fiona Ebner
  2024-01-25 14:41 ` [pve-devel] [PATCH qemu 03/13] block/io: clear BDRV_BLOCK_RECURSE flag after recursing in bdrv_co_block_status Fiona Ebner
                   ` (11 subsequent siblings)
  13 siblings, 0 replies; 31+ messages in thread
From: Fiona Ebner @ 2024-01-25 14:41 UTC (permalink / raw)
  To: pve-devel

Make variables more local. Put failure case for !blk first to avoid
an additional else block with indentation.

Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
---

Can be squashed into "PVE-Backup: Proxmox backup patches for QEMU".

 pve-backup.c | 27 +++++++++++----------------
 1 file changed, 11 insertions(+), 16 deletions(-)

diff --git a/pve-backup.c b/pve-backup.c
index 75af865437..4b0dcca246 100644
--- a/pve-backup.c
+++ b/pve-backup.c
@@ -585,8 +585,6 @@ static GList coroutine_fn *get_device_info(
     const char *devlist,
     Error **errp)
 {
-    BlockBackend *blk;
-    BlockDriverState *bs = NULL;
     gchar **devs = NULL;
     GList *di_list = NULL;
 
@@ -595,29 +593,26 @@ static GList coroutine_fn *get_device_info(
 
         gchar **d = devs;
         while (d && *d) {
-            blk = blk_by_name(*d);
-            if (blk) {
-                bs = blk_bs(blk);
-                if (!bdrv_co_is_inserted(bs)) {
-                    error_setg(errp, QERR_DEVICE_HAS_NO_MEDIUM, *d);
-                    goto err;
-                }
-                PVEBackupDevInfo *di = g_new0(PVEBackupDevInfo, 1);
-                di->bs = bs;
-                di_list = g_list_append(di_list, di);
-            } else {
+            BlockBackend *blk = blk_by_name(*d);
+            if (!blk) {
                 error_set(errp, ERROR_CLASS_DEVICE_NOT_FOUND,
                           "Device '%s' not found", *d);
                 goto err;
             }
+            BlockDriverState *bs = blk_bs(blk);
+            if (!bdrv_co_is_inserted(bs)) {
+                error_setg(errp, QERR_DEVICE_HAS_NO_MEDIUM, *d);
+                goto err;
+            }
+            PVEBackupDevInfo *di = g_new0(PVEBackupDevInfo, 1);
+            di->bs = bs;
+            di_list = g_list_append(di_list, di);
             d++;
         }
-
     } else {
         BdrvNextIterator it;
 
-        bs = NULL;
-        for (bs = bdrv_first(&it); bs; bs = bdrv_next(&it)) {
+        for (BlockDriverState *bs = bdrv_first(&it); bs; bs = bdrv_next(&it)) {
             if (!bdrv_co_is_inserted(bs) || bdrv_is_read_only(bs)) {
                 continue;
             }
-- 
2.39.2





^ permalink raw reply	[flat|nested] 31+ messages in thread

* [pve-devel] [PATCH qemu 03/13] block/io: clear BDRV_BLOCK_RECURSE flag after recursing in bdrv_co_block_status
  2024-01-25 14:41 [pve-devel] [RFC qemu/guest-common/manager/qemu-server/docs 00/13] fix #4136: implement backup fleecing Fiona Ebner
  2024-01-25 14:41 ` [pve-devel] [PATCH qemu 01/13] backup: factor out gathering device info into helper Fiona Ebner
  2024-01-25 14:41 ` [pve-devel] [PATCH qemu 02/13] backup: get device info: code cleanup Fiona Ebner
@ 2024-01-25 14:41 ` Fiona Ebner
  2024-01-25 14:41 ` [pve-devel] [RFC qemu 04/13] block/copy-before-write: create block_copy bitmap in filter node Fiona Ebner
                   ` (10 subsequent siblings)
  13 siblings, 0 replies; 31+ messages in thread
From: Fiona Ebner @ 2024-01-25 14:41 UTC (permalink / raw)
  To: pve-devel

Using fleecing backup like in [0] on a qcow2 image (with metadata
preallocation) can lead to the following assertion failure:

> bdrv_co_do_block_status: Assertion `!(ret & BDRV_BLOCK_ZERO)' failed.

In the reproducer [0], it happens because the BDRV_BLOCK_RECURSE flag
will be set by the qcow2 driver, so the caller will recursively check
the file child. Then the BDRV_BLOCK_ZERO set too. Later up the call
chain, in bdrv_co_do_block_status() for the snapshot-access driver,
the assertion failure will happen, because both flags are set.

To fix it, clear the recurse flag after the recursive check was done.

In detail:

> #0  qcow2_co_block_status

Returns 0x45 = BDRV_BLOCK_RECURSE | BDRV_BLOCK_DATA |
BDRV_BLOCK_OFFSET_VALID.

> #1  bdrv_co_do_block_status

Because of the data flag, bdrv_co_do_block_status() will now also set
BDRV_BLOCK_ALLOCATED. Because of the recurse flag,
bdrv_co_do_block_status() for the bdrv_file child will be called,
which returns 0x16 = BDRV_BLOCK_ALLOCATED | BDRV_BLOCK_OFFSET_VALID |
BDRV_BLOCK_ZERO. Now the return value inherits the zero flag.

Returns 0x57 = BDRV_BLOCK_RECURSE | BDRV_BLOCK_DATA |
BDRV_BLOCK_OFFSET_VALID | BDRV_BLOCK_ALLOCATED | BDRV_BLOCK_ZERO.

> #2  bdrv_co_common_block_status_above
> #3  bdrv_co_block_status_above
> #4  bdrv_co_block_status
> #5  cbw_co_snapshot_block_status
> #6  bdrv_co_snapshot_block_status
> #7  snapshot_access_co_block_status
> #8  bdrv_co_do_block_status

Return value is propagated all the way up to here, where the assertion
failure happens, because BDRV_BLOCK_RECURSE and BDRV_BLOCK_ZERO are
both set.

> #9  bdrv_co_common_block_status_above
> #10 bdrv_co_block_status_above
> #11 block_copy_block_status
> #12 block_copy_dirty_clusters
> #13 block_copy_common
> #14 block_copy_async_co_entry
> #15 coroutine_trampoline

[0]:

> #!/bin/bash
> rm /tmp/disk.qcow2
> ./qemu-img create /tmp/disk.qcow2 -o preallocation=metadata -f qcow2 1G
> ./qemu-img create /tmp/fleecing.qcow2 -f qcow2 1G
> ./qemu-img create /tmp/backup.qcow2 -f qcow2 1G
> ./qemu-system-x86_64 --qmp stdio \
> --blockdev qcow2,node-name=node0,file.driver=file,file.filename=/tmp/disk.qcow2 \
> --blockdev qcow2,node-name=node1,file.driver=file,file.filename=/tmp/fleecing.qcow2 \
> --blockdev qcow2,node-name=node2,file.driver=file,file.filename=/tmp/backup.qcow2 \
> <<EOF
> {"execute": "qmp_capabilities"}
> {"execute": "blockdev-add", "arguments": { "driver": "copy-before-write", "file": "node0", "target": "node1", "node-name": "node3" } }
> {"execute": "blockdev-add", "arguments": { "driver": "snapshot-access", "file": "node3", "node-name": "snap0" } }
> {"execute": "blockdev-backup", "arguments": { "device": "snap0", "target": "node1", "sync": "full", "job-id": "backup0" } }
> EOF

Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
---

Can be added as an extra/ patch.

 block/io.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/block/io.c b/block/io.c
index 63f7b3ad3e..24a3c84c93 100644
--- a/block/io.c
+++ b/block/io.c
@@ -2589,6 +2589,16 @@ bdrv_co_block_status(BlockDriverState *bs, bool want_zero,
                 ret |= (ret2 & BDRV_BLOCK_ZERO);
             }
         }
+
+        /*
+         * Now that the recursive search was done, clear the flag. Otherwise,
+         * with more complicated block graphs like snapshot-access ->
+         * copy-before-write -> qcow2, where the return value will be propagated
+         * further up to a parent bdrv_co_do_block_status() call, both the
+         * BDRV_BLOCK_RECURSE and BDRV_BLOCK_ZERO flags would be set, which is
+         * not allowed.
+         */
+        ret &= ~BDRV_BLOCK_RECURSE;
     }
 
 out:
-- 
2.39.2





^ permalink raw reply	[flat|nested] 31+ messages in thread

* [pve-devel] [RFC qemu 04/13] block/copy-before-write: create block_copy bitmap in filter node
  2024-01-25 14:41 [pve-devel] [RFC qemu/guest-common/manager/qemu-server/docs 00/13] fix #4136: implement backup fleecing Fiona Ebner
                   ` (2 preceding siblings ...)
  2024-01-25 14:41 ` [pve-devel] [PATCH qemu 03/13] block/io: clear BDRV_BLOCK_RECURSE flag after recursing in bdrv_co_block_status Fiona Ebner
@ 2024-01-25 14:41 ` Fiona Ebner
  2024-01-25 14:41 ` [pve-devel] [RFC qemu 05/13] qapi: blockdev-backup: add discard-source parameter Fiona Ebner
                   ` (9 subsequent siblings)
  13 siblings, 0 replies; 31+ messages in thread
From: Fiona Ebner @ 2024-01-25 14:41 UTC (permalink / raw)
  To: pve-devel

From: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>

Currently block_copy creates copy_bitmap in source node. But that is in
bad relation with .independent_close=true of copy-before-write filter:
source node may be detached and removed before .bdrv_close() handler
called, which should call block_copy_state_free(), which in turn should
remove copy_bitmap.

That's all not ideal: it would be better if internal bitmap of
block-copy object is not attached to any node. But that is not possible
now.

The simplest solution is just create copy_bitmap in filter node, where
anyway two other bitmaps are created.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
---
 block/block-copy.c         | 3 ++-
 block/copy-before-write.c  | 2 +-
 include/block/block-copy.h | 1 +
 3 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/block/block-copy.c b/block/block-copy.c
index e13d7bc6b6..b61685f1a2 100644
--- a/block/block-copy.c
+++ b/block/block-copy.c
@@ -346,6 +346,7 @@ static int64_t block_copy_calculate_cluster_size(BlockDriverState *target,
 }
 
 BlockCopyState *block_copy_state_new(BdrvChild *source, BdrvChild *target,
+                                     BlockDriverState *copy_bitmap_bs,
                                      const BdrvDirtyBitmap *bitmap,
                                      Error **errp)
 {
@@ -360,7 +361,7 @@ BlockCopyState *block_copy_state_new(BdrvChild *source, BdrvChild *target,
         return NULL;
     }
 
-    copy_bitmap = bdrv_create_dirty_bitmap(source->bs, cluster_size, NULL,
+    copy_bitmap = bdrv_create_dirty_bitmap(copy_bitmap_bs, cluster_size, NULL,
                                            errp);
     if (!copy_bitmap) {
         return NULL;
diff --git a/block/copy-before-write.c b/block/copy-before-write.c
index b866e42271..41cb0955c9 100644
--- a/block/copy-before-write.c
+++ b/block/copy-before-write.c
@@ -456,7 +456,7 @@ static int cbw_open(BlockDriverState *bs, QDict *options, int flags,
             ((BDRV_REQ_FUA | BDRV_REQ_MAY_UNMAP | BDRV_REQ_NO_FALLBACK) &
              bs->file->bs->supported_zero_flags);
 
-    s->bcs = block_copy_state_new(bs->file, s->target, bitmap, errp);
+    s->bcs = block_copy_state_new(bs->file, s->target, bs, bitmap, errp);
     if (!s->bcs) {
         error_prepend(errp, "Cannot create block-copy-state: ");
         ret = -EINVAL;
diff --git a/include/block/block-copy.h b/include/block/block-copy.h
index 0700953ab8..8b41643bfa 100644
--- a/include/block/block-copy.h
+++ b/include/block/block-copy.h
@@ -25,6 +25,7 @@ typedef struct BlockCopyState BlockCopyState;
 typedef struct BlockCopyCallState BlockCopyCallState;
 
 BlockCopyState *block_copy_state_new(BdrvChild *source, BdrvChild *target,
+                                     BlockDriverState *copy_bitmap_bs,
                                      const BdrvDirtyBitmap *bitmap,
                                      Error **errp);
 
-- 
2.39.2





^ permalink raw reply	[flat|nested] 31+ messages in thread

* [pve-devel] [RFC qemu 05/13] qapi: blockdev-backup: add discard-source parameter
  2024-01-25 14:41 [pve-devel] [RFC qemu/guest-common/manager/qemu-server/docs 00/13] fix #4136: implement backup fleecing Fiona Ebner
                   ` (3 preceding siblings ...)
  2024-01-25 14:41 ` [pve-devel] [RFC qemu 04/13] block/copy-before-write: create block_copy bitmap in filter node Fiona Ebner
@ 2024-01-25 14:41 ` Fiona Ebner
  2024-01-25 14:41 ` [pve-devel] [HACK qemu 06/13] block/{copy-before-write, snapshot-access}: implement bdrv_co_get_info driver callback Fiona Ebner
                   ` (8 subsequent siblings)
  13 siblings, 0 replies; 31+ messages in thread
From: Fiona Ebner @ 2024-01-25 14:41 UTC (permalink / raw)
  To: pve-devel

From: Vladimir Sementsov-Ogievskiy <vladimir.sementsov-ogievskiy@openvz.org>

Add a parameter that enables discard-after-copy. That is mostly useful
in "push backup with fleecing" scheme, when source is snapshot-access
format driver node, based on copy-before-write filter snapshot-access
API:

[guest]      [snapshot-access] ~~ blockdev-backup ~~> [backup target]
   |            |
   | root       | file
   v            v
[copy-before-write]
   |             |
   | file        | target
   v             v
[active disk]   [temp.img]

In this case discard-after-copy does two things:

 - discard data in temp.img to save disk space
 - avoid further copy-before-write operation in discarded area

Note that we have to declare WRITE permission on source in
copy-before-write filter, for discard to work. This is done via a new
source-write-perm open option for copy-before-write, because it needs
to happen conditionally only when discard_source is set. It cannot be
set for usual backup, because there, the associated virtual hardware
device already requires exclusive write.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@openvz.org>
[FE: adapt for Proxmox downstream 8.1
     fix discard at end of image when not aligned to cluster size
     add an option option for setting the write permission on the cbw
       source child conditionally during open and adapt commit message]
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
---

It'd be better if the source-write-perm for copy-before-write
would not be exposed via QAPI, but only internal.

 block/backup.c                         |  7 ++++---
 block/block-copy.c                     | 11 +++++++++--
 block/copy-before-write.c              | 14 ++++++++++++++
 block/copy-before-write.h              |  1 +
 block/replication.c                    |  4 ++--
 blockdev.c                             |  2 +-
 include/block/block-copy.h             |  2 +-
 include/block/block_int-global-state.h |  2 +-
 pve-backup.c                           |  2 +-
 qapi/block-core.json                   | 10 +++++++++-
 10 files changed, 43 insertions(+), 12 deletions(-)

diff --git a/block/backup.c b/block/backup.c
index af87fa6aa9..51f9d28115 100644
--- a/block/backup.c
+++ b/block/backup.c
@@ -332,7 +332,7 @@ BlockJob *backup_job_create(const char *job_id, BlockDriverState *bs,
                   BlockDriverState *target, int64_t speed,
                   MirrorSyncMode sync_mode, BdrvDirtyBitmap *sync_bitmap,
                   BitmapSyncMode bitmap_mode,
-                  bool compress,
+                  bool compress, bool discard_source,
                   const char *filter_node_name,
                   BackupPerf *perf,
                   BlockdevOnError on_source_error,
@@ -429,7 +429,7 @@ BlockJob *backup_job_create(const char *job_id, BlockDriverState *bs,
         goto error;
     }
 
-    cbw = bdrv_cbw_append(bs, target, filter_node_name, &bcs, errp);
+    cbw = bdrv_cbw_append(bs, target, filter_node_name, discard_source, &bcs, errp);
     if (!cbw) {
         goto error;
     }
@@ -471,7 +471,8 @@ BlockJob *backup_job_create(const char *job_id, BlockDriverState *bs,
     job->len = len;
     job->perf = *perf;
 
-    block_copy_set_copy_opts(bcs, perf->use_copy_range, compress);
+    block_copy_set_copy_opts(bcs, perf->use_copy_range, compress,
+                             discard_source);
     block_copy_set_progress_meter(bcs, &job->common.job.progress);
     block_copy_set_speed(bcs, speed);
 
diff --git a/block/block-copy.c b/block/block-copy.c
index b61685f1a2..05cfccfda8 100644
--- a/block/block-copy.c
+++ b/block/block-copy.c
@@ -137,6 +137,7 @@ typedef struct BlockCopyState {
     CoMutex lock;
     int64_t in_flight_bytes;
     BlockCopyMethod method;
+    bool discard_source;
     BlockReqList reqs;
     QLIST_HEAD(, BlockCopyCallState) calls;
     /*
@@ -282,11 +283,12 @@ static uint32_t block_copy_max_transfer(BdrvChild *source, BdrvChild *target)
 }
 
 void block_copy_set_copy_opts(BlockCopyState *s, bool use_copy_range,
-                              bool compress)
+                              bool compress, bool discard_source)
 {
     /* Keep BDRV_REQ_SERIALISING set (or not set) in block_copy_state_new() */
     s->write_flags = (s->write_flags & BDRV_REQ_SERIALISING) |
         (compress ? BDRV_REQ_WRITE_COMPRESSED : 0);
+    s->discard_source = discard_source;
 
     if (s->max_transfer < s->cluster_size) {
         /*
@@ -409,7 +411,7 @@ BlockCopyState *block_copy_state_new(BdrvChild *source, BdrvChild *target,
                                     cluster_size),
     };
 
-    block_copy_set_copy_opts(s, false, false);
+    block_copy_set_copy_opts(s, false, false, false);
 
     ratelimit_init(&s->rate_limit);
     qemu_co_mutex_init(&s->lock);
@@ -580,6 +582,11 @@ static coroutine_fn int block_copy_task_entry(AioTask *task)
     co_put_to_shres(s->mem, t->req.bytes);
     block_copy_task_end(t, ret);
 
+    if (s->discard_source && ret == 0) {
+        int64_t nbytes = MIN(t->req.offset + t->req.bytes, s->len) - t->req.offset;
+        bdrv_co_pdiscard(s->source, t->req.offset, nbytes);
+    }
+
     return ret;
 }
 
diff --git a/block/copy-before-write.c b/block/copy-before-write.c
index 41cb0955c9..87f047ad5f 100644
--- a/block/copy-before-write.c
+++ b/block/copy-before-write.c
@@ -44,6 +44,7 @@ typedef struct BDRVCopyBeforeWriteState {
     BdrvChild *target;
     OnCbwError on_cbw_error;
     uint32_t cbw_timeout_ns;
+    bool source_write_perm;
 
     /*
      * @lock: protects access to @access_bitmap, @done_bitmap and
@@ -363,6 +364,11 @@ static void cbw_child_perm(BlockDriverState *bs, BdrvChild *c,
         bdrv_default_perms(bs, c, role, reopen_queue,
                            perm, shared, nperm, nshared);
 
+        BDRVCopyBeforeWriteState *s = bs->opaque;
+        if (s->source_write_perm) {
+            *nperm = *nperm | BLK_PERM_WRITE;
+        }
+
         if (!QLIST_EMPTY(&bs->parents)) {
             if (perm & BLK_PERM_WRITE) {
                 *nperm = *nperm | BLK_PERM_CONSISTENT_READ;
@@ -422,6 +428,12 @@ static int cbw_open(BlockDriverState *bs, QDict *options, int flags,
     assert(full_opts->driver == BLOCKDEV_DRIVER_COPY_BEFORE_WRITE);
     opts = &full_opts->u.copy_before_write;
 
+    if (opts->source_write_perm) {
+        s->source_write_perm = true;
+    } else {
+        s->source_write_perm = false;
+    }
+
     ret = bdrv_open_file_child(NULL, options, "file", bs, errp);
     if (ret < 0) {
         return ret;
@@ -530,6 +542,7 @@ BlockDriver bdrv_cbw_filter = {
 BlockDriverState *bdrv_cbw_append(BlockDriverState *source,
                                   BlockDriverState *target,
                                   const char *filter_node_name,
+                                  bool source_write_perm,
                                   BlockCopyState **bcs,
                                   Error **errp)
 {
@@ -547,6 +560,7 @@ BlockDriverState *bdrv_cbw_append(BlockDriverState *source,
     }
     qdict_put_str(opts, "file", bdrv_get_node_name(source));
     qdict_put_str(opts, "target", bdrv_get_node_name(target));
+    qdict_put_bool(opts, "source-write-perm", source_write_perm);
 
     top = bdrv_insert_node(source, opts, BDRV_O_RDWR, errp);
     if (!top) {
diff --git a/block/copy-before-write.h b/block/copy-before-write.h
index 6e72bb25e9..03ee708d10 100644
--- a/block/copy-before-write.h
+++ b/block/copy-before-write.h
@@ -39,6 +39,7 @@
 BlockDriverState *bdrv_cbw_append(BlockDriverState *source,
                                   BlockDriverState *target,
                                   const char *filter_node_name,
+                                  bool source_write_perm,
                                   BlockCopyState **bcs,
                                   Error **errp);
 void bdrv_cbw_drop(BlockDriverState *bs);
diff --git a/block/replication.c b/block/replication.c
index ea4bf1aa80..39ad78cf98 100644
--- a/block/replication.c
+++ b/block/replication.c
@@ -579,8 +579,8 @@ static void replication_start(ReplicationState *rs, ReplicationMode mode,
 
         s->backup_job = backup_job_create(
                                 NULL, s->secondary_disk->bs, s->hidden_disk->bs,
-                                0, MIRROR_SYNC_MODE_NONE, NULL, 0, false, NULL,
-                                &perf,
+                                0, MIRROR_SYNC_MODE_NONE, NULL, 0, false, false,
+                                NULL, &perf,
                                 BLOCKDEV_ON_ERROR_REPORT,
                                 BLOCKDEV_ON_ERROR_REPORT, JOB_INTERNAL,
                                 backup_job_completed, bs, NULL, &local_err);
diff --git a/blockdev.c b/blockdev.c
index 7793143d76..ce3fef924c 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -2802,7 +2802,7 @@ static BlockJob *do_backup_common(BackupCommon *backup,
 
     job = backup_job_create(backup->job_id, bs, target_bs, backup->speed,
                             backup->sync, bmap, backup->bitmap_mode,
-                            backup->compress,
+                            backup->compress, backup->discard_source,
                             backup->filter_node_name,
                             &perf,
                             backup->on_source_error,
diff --git a/include/block/block-copy.h b/include/block/block-copy.h
index 8b41643bfa..9555de2562 100644
--- a/include/block/block-copy.h
+++ b/include/block/block-copy.h
@@ -31,7 +31,7 @@ BlockCopyState *block_copy_state_new(BdrvChild *source, BdrvChild *target,
 
 /* Function should be called prior any actual copy request */
 void block_copy_set_copy_opts(BlockCopyState *s, bool use_copy_range,
-                              bool compress);
+                              bool compress, bool discard_source);
 void block_copy_set_progress_meter(BlockCopyState *s, ProgressMeter *pm);
 
 void block_copy_state_free(BlockCopyState *s);
diff --git a/include/block/block_int-global-state.h b/include/block/block_int-global-state.h
index 32f0f9858a..546f2b5532 100644
--- a/include/block/block_int-global-state.h
+++ b/include/block/block_int-global-state.h
@@ -189,7 +189,7 @@ BlockJob *backup_job_create(const char *job_id, BlockDriverState *bs,
                             MirrorSyncMode sync_mode,
                             BdrvDirtyBitmap *sync_bitmap,
                             BitmapSyncMode bitmap_mode,
-                            bool compress,
+                            bool compress, bool discard_source,
                             const char *filter_node_name,
                             BackupPerf *perf,
                             BlockdevOnError on_source_error,
diff --git a/pve-backup.c b/pve-backup.c
index 4b0dcca246..4bf641ce5a 100644
--- a/pve-backup.c
+++ b/pve-backup.c
@@ -518,7 +518,7 @@ static void create_backup_jobs_bh(void *opaque) {
 
         BlockJob *job = backup_job_create(
             NULL, di->bs, di->target, backup_state.speed, sync_mode, di->bitmap,
-            bitmap_mode, false, NULL, &backup_state.perf, BLOCKDEV_ON_ERROR_REPORT,
+            bitmap_mode, false, false, NULL, &backup_state.perf, BLOCKDEV_ON_ERROR_REPORT,
             BLOCKDEV_ON_ERROR_REPORT, JOB_DEFAULT, pvebackup_complete_cb, di, backup_state.txn,
             &local_err);
 
diff --git a/qapi/block-core.json b/qapi/block-core.json
index 48eb47c6ea..51ce786bc5 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -1816,6 +1816,9 @@
 #     node specified by @drive.  If this option is not given, a node
 #     name is autogenerated.  (Since: 4.2)
 #
+# @discard-source: Discard blocks on source which are already copied to the
+#     target.  (Since 8.1)
+#
 # @x-perf: Performance options.  (Since 6.0)
 #
 # Features:
@@ -1837,6 +1840,7 @@
             '*on-target-error': 'BlockdevOnError',
             '*auto-finalize': 'bool', '*auto-dismiss': 'bool',
             '*filter-node-name': 'str',
+            '*discard-source': 'bool',
             '*x-perf': { 'type': 'BackupPerf',
                          'features': [ 'unstable' ] } } }
 
@@ -4817,12 +4821,16 @@
 #     @on-cbw-error parameter will decide how this failure is handled.
 #     Default 0. (Since 7.1)
 #
+# @source-write-perm: For internal use only. Set write permission
+#     for the source child.  (Since 8.1)
+#
 # Since: 6.2
 ##
 { 'struct': 'BlockdevOptionsCbw',
   'base': 'BlockdevOptionsGenericFormat',
   'data': { 'target': 'BlockdevRef', '*bitmap': 'BlockDirtyBitmap',
-            '*on-cbw-error': 'OnCbwError', '*cbw-timeout': 'uint32' } }
+            '*on-cbw-error': 'OnCbwError', '*cbw-timeout': 'uint32',
+            '*source-write-perm': 'bool' } }
 
 ##
 # @BlockdevOptions:
-- 
2.39.2





^ permalink raw reply	[flat|nested] 31+ messages in thread

* [pve-devel] [HACK qemu 06/13] block/{copy-before-write, snapshot-access}: implement bdrv_co_get_info driver callback
  2024-01-25 14:41 [pve-devel] [RFC qemu/guest-common/manager/qemu-server/docs 00/13] fix #4136: implement backup fleecing Fiona Ebner
                   ` (4 preceding siblings ...)
  2024-01-25 14:41 ` [pve-devel] [RFC qemu 05/13] qapi: blockdev-backup: add discard-source parameter Fiona Ebner
@ 2024-01-25 14:41 ` Fiona Ebner
  2024-01-29 14:35   ` Fiona Ebner
  2024-01-25 14:41 ` [pve-devel] [HACK qemu 07/13] block/block-copy: always consider source cluster size too Fiona Ebner
                   ` (7 subsequent siblings)
  13 siblings, 1 reply; 31+ messages in thread
From: Fiona Ebner @ 2024-01-25 14:41 UTC (permalink / raw)
  To: pve-devel

In preparation to fix an issue for backup fleecing where discarding
the source would lead to an assertion failure when the fleecing image
has larger granularity than the backup target.

Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
---

Still need to wait on a response from upstream. For now this hack, so
that the RFC as a whole doesn't have to wait.

 block/copy-before-write.c | 30 ++++++++++++++++++++++++++++++
 block/snapshot-access.c   |  7 +++++++
 2 files changed, 37 insertions(+)

diff --git a/block/copy-before-write.c b/block/copy-before-write.c
index 87f047ad5f..961e7439ad 100644
--- a/block/copy-before-write.c
+++ b/block/copy-before-write.c
@@ -322,6 +322,35 @@ cbw_co_snapshot_block_status(BlockDriverState *bs,
     return ret;
 }
 
+static int coroutine_fn
+cbw_co_get_info(BlockDriverState *bs, BlockDriverInfo *bdi)
+{
+    BDRVCopyBeforeWriteState *s = bs->opaque;
+
+    BlockDriverInfo local_bdi;
+    int64_t cluster_size = 0;
+
+    int ret, ret2;
+
+    ret = bdrv_co_get_info(bs->file->bs, &local_bdi);
+    if (ret == 0) {
+        cluster_size = MAX(cluster_size, local_bdi.cluster_size);
+    }
+
+    ret2 = bdrv_co_get_info(s->target->bs, &local_bdi);
+    if (ret2 == 0) {
+        cluster_size = MAX(cluster_size, local_bdi.cluster_size);
+    }
+
+    if (ret != 0 && ret2 != 0) {
+        return ret;
+    }
+
+    bdi->cluster_size = cluster_size;
+
+    return 0;
+}
+
 static int coroutine_fn GRAPH_RDLOCK
 cbw_co_pdiscard_snapshot(BlockDriverState *bs, int64_t offset, int64_t bytes)
 {
@@ -531,6 +560,7 @@ BlockDriver bdrv_cbw_filter = {
     .bdrv_co_preadv_snapshot       = cbw_co_preadv_snapshot,
     .bdrv_co_pdiscard_snapshot     = cbw_co_pdiscard_snapshot,
     .bdrv_co_snapshot_block_status = cbw_co_snapshot_block_status,
+    .bdrv_co_get_info              = cbw_co_get_info,
 
     .bdrv_refresh_filename      = cbw_refresh_filename,
 
diff --git a/block/snapshot-access.c b/block/snapshot-access.c
index 67ea339da9..5aa20aaa1f 100644
--- a/block/snapshot-access.c
+++ b/block/snapshot-access.c
@@ -25,6 +25,7 @@
 #include "sysemu/block-backend.h"
 #include "qemu/cutils.h"
 #include "block/block_int.h"
+#include "block/copy-before-write.h"
 
 static int coroutine_fn GRAPH_RDLOCK
 snapshot_access_co_preadv_part(BlockDriverState *bs,
@@ -72,6 +73,11 @@ snapshot_access_co_pwritev_part(BlockDriverState *bs,
     return -ENOTSUP;
 }
 
+static int coroutine_fn
+snapshot_access_co_get_info(BlockDriverState *bs, BlockDriverInfo *bdi)
+{
+    return bdrv_co_get_info(bs->file->bs, bdi);
+}
 
 static void snapshot_access_refresh_filename(BlockDriverState *bs)
 {
@@ -118,6 +124,7 @@ BlockDriver bdrv_snapshot_access_drv = {
     .bdrv_co_pwrite_zeroes      = snapshot_access_co_pwrite_zeroes,
     .bdrv_co_pdiscard           = snapshot_access_co_pdiscard,
     .bdrv_co_block_status       = snapshot_access_co_block_status,
+    .bdrv_co_get_info           = snapshot_access_co_get_info,
 
     .bdrv_refresh_filename      = snapshot_access_refresh_filename,
 
-- 
2.39.2





^ permalink raw reply	[flat|nested] 31+ messages in thread

* [pve-devel] [HACK qemu 07/13] block/block-copy: always consider source cluster size too
  2024-01-25 14:41 [pve-devel] [RFC qemu/guest-common/manager/qemu-server/docs 00/13] fix #4136: implement backup fleecing Fiona Ebner
                   ` (5 preceding siblings ...)
  2024-01-25 14:41 ` [pve-devel] [HACK qemu 06/13] block/{copy-before-write, snapshot-access}: implement bdrv_co_get_info driver callback Fiona Ebner
@ 2024-01-25 14:41 ` Fiona Ebner
  2024-01-25 14:41 ` [pve-devel] [RFC qemu 08/13] PVE backup: add fleecing option Fiona Ebner
                   ` (6 subsequent siblings)
  13 siblings, 0 replies; 31+ messages in thread
From: Fiona Ebner @ 2024-01-25 14:41 UTC (permalink / raw)
  To: pve-devel

With backup fleecing, it might be necessary to discard the source.
There will be an assertion failure if bitmaps on the source side have
a bigger granularity than the block copy's cluster size, so just
consider the source side too.

This also supersedes the hunk in block/backup.c added by
"PVE-Backup: add backup-dump block driver".

Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
---

Still need to wait on a response from upstream. For now this hack, so
that the RFC as a whole doesn't have to wait.

 block/backup.c     |  8 --------
 block/block-copy.c | 22 ++++++++++++++--------
 2 files changed, 14 insertions(+), 16 deletions(-)

diff --git a/block/backup.c b/block/backup.c
index 51f9d28115..7950bff27e 100644
--- a/block/backup.c
+++ b/block/backup.c
@@ -435,14 +435,6 @@ BlockJob *backup_job_create(const char *job_id, BlockDriverState *bs,
     }
 
     cluster_size = block_copy_cluster_size(bcs);
-    if (cluster_size < 0) {
-        goto error;
-    }
-
-    BlockDriverInfo bdi;
-    if (bdrv_get_info(bs, &bdi) == 0) {
-        cluster_size = MAX(cluster_size, bdi.cluster_size);
-    }
 
     if (perf->max_chunk && perf->max_chunk < cluster_size) {
         error_setg(errp, "Required max-chunk (%" PRIi64 ") is less than backup "
diff --git a/block/block-copy.c b/block/block-copy.c
index 05cfccfda8..cf750ef670 100644
--- a/block/block-copy.c
+++ b/block/block-copy.c
@@ -310,13 +310,19 @@ void block_copy_set_copy_opts(BlockCopyState *s, bool use_copy_range,
     }
 }
 
-static int64_t block_copy_calculate_cluster_size(BlockDriverState *target,
+static int64_t block_copy_calculate_cluster_size(BlockDriverState *source,
+                                                 BlockDriverState *target,
                                                  Error **errp)
 {
     int ret;
+    int64_t source_cluster_size = BLOCK_COPY_CLUSTER_SIZE_DEFAULT;
     BlockDriverInfo bdi;
     bool target_does_cow = bdrv_backing_chain_next(target);
 
+    if (bdrv_get_info(source, &bdi) == 0) {
+        source_cluster_size = MAX(source_cluster_size, bdi.cluster_size);
+    }
+
     /*
      * If there is no backing file on the target, we cannot rely on COW if our
      * backup cluster size is smaller than the target cluster size. Even for
@@ -327,11 +333,11 @@ static int64_t block_copy_calculate_cluster_size(BlockDriverState *target,
         /* Cluster size is not defined */
         warn_report("The target block device doesn't provide "
                     "information about the block size and it doesn't have a "
-                    "backing file. The default block size of %u bytes is "
+                    "backing file. The source's or default block size of %ld bytes is "
                     "used. If the actual block size of the target exceeds "
-                    "this default, the backup may be unusable",
-                    BLOCK_COPY_CLUSTER_SIZE_DEFAULT);
-        return BLOCK_COPY_CLUSTER_SIZE_DEFAULT;
+                    "this value, the backup may be unusable",
+                    source_cluster_size);
+        return source_cluster_size;
     } else if (ret < 0 && !target_does_cow) {
         error_setg_errno(errp, -ret,
             "Couldn't determine the cluster size of the target image, "
@@ -341,10 +347,10 @@ static int64_t block_copy_calculate_cluster_size(BlockDriverState *target,
         return ret;
     } else if (ret < 0 && target_does_cow) {
         /* Not fatal; just trudge on ahead. */
-        return BLOCK_COPY_CLUSTER_SIZE_DEFAULT;
+        return source_cluster_size;
     }
 
-    return MAX(BLOCK_COPY_CLUSTER_SIZE_DEFAULT, bdi.cluster_size);
+    return MAX(source_cluster_size, bdi.cluster_size);
 }
 
 BlockCopyState *block_copy_state_new(BdrvChild *source, BdrvChild *target,
@@ -358,7 +364,7 @@ BlockCopyState *block_copy_state_new(BdrvChild *source, BdrvChild *target,
     BdrvDirtyBitmap *copy_bitmap;
     bool is_fleecing;
 
-    cluster_size = block_copy_calculate_cluster_size(target->bs, errp);
+    cluster_size = block_copy_calculate_cluster_size(source->bs, target->bs, errp);
     if (cluster_size < 0) {
         return NULL;
     }
-- 
2.39.2





^ permalink raw reply	[flat|nested] 31+ messages in thread

* [pve-devel] [RFC qemu 08/13] PVE backup: add fleecing option
  2024-01-25 14:41 [pve-devel] [RFC qemu/guest-common/manager/qemu-server/docs 00/13] fix #4136: implement backup fleecing Fiona Ebner
                   ` (6 preceding siblings ...)
  2024-01-25 14:41 ` [pve-devel] [HACK qemu 07/13] block/block-copy: always consider source cluster size too Fiona Ebner
@ 2024-01-25 14:41 ` Fiona Ebner
  2024-01-25 14:41 ` [pve-devel] [RFC guest-common 09/13] vzdump: schema: add fleecing property string Fiona Ebner
                   ` (5 subsequent siblings)
  13 siblings, 0 replies; 31+ messages in thread
From: Fiona Ebner @ 2024-01-25 14:41 UTC (permalink / raw)
  To: pve-devel

When a fleecing option is given, it is expected that each device has
a corresponding "-fleecing" block device already attached, except for
EFI disk and TPM state, where fleecing is never used.

The following graph was adapted from [0] which also contains more
details about fleecing.

[guest]
   |
   | root
   v                 file
[copy-before-write]<------[snapshot-access]
   |           |
   | file      | target
   v           v
[source] [fleecing]

For fleecing, a copy-before-write filter is inserted on top of the
source node, as well as a snapshot-access node pointing to the filter
node which allows to read the consistent state of the image at the
time it was inserted. New guest writes are passed through the
copy-before-write filter which will first copy over old data to the
fleecing image in case that old data is still needed by the
snapshot-access node.

The backup process will sequentially read from the snapshot access,
which has a bitmap and knows whether to read from the original image
or the fleecing image to get the "snapshot" state, i.e. data from the
source image at the time when the copy-before-write filter was
inserted. After reading, the copied sections are discarded from the
fleecing image to reduce space usage.

All of this can be restricted by an initial dirty bitmap to parts of
the source image that are required for an incremental backup.

Additionally, the cbw-timeout and on-cbw-error=break-snapshot options
are set when installing the copy-before-write filter and
snapshot-access. When an error or timeout occurs, the problematic (and
each further) snapshot operation will fail and thus cancel the backup
instead of breaking the guest write.

Note that job_id cannot be inferred from the snapshot-access bs because
it has no parent, so just pass the one from the original bs.

[0]: https://www.mail-archive.com/qemu-devel@nongnu.org/msg876056.html

Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
---

While initial tests seem fine, bitmap handling needs to be carefully
checked for correctness. More eyeballs can't hurt there.

 block/monitor/block-hmp-cmds.c |   1 +
 pve-backup.c                   | 125 ++++++++++++++++++++++++++++++++-
 qapi/block-core.json           |   8 ++-
 3 files changed, 130 insertions(+), 4 deletions(-)

diff --git a/block/monitor/block-hmp-cmds.c b/block/monitor/block-hmp-cmds.c
index 6efe28cef5..ca29cc4281 100644
--- a/block/monitor/block-hmp-cmds.c
+++ b/block/monitor/block-hmp-cmds.c
@@ -1064,6 +1064,7 @@ void coroutine_fn hmp_backup(Monitor *mon, const QDict *qdict)
         NULL, NULL,
         devlist, qdict_haskey(qdict, "speed"), speed,
         false, 0, // BackupPerf max-workers
+        false, false, // fleecing
         &error);
 
     hmp_handle_error(mon, error);
diff --git a/pve-backup.c b/pve-backup.c
index 4bf641ce5a..602b857700 100644
--- a/pve-backup.c
+++ b/pve-backup.c
@@ -7,8 +7,10 @@
 #include "sysemu/blockdev.h"
 #include "block/block_int-global-state.h"
 #include "block/blockjob.h"
+#include "block/copy-before-write.h"
 #include "block/dirty-bitmap.h"
 #include "qapi/qapi-commands-block.h"
+#include "qapi/qmp/qdict.h"
 #include "qapi/qmp/qerror.h"
 #include "qemu/cutils.h"
 
@@ -80,8 +82,15 @@ static void pvebackup_init(void)
 // initialize PVEBackupState at startup
 opts_init(pvebackup_init);
 
+typedef struct PVEBackupFleecingInfo {
+    BlockDriverState *bs;
+    BlockDriverState *cbw;
+    BlockDriverState *snapshot_access;
+} PVEBackupFleecingInfo;
+
 typedef struct PVEBackupDevInfo {
     BlockDriverState *bs;
+    PVEBackupFleecingInfo fleecing;
     size_t size;
     uint64_t block_size;
     uint8_t dev_id;
@@ -356,6 +365,25 @@ static void pvebackup_complete_cb(void *opaque, int ret)
     PVEBackupDevInfo *di = opaque;
     di->completed_ret = ret;
 
+    /*
+     * Handle block-graph specific cleanup (for fleecing) outside of the coroutine, because the work
+     * won't be done as a coroutine anyways:
+     * - For snapshot_access, allows doing bdrv_unref() directly. Doing it via bdrv_co_unref() would
+     *   just spawn a BH calling bdrv_unref().
+     * - For cbw, draining would need to spawn a BH.
+     *
+     * Note that the AioContext lock is already acquired by our caller, i.e.
+     * job_finalize_single_locked()
+     */
+    if (di->fleecing.snapshot_access) {
+        bdrv_unref(di->fleecing.snapshot_access);
+        di->fleecing.snapshot_access = NULL;
+    }
+    if (di->fleecing.cbw) {
+        bdrv_cbw_drop(di->fleecing.cbw);
+        di->fleecing.cbw = NULL;
+    }
+
     /*
      * Schedule stream cleanup in async coroutine. close_image and finish might
      * take a while, so we can't block on them here. This way it also doesn't
@@ -516,9 +544,64 @@ static void create_backup_jobs_bh(void *opaque) {
 
         bdrv_drained_begin(di->bs);
 
+        BlockDriverState *source_bs = di->bs;
+        bool discard_source = false;
+        const char *job_id = bdrv_get_device_name(di->bs);
+        if (di->fleecing.bs) {
+            QDict *cbw_opts = qdict_new();
+            qdict_put_str(cbw_opts, "driver", "copy-before-write");
+            qdict_put_str(cbw_opts, "file", bdrv_get_node_name(di->bs));
+            qdict_put_str(cbw_opts, "target", bdrv_get_node_name(di->fleecing.bs));
+
+            if (di->bitmap) {
+                /*
+                 * Only guest writes to parts relevant for the backup need to be intercepted with
+                 * old data being copied to the fleecing image.
+                 */
+                qdict_put_str(cbw_opts, "bitmap.node", bdrv_get_node_name(di->bs));
+                qdict_put_str(cbw_opts, "bitmap.name", bdrv_dirty_bitmap_name(di->bitmap));
+            }
+            /*
+             * Fleecing storage is supposed to be fast and it's better to break backup than guest
+             * writes. Certain guest drivers like VirtIO-win have 60 seconds timeout by default, so
+             * abort a bit before that.
+             */
+            qdict_put_str(cbw_opts, "on-cbw-error", "break-snapshot");
+            qdict_put_int(cbw_opts, "cbw-timeout", 45);
+
+            di->fleecing.cbw = bdrv_insert_node(di->bs, cbw_opts, BDRV_O_RDWR, &local_err);
+
+            if (!di->fleecing.cbw) {
+                error_setg(errp, "appending cbw node for fleecing failed: %s",
+                           local_err ? error_get_pretty(local_err) : "unknown error");
+                break;
+            }
+
+            QDict *snapshot_access_opts = qdict_new();
+            qdict_put_str(snapshot_access_opts, "driver", "snapshot-access");
+            qdict_put_str(snapshot_access_opts, "file", bdrv_get_node_name(di->fleecing.cbw));
+
+            /*
+             * Holding the AioContext lock here would cause a deadlock, because bdrv_open_driver()
+             * will aquire it a second time. But it's allowed to be held exactly once when polling
+             * and that happens when the bdrv_refresh_total_sectors() call is made there.
+             */
+            aio_context_release(aio_context);
+            di->fleecing.snapshot_access =
+                bdrv_open(NULL, NULL, snapshot_access_opts, BDRV_O_RDWR | BDRV_O_UNMAP, &local_err);
+            aio_context_acquire(aio_context);
+            if (!di->fleecing.snapshot_access) {
+                error_setg(errp, "setting up snapshot access for fleecing failed: %s",
+                           local_err ? error_get_pretty(local_err) : "unknown error");
+                break;
+            }
+            source_bs = di->fleecing.snapshot_access;
+            discard_source = true;
+        }
+
         BlockJob *job = backup_job_create(
-            NULL, di->bs, di->target, backup_state.speed, sync_mode, di->bitmap,
-            bitmap_mode, false, false, NULL, &backup_state.perf, BLOCKDEV_ON_ERROR_REPORT,
+            job_id, source_bs, di->target, backup_state.speed, sync_mode, di->bitmap,
+            bitmap_mode, false, discard_source, NULL, &backup_state.perf, BLOCKDEV_ON_ERROR_REPORT,
             BLOCKDEV_ON_ERROR_REPORT, JOB_DEFAULT, pvebackup_complete_cb, di, backup_state.txn,
             &local_err);
 
@@ -576,6 +659,14 @@ static void create_backup_jobs_bh(void *opaque) {
     aio_co_enter(data->ctx, data->co);
 }
 
+/*
+ * EFI disk and TPM state are small and it's just not worth setting up fleecing for them.
+ */
+static bool device_uses_fleecing(const char *device_id)
+{
+    return strncmp(device_id, "drive-efidisk", 13) && strncmp(device_id, "drive-tpmstate", 14);
+}
+
 /*
  * Returns a list of device infos, which needs to be freed by the caller. In
  * case of an error, errp will be set, but the returned value might still be a
@@ -583,6 +674,7 @@ static void create_backup_jobs_bh(void *opaque) {
  */
 static GList coroutine_fn *get_device_info(
     const char *devlist,
+    bool fleecing,
     Error **errp)
 {
     gchar **devs = NULL;
@@ -606,6 +698,31 @@ static GList coroutine_fn *get_device_info(
             }
             PVEBackupDevInfo *di = g_new0(PVEBackupDevInfo, 1);
             di->bs = bs;
+
+            if (fleecing && device_uses_fleecing(*d)) {
+                g_autofree gchar *fleecing_devid = g_strconcat(*d, "-fleecing", NULL);
+                BlockBackend *fleecing_blk = blk_by_name(fleecing_devid);
+                if (!fleecing_blk) {
+                    error_set(errp, ERROR_CLASS_DEVICE_NOT_FOUND,
+                              "Device '%s' not found", fleecing_devid);
+                    goto err;
+                }
+                BlockDriverState *fleecing_bs = blk_bs(fleecing_blk);
+                if (!bdrv_co_is_inserted(fleecing_bs)) {
+                    error_setg(errp, QERR_DEVICE_HAS_NO_MEDIUM, fleecing_devid);
+                    goto err;
+                }
+                /*
+                 * Fleecing image needs to be the same size to act as a cbw target.
+                 */
+                if (bs->total_sectors != fleecing_bs->total_sectors) {
+                    error_setg(errp, "Size mismatch for '%s' - sector count %ld != %ld",
+                               fleecing_devid, fleecing_bs->total_sectors, bs->total_sectors);
+                    goto err;
+                }
+                di->fleecing.bs = fleecing_bs;
+            }
+
             di_list = g_list_append(di_list, di);
             d++;
         }
@@ -655,6 +772,7 @@ UuidInfo coroutine_fn *qmp_backup(
     const char *devlist,
     bool has_speed, int64_t speed,
     bool has_max_workers, int64_t max_workers,
+    bool has_fleecing, bool fleecing,
     Error **errp)
 {
     assert(qemu_in_coroutine());
@@ -682,7 +800,7 @@ UuidInfo coroutine_fn *qmp_backup(
     /* Todo: try to auto-detect format based on file name */
     format = has_format ? format : BACKUP_FORMAT_VMA;
 
-    di_list = get_device_info(devlist, &local_err);
+    di_list = get_device_info(devlist, has_fleecing && fleecing, &local_err);
     if (local_err) {
         error_propagate(errp, local_err);
         goto err;
@@ -1081,5 +1199,6 @@ ProxmoxSupportStatus *qmp_query_proxmox_support(Error **errp)
     ret->query_bitmap_info = true;
     ret->pbs_masterkey = true;
     ret->backup_max_workers = true;
+    ret->backup_fleecing = true;
     return ret;
 }
diff --git a/qapi/block-core.json b/qapi/block-core.json
index 51ce786bc5..114145d8c0 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -933,6 +933,10 @@
 #
 # @max-workers: see @BackupPerf for details. Default 16.
 #
+# @fleecing: perform a backup with fleecing. For each device in @devlist, a
+#            corresponing '-fleecing' device with the same size already needs to
+#            be present.
+#
 # Returns: the uuid of the backup job
 #
 ##
@@ -953,7 +957,8 @@
                                     '*firewall-file': 'str',
                                     '*devlist': 'str',
                                     '*speed': 'int',
-                                    '*max-workers': 'int' },
+                                    '*max-workers': 'int',
+                                    '*fleecing': 'bool' },
   'returns': 'UuidInfo', 'coroutine': true }
 
 ##
@@ -1009,6 +1014,7 @@
             'pbs-dirty-bitmap-migration': 'bool',
             'pbs-masterkey': 'bool',
             'pbs-library-version': 'str',
+            'backup-fleecing': 'bool',
             'backup-max-workers': 'bool' } }
 
 ##
-- 
2.39.2





^ permalink raw reply	[flat|nested] 31+ messages in thread

* [pve-devel] [RFC guest-common 09/13] vzdump: schema: add fleecing property string
  2024-01-25 14:41 [pve-devel] [RFC qemu/guest-common/manager/qemu-server/docs 00/13] fix #4136: implement backup fleecing Fiona Ebner
                   ` (7 preceding siblings ...)
  2024-01-25 14:41 ` [pve-devel] [RFC qemu 08/13] PVE backup: add fleecing option Fiona Ebner
@ 2024-01-25 14:41 ` Fiona Ebner
  2024-01-29 15:41   ` Fiona Ebner
  2024-01-25 14:41 ` [pve-devel] [RFC manager 10/13] vzdump: handle new 'fleecing' " Fiona Ebner
                   ` (4 subsequent siblings)
  13 siblings, 1 reply; 31+ messages in thread
From: Fiona Ebner @ 2024-01-25 14:41 UTC (permalink / raw)
  To: pve-devel

It's a property string, because that avoids having an implicit
"enabled" as part of a 'fleecing-storage' property. And there likely
will be more options in the future, e.g. threshold/limit for the
fleecing image size.

Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
---
 src/PVE/VZDump/Common.pm | 25 +++++++++++++++++++++++++
 1 file changed, 25 insertions(+)

diff --git a/src/PVE/VZDump/Common.pm b/src/PVE/VZDump/Common.pm
index 5be87a0..60574e4 100644
--- a/src/PVE/VZDump/Common.pm
+++ b/src/PVE/VZDump/Common.pm
@@ -33,6 +33,7 @@ my $dowhash_to_dow = sub {
 };
 
 our $PROPERTY_STRINGS = {
+    'fleecing' => 'backup-fleecing',
     'performance' => 'backup-performance',
     'prune-backups' => 'prune-backups',
 };
@@ -79,6 +80,24 @@ sub parse_dow {
     return $res;
 };
 
+PVE::JSONSchema::register_format('backup-fleecing', {
+    enabled => {
+	description => "Enable backup fleecing. Cache backup data from blocks where new guest "
+	    ."writes happen on specified storage instead of copying them directly to the backup "
+	    ."target. This can help guest IO performance and even prevent hangs, at the cost of "
+	    ."requiring more storage space.",
+	type => 'boolean',
+	default => 0,
+	optional => 1,
+	default_key => 1,
+    },
+    storage => get_standard_option('pve-storage-id', {
+	description => "Use this storage to storage fleecing images. Default is to use the same "
+	    ."storage as the VM disk itself.",
+	optional => 1,
+    }),
+});
+
 PVE::JSONSchema::register_format('backup-performance', {
     'max-workers' => {
 	description => "Applies to VMs. Allow up to this many IO workers at the same time.",
@@ -262,6 +281,12 @@ my $confdesc = {
 	format => 'backup-performance',
 	optional => 1,
     },
+    fleecing => {
+	type => 'string',
+	description => "Options for backup fleecing (VM only).",
+	format => 'backup-fleecing',
+	optional => 1,
+    },
     lockwait => {
 	type => 'integer',
 	description => "Maximal time to wait for the global lock (minutes).",
-- 
2.39.2





^ permalink raw reply	[flat|nested] 31+ messages in thread

* [pve-devel] [RFC manager 10/13] vzdump: handle new 'fleecing' property string
  2024-01-25 14:41 [pve-devel] [RFC qemu/guest-common/manager/qemu-server/docs 00/13] fix #4136: implement backup fleecing Fiona Ebner
                   ` (8 preceding siblings ...)
  2024-01-25 14:41 ` [pve-devel] [RFC guest-common 09/13] vzdump: schema: add fleecing property string Fiona Ebner
@ 2024-01-25 14:41 ` Fiona Ebner
  2024-01-25 14:41 ` [pve-devel] [RFC qemu-server 11/13] backup: disk info: also keep track of size Fiona Ebner
                   ` (3 subsequent siblings)
  13 siblings, 0 replies; 31+ messages in thread
From: Fiona Ebner @ 2024-01-25 14:41 UTC (permalink / raw)
  To: pve-devel

Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
---
 PVE/VZDump.pm | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/PVE/VZDump.pm b/PVE/VZDump.pm
index 4185ed62..bdf48fb2 100644
--- a/PVE/VZDump.pm
+++ b/PVE/VZDump.pm
@@ -130,6 +130,15 @@ my $generate_notes = sub {
     return $notes_template;
 };
 
+my sub parse_fleecing {
+    my ($param) = @_;
+
+    if (defined(my $fleecing = $param->{fleecing})) {
+	return if ref($fleecing) eq 'HASH'; # already parsed
+	$param->{fleecing} = PVE::JSONSchema::parse_property_string('backup-fleecing', $fleecing);
+    }
+}
+
 my sub parse_performance {
     my ($param) = @_;
 
@@ -278,6 +287,7 @@ sub read_vzdump_defaults {
 	} keys %$confdesc_for_defaults
     };
     $parse_prune_backups_maxfiles->($defaults, "defaults in VZDump schema");
+    parse_fleecing($defaults);
     parse_performance($defaults);
 
     my $raw;
@@ -304,6 +314,7 @@ sub read_vzdump_defaults {
 	$res->{mailto} = [ @mailto ];
     }
     $parse_prune_backups_maxfiles->($res, "options in '$fn'");
+    parse_fleecing($res);
     parse_performance($res);
 
     foreach my $key (keys %$defaults) {
@@ -1449,6 +1460,7 @@ sub verify_vzdump_parameters {
 	if defined($param->{'prune-backups'}) && defined($param->{maxfiles});
 
     $parse_prune_backups_maxfiles->($param, 'CLI parameters');
+    parse_fleecing($param);
     parse_performance($param);
 
     if (my $template = $param->{'notes-template'}) {
-- 
2.39.2





^ permalink raw reply	[flat|nested] 31+ messages in thread

* [pve-devel] [RFC qemu-server 11/13] backup: disk info: also keep track of size
  2024-01-25 14:41 [pve-devel] [RFC qemu/guest-common/manager/qemu-server/docs 00/13] fix #4136: implement backup fleecing Fiona Ebner
                   ` (9 preceding siblings ...)
  2024-01-25 14:41 ` [pve-devel] [RFC manager 10/13] vzdump: handle new 'fleecing' " Fiona Ebner
@ 2024-01-25 14:41 ` Fiona Ebner
  2024-01-25 14:41 ` [pve-devel] [RFC qemu-server 12/13] backup: implement fleecing option Fiona Ebner
                   ` (2 subsequent siblings)
  13 siblings, 0 replies; 31+ messages in thread
From: Fiona Ebner @ 2024-01-25 14:41 UTC (permalink / raw)
  To: pve-devel

which will be needed to allocate fleecing images.

Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
---
 PVE/VZDump/QemuServer.pm | 1 +
 1 file changed, 1 insertion(+)

diff --git a/PVE/VZDump/QemuServer.pm b/PVE/VZDump/QemuServer.pm
index be7d8e1e..51498dbc 100644
--- a/PVE/VZDump/QemuServer.pm
+++ b/PVE/VZDump/QemuServer.pm
@@ -140,6 +140,7 @@ sub prepare {
 	    path => $path,
 	    volid => $volid,
 	    storeid => $storeid,
+	    size => $size,
 	    format => $format,
 	    virtdev => $ds,
 	    qmdevice => "drive-$ds",
-- 
2.39.2





^ permalink raw reply	[flat|nested] 31+ messages in thread

* [pve-devel] [RFC qemu-server 12/13] backup: implement fleecing option
  2024-01-25 14:41 [pve-devel] [RFC qemu/guest-common/manager/qemu-server/docs 00/13] fix #4136: implement backup fleecing Fiona Ebner
                   ` (10 preceding siblings ...)
  2024-01-25 14:41 ` [pve-devel] [RFC qemu-server 11/13] backup: disk info: also keep track of size Fiona Ebner
@ 2024-01-25 14:41 ` Fiona Ebner
  2024-01-29 15:28   ` Fiona Ebner
  2024-01-25 14:41 ` [pve-devel] [RFC docs 13/13] vzdump: add section about backup fleecing Fiona Ebner
  2024-01-25 16:02 ` [pve-devel] [RFC qemu/guest-common/manager/qemu-server/docs 00/13] fix #4136: implement " DERUMIER, Alexandre
  13 siblings, 1 reply; 31+ messages in thread
From: Fiona Ebner @ 2024-01-25 14:41 UTC (permalink / raw)
  To: pve-devel

Management for fleecing images is implemented here. If the fleecing
option is set, for each disk (except EFI disk and TPM state) a new raw
fleecing image is allocated on the configured fleecing storage (same
storage as original disk by default). The disk is attached to QEMU
with the 'size' parameter, because the block node in QEMU has to be
the exact same size and the newly allocated image might be bigger if
the storage has a coarser allocation or rounded up. After backup, the
disks are detached and removed from the storage.

Partially inspired by the existing handling of the TPM state image
during backup.

Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
---

STILL OPEN (see also the TODOs):
    * Naming for fleecing images and also filtering/prohibiting them
      with other operations. Currently using -fleecing suffix but
      those can conflict with user-created ones.
    * Should fleecing be used if the VM was started specifically for
      backup? In theory makes sense, because VM could be resumed
      during backup, but in most cases it won't.

 PVE/VZDump/QemuServer.pm | 140 ++++++++++++++++++++++++++++++++++++++-
 1 file changed, 138 insertions(+), 2 deletions(-)

diff --git a/PVE/VZDump/QemuServer.pm b/PVE/VZDump/QemuServer.pm
index 51498dbc..d642ae46 100644
--- a/PVE/VZDump/QemuServer.pm
+++ b/PVE/VZDump/QemuServer.pm
@@ -26,6 +26,7 @@ use PVE::Format qw(render_duration render_bytes);
 
 use PVE::QemuConfig;
 use PVE::QemuServer;
+use PVE::QemuServer::Helpers;
 use PVE::QemuServer::Machine;
 use PVE::QemuServer::Monitor qw(mon_cmd);
 
@@ -525,6 +526,117 @@ sub get_and_check_pbs_encryption_config {
     die "internal error - unhandled case for getting & checking PBS encryption ($keyfile, $master_keyfile)!";
 }
 
+my sub cleanup_fleecing_images {
+    my ($self, $disks) = @_;
+
+    for my $di ($disks->@*) {
+	if (my $volid = $di->{'fleece-volid'}) {
+	    eval { PVE::Storage::vdisk_free($self->{storecfg}, $volid); };
+	    $self->log('warn', "error removing fleecing image '$volid' - $@") if $@;
+	}
+    }
+}
+
+my sub allocate_fleecing_images {
+    my ($self, $disks, $vmid, $fleecing_storeid) = @_;
+
+    # TODO what about left-over images from previous attempts? Just
+    # auto-remove? While unlikely, could conflict with manually created image from user...
+
+    eval {
+	for my $di ($disks->@*) {
+	    next if $di->{virtdev} =~ m/^(?:tpmstate|efidisk)\d$/; # too small to be worth it
+	    if ($di->{type} eq 'block' || $di->{type} eq 'file') {
+		# TODO better named like fleecing-VMID-disk-N (needs storage plugin support...) or
+		# vm-VMID-fleecing-N?
+		# TODO filter/disallow these for qm rescan/qm set/etc. like for -state- volumes
+		my $storage = $fleecing_storeid || $di->{storeid};
+		my $scfg = PVE::Storage::storage_config($self->{storecfg}, $storage);
+		my (undef, $name) = PVE::Storage::parse_volname($self->{storecfg}, $di->{volid});
+		$name =~ s/\.(.*?)$//;
+		# TODO checking for path is only a heuristic...
+		if ($scfg->{path}) {
+		    $name .= '-fleecing.raw';
+		} else {
+		    $name .= '-fleecing';
+		}
+		my $size = PVE::Tools::convert_size($di->{size}, 'b' => 'kb');
+		$di->{'fleece-volid'} = PVE::Storage::vdisk_alloc(
+		    $self->{storecfg}, $storage, $vmid, 'raw', $name, $size);
+	    } else {
+		die "implement me (type '$di->{type}')";
+	    }
+	}
+    };
+    if (my $err = $@) {
+	cleanup_fleecing_images($self, $disks);
+	die $err;
+    }
+}
+
+my sub detach_fleecing_images {
+    my ($disks, $vmid) = @_;
+
+    return if !PVE::QemuServer::Helpers::vm_running_locally($vmid);
+
+    for my $di ($disks->@*) {
+	if (my $volid = $di->{'fleece-volid'}) {
+	    my $devid = "$di->{qmdevice}-fleecing";
+	    $devid =~ s/^drive-//; # re-added by qemu_drivedel()
+	    eval { PVE::QemuServer::qemu_drivedel($vmid, $devid) };
+	}
+    }
+}
+
+my sub attach_fleecing_images {
+    my ($self, $disks, $vmid) = @_;
+
+    # unconditionally try to remove potential left-overs from a previous backup
+    detach_fleecing_images($disks, $vmid);
+
+    my $vollist = [ map { $_->{'fleece-volid'} } grep { $_->{'fleece-volid'} } $disks->@* ];
+    PVE::Storage::activate_volumes($self->{storecfg}, $vollist);
+
+    for my $di ($disks->@*) {
+	if (my $volid = $di->{'fleece-volid'}) {
+	    $self->loginfo("$di->{qmdevice}: attaching fleecing image $volid to QEMU");
+
+	    my $path = PVE::Storage::path($self->{storecfg}, $volid);
+	    my $devid = "$di->{qmdevice}-fleecing";
+	    # Specify size explicitly, to make it work if storage backend rounded up size for
+	    # fleecing image when allocating.
+	    my $drive = "file=$path,if=none,id=$devid,format=raw,discard=unmap,size=$di->{size}";
+	    $drive =~ s/\\/\\\\/g;
+	    my $ret = PVE::QemuServer::Monitor::hmp_cmd($vmid, "drive_add auto \"$drive\"");
+	    die "attaching fleecing image $volid failed - $ret\n" if $ret !~ m/OK/s;
+	}
+    }
+}
+
+my sub check_and_prepare_fleecing {
+    my ($self, $vmid, $fleecing_opts, $disks, $is_template, $qemu_support) = @_;
+
+    # TODO what if VM was started specifically for backup? While VM could be started during
+    # backup, so fleecing can in theory makes sense then, in most cases it won't...
+
+    my $use_fleecing = $fleecing_opts && $fleecing_opts->{enabled} && !$is_template;
+
+    if ($use_fleecing && !defined($qemu_support->{'backup-fleecing'})) {
+	$self->log(
+	    'warn',
+	    "running QEMU version does not support backup fleecing - continuing without",
+	);
+	$use_fleecing = 0;
+    }
+
+    if ($use_fleecing) {
+	allocate_fleecing_images($self, $disks, $vmid, $fleecing_opts->{storage});
+	attach_fleecing_images($self, $disks, $vmid, $fleecing_opts->{storage});
+    }
+
+    return $use_fleecing;
+}
+
 sub archive_pbs {
     my ($self, $task, $vmid) = @_;
 
@@ -578,6 +690,7 @@ sub archive_pbs {
 
     # get list early so we die on unkown drive types before doing anything
     my $devlist = _get_task_devlist($task);
+    my $use_fleecing;
 
     $self->enforce_vm_running_for_backup($vmid);
     $self->{qmeventd_fh} = PVE::QemuServer::register_qmeventd_handle($vmid);
@@ -606,6 +719,11 @@ sub archive_pbs {
 
 	$attach_tpmstate_drive->($self, $task, $vmid);
 
+	my $is_template = PVE::QemuConfig->is_template($self->{vmlist}->{$vmid});
+
+	$use_fleecing = check_and_prepare_fleecing(
+	    $self, $vmid, $opts->{fleecing}, $task->{disks}, $is_template, $qemu_support);
+
 	my $fs_frozen = $self->qga_fs_freeze($task, $vmid);
 
 	my $params = {
@@ -617,6 +735,8 @@ sub archive_pbs {
 	    devlist => $devlist,
 	    'config-file' => $conffile,
 	};
+	$params->{fleecing} = JSON::true if $use_fleecing;
+
 	if (defined(my $ns = $scfg->{namespace})) {
 	    $params->{'backup-ns'} = $ns;
 	}
@@ -633,7 +753,6 @@ sub archive_pbs {
 	    $params->{"master-keyfile"} = $master_keyfile if defined($master_keyfile);
 	}
 
-	my $is_template = PVE::QemuConfig->is_template($self->{vmlist}->{$vmid});
 	$params->{'use-dirty-bitmap'} = JSON::true
 	    if $qemu_support->{'pbs-dirty-bitmap'} && !$is_template;
 
@@ -665,6 +784,11 @@ sub archive_pbs {
     }
     $self->restore_vm_power_state($vmid);
 
+    if ($use_fleecing) {
+	detach_fleecing_images($task->{disks}, $vmid);
+	cleanup_fleecing_images($self, $task->{disks});
+    }
+
     die $err if $err;
 }
 
@@ -724,8 +848,10 @@ sub archive_vma {
 	$speed = $opts->{bwlimit}*1024;
     }
 
+    my $is_template = PVE::QemuConfig->is_template($self->{vmlist}->{$vmid});
+
     my $diskcount = scalar(@{$task->{disks}});
-    if (PVE::QemuConfig->is_template($self->{vmlist}->{$vmid}) || !$diskcount) {
+    if ($is_template || !$diskcount) {
 	my @pathlist;
 	foreach my $di (@{$task->{disks}}) {
 	    if ($di->{type} eq 'block' || $di->{type} eq 'file') {
@@ -765,6 +891,7 @@ sub archive_vma {
     }
 
     my $devlist = _get_task_devlist($task);
+    my $use_fleecing;
 
     $self->enforce_vm_running_for_backup($vmid);
     $self->{qmeventd_fh} = PVE::QemuServer::register_qmeventd_handle($vmid);
@@ -784,6 +911,9 @@ sub archive_vma {
 
 	$attach_tpmstate_drive->($self, $task, $vmid);
 
+	$use_fleecing = check_and_prepare_fleecing(
+	    $self, $vmid, $opts->{fleecing}, $task->{disks}, $is_template, $qemu_support);
+
 	my $outfh;
 	if ($opts->{stdout}) {
 	    $outfh = $opts->{stdout};
@@ -812,6 +942,7 @@ sub archive_vma {
 		devlist => $devlist
 	    };
 	    $params->{'firewall-file'} = $firewall if -e $firewall;
+	    $params->{fleecing} = JSON::true if $use_fleecing;
 	    add_backup_performance_options($params, $opts->{performance}, $qemu_support);
 
 	    $qmpclient->queue_cmd($vmid, $backup_cb, 'backup', %$params);
@@ -853,6 +984,11 @@ sub archive_vma {
 
     $self->restore_vm_power_state($vmid);
 
+    if ($use_fleecing) {
+	detach_fleecing_images($task->{disks}, $vmid);
+	cleanup_fleecing_images($self, $task->{disks});
+    }
+
     if ($err) {
 	if ($cpid) {
 	    kill(9, $cpid);
-- 
2.39.2





^ permalink raw reply	[flat|nested] 31+ messages in thread

* [pve-devel] [RFC docs 13/13] vzdump: add section about backup fleecing
  2024-01-25 14:41 [pve-devel] [RFC qemu/guest-common/manager/qemu-server/docs 00/13] fix #4136: implement backup fleecing Fiona Ebner
                   ` (11 preceding siblings ...)
  2024-01-25 14:41 ` [pve-devel] [RFC qemu-server 12/13] backup: implement fleecing option Fiona Ebner
@ 2024-01-25 14:41 ` Fiona Ebner
  2024-01-25 16:13   ` Dietmar Maurer
  2024-01-25 16:02 ` [pve-devel] [RFC qemu/guest-common/manager/qemu-server/docs 00/13] fix #4136: implement " DERUMIER, Alexandre
  13 siblings, 1 reply; 31+ messages in thread
From: Fiona Ebner @ 2024-01-25 14:41 UTC (permalink / raw)
  To: pve-devel

Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
---
 vzdump.adoc | 28 ++++++++++++++++++++++++++++
 1 file changed, 28 insertions(+)

diff --git a/vzdump.adoc b/vzdump.adoc
index 24a3e80..eb67141 100644
--- a/vzdump.adoc
+++ b/vzdump.adoc
@@ -136,6 +136,34 @@ not included in backups. For volume mount points you can set the *Backup* option
 to include the mount point in the backup. Device and bind mounts are never
 backed up as their content is managed outside the {pve} storage library.
 
+VM Backup Fleecing
+~~~~~~~~~~~~~~~~~~
+
+WARNING: Backup fleecing is still being worked on (also in upstream QEMU) and is
+currently only a technology preview.
+
+When a backup for a VM is started, QEMU will install a "copy-before-write"
+filter in its block layer. This filter ensures that upon new guest writes, old
+data still needed for the backup is sent to the backup target first. The guest
+write blocks until this operation is finished so guest IO to not-yet-backed-up
+sectors will be limited by the speed of the backup target.
+
+With backup fleecing, such old data is cached in a fleecing image rather than
+sent directly to the backup target. This can help guest IO performance and even
+prevent hangs in certain scenarios, at the cost of requiring more storage space.
+Use e.g. `vzdump 123 --fleecing enabled=1,storage=local-zfs` to enable backup
+fleecing, with fleecing images created on the storage `local-zfs`. If no storage
+is specified, the fleecing image will be created on the same storage as the
+original image.
+
+WARNING: Theoretically, the fleecing image can grow to the same size as the
+original image, e.g. if the guest re-writes a whole disk while the backup is
+busy with another disk.
+
+Parts of the fleecing image that have been backed up will be discarded to try
+and keep the space usage low.
+
+
 Backup File Names
 -----------------
 
-- 
2.39.2





^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [pve-devel] [RFC qemu/guest-common/manager/qemu-server/docs 00/13] fix #4136: implement backup fleecing
  2024-01-25 14:41 [pve-devel] [RFC qemu/guest-common/manager/qemu-server/docs 00/13] fix #4136: implement backup fleecing Fiona Ebner
                   ` (12 preceding siblings ...)
  2024-01-25 14:41 ` [pve-devel] [RFC docs 13/13] vzdump: add section about backup fleecing Fiona Ebner
@ 2024-01-25 16:02 ` DERUMIER, Alexandre
  13 siblings, 0 replies; 31+ messages in thread
From: DERUMIER, Alexandre @ 2024-01-25 16:02 UTC (permalink / raw)
  To: pve-devel

oh!!! Thanks you very much Fiona !!!

This is really the blocking feature for me, still not using pbs because
of this.

I'll try to build a lab for testing as soon as possible 
(I'm a bit busy with fosdem preparation)

I'l also  to test vm crash/host crash when backup is running, to see
how it's handled.


-------- Message initial --------
De: Fiona Ebner <f.ebner@proxmox.com>
Répondre à: Proxmox VE development discussion <pve-
devel@lists.proxmox.com>
À: pve-devel@lists.proxmox.com
Objet: [pve-devel] [RFC qemu/guest-common/manager/qemu-server/docs
00/13] fix #4136: implement backup fleecing
Date: 25/01/2024 15:41:36

When a backup for a VM is started, QEMU will install a
"copy-before-write" filter in its block layer. This filter ensures
that upon new guest writes, old data still needed for the backup is
sent to the backup target first. The guest write blocks until this
operation is finished so guest IO to not-yet-backed-up sectors will be
limited by the speed of the backup target.

With backup fleecing, such old data is cached in a fleecing image
rather than sent directly to the backup target. This can help guest IO
performance and even prevent hangs in certain scenarios, at the cost
of requiring more storage space.

With this series it will be possible to enable backup-fleecing via
e.g. `vzdump 123 --fleecing enabled=1,storage=local-zfs` with fleecing
images created on the storage `local-zfs`. If no storage is specified,
the fleecing image will be created on the same storage as the original
image.


Fleecing images are created by qemu-server via pve-storage and
attached to QEMU before the backup starts, and cleaned up after the
backup finished or failed. Currently, just a "-fleecing(.raw)" suffix
is added and there is no special handling yet for e.g. qm rescan/etc..
And previous left-overs are not automatically cleaned up, because
while unlikely, images with this name might've been created by a user
too. Happy to discuss alternatives!

The fleecing image needs to be the exact same size as the source, but
luckily, an explicit size can be specified when attaching a raw image
to QEMU so there are no size issues when using storages that have
coarser allocation/round up.


While initial tests seem fine, bitmap handling needs to be carefully
checked for correctness. More eyeballs can't hurt there.

QEMU patches are for the submodule for better reviewability. There are
unfortunately a few prerequisites which are also still being worked on
upstream. These are:

Fix for qcow2 block status querying when used as a source image [0].
Already reviewed and being pulled.

For being able to discard the fleecing image, addition of a
discard-source parameter[1]. This series was adapted for downstream
and I tried to address the two remaining issues:

1. Permission issue when backup source node is read-only (e.g. TMP
state): Made permissions conditional for when discard-source is set
with a new option for the copy-before-write block driver. Currently,
it's part of QAPI, nicer would be to make it internal-only.

2. Cluster size issue when fleecing image has a larger cluster size
than backup target: Made a workaround by also considering source image
when calculating cluster size for block copy and had to hack
.bdrv_co_get_info implementations for snapshot-access and
copy-before-write. Not super confident and better to wait for an
answer from upstream.

Upstream reports/discussions for these can also be found at [1].


No hard dependencies AFAICS, but of course pve-manager should depend
on both new pve-guest-common and qemu-server to actually be able to
use the option.


[0]:
https://antiphishing.vadesecure.com/v4?f=SVN0TjFBb1k5Qk8zQ2E1YTzVLi48fY
wOWQgYFfxt7g0KvdT5nea3YyBhm5UJw1fuk8HWaPXl-
BfpYUXNTonGgg&i=YXJwbnI5ZGY3YXM2MThBYezeqTiyfur56l5--
I5CLrY&k=ogd1&r=d2RpVFJVaTVtcFJRWFNMYgYCddP93Y9SOEaGwAD-
9JdLrx2JwwKfs9Sn_uiRQCCUgqnCg4WLD-
gLY0eKXrXX4A&s=d8690db845451bca9cb2f13496bf1af08b8e72a9fa856ee81c1ca437
cd57b2cb&u=https%3A%2F%2Flore.kernel.org%2Fqemu-
devel%2F20240116154839.401030-1-f.ebner%40proxmox.com%2F
[1]:
https://antiphishing.vadesecure.com/v4?f=SVN0TjFBb1k5Qk8zQ2E1YTzVLi48fY
wOWQgYFfxt7g0KvdT5nea3YyBhm5UJw1fuk8HWaPXl-
BfpYUXNTonGgg&i=YXJwbnI5ZGY3YXM2MThBYezeqTiyfur56l5--
I5CLrY&k=ogd1&r=d2RpVFJVaTVtcFJRWFNMYgYCddP93Y9SOEaGwAD-
9JdLrx2JwwKfs9Sn_uiRQCCUgqnCg4WLD-
gLY0eKXrXX4A&s=4f49e8ee5b68a6da0c3835c1aebb78501573f9d4f88a503b0c56859a
3afa6e2f&u=https%3A%2F%2Flore.kernel.org%2Fqemu-
devel%2F20240117160737.1057513-1-vsementsov%40yandex-team.ru%2F

qemu:

Fiona Ebner (6):
  backup: factor out gathering device info into helper
  backup: get device info: code cleanup
  block/io: clear BDRV_BLOCK_RECURSE flag after recursing in
    bdrv_co_block_status
  block/{copy-before-write,snapshot-access}: implement bdrv_co_get_info
    driver callback
  block/block-copy: always consider source cluster size too
  PVE backup: add fleecing option

Vladimir Sementsov-Ogievskiy (2):
  block/copy-before-write: create block_copy bitmap in filter node
  qapi: blockdev-backup: add discard-source parameter

 block/backup.c                         |  15 +-
 block/block-copy.c                     |  36 ++--
 block/copy-before-write.c              |  46 ++++-
 block/copy-before-write.h              |   1 +
 block/io.c                             |  10 ++
 block/monitor/block-hmp-cmds.c         |   1 +
 block/replication.c                    |   4 +-
 block/snapshot-access.c                |   7 +
 blockdev.c                             |   2 +-
 include/block/block-copy.h             |   3 +-
 include/block/block_int-global-state.h |   2 +-
 pve-backup.c                           | 234 +++++++++++++++++++------
 qapi/block-core.json                   |  18 +-
 13 files changed, 300 insertions(+), 79 deletions(-)


guest-common:

Fiona Ebner (1):
  vzdump: schema: add fleecing property string

 src/PVE/VZDump/Common.pm | 25 +++++++++++++++++++++++++
 1 file changed, 25 insertions(+)


manager:

Fiona Ebner (1):
  vzdump: handle new 'fleecing' property string

 PVE/VZDump.pm | 12 ++++++++++++
 1 file changed, 12 insertions(+)


qemu-server:

Fiona Ebner (2):
  backup: disk info: also keep track of size
  backup: implement fleecing option

 PVE/VZDump/QemuServer.pm | 141 ++++++++++++++++++++++++++++++++++++++-
 1 file changed, 139 insertions(+), 2 deletions(-)


docs:

Fiona Ebner (1):
  vzdump: add section about backup fleecing

 vzdump.adoc | 28 ++++++++++++++++++++++++++++
 1 file changed, 28 insertions(+)


Summary over all repositories:
  17 files changed, 504 insertions(+), 0 deletions(-)



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [pve-devel] [RFC docs 13/13] vzdump: add section about backup fleecing
  2024-01-25 14:41 ` [pve-devel] [RFC docs 13/13] vzdump: add section about backup fleecing Fiona Ebner
@ 2024-01-25 16:13   ` Dietmar Maurer
  2024-01-25 16:41     ` DERUMIER, Alexandre
  0 siblings, 1 reply; 31+ messages in thread
From: Dietmar Maurer @ 2024-01-25 16:13 UTC (permalink / raw)
  To: Proxmox VE development discussion, Fiona Ebner

Stupid question: Wouldn't It be much easier to add a simple IO-buffer
with limited capacity, implemented inside the RUST backup code?


> +WARNING: Theoretically, the fleecing image can grow to the same size as the
> +original image, e.g. if the guest re-writes a whole disk while the backup is
> +busy with another disk.
_______________________
> pve-devel mailing list
> pve-devel@lists.proxmox.com
> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel




^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [pve-devel] [RFC docs 13/13] vzdump: add section about backup fleecing
  2024-01-25 16:13   ` Dietmar Maurer
@ 2024-01-25 16:41     ` DERUMIER, Alexandre
  2024-01-25 18:18       ` Dietmar Maurer
  0 siblings, 1 reply; 31+ messages in thread
From: DERUMIER, Alexandre @ 2024-01-25 16:41 UTC (permalink / raw)
  To: pve-devel, f.ebner

Hi Dietmar !

>>Stupid question: Wouldn't It be much easier to add a simple IO-buffer
>>with limited capacity, implemented inside the RUST backup code?

At work, we are running a backup cluster on remote location with hdd , 
and a production cluster with super fast nvme,
and sometimes I have really big write spikes (in GB/s), so it's
impossible for the backup storage or network to handle it without
increase latency or saturate link.



So with limited capacity (how much ? in memory ?), I don't think it
solve the problem. If the buffer is full, the vm write will hang.




> +WARNING: Theoretically, the fleecing image can grow to the same size
> as the
> +original image, e.g. if the guest re-writes a whole disk while the
> backup is
> +busy with another disk.
_______________________
> pve-devel mailing list
> pve-devel@lists.proxmox.com
> https://antiphishing.vadesecure.com/v4?f=UGdQTlk5MTBobFFsZ3RObShDuc6K
> r0qQJM5Xjlq8ly-YkurD_OonkFQQHRM5dD0sHiZMono-
> 05fd_zuQPhsZ7g&i=UkFYeTM5UEhtWXVaSmd1ZLDvXbHWNXEVJK0yd52D7Tk&k=zxOb&r
> =dXF1WU1mV3BtS1V4VGZmWTT6GuSBOaRA_KaIinwYkDZZwZ8nSONtpLHP0YxtYvNWjevg
> hX6c4tEgCTdMP3GQoA&s=a627f4eac656b548560f4410e5d32f036fba2f7c71550e98
> 8e00808f9e1bae70&u=https%3A%2F%2Flists.proxmox.com%2Fcgi-
> bin%2Fmailman%2Flistinfo%2Fpve-devel


_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://antiphishing.vadesecure.com/v4?f=UGdQTlk5MTBobFFsZ3RObShDuc6Kr0
qQJM5Xjlq8ly-YkurD_OonkFQQHRM5dD0sHiZMono-
05fd_zuQPhsZ7g&i=UkFYeTM5UEhtWXVaSmd1ZLDvXbHWNXEVJK0yd52D7Tk&k=zxOb&r=d
XF1WU1mV3BtS1V4VGZmWTT6GuSBOaRA_KaIinwYkDZZwZ8nSONtpLHP0YxtYvNWjevghX6c
4tEgCTdMP3GQoA&s=a627f4eac656b548560f4410e5d32f036fba2f7c71550e988e0080
8f9e1bae70&u=https%3A%2F%2Flists.proxmox.com%2Fcgi-
bin%2Fmailman%2Flistinfo%2Fpve-devel



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [pve-devel] [RFC docs 13/13] vzdump: add section about backup fleecing
  2024-01-25 16:41     ` DERUMIER, Alexandre
@ 2024-01-25 18:18       ` Dietmar Maurer
  2024-01-26  8:39         ` Fiona Ebner
  0 siblings, 1 reply; 31+ messages in thread
From: Dietmar Maurer @ 2024-01-25 18:18 UTC (permalink / raw)
  To: Proxmox VE development discussion, DERUMIER, Alexandre, f.ebner

> >>Stupid question: Wouldn't It be much easier to add a simple IO-buffer
> >>with limited capacity, implemented inside the RUST backup code?
> 
> At work, we are running a backup cluster on remote location with hdd , 
> and a production cluster with super fast nvme,
> and sometimes I have really big write spikes (in GB/s), so it's
> impossible for the backup storage or network to handle it without
> increase latency or saturate link.
>  
> So with limited capacity (how much ? in memory ?), I don't think it
> solve the problem. If the buffer is full, the vm write will hang.

Ok, I can see the problem...




^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [pve-devel] [RFC docs 13/13] vzdump: add section about backup fleecing
  2024-01-25 18:18       ` Dietmar Maurer
@ 2024-01-26  8:39         ` Fiona Ebner
  0 siblings, 0 replies; 31+ messages in thread
From: Fiona Ebner @ 2024-01-26  8:39 UTC (permalink / raw)
  To: Dietmar Maurer, Proxmox VE development discussion, DERUMIER, Alexandre

Am 25.01.24 um 19:18 schrieb Dietmar Maurer:
>>>> Stupid question: Wouldn't It be much easier to add a simple IO-buffer
>>>> with limited capacity, implemented inside the RUST backup code?
>>
>> At work, we are running a backup cluster on remote location with hdd , 
>> and a production cluster with super fast nvme,
>> and sometimes I have really big write spikes (in GB/s), so it's
>> impossible for the backup storage or network to handle it without
>> increase latency or saturate link.
>>  
>> So with limited capacity (how much ? in memory ?), I don't think it
>> solve the problem. If the buffer is full, the vm write will hang.
> 
> Ok, I can see the problem...

Yes, exactly. Sure, it could be done with an in-memory buffer. And while
it would help some people, it would help in much fewer scenarios
compared to fleecing, because RAM is almost always more expensive/more
limited than storage space.




^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [pve-devel] [HACK qemu 06/13] block/{copy-before-write, snapshot-access}: implement bdrv_co_get_info driver callback
  2024-01-25 14:41 ` [pve-devel] [HACK qemu 06/13] block/{copy-before-write, snapshot-access}: implement bdrv_co_get_info driver callback Fiona Ebner
@ 2024-01-29 14:35   ` Fiona Ebner
  0 siblings, 0 replies; 31+ messages in thread
From: Fiona Ebner @ 2024-01-29 14:35 UTC (permalink / raw)
  To: pve-devel

Am 25.01.24 um 15:41 schrieb Fiona Ebner:
> In preparation to fix an issue for backup fleecing where discarding
> the source would lead to an assertion failure when the fleecing image
> has larger granularity than the backup target.
> 
> Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
> ---
> 
> Still need to wait on a response from upstream. For now this hack, so
> that the RFC as a whole doesn't have to wait.
> 

From [0]:

> 2. if we need another logic for block_copy_calculate_cluster_size() it should be an option. May be 
> explicit "copy-cluster-size" or "granularity" option for CBW driver and for backup 
> job. And we'll just check that given cluster-size is power of two >= target_size.

I implemented a new option based on this suggestion and it also resolves
the issue. So the hacks will be gone in the next version :) Will also
try to upstream the option after v3 of the patch series from [0] is out.

[0]: https://lists.nongnu.org/archive/html/qemu-devel/2024-01/msg05131.html




^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [pve-devel] [RFC qemu-server 12/13] backup: implement fleecing option
  2024-01-25 14:41 ` [pve-devel] [RFC qemu-server 12/13] backup: implement fleecing option Fiona Ebner
@ 2024-01-29 15:28   ` Fiona Ebner
  0 siblings, 0 replies; 31+ messages in thread
From: Fiona Ebner @ 2024-01-29 15:28 UTC (permalink / raw)
  To: pve-devel

Am 25.01.24 um 15:41 schrieb Fiona Ebner:
> Management for fleecing images is implemented here. If the fleecing
> option is set, for each disk (except EFI disk and TPM state) a new raw
> fleecing image is allocated on the configured fleecing storage (same
> storage as original disk by default). The disk is attached to QEMU
> with the 'size' parameter, because the block node in QEMU has to be
> the exact same size and the newly allocated image might be bigger if
> the storage has a coarser allocation or rounded up. After backup, the
> disks are detached and removed from the storage.
> 
> Partially inspired by the existing handling of the TPM state image
> during backup.
> 
> Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
> ---
> 
> STILL OPEN (see also the TODOs):
>     * Naming for fleecing images and also filtering/prohibiting them
>       with other operations. Currently using -fleecing suffix but
>       those can conflict with user-created ones.

With the current draft, there is another issue when the VM has
A:vm-123-disk-0 and B:vm-123-disk-0, because the names for the fleecing
images will clash.




^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [pve-devel] [RFC guest-common 09/13] vzdump: schema: add fleecing property string
  2024-01-25 14:41 ` [pve-devel] [RFC guest-common 09/13] vzdump: schema: add fleecing property string Fiona Ebner
@ 2024-01-29 15:41   ` Fiona Ebner
  2024-01-30 14:03     ` DERUMIER, Alexandre
  2024-02-01 12:39     ` DERUMIER, Alexandre
  0 siblings, 2 replies; 31+ messages in thread
From: Fiona Ebner @ 2024-01-29 15:41 UTC (permalink / raw)
  To: pve-devel

Am 25.01.24 um 15:41 schrieb Fiona Ebner:
> +    storage => get_standard_option('pve-storage-id', {
> +	description => "Use this storage to storage fleecing images. Default is to use the same "
> +	    ."storage as the VM disk itself.",
> +	optional => 1,
> +    }),
> +});
> +

LVM and non-sparse ZFS need enough space for a copy for the full disk
up-front, so are not suitable as fleecing storages in many cases. ISCSI
doesn't allow disk allocation. Should such storages be outright
forbidden as fleecing storages or should it just be documented? Should
the setting rather be VM-specific than backup job-specific? These issues
mostly defeat the purpose of the default here.

IIRC older version of NFS lack the ability to discard. While not quite
as bad as the above, it's still far from ideal. Might also be worth
trying to detect? Will add something to the docs in any case.




^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [pve-devel] [RFC guest-common 09/13] vzdump: schema: add fleecing property string
  2024-01-29 15:41   ` Fiona Ebner
@ 2024-01-30 14:03     ` DERUMIER, Alexandre
  2024-02-01  8:28       ` Fiona Ebner
  2024-02-01 12:39     ` DERUMIER, Alexandre
  1 sibling, 1 reply; 31+ messages in thread
From: DERUMIER, Alexandre @ 2024-01-30 14:03 UTC (permalink / raw)
  To: pve-devel



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [pve-devel] [RFC guest-common 09/13] vzdump: schema: add fleecing property string
  2024-01-30 14:03     ` DERUMIER, Alexandre
@ 2024-02-01  8:28       ` Fiona Ebner
  0 siblings, 0 replies; 31+ messages in thread
From: Fiona Ebner @ 2024-02-01  8:28 UTC (permalink / raw)
  To: Proxmox VE development discussion, DERUMIER, Alexandre

Did the contents of this mail get eaten somehow?

Am 30.01.24 um 15:03 schrieb DERUMIER, Alexandre:
> 
> _______________________________________________
> pve-devel mailing list
> pve-devel@lists.proxmox.com
> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
> 
> 




^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [pve-devel] [RFC guest-common 09/13] vzdump: schema: add fleecing property string
  2024-01-29 15:41   ` Fiona Ebner
  2024-01-30 14:03     ` DERUMIER, Alexandre
@ 2024-02-01 12:39     ` DERUMIER, Alexandre
  2024-02-01 13:11       ` Fiona Ebner
  1 sibling, 1 reply; 31+ messages in thread
From: DERUMIER, Alexandre @ 2024-02-01 12:39 UTC (permalink / raw)
  To: pve-devel

>>LVM and non-sparse ZFS need enough space for a copy for the full disk
>>up-front, so are not suitable as fleecing storages in many cases.

can't we force sparse for theses fleecing volumes, even if the storage
don't have sparse enabled ? (I can understand that it could make sense
for user to have non sparse for production for performance or
allocation reservation, but for fleecing image, it should be
exceptionnal to rewrite a full image)

>>ISCSI doesn't allow disk allocation. Should such storages be outright
>>forbidden as fleecing storages or should it just be documented?

I think it should be forbidden if it's incompatible.

>>Should the setting rather be VM-specific than backup job-specific?
>>These issues
>>mostly defeat the purpose of the default here.

can't we forbidden it in storage plugin features ? { fleecing => 1} ?

>>IIRC older version of NFS lack the ability to discard. While not
>>quite
>>as bad as the above, it's still far from ideal. Might also be worth
>>trying to detect? Will add something to the docs in any case.

I never have seen working discard with nfs, I think (never tested) it's
possible with 4.2, but 4.2 is really new on nas appliance (netapp,...).
So I think than 90% of user don't have working discard with nfs.

Is it a problem if the vm main storage support discard , but not
fleecing storage ? (I don't have looked yet how exactly fleecing is
working)

If it's a problem, I think we should forbind to use a fleecing storage
not supporting discard, if the vm have discard on 1 disk.

-------- Message initial --------
De: Fiona Ebner <f.ebner@proxmox.com>
Répondre à: Proxmox VE development discussion <pve-
devel@lists.proxmox.com>
À: pve-devel@lists.proxmox.com
Objet: Re: [pve-devel] [RFC guest-common 09/13] vzdump: schema: add
fleecing property string
Date: 29/01/2024 16:41:06

Am 25.01.24 um 15:41 schrieb Fiona Ebner:
> +    storage => get_standard_option('pve-storage-id', {
> + description => "Use this storage to storage fleecing images.
> Default is to use the same "
> +     ."storage as the VM disk itself.",
> + optional => 1,
> +    }),
> +});
> +

LVM and non-sparse ZFS need enough space for a copy for the full disk
up-front, so are not suitable as fleecing storages in many cases. ISCSI
doesn't allow disk allocation. Should such storages be outright
forbidden as fleecing storages or should it just be documented? Should
the setting rather be VM-specific than backup job-specific? These
issues
mostly defeat the purpose of the default here.

IIRC older version of NFS lack the ability to discard. While not quite
as bad as the above, it's still far from ideal. Might also be worth
trying to detect? Will add something to the docs in any case.


_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://antiphishing.vadesecure.com/v4?f=Y2ppc2F5MDE3eUdxMlltU4la3qYcBp
EjzeX3N4-PuW3gcX4fMHIOhM2G7-
OPL0t20GPmF8qXOFj0jj7fEgJ5xQ&i=U3V6V3dQeXk0eE5UcE9Yet62u8t5JmjzrUublmCg
X-s&k=vAZ2&r=RW9wTWdNZXQ2MXZZM0dmatWClvz7ii9jqPC9dGZXoKxhJ2Psdtv76pc-
UWttiNif&s=07df5874c1611f7dddf54b953207c69655b9d1b807b88d742e05c6974191
ce49&u=https%3A%2F%2Flists.proxmox.com%2Fcgi-
bin%2Fmailman%2Flistinfo%2Fpve-devel



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [pve-devel] [RFC guest-common 09/13] vzdump: schema: add fleecing property string
  2024-02-01 12:39     ` DERUMIER, Alexandre
@ 2024-02-01 13:11       ` Fiona Ebner
  2024-02-01 13:20         ` DERUMIER, Alexandre
  0 siblings, 1 reply; 31+ messages in thread
From: Fiona Ebner @ 2024-02-01 13:11 UTC (permalink / raw)
  To: Proxmox VE development discussion, DERUMIER, Alexandre

Am 01.02.24 um 13:39 schrieb DERUMIER, Alexandre:
>>> LVM and non-sparse ZFS need enough space for a copy for the full disk
>>> up-front, so are not suitable as fleecing storages in many cases.
> 
> can't we force sparse for theses fleecing volumes, even if the storage
> don't have sparse enabled ? (I can understand that it could make sense
> for user to have non sparse for production for performance or
> allocation reservation, but for fleecing image, it should be
> exceptionnal to rewrite a full image)
> 

For ZFS, we could always allocate fleecing images sparsely, but would
require a change to the storage API as you can't tell vdisk_alloc() to
do that right now. There could also be a new helper altogether,
allocate_fleecing_image() then the storage plugin itself could decide
what the best settings are.

>>> Should the setting rather be VM-specific than backup job-specific?
>>> These issues
>>> mostly defeat the purpose of the default here.
> 
> can't we forbidden it in storage plugin features ? { fleecing => 1} ?
> 

There is no feature list for storage plugins right now, just
volume_has_feature() and that doesn't help if don't already have a volume.

There is storage_can_replicate() and we could either switch to a common
helper for storage features and deprecate the old or simply add a
storage_supports_fleecing() helper.

But the question remains if the setting should be VM-specific or
job-wide. Most flexible would be both, but I'd rather not overcomplicate
things. Maybe my idea for the default with "use same storage for
fleecing" is not actually a good one and having a dedicated storage for
fleecing is better. Then it needs to be a conscious decision.

>>> IIRC older version of NFS lack the ability to discard. While not
>>> quite
>>> as bad as the above, it's still far from ideal. Might also be worth
>>> trying to detect? Will add something to the docs in any case.
> 
> I never have seen working discard with nfs, I think (never tested) it's
> possible with 4.2, but 4.2 is really new on nas appliance (netapp,...).
> So I think than 90% of user don't have working discard with nfs.
> 

With NFS 4.2 discard works nicely even with raw format. But you might be
right about most users not having new enough versions. We discussed this
off-list too and an improvement would be to use qcow2, so the discards
could happen at least internally. The qcow2 could not free the allocated
blocks, but re-use already allocated ones.

> Is it a problem if the vm main storage support discard , but not
> fleecing storage ? (I don't have looked yet how exactly fleecing is
> working)
> 

It doesn't matter if the main storage does or not. It only depends on
the fleecing storage.

> If it's a problem, I think we should forbind to use a fleecing storage
> not supporting discard, if the vm have discard on 1 disk.
> 

The problem is that the space usage can be very high. It's not a
fundamental problem, you can still use fleecing on such storages if you
have enough space.

There are already ideas to have a limit setting, monitor the space usage
and abort when the limit is hit, but nothing concrete yet.




^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [pve-devel] [RFC guest-common 09/13] vzdump: schema: add fleecing property string
  2024-02-01 13:11       ` Fiona Ebner
@ 2024-02-01 13:20         ` DERUMIER, Alexandre
  2024-02-01 13:27           ` Fiona Ebner
  2024-02-01 13:30           ` Fiona Ebner
  0 siblings, 2 replies; 31+ messages in thread
From: DERUMIER, Alexandre @ 2024-02-01 13:20 UTC (permalink / raw)
  To: pve-devel, f.ebner

-------- Message initial --------
De: Fiona Ebner <f.ebner@proxmox.com>
À: Proxmox VE development discussion <pve-devel@lists.proxmox.com>,
"DERUMIER, Alexandre" <alexandre.derumier@groupe-cyllene.com>
Objet: Re: [pve-devel] [RFC guest-common 09/13] vzdump: schema: add
fleecing property string
Date: 01/02/2024 14:11:20

>>But the question remains if the setting should be VM-specific or
>>job-wide. Most flexible would be both, but I'd rather not
>>overcomplicate
>>things. Maybe my idea for the default with "use same storage for
>>fleecing" is not actually a good one and having a dedicated storage
>>for
>>fleecing is better. Then it needs to be a conscious decision.

Instead, couldn't we defined it in pbs storage ?   
at vm level, we can't known which storage will be used. (one could be
fast without need of fleecing, another could be slow and need fleecing)

storage.cfg:

pbs  : mypbs
       fleecing-storage  .....

An allow to override at job level  ?






^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [pve-devel] [RFC guest-common 09/13] vzdump: schema: add fleecing property string
  2024-02-01 13:20         ` DERUMIER, Alexandre
@ 2024-02-01 13:27           ` Fiona Ebner
  2024-02-01 21:33             ` DERUMIER, Alexandre
  2024-02-01 13:30           ` Fiona Ebner
  1 sibling, 1 reply; 31+ messages in thread
From: Fiona Ebner @ 2024-02-01 13:27 UTC (permalink / raw)
  To: DERUMIER, Alexandre, pve-devel

Am 01.02.24 um 14:20 schrieb DERUMIER, Alexandre:
> -------- Message initial --------
> De: Fiona Ebner <f.ebner@proxmox.com>
> À: Proxmox VE development discussion <pve-devel@lists.proxmox.com>,
> "DERUMIER, Alexandre" <alexandre.derumier@groupe-cyllene.com>
> Objet: Re: [pve-devel] [RFC guest-common 09/13] vzdump: schema: add
> fleecing property string
> Date: 01/02/2024 14:11:20
> 
>>> But the question remains if the setting should be VM-specific or
>>> job-wide. Most flexible would be both, but I'd rather not
>>> overcomplicate
>>> things. Maybe my idea for the default with "use same storage for
>>> fleecing" is not actually a good one and having a dedicated storage
>>> for
>>> fleecing is better. Then it needs to be a conscious decision.
> 
> Instead, couldn't we defined it in pbs storage ?   
> at vm level, we can't known which storage will be used. (one could be
> fast without need of fleecing, another could be slow and need fleecing)
> 
> storage.cfg:
> 
> pbs  : mypbs
>        fleecing-storage  .....
> 

Backup fleecing is not limited to PBS and it's doesn't have anything to
do with PBS implementation-wise, so this is not the right place for the
setting IMHO.

> An allow to override at job level  ?

A job-level option alone is already more flexible than defining it as
part of the PBS-configuration.




^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [pve-devel] [RFC guest-common 09/13] vzdump: schema: add fleecing property string
  2024-02-01 13:20         ` DERUMIER, Alexandre
  2024-02-01 13:27           ` Fiona Ebner
@ 2024-02-01 13:30           ` Fiona Ebner
  1 sibling, 0 replies; 31+ messages in thread
From: Fiona Ebner @ 2024-02-01 13:30 UTC (permalink / raw)
  To: DERUMIER, Alexandre, pve-devel

Am 01.02.24 um 14:20 schrieb DERUMIER, Alexandre:
> at vm level, we can't known which storage will be used. (one could be
> fast without need of fleecing, another could be slow and need fleecing)
> 

To clarify, with my "VM-specific" proposal I mean: The backup job
decides whether to use fleecing or not. The VM decides which storage to
use for fleecing.




^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [pve-devel] [RFC guest-common 09/13] vzdump: schema: add fleecing property string
  2024-02-01 13:27           ` Fiona Ebner
@ 2024-02-01 21:33             ` DERUMIER, Alexandre
  2024-02-02  8:30               ` Fiona Ebner
  0 siblings, 1 reply; 31+ messages in thread
From: DERUMIER, Alexandre @ 2024-02-01 21:33 UTC (permalink / raw)
  To: pve-devel, f.ebner


> 
> storage.cfg:
> 
> pbs  : mypbs
>        fleecing-storage  .....
> 

>>Backup fleecing is not limited to PBS and it's doesn't have anything
>>to
>>do with PBS implementation-wise, so this is not the right place for
>>the
>>setting IMHO.

mmm, good point. 


> An allow to override at job level  ?
>>
>>A job-level option alone is already more flexible than defining it as
part of the PBS-configuration.


>>To clarify, with my "VM-specific" proposal I mean: The backup job
>>decides whether to use fleecing or not. The VM decides which storage
>>to
>>use for fleecing.


Personnaly, for my usage, I prefer to define fleecing storage in backup
job. (as we only define 1 backup storage in 1 job, and my problem to
solve is the slow backup storage).



But Maybe, indeed, some users would like to choose "use same storage
than disk" , or choose a specific fleecing storage for a specific vm.
So, maybe add an extra override option in vm options or directly on
disk ? I really don't known :)  

Maybe try a simple option on backup job first, then add extra options
at vm level if users want it ?






^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [pve-devel] [RFC guest-common 09/13] vzdump: schema: add fleecing property string
  2024-02-01 21:33             ` DERUMIER, Alexandre
@ 2024-02-02  8:30               ` Fiona Ebner
  0 siblings, 0 replies; 31+ messages in thread
From: Fiona Ebner @ 2024-02-02  8:30 UTC (permalink / raw)
  To: DERUMIER, Alexandre, pve-devel

Am 01.02.24 um 22:33 schrieb DERUMIER, Alexandre:
> 
>>> To clarify, with my "VM-specific" proposal I mean: The backup job
>>> decides whether to use fleecing or not. The VM decides which storage
>>> to
>>> use for fleecing.
> 
> 
> Personnaly, for my usage, I prefer to define fleecing storage in backup
> job. (as we only define 1 backup storage in 1 job, and my problem to
> solve is the slow backup storage).
> 
> 
> 
> But Maybe, indeed, some users would like to choose "use same storage
> than disk" , or choose a specific fleecing storage for a specific vm.
> So, maybe add an extra override option in vm options or directly on
> disk ? I really don't known :)  
> 
> Maybe try a simple option on backup job first, then add extra options
> at vm level if users want it ?
> 

Yes, I think I'll go that route for now. And require users to choose a
specific storage instead of having a default. A default can still be
added later. "Use same storage as disk" also has the issue of
duplicating writes to the same storage with copy-before-write, which is
not the best for performance either, especially for NFS/CIFS.




^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2024-02-02  8:31 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-01-25 14:41 [pve-devel] [RFC qemu/guest-common/manager/qemu-server/docs 00/13] fix #4136: implement backup fleecing Fiona Ebner
2024-01-25 14:41 ` [pve-devel] [PATCH qemu 01/13] backup: factor out gathering device info into helper Fiona Ebner
2024-01-25 14:41 ` [pve-devel] [PATCH qemu 02/13] backup: get device info: code cleanup Fiona Ebner
2024-01-25 14:41 ` [pve-devel] [PATCH qemu 03/13] block/io: clear BDRV_BLOCK_RECURSE flag after recursing in bdrv_co_block_status Fiona Ebner
2024-01-25 14:41 ` [pve-devel] [RFC qemu 04/13] block/copy-before-write: create block_copy bitmap in filter node Fiona Ebner
2024-01-25 14:41 ` [pve-devel] [RFC qemu 05/13] qapi: blockdev-backup: add discard-source parameter Fiona Ebner
2024-01-25 14:41 ` [pve-devel] [HACK qemu 06/13] block/{copy-before-write, snapshot-access}: implement bdrv_co_get_info driver callback Fiona Ebner
2024-01-29 14:35   ` Fiona Ebner
2024-01-25 14:41 ` [pve-devel] [HACK qemu 07/13] block/block-copy: always consider source cluster size too Fiona Ebner
2024-01-25 14:41 ` [pve-devel] [RFC qemu 08/13] PVE backup: add fleecing option Fiona Ebner
2024-01-25 14:41 ` [pve-devel] [RFC guest-common 09/13] vzdump: schema: add fleecing property string Fiona Ebner
2024-01-29 15:41   ` Fiona Ebner
2024-01-30 14:03     ` DERUMIER, Alexandre
2024-02-01  8:28       ` Fiona Ebner
2024-02-01 12:39     ` DERUMIER, Alexandre
2024-02-01 13:11       ` Fiona Ebner
2024-02-01 13:20         ` DERUMIER, Alexandre
2024-02-01 13:27           ` Fiona Ebner
2024-02-01 21:33             ` DERUMIER, Alexandre
2024-02-02  8:30               ` Fiona Ebner
2024-02-01 13:30           ` Fiona Ebner
2024-01-25 14:41 ` [pve-devel] [RFC manager 10/13] vzdump: handle new 'fleecing' " Fiona Ebner
2024-01-25 14:41 ` [pve-devel] [RFC qemu-server 11/13] backup: disk info: also keep track of size Fiona Ebner
2024-01-25 14:41 ` [pve-devel] [RFC qemu-server 12/13] backup: implement fleecing option Fiona Ebner
2024-01-29 15:28   ` Fiona Ebner
2024-01-25 14:41 ` [pve-devel] [RFC docs 13/13] vzdump: add section about backup fleecing Fiona Ebner
2024-01-25 16:13   ` Dietmar Maurer
2024-01-25 16:41     ` DERUMIER, Alexandre
2024-01-25 18:18       ` Dietmar Maurer
2024-01-26  8:39         ` Fiona Ebner
2024-01-25 16:02 ` [pve-devel] [RFC qemu/guest-common/manager/qemu-server/docs 00/13] fix #4136: implement " DERUMIER, Alexandre

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox
Service provided by Proxmox Server Solutions GmbH | Privacy | Legal