From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <w.bumiller@proxmox.com>
Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68])
 (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
 key-exchange X25519 server-signature RSA-PSS (2048 bits))
 (No client certificate requested)
 by lists.proxmox.com (Postfix) with ESMTPS id 05E0D93165
 for <pve-devel@lists.proxmox.com>; Mon,  8 Apr 2024 14:46:18 +0200 (CEST)
Received: from firstgate.proxmox.com (localhost [127.0.0.1])
 by firstgate.proxmox.com (Proxmox) with ESMTP id D4D33A7A0
 for <pve-devel@lists.proxmox.com>; Mon,  8 Apr 2024 14:45:47 +0200 (CEST)
Received: from proxmox-new.maurer-it.com (proxmox-new.maurer-it.com
 [94.136.29.106])
 (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
 key-exchange X25519 server-signature RSA-PSS (2048 bits))
 (No client certificate requested)
 by firstgate.proxmox.com (Proxmox) with ESMTPS
 for <pve-devel@lists.proxmox.com>; Mon,  8 Apr 2024 14:45:46 +0200 (CEST)
Received: from proxmox-new.maurer-it.com (localhost.localdomain [127.0.0.1])
 by proxmox-new.maurer-it.com (Proxmox) with ESMTP id 63A4B440AF
 for <pve-devel@lists.proxmox.com>; Mon,  8 Apr 2024 14:45:46 +0200 (CEST)
Date: Mon, 8 Apr 2024 14:45:44 +0200
From: Wolfgang Bumiller <w.bumiller@proxmox.com>
To: Fiona Ebner <f.ebner@proxmox.com>
Cc: pve-devel@lists.proxmox.com
Message-ID: <pbokhbft2wbc2jnthp75gzz5u3vgswr7irlarwxadvc5cr777q@avaj6rwkxugb>
References: <20240315102502.84163-1-f.ebner@proxmox.com>
 <20240315102502.84163-8-f.ebner@proxmox.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20240315102502.84163-8-f.ebner@proxmox.com>
X-SPAM-LEVEL: Spam detection results:  0
 AWL 0.087 Adjusted score from AWL reputation of From: address
 BAYES_00                 -1.9 Bayes spam probability is 0 to 1%
 DMARC_MISSING             0.1 Missing DMARC policy
 KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment
 SPF_HELO_NONE           0.001 SPF: HELO does not publish an SPF Record
 SPF_PASS               -0.001 SPF: sender matches SPF record
 URIBL_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to URIBL was blocked. See
 http://wiki.apache.org/spamassassin/DnsBlocklists#dnsbl-block for more
 information. [bitmap.name, fleecing.bs, mail-archive.com]
Subject: Re: [pve-devel] [PATCH qemu v2 07/21] PVE backup: add fleecing
 option
X-BeenThere: pve-devel@lists.proxmox.com
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Proxmox VE development discussion <pve-devel.lists.proxmox.com>
List-Unsubscribe: <https://lists.proxmox.com/cgi-bin/mailman/options/pve-devel>, 
 <mailto:pve-devel-request@lists.proxmox.com?subject=unsubscribe>
List-Archive: <http://lists.proxmox.com/pipermail/pve-devel/>
List-Post: <mailto:pve-devel@lists.proxmox.com>
List-Help: <mailto:pve-devel-request@lists.proxmox.com?subject=help>
List-Subscribe: <https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel>, 
 <mailto:pve-devel-request@lists.proxmox.com?subject=subscribe>
X-List-Received-Date: Mon, 08 Apr 2024 12:46:18 -0000

On Fri, Mar 15, 2024 at 11:24:48AM +0100, Fiona Ebner wrote:
> When a fleecing option is given, it is expected that each device has
> a corresponding "-fleecing" block device already attached, except for
> EFI disk and TPM state, where fleecing is never used.
> 
> The following graph was adapted from [0] which also contains more
> details about fleecing.
> 
> [guest]
>    |
>    | root
>    v                 file
> [copy-before-write]<------[snapshot-access]
>    |           |
>    | file      | target
>    v           v
> [source] [fleecing]
> 
> For fleecing, a copy-before-write filter is inserted on top of the
> source node, as well as a snapshot-access node pointing to the filter
> node which allows to read the consistent state of the image at the
> time it was inserted. New guest writes are passed through the
> copy-before-write filter which will first copy over old data to the
> fleecing image in case that old data is still needed by the
> snapshot-access node.
> 
> The backup process will sequentially read from the snapshot access,
> which has a bitmap and knows whether to read from the original image
> or the fleecing image to get the "snapshot" state, i.e. data from the
> source image at the time when the copy-before-write filter was
> inserted. After reading, the copied sections are discarded from the
> fleecing image to reduce space usage.
> 
> All of this can be restricted by an initial dirty bitmap to parts of
> the source image that are required for an incremental backup.
> 
> For discard to work, it is necessary that the fleecing image does not
> have a larger cluster size than the backup job granularity. Since
> querying that size does not always work, e.g. for RBD with krbd, the
> cluster size will not be reported, a minimum of 4 MiB is used. A job
> with PBS target already has at least this granularity, so it's just
> relevant for other targets. I.e. edge cases where this minimum is not
> enough should be very rare in practice. If ever necessary in the
> future, can still add a passed-in value for the backup QMP command to
> override.
> 
> Additionally, the cbw-timeout and on-cbw-error=break-snapshot options
> are set when installing the copy-before-write filter and
> snapshot-access. When an error or timeout occurs, the problematic (and
> each further) snapshot operation will fail and thus cancel the backup
> instead of breaking the guest write.
> 
> Note that job_id cannot be inferred from the snapshot-access bs because
> it has no parent, so just pass the one from the original bs.
> 
> [0]: https://www.mail-archive.com/qemu-devel@nongnu.org/msg876056.html
> 
> Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
> ---
> 
> Changes in v2:
>     * specify minimum cluster size for backup job
> 
>  block/monitor/block-hmp-cmds.c |   1 +
>  pve-backup.c                   | 143 ++++++++++++++++++++++++++++++++-
>  qapi/block-core.json           |   8 +-
>  3 files changed, 148 insertions(+), 4 deletions(-)
> 
> diff --git a/block/monitor/block-hmp-cmds.c b/block/monitor/block-hmp-cmds.c
> index 6efe28cef5..ca29cc4281 100644
> --- a/block/monitor/block-hmp-cmds.c
> +++ b/block/monitor/block-hmp-cmds.c
> @@ -1064,6 +1064,7 @@ void coroutine_fn hmp_backup(Monitor *mon, const QDict *qdict)
>          NULL, NULL,
>          devlist, qdict_haskey(qdict, "speed"), speed,
>          false, 0, // BackupPerf max-workers
> +        false, false, // fleecing
>          &error);
>  
>      hmp_handle_error(mon, error);
> diff --git a/pve-backup.c b/pve-backup.c
> index e6b17b797e..00aaff6509 100644
> --- a/pve-backup.c
> +++ b/pve-backup.c
> @@ -7,8 +7,10 @@
>  #include "sysemu/blockdev.h"
>  #include "block/block_int-global-state.h"
>  #include "block/blockjob.h"
> +#include "block/copy-before-write.h"
>  #include "block/dirty-bitmap.h"
>  #include "qapi/qapi-commands-block.h"
> +#include "qapi/qmp/qdict.h"
>  #include "qapi/qmp/qerror.h"
>  #include "qemu/cutils.h"
>  
> @@ -80,8 +82,15 @@ static void pvebackup_init(void)
>  // initialize PVEBackupState at startup
>  opts_init(pvebackup_init);
>  
> +typedef struct PVEBackupFleecingInfo {
> +    BlockDriverState *bs;
> +    BlockDriverState *cbw;
> +    BlockDriverState *snapshot_access;
> +} PVEBackupFleecingInfo;
> +
>  typedef struct PVEBackupDevInfo {
>      BlockDriverState *bs;
> +    PVEBackupFleecingInfo fleecing;
>      size_t size;
>      uint64_t block_size;
>      uint8_t dev_id;
> @@ -361,6 +370,25 @@ static void pvebackup_complete_cb(void *opaque, int ret)
>      PVEBackupDevInfo *di = opaque;
>      di->completed_ret = ret;
>  
> +    /*
> +     * Handle block-graph specific cleanup (for fleecing) outside of the coroutine, because the work
> +     * won't be done as a coroutine anyways:
> +     * - For snapshot_access, allows doing bdrv_unref() directly. Doing it via bdrv_co_unref() would
> +     *   just spawn a BH calling bdrv_unref().
> +     * - For cbw, draining would need to spawn a BH.
> +     *
> +     * Note that the AioContext lock is already acquired by our caller, i.e.
> +     * job_finalize_single_locked()
> +     */
> +    if (di->fleecing.snapshot_access) {
> +        bdrv_unref(di->fleecing.snapshot_access);
> +        di->fleecing.snapshot_access = NULL;
> +    }
> +    if (di->fleecing.cbw) {
> +        bdrv_cbw_drop(di->fleecing.cbw);
> +        di->fleecing.cbw = NULL;
> +    }
> +
>      /*
>       * Schedule stream cleanup in async coroutine. close_image and finish might
>       * take a while, so we can't block on them here. This way it also doesn't
> @@ -521,9 +549,82 @@ static void create_backup_jobs_bh(void *opaque) {
>  
>          bdrv_drained_begin(di->bs);
>  
> +        BackupPerf perf = (BackupPerf){ .max_workers = backup_state.perf.max_workers };
> +
> +        BlockDriverState *source_bs = di->bs;
> +        bool discard_source = false;
> +        const char *job_id = bdrv_get_device_name(di->bs);
> +        if (di->fleecing.bs) {
> +            QDict *cbw_opts = qdict_new();
> +            qdict_put_str(cbw_opts, "driver", "copy-before-write");
> +            qdict_put_str(cbw_opts, "file", bdrv_get_node_name(di->bs));
> +            qdict_put_str(cbw_opts, "target", bdrv_get_node_name(di->fleecing.bs));
> +
> +            if (di->bitmap) {
> +                /*
> +                 * Only guest writes to parts relevant for the backup need to be intercepted with
> +                 * old data being copied to the fleecing image.
> +                 */
> +                qdict_put_str(cbw_opts, "bitmap.node", bdrv_get_node_name(di->bs));
> +                qdict_put_str(cbw_opts, "bitmap.name", bdrv_dirty_bitmap_name(di->bitmap));
> +            }
> +            /*
> +             * Fleecing storage is supposed to be fast and it's better to break backup than guest
> +             * writes. Certain guest drivers like VirtIO-win have 60 seconds timeout by default, so
> +             * abort a bit before that.
> +             */
> +            qdict_put_str(cbw_opts, "on-cbw-error", "break-snapshot");
> +            qdict_put_int(cbw_opts, "cbw-timeout", 45);
> +
> +            di->fleecing.cbw = bdrv_insert_node(di->bs, cbw_opts, BDRV_O_RDWR, &local_err);
> +
> +            if (!di->fleecing.cbw) {
> +                error_setg(errp, "appending cbw node for fleecing failed: %s",
> +                           local_err ? error_get_pretty(local_err) : "unknown error");
> +                break;
> +            }
> +
> +            QDict *snapshot_access_opts = qdict_new();
> +            qdict_put_str(snapshot_access_opts, "driver", "snapshot-access");
> +            qdict_put_str(snapshot_access_opts, "file", bdrv_get_node_name(di->fleecing.cbw));
> +
> +            /*
> +             * Holding the AioContext lock here would cause a deadlock, because bdrv_open_driver()
> +             * will aquire it a second time. But it's allowed to be held exactly once when polling
> +             * and that happens when the bdrv_refresh_total_sectors() call is made there.
> +             */
> +            aio_context_release(aio_context);
> +            di->fleecing.snapshot_access =
> +                bdrv_open(NULL, NULL, snapshot_access_opts, BDRV_O_RDWR | BDRV_O_UNMAP, &local_err);
> +            aio_context_acquire(aio_context);
> +            if (!di->fleecing.snapshot_access) {
> +                error_setg(errp, "setting up snapshot access for fleecing failed: %s",
> +                           local_err ? error_get_pretty(local_err) : "unknown error");
> +                break;
> +            }
> +            source_bs = di->fleecing.snapshot_access;
> +            discard_source = true;
> +
> +            /*
> +             * bdrv_get_info() just retuns 0 (= doesn't matter) for RBD when using krbd. But discard
> +             * on the fleecing image won't work if the backup job's granularity is less than the RBD
> +             * object size (default 4 MiB), so it does matter. Always use at least 4 MiB. With a PBS
> +             * target, the backup job granularity would already be at least this much.
> +             */
> +            perf.min_cluster_size = 4 * 1024 * 1024;
> +            /*
> +             * For discard to work, cluster size for the backup job must be at least the same as for
> +             * the fleecing image.
> +             */
> +            BlockDriverInfo bdi;
> +            if (bdrv_get_info(di->fleecing.bs, &bdi) >= 0) {
> +                perf.min_cluster_size = MAX(perf.min_cluster_size, bdi.cluster_size);
> +            }
> +        }
> +
>          BlockJob *job = backup_job_create(
> -            NULL, di->bs, di->target, backup_state.speed, sync_mode, di->bitmap,
> -            bitmap_mode, false, NULL, &backup_state.perf, BLOCKDEV_ON_ERROR_REPORT,
> +            job_id, source_bs, di->target, backup_state.speed, sync_mode, di->bitmap,
> +            bitmap_mode, false, discard_source, NULL, &perf, BLOCKDEV_ON_ERROR_REPORT,
>              BLOCKDEV_ON_ERROR_REPORT, JOB_DEFAULT, pvebackup_complete_cb, di, backup_state.txn,
>              &local_err);
>  
> @@ -581,6 +682,14 @@ static void create_backup_jobs_bh(void *opaque) {
>      aio_co_enter(data->ctx, data->co);
>  }
>  
> +/*
> + * EFI disk and TPM state are small and it's just not worth setting up fleecing for them.
> + */
> +static bool device_uses_fleecing(const char *device_id)

Do we really want this?

IMO we already have enough code trying to distinguish "real" disks from
efidisks and tpmstate files.

AFAICT we do check whether the hmp command to *create* the fleecing
drives actually works, so... (see below)

> +{
> +    return strncmp(device_id, "drive-efidisk", 13) && strncmp(device_id, "drive-tpmstate", 14);
> +}
> +
>  /*
>   * Returns a list of device infos, which needs to be freed by the caller. In
>   * case of an error, errp will be set, but the returned value might still be a
> @@ -588,6 +697,7 @@ static void create_backup_jobs_bh(void *opaque) {
>   */
>  static GList coroutine_fn *get_device_info(
>      const char *devlist,
> +    bool fleecing,
>      Error **errp)
>  {
>      gchar **devs = NULL;
> @@ -611,6 +721,31 @@ static GList coroutine_fn *get_device_info(
>              }
>              PVEBackupDevInfo *di = g_new0(PVEBackupDevInfo, 1);
>              di->bs = bs;
> +
> +            if (fleecing && device_uses_fleecing(*d)) {
> +                g_autofree gchar *fleecing_devid = g_strconcat(*d, "-fleecing", NULL);
> +                BlockBackend *fleecing_blk = blk_by_name(fleecing_devid);
> +                if (!fleecing_blk) {

...so instead of this, we could just treat the absence of a fleecing
BlockBackend *not* as an error, but as deliberate?

> +                    error_set(errp, ERROR_CLASS_DEVICE_NOT_FOUND,
> +                              "Device '%s' not found", fleecing_devid);
> +                    goto err;
> +                }
> +                BlockDriverState *fleecing_bs = blk_bs(fleecing_blk);
> +                if (!bdrv_co_is_inserted(fleecing_bs)) {
> +                    error_setg(errp, QERR_DEVICE_HAS_NO_MEDIUM, fleecing_devid);
> +                    goto err;
> +                }
> +                /*
> +                 * Fleecing image needs to be the same size to act as a cbw target.
> +                 */
> +                if (bs->total_sectors != fleecing_bs->total_sectors) {
> +                    error_setg(errp, "Size mismatch for '%s' - sector count %ld != %ld",
> +                               fleecing_devid, fleecing_bs->total_sectors, bs->total_sectors);
> +                    goto err;
> +                }
> +                di->fleecing.bs = fleecing_bs;
> +            }
> +
>              di_list = g_list_append(di_list, di);
>              d++;
>          }
> @@ -660,6 +795,7 @@ UuidInfo coroutine_fn *qmp_backup(
>      const char *devlist,
>      bool has_speed, int64_t speed,
>      bool has_max_workers, int64_t max_workers,
> +    bool has_fleecing, bool fleecing,
>      Error **errp)
>  {
>      assert(qemu_in_coroutine());
> @@ -687,7 +823,7 @@ UuidInfo coroutine_fn *qmp_backup(
>      /* Todo: try to auto-detect format based on file name */
>      format = has_format ? format : BACKUP_FORMAT_VMA;
>  
> -    di_list = get_device_info(devlist, &local_err);
> +    di_list = get_device_info(devlist, has_fleecing && fleecing, &local_err);
>      if (local_err) {
>          error_propagate(errp, local_err);
>          goto err;
> @@ -1086,5 +1222,6 @@ ProxmoxSupportStatus *qmp_query_proxmox_support(Error **errp)
>      ret->query_bitmap_info = true;
>      ret->pbs_masterkey = true;
>      ret->backup_max_workers = true;
> +    ret->backup_fleecing = true;
>      return ret;
>  }
> diff --git a/qapi/block-core.json b/qapi/block-core.json
> index 58fd637e86..0bc5f42677 100644
> --- a/qapi/block-core.json
> +++ b/qapi/block-core.json
> @@ -933,6 +933,10 @@
>  #
>  # @max-workers: see @BackupPerf for details. Default 16.
>  #
> +# @fleecing: perform a backup with fleecing. For each device in @devlist, a
> +#            corresponing '-fleecing' device with the same size already needs to
> +#            be present.
> +#
>  # Returns: the uuid of the backup job
>  #
>  ##
> @@ -953,7 +957,8 @@
>                                      '*firewall-file': 'str',
>                                      '*devlist': 'str',
>                                      '*speed': 'int',
> -                                    '*max-workers': 'int' },
> +                                    '*max-workers': 'int',
> +                                    '*fleecing': 'bool' },
>    'returns': 'UuidInfo', 'coroutine': true }
>  
>  ##
> @@ -1009,6 +1014,7 @@
>              'pbs-dirty-bitmap-migration': 'bool',
>              'pbs-masterkey': 'bool',
>              'pbs-library-version': 'str',
> +            'backup-fleecing': 'bool',
>              'backup-max-workers': 'bool' } }
>  
>  ##
> -- 
> 2.39.2