From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits)) (No client certificate requested) by lists.proxmox.com (Postfix) with ESMTPS id 05E0D93165 for ; Mon, 8 Apr 2024 14:46:18 +0200 (CEST) Received: from firstgate.proxmox.com (localhost [127.0.0.1]) by firstgate.proxmox.com (Proxmox) with ESMTP id D4D33A7A0 for ; Mon, 8 Apr 2024 14:45:47 +0200 (CEST) Received: from proxmox-new.maurer-it.com (proxmox-new.maurer-it.com [94.136.29.106]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits)) (No client certificate requested) by firstgate.proxmox.com (Proxmox) with ESMTPS for ; Mon, 8 Apr 2024 14:45:46 +0200 (CEST) Received: from proxmox-new.maurer-it.com (localhost.localdomain [127.0.0.1]) by proxmox-new.maurer-it.com (Proxmox) with ESMTP id 63A4B440AF for ; Mon, 8 Apr 2024 14:45:46 +0200 (CEST) Date: Mon, 8 Apr 2024 14:45:44 +0200 From: Wolfgang Bumiller To: Fiona Ebner Cc: pve-devel@lists.proxmox.com Message-ID: References: <20240315102502.84163-1-f.ebner@proxmox.com> <20240315102502.84163-8-f.ebner@proxmox.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20240315102502.84163-8-f.ebner@proxmox.com> X-SPAM-LEVEL: Spam detection results: 0 AWL 0.087 Adjusted score from AWL reputation of From: address BAYES_00 -1.9 Bayes spam probability is 0 to 1% DMARC_MISSING 0.1 Missing DMARC policy KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment SPF_HELO_NONE 0.001 SPF: HELO does not publish an SPF Record SPF_PASS -0.001 SPF: sender matches SPF record URIBL_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to URIBL was blocked. See http://wiki.apache.org/spamassassin/DnsBlocklists#dnsbl-block for more information. [bitmap.name, fleecing.bs, mail-archive.com] Subject: Re: [pve-devel] [PATCH qemu v2 07/21] PVE backup: add fleecing option X-BeenThere: pve-devel@lists.proxmox.com X-Mailman-Version: 2.1.29 Precedence: list List-Id: Proxmox VE development discussion List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 08 Apr 2024 12:46:18 -0000 On Fri, Mar 15, 2024 at 11:24:48AM +0100, Fiona Ebner wrote: > When a fleecing option is given, it is expected that each device has > a corresponding "-fleecing" block device already attached, except for > EFI disk and TPM state, where fleecing is never used. > > The following graph was adapted from [0] which also contains more > details about fleecing. > > [guest] > | > | root > v file > [copy-before-write]<------[snapshot-access] > | | > | file | target > v v > [source] [fleecing] > > For fleecing, a copy-before-write filter is inserted on top of the > source node, as well as a snapshot-access node pointing to the filter > node which allows to read the consistent state of the image at the > time it was inserted. New guest writes are passed through the > copy-before-write filter which will first copy over old data to the > fleecing image in case that old data is still needed by the > snapshot-access node. > > The backup process will sequentially read from the snapshot access, > which has a bitmap and knows whether to read from the original image > or the fleecing image to get the "snapshot" state, i.e. data from the > source image at the time when the copy-before-write filter was > inserted. After reading, the copied sections are discarded from the > fleecing image to reduce space usage. > > All of this can be restricted by an initial dirty bitmap to parts of > the source image that are required for an incremental backup. > > For discard to work, it is necessary that the fleecing image does not > have a larger cluster size than the backup job granularity. Since > querying that size does not always work, e.g. for RBD with krbd, the > cluster size will not be reported, a minimum of 4 MiB is used. A job > with PBS target already has at least this granularity, so it's just > relevant for other targets. I.e. edge cases where this minimum is not > enough should be very rare in practice. If ever necessary in the > future, can still add a passed-in value for the backup QMP command to > override. > > Additionally, the cbw-timeout and on-cbw-error=break-snapshot options > are set when installing the copy-before-write filter and > snapshot-access. When an error or timeout occurs, the problematic (and > each further) snapshot operation will fail and thus cancel the backup > instead of breaking the guest write. > > Note that job_id cannot be inferred from the snapshot-access bs because > it has no parent, so just pass the one from the original bs. > > [0]: https://www.mail-archive.com/qemu-devel@nongnu.org/msg876056.html > > Signed-off-by: Fiona Ebner > --- > > Changes in v2: > * specify minimum cluster size for backup job > > block/monitor/block-hmp-cmds.c | 1 + > pve-backup.c | 143 ++++++++++++++++++++++++++++++++- > qapi/block-core.json | 8 +- > 3 files changed, 148 insertions(+), 4 deletions(-) > > diff --git a/block/monitor/block-hmp-cmds.c b/block/monitor/block-hmp-cmds.c > index 6efe28cef5..ca29cc4281 100644 > --- a/block/monitor/block-hmp-cmds.c > +++ b/block/monitor/block-hmp-cmds.c > @@ -1064,6 +1064,7 @@ void coroutine_fn hmp_backup(Monitor *mon, const QDict *qdict) > NULL, NULL, > devlist, qdict_haskey(qdict, "speed"), speed, > false, 0, // BackupPerf max-workers > + false, false, // fleecing > &error); > > hmp_handle_error(mon, error); > diff --git a/pve-backup.c b/pve-backup.c > index e6b17b797e..00aaff6509 100644 > --- a/pve-backup.c > +++ b/pve-backup.c > @@ -7,8 +7,10 @@ > #include "sysemu/blockdev.h" > #include "block/block_int-global-state.h" > #include "block/blockjob.h" > +#include "block/copy-before-write.h" > #include "block/dirty-bitmap.h" > #include "qapi/qapi-commands-block.h" > +#include "qapi/qmp/qdict.h" > #include "qapi/qmp/qerror.h" > #include "qemu/cutils.h" > > @@ -80,8 +82,15 @@ static void pvebackup_init(void) > // initialize PVEBackupState at startup > opts_init(pvebackup_init); > > +typedef struct PVEBackupFleecingInfo { > + BlockDriverState *bs; > + BlockDriverState *cbw; > + BlockDriverState *snapshot_access; > +} PVEBackupFleecingInfo; > + > typedef struct PVEBackupDevInfo { > BlockDriverState *bs; > + PVEBackupFleecingInfo fleecing; > size_t size; > uint64_t block_size; > uint8_t dev_id; > @@ -361,6 +370,25 @@ static void pvebackup_complete_cb(void *opaque, int ret) > PVEBackupDevInfo *di = opaque; > di->completed_ret = ret; > > + /* > + * Handle block-graph specific cleanup (for fleecing) outside of the coroutine, because the work > + * won't be done as a coroutine anyways: > + * - For snapshot_access, allows doing bdrv_unref() directly. Doing it via bdrv_co_unref() would > + * just spawn a BH calling bdrv_unref(). > + * - For cbw, draining would need to spawn a BH. > + * > + * Note that the AioContext lock is already acquired by our caller, i.e. > + * job_finalize_single_locked() > + */ > + if (di->fleecing.snapshot_access) { > + bdrv_unref(di->fleecing.snapshot_access); > + di->fleecing.snapshot_access = NULL; > + } > + if (di->fleecing.cbw) { > + bdrv_cbw_drop(di->fleecing.cbw); > + di->fleecing.cbw = NULL; > + } > + > /* > * Schedule stream cleanup in async coroutine. close_image and finish might > * take a while, so we can't block on them here. This way it also doesn't > @@ -521,9 +549,82 @@ static void create_backup_jobs_bh(void *opaque) { > > bdrv_drained_begin(di->bs); > > + BackupPerf perf = (BackupPerf){ .max_workers = backup_state.perf.max_workers }; > + > + BlockDriverState *source_bs = di->bs; > + bool discard_source = false; > + const char *job_id = bdrv_get_device_name(di->bs); > + if (di->fleecing.bs) { > + QDict *cbw_opts = qdict_new(); > + qdict_put_str(cbw_opts, "driver", "copy-before-write"); > + qdict_put_str(cbw_opts, "file", bdrv_get_node_name(di->bs)); > + qdict_put_str(cbw_opts, "target", bdrv_get_node_name(di->fleecing.bs)); > + > + if (di->bitmap) { > + /* > + * Only guest writes to parts relevant for the backup need to be intercepted with > + * old data being copied to the fleecing image. > + */ > + qdict_put_str(cbw_opts, "bitmap.node", bdrv_get_node_name(di->bs)); > + qdict_put_str(cbw_opts, "bitmap.name", bdrv_dirty_bitmap_name(di->bitmap)); > + } > + /* > + * Fleecing storage is supposed to be fast and it's better to break backup than guest > + * writes. Certain guest drivers like VirtIO-win have 60 seconds timeout by default, so > + * abort a bit before that. > + */ > + qdict_put_str(cbw_opts, "on-cbw-error", "break-snapshot"); > + qdict_put_int(cbw_opts, "cbw-timeout", 45); > + > + di->fleecing.cbw = bdrv_insert_node(di->bs, cbw_opts, BDRV_O_RDWR, &local_err); > + > + if (!di->fleecing.cbw) { > + error_setg(errp, "appending cbw node for fleecing failed: %s", > + local_err ? error_get_pretty(local_err) : "unknown error"); > + break; > + } > + > + QDict *snapshot_access_opts = qdict_new(); > + qdict_put_str(snapshot_access_opts, "driver", "snapshot-access"); > + qdict_put_str(snapshot_access_opts, "file", bdrv_get_node_name(di->fleecing.cbw)); > + > + /* > + * Holding the AioContext lock here would cause a deadlock, because bdrv_open_driver() > + * will aquire it a second time. But it's allowed to be held exactly once when polling > + * and that happens when the bdrv_refresh_total_sectors() call is made there. > + */ > + aio_context_release(aio_context); > + di->fleecing.snapshot_access = > + bdrv_open(NULL, NULL, snapshot_access_opts, BDRV_O_RDWR | BDRV_O_UNMAP, &local_err); > + aio_context_acquire(aio_context); > + if (!di->fleecing.snapshot_access) { > + error_setg(errp, "setting up snapshot access for fleecing failed: %s", > + local_err ? error_get_pretty(local_err) : "unknown error"); > + break; > + } > + source_bs = di->fleecing.snapshot_access; > + discard_source = true; > + > + /* > + * bdrv_get_info() just retuns 0 (= doesn't matter) for RBD when using krbd. But discard > + * on the fleecing image won't work if the backup job's granularity is less than the RBD > + * object size (default 4 MiB), so it does matter. Always use at least 4 MiB. With a PBS > + * target, the backup job granularity would already be at least this much. > + */ > + perf.min_cluster_size = 4 * 1024 * 1024; > + /* > + * For discard to work, cluster size for the backup job must be at least the same as for > + * the fleecing image. > + */ > + BlockDriverInfo bdi; > + if (bdrv_get_info(di->fleecing.bs, &bdi) >= 0) { > + perf.min_cluster_size = MAX(perf.min_cluster_size, bdi.cluster_size); > + } > + } > + > BlockJob *job = backup_job_create( > - NULL, di->bs, di->target, backup_state.speed, sync_mode, di->bitmap, > - bitmap_mode, false, NULL, &backup_state.perf, BLOCKDEV_ON_ERROR_REPORT, > + job_id, source_bs, di->target, backup_state.speed, sync_mode, di->bitmap, > + bitmap_mode, false, discard_source, NULL, &perf, BLOCKDEV_ON_ERROR_REPORT, > BLOCKDEV_ON_ERROR_REPORT, JOB_DEFAULT, pvebackup_complete_cb, di, backup_state.txn, > &local_err); > > @@ -581,6 +682,14 @@ static void create_backup_jobs_bh(void *opaque) { > aio_co_enter(data->ctx, data->co); > } > > +/* > + * EFI disk and TPM state are small and it's just not worth setting up fleecing for them. > + */ > +static bool device_uses_fleecing(const char *device_id) Do we really want this? IMO we already have enough code trying to distinguish "real" disks from efidisks and tpmstate files. AFAICT we do check whether the hmp command to *create* the fleecing drives actually works, so... (see below) > +{ > + return strncmp(device_id, "drive-efidisk", 13) && strncmp(device_id, "drive-tpmstate", 14); > +} > + > /* > * Returns a list of device infos, which needs to be freed by the caller. In > * case of an error, errp will be set, but the returned value might still be a > @@ -588,6 +697,7 @@ static void create_backup_jobs_bh(void *opaque) { > */ > static GList coroutine_fn *get_device_info( > const char *devlist, > + bool fleecing, > Error **errp) > { > gchar **devs = NULL; > @@ -611,6 +721,31 @@ static GList coroutine_fn *get_device_info( > } > PVEBackupDevInfo *di = g_new0(PVEBackupDevInfo, 1); > di->bs = bs; > + > + if (fleecing && device_uses_fleecing(*d)) { > + g_autofree gchar *fleecing_devid = g_strconcat(*d, "-fleecing", NULL); > + BlockBackend *fleecing_blk = blk_by_name(fleecing_devid); > + if (!fleecing_blk) { ...so instead of this, we could just treat the absence of a fleecing BlockBackend *not* as an error, but as deliberate? > + error_set(errp, ERROR_CLASS_DEVICE_NOT_FOUND, > + "Device '%s' not found", fleecing_devid); > + goto err; > + } > + BlockDriverState *fleecing_bs = blk_bs(fleecing_blk); > + if (!bdrv_co_is_inserted(fleecing_bs)) { > + error_setg(errp, QERR_DEVICE_HAS_NO_MEDIUM, fleecing_devid); > + goto err; > + } > + /* > + * Fleecing image needs to be the same size to act as a cbw target. > + */ > + if (bs->total_sectors != fleecing_bs->total_sectors) { > + error_setg(errp, "Size mismatch for '%s' - sector count %ld != %ld", > + fleecing_devid, fleecing_bs->total_sectors, bs->total_sectors); > + goto err; > + } > + di->fleecing.bs = fleecing_bs; > + } > + > di_list = g_list_append(di_list, di); > d++; > } > @@ -660,6 +795,7 @@ UuidInfo coroutine_fn *qmp_backup( > const char *devlist, > bool has_speed, int64_t speed, > bool has_max_workers, int64_t max_workers, > + bool has_fleecing, bool fleecing, > Error **errp) > { > assert(qemu_in_coroutine()); > @@ -687,7 +823,7 @@ UuidInfo coroutine_fn *qmp_backup( > /* Todo: try to auto-detect format based on file name */ > format = has_format ? format : BACKUP_FORMAT_VMA; > > - di_list = get_device_info(devlist, &local_err); > + di_list = get_device_info(devlist, has_fleecing && fleecing, &local_err); > if (local_err) { > error_propagate(errp, local_err); > goto err; > @@ -1086,5 +1222,6 @@ ProxmoxSupportStatus *qmp_query_proxmox_support(Error **errp) > ret->query_bitmap_info = true; > ret->pbs_masterkey = true; > ret->backup_max_workers = true; > + ret->backup_fleecing = true; > return ret; > } > diff --git a/qapi/block-core.json b/qapi/block-core.json > index 58fd637e86..0bc5f42677 100644 > --- a/qapi/block-core.json > +++ b/qapi/block-core.json > @@ -933,6 +933,10 @@ > # > # @max-workers: see @BackupPerf for details. Default 16. > # > +# @fleecing: perform a backup with fleecing. For each device in @devlist, a > +# corresponing '-fleecing' device with the same size already needs to > +# be present. > +# > # Returns: the uuid of the backup job > # > ## > @@ -953,7 +957,8 @@ > '*firewall-file': 'str', > '*devlist': 'str', > '*speed': 'int', > - '*max-workers': 'int' }, > + '*max-workers': 'int', > + '*fleecing': 'bool' }, > 'returns': 'UuidInfo', 'coroutine': true } > > ## > @@ -1009,6 +1014,7 @@ > 'pbs-dirty-bitmap-migration': 'bool', > 'pbs-masterkey': 'bool', > 'pbs-library-version': 'str', > + 'backup-fleecing': 'bool', > 'backup-max-workers': 'bool' } } > > ## > -- > 2.39.2