From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits)) (No client certificate requested) by lists.proxmox.com (Postfix) with ESMTPS id 9E7D067417; Tue, 12 Jan 2021 11:54:15 +0100 (CET) Received: from firstgate.proxmox.com (localhost [127.0.0.1]) by firstgate.proxmox.com (Proxmox) with ESMTP id 93B0C236DD; Tue, 12 Jan 2021 11:54:15 +0100 (CET) Received: from proxmox-new.maurer-it.com (proxmox-new.maurer-it.com [212.186.127.180]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits)) (No client certificate requested) by firstgate.proxmox.com (Proxmox) with ESMTPS id 745D6236D0; Tue, 12 Jan 2021 11:54:13 +0100 (CET) Received: from proxmox-new.maurer-it.com (localhost.localdomain [127.0.0.1]) by proxmox-new.maurer-it.com (Proxmox) with ESMTP id 3B38E45062; Tue, 12 Jan 2021 11:54:13 +0100 (CET) Date: Tue, 12 Jan 2021 11:54:11 +0100 From: Wolfgang Bumiller To: Stefan Reiter Cc: pve-devel@lists.proxmox.com, pbs-devel@lists.proxmox.com Message-ID: <20210112105411.rnlyd3wwrbav472f@wobu-vie.proxmox.com> References: <20210111111409.32385-1-s.reiter@proxmox.com> <20210111111409.32385-4-s.reiter@proxmox.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20210111111409.32385-4-s.reiter@proxmox.com> User-Agent: NeoMutt/20180716 X-SPAM-LEVEL: Spam detection results: 0 AWL 0.049 Adjusted score from AWL reputation of From: address KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment RCVD_IN_DNSWL_MED -2.3 Sender listed at https://www.dnswl.org/, medium trust SPF_HELO_NONE 0.001 SPF: HELO does not publish an SPF Record SPF_PASS -0.001 SPF: sender matches SPF record Subject: Re: [pve-devel] [PATCH qemu 03/11] block: add alloc-track driver X-BeenThere: pve-devel@lists.proxmox.com X-Mailman-Version: 2.1.29 Precedence: list List-Id: Proxmox VE development discussion List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 12 Jan 2021 10:54:15 -0000 one inline comment, otherwise lgtm On Mon, Jan 11, 2021 at 12:14:01PM +0100, Stefan Reiter wrote: > Add a new filter node 'alloc-track', which seperates reads and writes to > different children, thus allowing to put a backing image behind any > blockdev (regardless of driver support). Since we can't detect any > pre-allocated blocks, we can only track new writes, hence the write > target ('file') for this node must always be empty. > > Intended use case is for live restoring, i.e. add a backup image as a > block device into a VM, then put an alloc-track on the restore target > and set the backup as backing. With this, one can use a regular > 'block-stream' to restore the image, while the VM can already run in the > background. Copy-on-read will help make progress as the VM reads as > well. > > This only worked if the target supports backing images, so up until now > only for qcow2, with alloc-track any driver for the target can be used. > > If 'auto-remove' is set, alloc-track will automatically detach itself > once the backing image is removed. It will be replaced by 'file'. > > Signed-off-by: Stefan Reiter > --- > > Potentially for upstream as well, but let's maybe test it a bit here first ;) > > block/Makefile.objs | 1 + > block/alloc-track.c | 319 ++++++++++++++++++++++++++++++++++++++++++++ > 2 files changed, 320 insertions(+) > create mode 100644 block/alloc-track.c > > diff --git a/block/Makefile.objs b/block/Makefile.objs > index 28fb3b7f7c..04f91b8cc9 100644 > --- a/block/Makefile.objs > +++ b/block/Makefile.objs > @@ -13,6 +13,7 @@ block-obj-$(CONFIG_QED) += qed-check.o > block-obj-y += vhdx.o vhdx-endian.o vhdx-log.o > block-obj-y += quorum.o > block-obj-y += zeroinit.o > +block-obj-y += alloc-track.o > block-obj-y += blkdebug.o blkverify.o blkreplay.o > block-obj-$(CONFIG_PARALLELS) += parallels.o > block-obj-y += blklogwrites.o > diff --git a/block/alloc-track.c b/block/alloc-track.c > new file mode 100644 > index 0000000000..f2a4080579 > --- /dev/null > +++ b/block/alloc-track.c > @@ -0,0 +1,319 @@ > +/* > + * Filter to allow backing images to be applied to any node. Assumes a blank > + * image to begin with, only new writes are tracked as allocated, thus this > + * must never be put on a node that already contains data. > + * > + * Copyright (c) 2020 Proxmox Server Solutions GmbH > + * Copyright (c) 2020 Stefan Reiter > + * > + * This work is licensed under the terms of the GNU GPL, version 2 or later. > + * See the COPYING file in the top-level directory. > + */ > + > +#include "qemu/osdep.h" > +#include "qapi/error.h" > +#include "block/block_int.h" > +#include "qapi/qmp/qdict.h" > +#include "qapi/qmp/qstring.h" > +#include "qemu/cutils.h" > +#include "qemu/option.h" > +#include "qemu/module.h" > + > +#define TRACK_OPT_AUTO_REMOVE "auto-remove" > + > +typedef struct { > + BdrvDirtyBitmap *bitmap; > + bool dropping; > + bool auto_remove; > +} BDRVAllocTrackState; > + > +static QemuOptsList runtime_opts = { > + .name = "alloc-track", > + .head = QTAILQ_HEAD_INITIALIZER(runtime_opts.head), > + .desc = { > + { > + .name = TRACK_OPT_AUTO_REMOVE, > + .type = QEMU_OPT_BOOL, > + .help = "automatically replace this node with 'file' when 'backing'" > + "is detached", > + }, > + { /* end of list */ } > + }, > +}; > + > +static void track_refresh_limits(BlockDriverState *bs, Error **errp) > +{ > + BlockDriverInfo bdi; > + > + if (!bs->file) { > + return; > + } > + > + /* always use alignment from underlying write device so RMW cycle for > + * bdrv_pwritev reads data from our backing via track_co_preadv (no partial > + * cluster allocation in 'file') */ > + bdrv_get_info(bs->file->bs, &bdi); > + bs->bl.request_alignment = MAX(bs->file->bs->bl.request_alignment, > + MAX(bdi.cluster_size, BDRV_SECTOR_SIZE)); > +} > + > +static int track_open(BlockDriverState *bs, QDict *options, int flags, > + Error **errp) > +{ > + BDRVAllocTrackState *s = bs->opaque; > + QemuOpts *opts; > + Error *local_err = NULL; > + int ret = 0; > + > + opts = qemu_opts_create(&runtime_opts, NULL, 0, &error_abort); > + qemu_opts_absorb_qdict(opts, options, &local_err); > + if (local_err) { > + error_propagate(errp, local_err); > + ret = -EINVAL; > + goto fail; > + } > + > + s->auto_remove = qemu_opt_get_bool(opts, TRACK_OPT_AUTO_REMOVE, false); > + > + /* open the target (write) node, backing will be attached by block layer */ > + bs->file = bdrv_open_child(NULL, options, "file", bs, &child_of_bds, > + BDRV_CHILD_DATA | BDRV_CHILD_METADATA, false, > + &local_err); > + if (local_err) { > + ret = -EINVAL; > + error_propagate(errp, local_err); > + goto fail; > + } > + > + track_refresh_limits(bs, errp); > + uint64_t gran = bs->bl.request_alignment; > + s->bitmap = bdrv_create_dirty_bitmap(bs->file->bs, gran, NULL, &local_err); > + if (local_err) { > + ret = -EIO; > + error_propagate(errp, local_err); > + goto fail; > + } > + > + s->dropping = false; > + > +fail: > + if (ret < 0) { > + bdrv_unref_child(bs, bs->file); > + if (s->bitmap) { > + bdrv_release_dirty_bitmap(s->bitmap); > + } > + } > + qemu_opts_del(opts); > + return ret; > +} > + > +static void track_close(BlockDriverState *bs) > +{ > + BDRVAllocTrackState *s = bs->opaque; > + if (s->bitmap) { > + bdrv_release_dirty_bitmap(s->bitmap); > + } > +} > + > +static int64_t track_getlength(BlockDriverState *bs) > +{ > + return bdrv_getlength(bs->file->bs); > +} > + > +static int coroutine_fn track_co_preadv(BlockDriverState *bs, > + uint64_t offset, uint64_t bytes, QEMUIOVector *qiov, int flags) > +{ > + BDRVAllocTrackState *s = bs->opaque; > + QEMUIOVector local_qiov; > + int ret; > + > + /* 'cur_offset' is relative to 'offset', 'local_offset' to image start */ > + uint64_t cur_offset, local_offset; > + int64_t local_bytes; > + bool alloc; > + > + /* a read request can span multiple granularity-sized chunks, and can thus > + * contain blocks with different allocation status - we could just iterate > + * granularity-wise, but for better performance use bdrv_dirty_bitmap_next_X > + * to find the next flip and consider everything up to that in one go */ > + for (cur_offset = 0; cur_offset < bytes; cur_offset += local_bytes) { > + local_offset = offset + cur_offset; > + alloc = bdrv_dirty_bitmap_get(s->bitmap, local_offset); > + if (alloc) { > + local_bytes = bdrv_dirty_bitmap_next_zero(s->bitmap, local_offset, > + bytes - cur_offset); > + } else { > + local_bytes = bdrv_dirty_bitmap_next_dirty(s->bitmap, local_offset, > + bytes - cur_offset); > + } > + > + /* _bitmap_next_X return is -1 if no end found within limit, otherwise > + * offset of next flip (to start of image) */ > + local_bytes = local_bytes < 0 ? > + bytes - cur_offset : > + local_bytes - local_offset; > + > + qemu_iovec_init_slice(&local_qiov, qiov, cur_offset, local_bytes); > + > + if (alloc) { > + ret = bdrv_co_preadv(bs->file, local_offset, local_bytes, > + &local_qiov, flags); > + } else if (bs->backing) { > + ret = bdrv_co_preadv(bs->backing, local_offset, local_bytes, > + &local_qiov, flags); > + } else { Here you handle the case where there's no backing file, but other functions (eg. track_co_block_status) simply assume that `bs->backing` is always valid. Is this intentional? It makes me also wonder - since you just zero out the iov below - if this driver could maybe also replace the current zeroinit filter? Otherwise it's at least a good reminder that the zeroinit code could start making use of bitmaps as well ;-) > + ret = qemu_iovec_memset(&local_qiov, cur_offset, 0, local_bytes); > + } > + > + if (ret != 0) { > + break; > + } > + } > + > + return ret; > +} > + > +static int coroutine_fn track_co_pwritev(BlockDriverState *bs, > + uint64_t offset, uint64_t bytes, QEMUIOVector *qiov, int flags) > +{ > + return bdrv_co_pwritev(bs->file, offset, bytes, qiov, flags); > +} > + > +static int coroutine_fn track_co_pwrite_zeroes(BlockDriverState *bs, > + int64_t offset, int count, BdrvRequestFlags flags) > +{ > + return bdrv_pwrite_zeroes(bs->file, offset, count, flags); > +} > + > +static int coroutine_fn track_co_pdiscard(BlockDriverState *bs, > + int64_t offset, int count) > +{ > + return bdrv_co_pdiscard(bs->file, offset, count); > +} > + > +static coroutine_fn int track_co_flush(BlockDriverState *bs) > +{ > + return bdrv_co_flush(bs->file->bs); > +} > + > +static int coroutine_fn track_co_block_status(BlockDriverState *bs, > + bool want_zero, > + int64_t offset, > + int64_t bytes, > + int64_t *pnum, > + int64_t *map, > + BlockDriverState **file) > +{ > + BDRVAllocTrackState *s = bs->opaque; > + > + bool alloc = bdrv_dirty_bitmap_get(s->bitmap, offset); > + int64_t next_flipped; > + if (alloc) { > + next_flipped = bdrv_dirty_bitmap_next_zero(s->bitmap, offset, bytes); > + } else { > + next_flipped = bdrv_dirty_bitmap_next_dirty(s->bitmap, offset, bytes); > + } > + > + /* in case not the entire region has the same state, we need to set pnum to > + * indicate for how many bytes our result is valid */ > + *pnum = next_flipped == -1 ? bytes : next_flipped - offset; > + *map = offset; > + > + if (alloc) { > + *file = bs->file->bs; > + return BDRV_BLOCK_RAW | BDRV_BLOCK_OFFSET_VALID; > + } else { > + *file = bs->backing->bs; > + return 0; > + } > +} > + > +static void track_child_perm(BlockDriverState *bs, BdrvChild *c, > + BdrvChildRole role, BlockReopenQueue *reopen_queue, > + uint64_t perm, uint64_t shared, > + uint64_t *nperm, uint64_t *nshared) > +{ > + BDRVAllocTrackState *s = bs->opaque; > + > + *nshared = BLK_PERM_ALL; > + > + /* in case we're currently dropping ourselves, claim to not use any > + * permissions at all - which is fine, since from this point on we will > + * never issue a read or write anymore */ > + if (s->dropping) { > + *nperm = 0; > + return; > + } > + > + if (role & BDRV_CHILD_DATA) { > + *nperm = perm & DEFAULT_PERM_PASSTHROUGH; > + } else { > + /* 'backing' is also a child of our BDS, but we don't expect it to be > + * writeable, so we only forward 'consistent read' */ > + *nperm = perm & BLK_PERM_CONSISTENT_READ; > + } > +} > + > +static void track_drop(void *opaque) > +{ > + BlockDriverState *bs = (BlockDriverState*)opaque; > + BDRVAllocTrackState *s = bs->opaque; > + > + bdrv_drained_begin(bs); > + > + s->dropping = true; > + > + bdrv_child_refresh_perms(bs, bs->file, &error_abort); > + bdrv_replace_node(bs, bs->file->bs, &error_abort); > + bdrv_set_backing_hd(bs, NULL, &error_abort); > + > + bdrv_drained_end(bs); > +} > + > +static int track_change_backing_file(BlockDriverState *bs, > + const char *backing_file, > + const char *backing_fmt) > +{ > + if (backing_file == NULL && backing_fmt == NULL) { > + /* backing file has been disconnected, there's no longer any use for > + * this node, so let's remove ourselves from the block graph - we need > + * to schedule this for later however, since when this function is > + * called, the blockjob modifying us is probably not done yet and has a > + * blocker on 'bs' */ > + aio_bh_schedule_oneshot(qemu_get_aio_context(), track_drop, (void*)bs); > + } > + > + return 0; > +} > + > +static BlockDriver bdrv_alloc_track = { > + .format_name = "alloc-track", > + .instance_size = sizeof(BDRVAllocTrackState), > + > + .bdrv_file_open = track_open, > + .bdrv_close = track_close, > + .bdrv_getlength = track_getlength, > + .bdrv_child_perm = track_child_perm, > + .bdrv_refresh_limits = track_refresh_limits, > + > + .bdrv_co_pwrite_zeroes = track_co_pwrite_zeroes, > + .bdrv_co_pwritev = track_co_pwritev, > + .bdrv_co_preadv = track_co_preadv, > + .bdrv_co_pdiscard = track_co_pdiscard, > + > + .bdrv_co_flush = track_co_flush, > + .bdrv_co_flush_to_disk = track_co_flush, > + > + .is_filter = true, > + .supports_backing = true, > + > + .bdrv_co_block_status = track_co_block_status, > + .bdrv_change_backing_file = track_change_backing_file, > +}; > + > +static void bdrv_alloc_track_init(void) > +{ > + bdrv_register(&bdrv_alloc_track); > +} > + > +block_init(bdrv_alloc_track_init); > -- > 2.20.1