From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits)) (No client certificate requested) by lists.proxmox.com (Postfix) with ESMTPS id 9F2EA67612; Tue, 12 Jan 2021 14:42:52 +0100 (CET) Received: from firstgate.proxmox.com (localhost [127.0.0.1]) by firstgate.proxmox.com (Proxmox) with ESMTP id 9142425DDA; Tue, 12 Jan 2021 14:42:52 +0100 (CET) Received: from proxmox-new.maurer-it.com (proxmox-new.maurer-it.com [212.186.127.180]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits)) (No client certificate requested) by firstgate.proxmox.com (Proxmox) with ESMTPS id 8D13225DD0; Tue, 12 Jan 2021 14:42:51 +0100 (CET) Received: from proxmox-new.maurer-it.com (localhost.localdomain [127.0.0.1]) by proxmox-new.maurer-it.com (Proxmox) with ESMTP id 5548144BF6; Tue, 12 Jan 2021 14:42:51 +0100 (CET) Date: Tue, 12 Jan 2021 14:42:49 +0100 From: Wolfgang Bumiller To: Stefan Reiter Cc: pve-devel@lists.proxmox.com, pbs-devel@lists.proxmox.com Message-ID: <20210112134249.jkubsnxkexxt6bbh@wobu-vie.proxmox.com> References: <20210111111409.32385-1-s.reiter@proxmox.com> <20210111111409.32385-4-s.reiter@proxmox.com> <20210112105411.rnlyd3wwrbav472f@wobu-vie.proxmox.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: NeoMutt/20180716 X-SPAM-LEVEL: Spam detection results: 0 AWL 0.048 Adjusted score from AWL reputation of From: address KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment RCVD_IN_DNSWL_MED -2.3 Sender listed at https://www.dnswl.org/, medium trust SPF_HELO_NONE 0.001 SPF: HELO does not publish an SPF Record SPF_PASS -0.001 SPF: sender matches SPF record Subject: Re: [pve-devel] [PATCH qemu 03/11] block: add alloc-track driver X-BeenThere: pve-devel@lists.proxmox.com X-Mailman-Version: 2.1.29 Precedence: list List-Id: Proxmox VE development discussion List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 12 Jan 2021 13:42:52 -0000 On Tue, Jan 12, 2021 at 12:29:05PM +0100, Stefan Reiter wrote: > On 12/01/2021 11:54, Wolfgang Bumiller wrote: > > one inline comment, otherwise lgtm > > > > On Mon, Jan 11, 2021 at 12:14:01PM +0100, Stefan Reiter wrote: > > > Add a new filter node 'alloc-track', which seperates reads and writes to > > > different children, thus allowing to put a backing image behind any > > > blockdev (regardless of driver support). Since we can't detect any > > > pre-allocated blocks, we can only track new writes, hence the write > > > target ('file') for this node must always be empty. > > > > > > Intended use case is for live restoring, i.e. add a backup image as a > > > block device into a VM, then put an alloc-track on the restore target > > > and set the backup as backing. With this, one can use a regular > > > 'block-stream' to restore the image, while the VM can already run in the > > > background. Copy-on-read will help make progress as the VM reads as > > > well. > > > > > > This only worked if the target supports backing images, so up until now > > > only for qcow2, with alloc-track any driver for the target can be used. > > > > > > If 'auto-remove' is set, alloc-track will automatically detach itself > > > once the backing image is removed. It will be replaced by 'file'. > > > > > > Signed-off-by: Stefan Reiter > > > --- > > > > > > Potentially for upstream as well, but let's maybe test it a bit here first ;) > > > > > > block/Makefile.objs | 1 + > > > block/alloc-track.c | 319 ++++++++++++++++++++++++++++++++++++++++++++ > > > 2 files changed, 320 insertions(+) > > > create mode 100644 block/alloc-track.c > > > > > > diff --git a/block/Makefile.objs b/block/Makefile.objs > > > index 28fb3b7f7c..04f91b8cc9 100644 > > > --- a/block/Makefile.objs > > > +++ b/block/Makefile.objs > > > @@ -13,6 +13,7 @@ block-obj-$(CONFIG_QED) += qed-check.o > > > block-obj-y += vhdx.o vhdx-endian.o vhdx-log.o > > > block-obj-y += quorum.o > > > block-obj-y += zeroinit.o > > > +block-obj-y += alloc-track.o > > > block-obj-y += blkdebug.o blkverify.o blkreplay.o > > > block-obj-$(CONFIG_PARALLELS) += parallels.o > > > block-obj-y += blklogwrites.o > > > diff --git a/block/alloc-track.c b/block/alloc-track.c > > > new file mode 100644 > > > index 0000000000..f2a4080579 > > > --- /dev/null > > > +++ b/block/alloc-track.c > > > @@ -0,0 +1,319 @@ > > > +/* > > > + * Filter to allow backing images to be applied to any node. Assumes a blank > > > + * image to begin with, only new writes are tracked as allocated, thus this > > > + * must never be put on a node that already contains data. > > > + * > > > + * Copyright (c) 2020 Proxmox Server Solutions GmbH > > > + * Copyright (c) 2020 Stefan Reiter > > > + * > > > + * This work is licensed under the terms of the GNU GPL, version 2 or later. > > > + * See the COPYING file in the top-level directory. > > > + */ > > > + > > > +#include "qemu/osdep.h" > > > +#include "qapi/error.h" > > > +#include "block/block_int.h" > > > +#include "qapi/qmp/qdict.h" > > > +#include "qapi/qmp/qstring.h" > > > +#include "qemu/cutils.h" > > > +#include "qemu/option.h" > > > +#include "qemu/module.h" > > > + > > > +#define TRACK_OPT_AUTO_REMOVE "auto-remove" > > > + > > > +typedef struct { > > > + BdrvDirtyBitmap *bitmap; > > > + bool dropping; > > > + bool auto_remove; > > > +} BDRVAllocTrackState; > > > + > > > +static QemuOptsList runtime_opts = { > > > + .name = "alloc-track", > > > + .head = QTAILQ_HEAD_INITIALIZER(runtime_opts.head), > > > + .desc = { > > > + { > > > + .name = TRACK_OPT_AUTO_REMOVE, > > > + .type = QEMU_OPT_BOOL, > > > + .help = "automatically replace this node with 'file' when 'backing'" > > > + "is detached", > > > + }, > > > + { /* end of list */ } > > > + }, > > > +}; > > > + > > > +static void track_refresh_limits(BlockDriverState *bs, Error **errp) > > > +{ > > > + BlockDriverInfo bdi; > > > + > > > + if (!bs->file) { > > > + return; > > > + } > > > + > > > + /* always use alignment from underlying write device so RMW cycle for > > > + * bdrv_pwritev reads data from our backing via track_co_preadv (no partial > > > + * cluster allocation in 'file') */ > > > + bdrv_get_info(bs->file->bs, &bdi); > > > + bs->bl.request_alignment = MAX(bs->file->bs->bl.request_alignment, > > > + MAX(bdi.cluster_size, BDRV_SECTOR_SIZE)); > > > +} > > > + > > > +static int track_open(BlockDriverState *bs, QDict *options, int flags, > > > + Error **errp) > > > +{ > > > + BDRVAllocTrackState *s = bs->opaque; > > > + QemuOpts *opts; > > > + Error *local_err = NULL; > > > + int ret = 0; > > > + > > > + opts = qemu_opts_create(&runtime_opts, NULL, 0, &error_abort); > > > + qemu_opts_absorb_qdict(opts, options, &local_err); > > > + if (local_err) { > > > + error_propagate(errp, local_err); > > > + ret = -EINVAL; > > > + goto fail; > > > + } > > > + > > > + s->auto_remove = qemu_opt_get_bool(opts, TRACK_OPT_AUTO_REMOVE, false); > > > + > > > + /* open the target (write) node, backing will be attached by block layer */ > > > + bs->file = bdrv_open_child(NULL, options, "file", bs, &child_of_bds, > > > + BDRV_CHILD_DATA | BDRV_CHILD_METADATA, false, > > > + &local_err); > > > + if (local_err) { > > > + ret = -EINVAL; > > > + error_propagate(errp, local_err); > > > + goto fail; > > > + } > > > + > > > + track_refresh_limits(bs, errp); > > > + uint64_t gran = bs->bl.request_alignment; > > > + s->bitmap = bdrv_create_dirty_bitmap(bs->file->bs, gran, NULL, &local_err); > > > + if (local_err) { > > > + ret = -EIO; > > > + error_propagate(errp, local_err); > > > + goto fail; > > > + } > > > + > > > + s->dropping = false; > > > + > > > +fail: > > > + if (ret < 0) { > > > + bdrv_unref_child(bs, bs->file); > > > + if (s->bitmap) { > > > + bdrv_release_dirty_bitmap(s->bitmap); > > > + } > > > + } > > > + qemu_opts_del(opts); > > > + return ret; > > > +} > > > + > > > +static void track_close(BlockDriverState *bs) > > > +{ > > > + BDRVAllocTrackState *s = bs->opaque; > > > + if (s->bitmap) { > > > + bdrv_release_dirty_bitmap(s->bitmap); > > > + } > > > +} > > > + > > > +static int64_t track_getlength(BlockDriverState *bs) > > > +{ > > > + return bdrv_getlength(bs->file->bs); > > > +} > > > + > > > +static int coroutine_fn track_co_preadv(BlockDriverState *bs, > > > + uint64_t offset, uint64_t bytes, QEMUIOVector *qiov, int flags) > > > +{ > > > + BDRVAllocTrackState *s = bs->opaque; > > > + QEMUIOVector local_qiov; > > > + int ret; > > > + > > > + /* 'cur_offset' is relative to 'offset', 'local_offset' to image start */ > > > + uint64_t cur_offset, local_offset; > > > + int64_t local_bytes; > > > + bool alloc; > > > + > > > + /* a read request can span multiple granularity-sized chunks, and can thus > > > + * contain blocks with different allocation status - we could just iterate > > > + * granularity-wise, but for better performance use bdrv_dirty_bitmap_next_X > > > + * to find the next flip and consider everything up to that in one go */ > > > + for (cur_offset = 0; cur_offset < bytes; cur_offset += local_bytes) { > > > + local_offset = offset + cur_offset; > > > + alloc = bdrv_dirty_bitmap_get(s->bitmap, local_offset); > > > + if (alloc) { > > > + local_bytes = bdrv_dirty_bitmap_next_zero(s->bitmap, local_offset, > > > + bytes - cur_offset); > > > + } else { > > > + local_bytes = bdrv_dirty_bitmap_next_dirty(s->bitmap, local_offset, > > > + bytes - cur_offset); > > > + } > > > + > > > + /* _bitmap_next_X return is -1 if no end found within limit, otherwise > > > + * offset of next flip (to start of image) */ > > > + local_bytes = local_bytes < 0 ? > > > + bytes - cur_offset : > > > + local_bytes - local_offset; > > > + > > > + qemu_iovec_init_slice(&local_qiov, qiov, cur_offset, local_bytes); > > > + > > > + if (alloc) { > > > + ret = bdrv_co_preadv(bs->file, local_offset, local_bytes, > > > + &local_qiov, flags); > > > + } else if (bs->backing) { > > > + ret = bdrv_co_preadv(bs->backing, local_offset, local_bytes, > > > + &local_qiov, flags); > > > + } else { > > > > Here you handle the case where there's no backing file, but other > > functions (eg. track_co_block_status) simply assume that `bs->backing` > > is always valid. > > > > Is this intentional? > > > > Not intentional, but the only place where bs->backing is assumed is in > track_co_block_status. Fix for this is simple enough, since according to the > comment in 'include/block/block_int.h' when you return 0, the block layer > automatically checks if a backing image is attached and assumes zero > otherwise. > > So basically just to avoid a potential NULL deref: > > ----------------- > diff --git a/block/alloc-track.c b/block/alloc-track.c > index f2a4080579..b1a41be05a 100644 > --- a/block/alloc-track.c > +++ b/block/alloc-track.c > @@ -222,10 +222,10 @@ static int coroutine_fn > track_co_block_status(BlockDriverState *bs, > if (alloc) { > *file = bs->file->bs; > return BDRV_BLOCK_RAW | BDRV_BLOCK_OFFSET_VALID; > - } else { > + } else if (bs->backing) { > *file = bs->backing->bs; > - return 0; > } > + return 0; > } > > static void track_child_perm(BlockDriverState *bs, BdrvChild *c, > ----------------- > > Yeah? > > > It makes me also wonder - since you just zero out the iov below - if > > this driver could maybe also replace the current zeroinit filter? > > Otherwise it's at least a good reminder that the zeroinit code could > > start making use of bitmaps as well ;-) > > > > But zeroinit operates on writes, not reads? Not sure how this driver would > help... It tracks write offsets, this one tracks writes in general. The main task of zeroinit is to make `bdrv_co_pwrite_zeroes` a nop if the destination isn't "dirty" (under the assumption that its initial state is all zeroed out already). So the difference would just be that there's no backing file and that `pwrite_zeroes` checks the bitmap. But yeah, come to think of it, that might be conflating too many things into one driver.