From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits)) (No client certificate requested) by lists.proxmox.com (Postfix) with ESMTPS id C5ADC6753B; Tue, 12 Jan 2021 12:29:38 +0100 (CET) Received: from firstgate.proxmox.com (localhost [127.0.0.1]) by firstgate.proxmox.com (Proxmox) with ESMTP id BA60124389; Tue, 12 Jan 2021 12:29:08 +0100 (CET) Received: from proxmox-new.maurer-it.com (proxmox-new.maurer-it.com [212.186.127.180]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits)) (No client certificate requested) by firstgate.proxmox.com (Proxmox) with ESMTPS id 678192437C; Tue, 12 Jan 2021 12:29:07 +0100 (CET) Received: from proxmox-new.maurer-it.com (localhost.localdomain [127.0.0.1]) by proxmox-new.maurer-it.com (Proxmox) with ESMTP id 2F13B448F0; Tue, 12 Jan 2021 12:29:07 +0100 (CET) To: Wolfgang Bumiller Cc: pve-devel@lists.proxmox.com, pbs-devel@lists.proxmox.com References: <20210111111409.32385-1-s.reiter@proxmox.com> <20210111111409.32385-4-s.reiter@proxmox.com> <20210112105411.rnlyd3wwrbav472f@wobu-vie.proxmox.com> From: Stefan Reiter Message-ID: Date: Tue, 12 Jan 2021 12:29:05 +0100 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.6.0 MIME-Version: 1.0 In-Reply-To: <20210112105411.rnlyd3wwrbav472f@wobu-vie.proxmox.com> Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: 7bit X-SPAM-LEVEL: Spam detection results: 0 AWL -0.039 Adjusted score from AWL reputation of From: address KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment NICE_REPLY_A -0.001 Looks like a legit reply (A) RCVD_IN_DNSWL_MED -2.3 Sender listed at https://www.dnswl.org/, medium trust SPF_HELO_NONE 0.001 SPF: HELO does not publish an SPF Record SPF_PASS -0.001 SPF: sender matches SPF record Subject: Re: [pve-devel] [PATCH qemu 03/11] block: add alloc-track driver X-BeenThere: pve-devel@lists.proxmox.com X-Mailman-Version: 2.1.29 Precedence: list List-Id: Proxmox VE development discussion List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 12 Jan 2021 11:29:38 -0000 On 12/01/2021 11:54, Wolfgang Bumiller wrote: > one inline comment, otherwise lgtm > > On Mon, Jan 11, 2021 at 12:14:01PM +0100, Stefan Reiter wrote: >> Add a new filter node 'alloc-track', which seperates reads and writes to >> different children, thus allowing to put a backing image behind any >> blockdev (regardless of driver support). Since we can't detect any >> pre-allocated blocks, we can only track new writes, hence the write >> target ('file') for this node must always be empty. >> >> Intended use case is for live restoring, i.e. add a backup image as a >> block device into a VM, then put an alloc-track on the restore target >> and set the backup as backing. With this, one can use a regular >> 'block-stream' to restore the image, while the VM can already run in the >> background. Copy-on-read will help make progress as the VM reads as >> well. >> >> This only worked if the target supports backing images, so up until now >> only for qcow2, with alloc-track any driver for the target can be used. >> >> If 'auto-remove' is set, alloc-track will automatically detach itself >> once the backing image is removed. It will be replaced by 'file'. >> >> Signed-off-by: Stefan Reiter >> --- >> >> Potentially for upstream as well, but let's maybe test it a bit here first ;) >> >> block/Makefile.objs | 1 + >> block/alloc-track.c | 319 ++++++++++++++++++++++++++++++++++++++++++++ >> 2 files changed, 320 insertions(+) >> create mode 100644 block/alloc-track.c >> >> diff --git a/block/Makefile.objs b/block/Makefile.objs >> index 28fb3b7f7c..04f91b8cc9 100644 >> --- a/block/Makefile.objs >> +++ b/block/Makefile.objs >> @@ -13,6 +13,7 @@ block-obj-$(CONFIG_QED) += qed-check.o >> block-obj-y += vhdx.o vhdx-endian.o vhdx-log.o >> block-obj-y += quorum.o >> block-obj-y += zeroinit.o >> +block-obj-y += alloc-track.o >> block-obj-y += blkdebug.o blkverify.o blkreplay.o >> block-obj-$(CONFIG_PARALLELS) += parallels.o >> block-obj-y += blklogwrites.o >> diff --git a/block/alloc-track.c b/block/alloc-track.c >> new file mode 100644 >> index 0000000000..f2a4080579 >> --- /dev/null >> +++ b/block/alloc-track.c >> @@ -0,0 +1,319 @@ >> +/* >> + * Filter to allow backing images to be applied to any node. Assumes a blank >> + * image to begin with, only new writes are tracked as allocated, thus this >> + * must never be put on a node that already contains data. >> + * >> + * Copyright (c) 2020 Proxmox Server Solutions GmbH >> + * Copyright (c) 2020 Stefan Reiter >> + * >> + * This work is licensed under the terms of the GNU GPL, version 2 or later. >> + * See the COPYING file in the top-level directory. >> + */ >> + >> +#include "qemu/osdep.h" >> +#include "qapi/error.h" >> +#include "block/block_int.h" >> +#include "qapi/qmp/qdict.h" >> +#include "qapi/qmp/qstring.h" >> +#include "qemu/cutils.h" >> +#include "qemu/option.h" >> +#include "qemu/module.h" >> + >> +#define TRACK_OPT_AUTO_REMOVE "auto-remove" >> + >> +typedef struct { >> + BdrvDirtyBitmap *bitmap; >> + bool dropping; >> + bool auto_remove; >> +} BDRVAllocTrackState; >> + >> +static QemuOptsList runtime_opts = { >> + .name = "alloc-track", >> + .head = QTAILQ_HEAD_INITIALIZER(runtime_opts.head), >> + .desc = { >> + { >> + .name = TRACK_OPT_AUTO_REMOVE, >> + .type = QEMU_OPT_BOOL, >> + .help = "automatically replace this node with 'file' when 'backing'" >> + "is detached", >> + }, >> + { /* end of list */ } >> + }, >> +}; >> + >> +static void track_refresh_limits(BlockDriverState *bs, Error **errp) >> +{ >> + BlockDriverInfo bdi; >> + >> + if (!bs->file) { >> + return; >> + } >> + >> + /* always use alignment from underlying write device so RMW cycle for >> + * bdrv_pwritev reads data from our backing via track_co_preadv (no partial >> + * cluster allocation in 'file') */ >> + bdrv_get_info(bs->file->bs, &bdi); >> + bs->bl.request_alignment = MAX(bs->file->bs->bl.request_alignment, >> + MAX(bdi.cluster_size, BDRV_SECTOR_SIZE)); >> +} >> + >> +static int track_open(BlockDriverState *bs, QDict *options, int flags, >> + Error **errp) >> +{ >> + BDRVAllocTrackState *s = bs->opaque; >> + QemuOpts *opts; >> + Error *local_err = NULL; >> + int ret = 0; >> + >> + opts = qemu_opts_create(&runtime_opts, NULL, 0, &error_abort); >> + qemu_opts_absorb_qdict(opts, options, &local_err); >> + if (local_err) { >> + error_propagate(errp, local_err); >> + ret = -EINVAL; >> + goto fail; >> + } >> + >> + s->auto_remove = qemu_opt_get_bool(opts, TRACK_OPT_AUTO_REMOVE, false); >> + >> + /* open the target (write) node, backing will be attached by block layer */ >> + bs->file = bdrv_open_child(NULL, options, "file", bs, &child_of_bds, >> + BDRV_CHILD_DATA | BDRV_CHILD_METADATA, false, >> + &local_err); >> + if (local_err) { >> + ret = -EINVAL; >> + error_propagate(errp, local_err); >> + goto fail; >> + } >> + >> + track_refresh_limits(bs, errp); >> + uint64_t gran = bs->bl.request_alignment; >> + s->bitmap = bdrv_create_dirty_bitmap(bs->file->bs, gran, NULL, &local_err); >> + if (local_err) { >> + ret = -EIO; >> + error_propagate(errp, local_err); >> + goto fail; >> + } >> + >> + s->dropping = false; >> + >> +fail: >> + if (ret < 0) { >> + bdrv_unref_child(bs, bs->file); >> + if (s->bitmap) { >> + bdrv_release_dirty_bitmap(s->bitmap); >> + } >> + } >> + qemu_opts_del(opts); >> + return ret; >> +} >> + >> +static void track_close(BlockDriverState *bs) >> +{ >> + BDRVAllocTrackState *s = bs->opaque; >> + if (s->bitmap) { >> + bdrv_release_dirty_bitmap(s->bitmap); >> + } >> +} >> + >> +static int64_t track_getlength(BlockDriverState *bs) >> +{ >> + return bdrv_getlength(bs->file->bs); >> +} >> + >> +static int coroutine_fn track_co_preadv(BlockDriverState *bs, >> + uint64_t offset, uint64_t bytes, QEMUIOVector *qiov, int flags) >> +{ >> + BDRVAllocTrackState *s = bs->opaque; >> + QEMUIOVector local_qiov; >> + int ret; >> + >> + /* 'cur_offset' is relative to 'offset', 'local_offset' to image start */ >> + uint64_t cur_offset, local_offset; >> + int64_t local_bytes; >> + bool alloc; >> + >> + /* a read request can span multiple granularity-sized chunks, and can thus >> + * contain blocks with different allocation status - we could just iterate >> + * granularity-wise, but for better performance use bdrv_dirty_bitmap_next_X >> + * to find the next flip and consider everything up to that in one go */ >> + for (cur_offset = 0; cur_offset < bytes; cur_offset += local_bytes) { >> + local_offset = offset + cur_offset; >> + alloc = bdrv_dirty_bitmap_get(s->bitmap, local_offset); >> + if (alloc) { >> + local_bytes = bdrv_dirty_bitmap_next_zero(s->bitmap, local_offset, >> + bytes - cur_offset); >> + } else { >> + local_bytes = bdrv_dirty_bitmap_next_dirty(s->bitmap, local_offset, >> + bytes - cur_offset); >> + } >> + >> + /* _bitmap_next_X return is -1 if no end found within limit, otherwise >> + * offset of next flip (to start of image) */ >> + local_bytes = local_bytes < 0 ? >> + bytes - cur_offset : >> + local_bytes - local_offset; >> + >> + qemu_iovec_init_slice(&local_qiov, qiov, cur_offset, local_bytes); >> + >> + if (alloc) { >> + ret = bdrv_co_preadv(bs->file, local_offset, local_bytes, >> + &local_qiov, flags); >> + } else if (bs->backing) { >> + ret = bdrv_co_preadv(bs->backing, local_offset, local_bytes, >> + &local_qiov, flags); >> + } else { > > Here you handle the case where there's no backing file, but other > functions (eg. track_co_block_status) simply assume that `bs->backing` > is always valid. > > Is this intentional? > Not intentional, but the only place where bs->backing is assumed is in track_co_block_status. Fix for this is simple enough, since according to the comment in 'include/block/block_int.h' when you return 0, the block layer automatically checks if a backing image is attached and assumes zero otherwise. So basically just to avoid a potential NULL deref: ----------------- diff --git a/block/alloc-track.c b/block/alloc-track.c index f2a4080579..b1a41be05a 100644 --- a/block/alloc-track.c +++ b/block/alloc-track.c @@ -222,10 +222,10 @@ static int coroutine_fn track_co_block_status(BlockDriverState *bs, if (alloc) { *file = bs->file->bs; return BDRV_BLOCK_RAW | BDRV_BLOCK_OFFSET_VALID; - } else { + } else if (bs->backing) { *file = bs->backing->bs; - return 0; } + return 0; } static void track_child_perm(BlockDriverState *bs, BdrvChild *c, ----------------- Yeah? > It makes me also wonder - since you just zero out the iov below - if > this driver could maybe also replace the current zeroinit filter? > Otherwise it's at least a good reminder that the zeroinit code could > start making use of bitmaps as well ;-) > But zeroinit operates on writes, not reads? Not sure how this driver would help... >> + ret = qemu_iovec_memset(&local_qiov, cur_offset, 0, local_bytes); >> + } >> + >> + if (ret != 0) { >> + break; >> + } >> + } >> + >> + return ret; >> +} >> + >> +static int coroutine_fn track_co_pwritev(BlockDriverState *bs, >> + uint64_t offset, uint64_t bytes, QEMUIOVector *qiov, int flags) >> +{ >> + return bdrv_co_pwritev(bs->file, offset, bytes, qiov, flags); >> +} >> + >> +static int coroutine_fn track_co_pwrite_zeroes(BlockDriverState *bs, >> + int64_t offset, int count, BdrvRequestFlags flags) >> +{ >> + return bdrv_pwrite_zeroes(bs->file, offset, count, flags); >> +} >> + >> +static int coroutine_fn track_co_pdiscard(BlockDriverState *bs, >> + int64_t offset, int count) >> +{ >> + return bdrv_co_pdiscard(bs->file, offset, count); >> +} >> + >> +static coroutine_fn int track_co_flush(BlockDriverState *bs) >> +{ >> + return bdrv_co_flush(bs->file->bs); >> +} >> + >> +static int coroutine_fn track_co_block_status(BlockDriverState *bs, >> + bool want_zero, >> + int64_t offset, >> + int64_t bytes, >> + int64_t *pnum, >> + int64_t *map, >> + BlockDriverState **file) >> +{ >> + BDRVAllocTrackState *s = bs->opaque; >> + >> + bool alloc = bdrv_dirty_bitmap_get(s->bitmap, offset); >> + int64_t next_flipped; >> + if (alloc) { >> + next_flipped = bdrv_dirty_bitmap_next_zero(s->bitmap, offset, bytes); >> + } else { >> + next_flipped = bdrv_dirty_bitmap_next_dirty(s->bitmap, offset, bytes); >> + } >> + >> + /* in case not the entire region has the same state, we need to set pnum to >> + * indicate for how many bytes our result is valid */ >> + *pnum = next_flipped == -1 ? bytes : next_flipped - offset; >> + *map = offset; >> + >> + if (alloc) { >> + *file = bs->file->bs; >> + return BDRV_BLOCK_RAW | BDRV_BLOCK_OFFSET_VALID; >> + } else { >> + *file = bs->backing->bs; >> + return 0; >> + } >> +} >> + >> +static void track_child_perm(BlockDriverState *bs, BdrvChild *c, >> + BdrvChildRole role, BlockReopenQueue *reopen_queue, >> + uint64_t perm, uint64_t shared, >> + uint64_t *nperm, uint64_t *nshared) >> +{ >> + BDRVAllocTrackState *s = bs->opaque; >> + >> + *nshared = BLK_PERM_ALL; >> + >> + /* in case we're currently dropping ourselves, claim to not use any >> + * permissions at all - which is fine, since from this point on we will >> + * never issue a read or write anymore */ >> + if (s->dropping) { >> + *nperm = 0; >> + return; >> + } >> + >> + if (role & BDRV_CHILD_DATA) { >> + *nperm = perm & DEFAULT_PERM_PASSTHROUGH; >> + } else { >> + /* 'backing' is also a child of our BDS, but we don't expect it to be >> + * writeable, so we only forward 'consistent read' */ >> + *nperm = perm & BLK_PERM_CONSISTENT_READ; >> + } >> +} >> + >> +static void track_drop(void *opaque) >> +{ >> + BlockDriverState *bs = (BlockDriverState*)opaque; >> + BDRVAllocTrackState *s = bs->opaque; >> + >> + bdrv_drained_begin(bs); >> + >> + s->dropping = true; >> + >> + bdrv_child_refresh_perms(bs, bs->file, &error_abort); >> + bdrv_replace_node(bs, bs->file->bs, &error_abort); >> + bdrv_set_backing_hd(bs, NULL, &error_abort); >> + >> + bdrv_drained_end(bs); >> +} >> + >> +static int track_change_backing_file(BlockDriverState *bs, >> + const char *backing_file, >> + const char *backing_fmt) >> +{ >> + if (backing_file == NULL && backing_fmt == NULL) { >> + /* backing file has been disconnected, there's no longer any use for >> + * this node, so let's remove ourselves from the block graph - we need >> + * to schedule this for later however, since when this function is >> + * called, the blockjob modifying us is probably not done yet and has a >> + * blocker on 'bs' */ >> + aio_bh_schedule_oneshot(qemu_get_aio_context(), track_drop, (void*)bs); >> + } >> + >> + return 0; >> +} >> + >> +static BlockDriver bdrv_alloc_track = { >> + .format_name = "alloc-track", >> + .instance_size = sizeof(BDRVAllocTrackState), >> + >> + .bdrv_file_open = track_open, >> + .bdrv_close = track_close, >> + .bdrv_getlength = track_getlength, >> + .bdrv_child_perm = track_child_perm, >> + .bdrv_refresh_limits = track_refresh_limits, >> + >> + .bdrv_co_pwrite_zeroes = track_co_pwrite_zeroes, >> + .bdrv_co_pwritev = track_co_pwritev, >> + .bdrv_co_preadv = track_co_preadv, >> + .bdrv_co_pdiscard = track_co_pdiscard, >> + >> + .bdrv_co_flush = track_co_flush, >> + .bdrv_co_flush_to_disk = track_co_flush, >> + >> + .is_filter = true, >> + .supports_backing = true, >> + >> + .bdrv_co_block_status = track_co_block_status, >> + .bdrv_change_backing_file = track_change_backing_file, >> +}; >> + >> +static void bdrv_alloc_track_init(void) >> +{ >> + bdrv_register(&bdrv_alloc_track); >> +} >> + >> +block_init(bdrv_alloc_track_init); >> -- >> 2.20.1