From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits)) (No client certificate requested) by lists.proxmox.com (Postfix) with ESMTPS id 745C5996F for ; Tue, 5 Sep 2023 15:19:29 +0200 (CEST) Received: from firstgate.proxmox.com (localhost [127.0.0.1]) by firstgate.proxmox.com (Proxmox) with ESMTP id 4ED5919DA4 for ; Tue, 5 Sep 2023 15:18:59 +0200 (CEST) Received: from proxmox-new.maurer-it.com (proxmox-new.maurer-it.com [94.136.29.106]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits)) (No client certificate requested) by firstgate.proxmox.com (Proxmox) with ESMTPS for ; Tue, 5 Sep 2023 15:18:57 +0200 (CEST) Received: from proxmox-new.maurer-it.com (localhost.localdomain [127.0.0.1]) by proxmox-new.maurer-it.com (Proxmox) with ESMTP id D2EE942225 for ; Tue, 5 Sep 2023 15:18:56 +0200 (CEST) Message-ID: <492aab38-d287-2ebd-79e5-54c50ed8504c@proxmox.com> Date: Tue, 5 Sep 2023 15:18:55 +0200 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.15.0 Content-Language: en-US From: Fiona Ebner To: pve-devel@lists.proxmox.com Reply-To: Proxmox VE development discussion References: <20230208140746.165320-1-f.ebner@proxmox.com> In-Reply-To: <20230208140746.165320-1-f.ebner@proxmox.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-SPAM-LEVEL: Spam detection results: 0 AWL 0.657 Adjusted score from AWL reputation of From: address BAYES_00 -1.9 Bayes spam probability is 0 to 1% DMARC_MISSING 0.1 Missing DMARC policy KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment NICE_REPLY_A -1.473 Looks like a legit reply (A) SPF_HELO_NONE 0.001 SPF: HELO does not publish an SPF Record SPF_PASS -0.001 SPF: sender matches SPF record URIBL_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to URIBL was blocked. See http://wiki.apache.org/spamassassin/DnsBlocklists#dnsbl-block for more information. [proxmox.com] Subject: Re: [pve-devel] [RFC] PVE-Backup: create jobs in a drained section X-BeenThere: pve-devel@lists.proxmox.com X-Mailman-Version: 2.1.29 Precedence: list List-Id: Proxmox VE development discussion List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 05 Sep 2023 13:19:29 -0000 Am 08.02.23 um 15:07 schrieb Fiona Ebner: > > Inserting the copy-before-write node is already protected with a > drained section, which is why there should be no actual issue right > now. While that drained section does not extend until the bcs bitmap > initialization, it should also not be an issue currently, because the > job is not created from a coroutine (and even if, there would need to > be a yield point in between). > My explanation why it's currently not required is wrong. Because there can be an IO thread which can interact with the bitmap and that is what happens (with my reproducer) for the crash reported here [0]. I couldn't reproduce it anymore with this RFC applied. A sketch of what happens is: Notes: * Each time a block-copy request is created, it resets the dirty bitmap at the corresponding range. * Thread 1 is still doing the backup setup, cbw_open() and backup_init_bcs_bitmap() happen in backup_job_create() * The check if a request can be created for a given range relies on the dirty bitmap. Thread 1 bdrv_dirty_bitmap_merge_internal() as part of cbw_open() A Thread 3 bdrv_reset_dirty_bitmap(offset=x, bytes=4MiB) Thread 1 bdrv_clear_dirty_bitmap() as part of backup_init_bcs_bitmap() Thread 1 bdrv_dirty_bitmap_merge_internal() as part of backup_init_bcs_bitmap() B Thread 3 bdrv_reset_dirty_bitmap(offset=0, bytes=4MiB) Thread 3 bdrv_reset_dirty_bitmap(offset=4MiB, bytes=4MiB) .... C Thread 3 bdrv_reset_dirty_bitmap(offset=x, bytes=4MiB) Note that at time B, there can be a mismatch between the bitmap and the request list, if merging didn't by chance set the bits at the location for request A again. Then, if C happens before A has finished, an assert will trigger, because there already is a request in the same range: > block_copy_task_create: Assertion `!reqlist_find_conflict(&s->reqs, offset, bytes)' failed C doesn't need to cover exactly the same range as A of course, just overlap, but it did for me. That said, I also tried reproducing the issue with QEMU 7.2, but didn't manage to yet. I'll take another look and then re-send this for QEMU 8.0 with a fixed commit message. [0]: https://forum.proxmox.com/threads/133149/