From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <pve-devel-bounces@lists.proxmox.com>
Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68])
	by lore.proxmox.com (Postfix) with ESMTPS id 6E7AA1FF173
	for <inbox@lore.proxmox.com>; Mon, 25 Nov 2024 16:07:25 +0100 (CET)
Received: from firstgate.proxmox.com (localhost [127.0.0.1])
	by firstgate.proxmox.com (Proxmox) with ESMTP id 4323516EB8;
	Mon, 25 Nov 2024 16:07:23 +0100 (CET)
Message-ID: <d5e11d01-f54e-4dd9-b1c0-a02077a0c65f@proxmox.com>
Date: Mon, 25 Nov 2024 16:06:44 +0100
MIME-Version: 1.0
User-Agent: Mozilla Thunderbird
To: "DERUMIER, Alexandre" <alexandre.derumier@groupe-cyllene.com>,
 "pve-devel@lists.proxmox.com" <pve-devel@lists.proxmox.com>
References: <mailman.249.1720016708.331.pve-devel@lists.proxmox.com>
 <b7946e0f-bb1b-41ad-a21f-7aac10456e92@proxmox.com>
 <98cdc246d14fdfc5dcfedf09dd4bc596acb0814f.camel@groupe-cyllene.com>
Content-Language: en-US
From: Fiona Ebner <f.ebner@proxmox.com>
In-Reply-To: <98cdc246d14fdfc5dcfedf09dd4bc596acb0814f.camel@groupe-cyllene.com>
X-SPAM-LEVEL: Spam detection results:  0
 AWL -0.056 Adjusted score from AWL reputation of From: address
 BAYES_00                 -1.9 Bayes spam probability is 0 to 1%
 DMARC_MISSING             0.1 Missing DMARC policy
 KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment
 RCVD_IN_VALIDITY_CERTIFIED_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to
 Validity was blocked. See
 https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more
 information.
 RCVD_IN_VALIDITY_RPBL_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to
 Validity was blocked. See
 https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more
 information.
 RCVD_IN_VALIDITY_SAFE_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to
 Validity was blocked. See
 https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more
 information.
 SPF_HELO_NONE           0.001 SPF: HELO does not publish an SPF Record
 SPF_PASS               -0.001 SPF: sender matches SPF record
Subject: Re: [pve-devel] [PATCH pve-storage] qcow2 format: enable subcluster
 allocation by default
X-BeenThere: pve-devel@lists.proxmox.com
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Proxmox VE development discussion <pve-devel.lists.proxmox.com>
List-Unsubscribe: <https://lists.proxmox.com/cgi-bin/mailman/options/pve-devel>, 
 <mailto:pve-devel-request@lists.proxmox.com?subject=unsubscribe>
List-Archive: <http://lists.proxmox.com/pipermail/pve-devel/>
List-Post: <mailto:pve-devel@lists.proxmox.com>
List-Help: <mailto:pve-devel-request@lists.proxmox.com?subject=help>
List-Subscribe: <https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel>, 
 <mailto:pve-devel-request@lists.proxmox.com?subject=subscribe>
Reply-To: Proxmox VE development discussion <pve-devel@lists.proxmox.com>
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Errors-To: pve-devel-bounces@lists.proxmox.com
Sender: "pve-devel" <pve-devel-bounces@lists.proxmox.com>

Am 14.11.24 um 09:31 schrieb DERUMIER, Alexandre:
> Hi Fiona,
> 
> I'm really sorry, I didn't see your reponse, lost in the flood of email
> :(
> 
> 
>>> How does read performance compare for you (with 128 KiB cluster
> size)?
> 
>>> I don't see any noticeable difference in my testing with an ext4
>>> directory storage on an SSD, attaching the qcow2 images as SCSI disks
>>> to
>>> the VM, neither for reading nor writing. I only tested without your
>>> change and with your change using 4k (rand)read and (rand)write.
>>>
>>> I'm not sure we should enable this for everybody, there's always a
>>> risk
>>> to break stuff with added complexity. Maybe it's better to have a
>>> storage configuration option that people can opt-in to, e.g.
>>>
>>> qcow2-create-opts extended_l2=on,cluster_size=128k
>>>
>>> If we get enough positive feedback, we can still change the default
>>> in a
>>> future (major) release.
> 
> What disk size do you use for your bench ? 
> It's really important, because generally, the more bigger, the slower
> read in qcow2 will be , because l2 metadatas need to be cached handle
> in memory. (and qemu have 1MB cache by default)
> 

I don't recall the exact size. Today I retested with a 900 GiB image.

> witout subcluster, for 1TB image, it's around 128MB l2 metadas in qcow2
> file. 
> 
> for write, it's mostly for first block allocation, so it's depend
> of the filesystem behind, if fallocate is supported.
> (I have seen really big difference on shared ocfs2/gfs2 fs)
> 

Okay, I see.

Mine are backed by an ext4 storage. With
'extended_l2=on,cluster_size=128k' I actually get much slower results
for the initial allocations (note that this is with
preallocation=metadata - without preallocation, I still see "without"
being about 1.5 times faster, i.e. 375MB written versus 255MB, however
the usage on the underlying storage is much(!) worse, i.e. 5.71 GiB
versus 373 MiB):

> fio --name=rw --filename=/dev/sdb --ioengine=libaio --direct 1 --bs=4k  --rw=randwrite --numjobs=1 --group_reporting --time_based --runtime 60

With options:

> WRITE: bw=5603KiB/s (5737kB/s), 5603KiB/s-5603KiB/s (5737kB/s-5737kB/s), io=328MiB (344MB), run=60001-60001msec

Without:

> WRITE: bw=27.1MiB/s (28.4MB/s), 27.1MiB/s-27.1MiB/s (28.4MB/s-28.4MB/s), io=1626MiB (1705MB), run=60001-60001msec

For a subsequent randrw:

> fio --name=rw --filename=/dev/sdb --ioengine=libaio --direct 1 --bs=4k  --rw=randrw --numjobs=1 --group_reporting --time_based --runtime 60

With options:

>    READ: bw=7071KiB/s (7241kB/s), 7071KiB/s-7071KiB/s (7241kB/s-7241kB/s), io=414MiB (434MB), run=60001-60001msec
>   WRITE: bw=7058KiB/s (7227kB/s), 7058KiB/s-7058KiB/s (7227kB/s-7227kB/s), io=414MiB (434MB), run=60001-60001msec

Without:

>    READ: bw=11.8MiB/s (12.4MB/s), 11.8MiB/s-11.8MiB/s (12.4MB/s-12.4MB/s), io=708MiB (742MB), run=60001-60001msec
>   WRITE: bw=11.8MiB/s (12.3MB/s), 11.8MiB/s-11.8MiB/s (12.3MB/s-12.3MB/s), io=707MiB (741MB), run=60001-60001msec

Took a snapshot and afterwards I get:

> fio --name=rw --filename=/dev/sdb --ioengine=libaio --direct 1 --bs=4k  --rw=randrw --numjobs=1 --group_reporting --time_based --runtime 60

With options:

>    READ: bw=1250KiB/s (1280kB/s), 1250KiB/s-1250KiB/s (1280kB/s-1280kB/s), io=73.3MiB (76.8MB), run=60003-60003msec
>   WRITE: bw=1251KiB/s (1281kB/s), 1251KiB/s-1251KiB/s (1281kB/s-1281kB/s), io=73.3MiB (76.9MB), run=60003-60003msec

Without:

>    READ: bw=1198KiB/s (1227kB/s), 1198KiB/s-1198KiB/s (1227kB/s-1227kB/s), io=70.2MiB (73.6MB), run=60002-60002msec
>   WRITE: bw=1201KiB/s (1230kB/s), 1201KiB/s-1201KiB/s (1230kB/s-1230kB/s), io=70.4MiB (73.8MB), run=60002-60002msec

But there is a big difference in the read performance, this time in
favor of "with options" (but there is less allocated on the image, as a
consequence of the previous tests):

> fio --name=rw --filename=/dev/sdb --ioengine=libaio --direct 1 --bs=4k  --rw=randread --numjobs=1 --group_reporting --time_based --runtime 60

With options:

> READ: bw=31.2MiB/s (32.7MB/s), 31.2MiB/s-31.2MiB/s (32.7MB/s-32.7MB/s), io=1871MiB (1962MB), run=60001-60001msec

Without:

> READ: bw=19.2MiB/s (20.1MB/s), 19.2MiB/s-19.2MiB/s (20.1MB/s-20.1MB/s), io=1151MiB (1207MB), run=60001-60001msec


> Also, for write, this is helping a lot with backing file. (so for
> linked qcow2 clone, and maybe in the future for external snapshots).
> Because currently, if you write 4k on a overlay , you need to read the
> full cluster 64k on the base write, and rewrite it. (so 8x
> overamplification)
> 

For clone, we should add the options in clone_image() too ;)

> 
> They are good information from the developper :
> https://blogs.igalia.com/berto/2020/12/03/subcluster-allocation-for-qcow2-images/
> 
> 
> I really don't think it's hurt, but maybe a default for 9.0 release to
> be safe. (I'll try to have the external snapshot ready for this date
> too)
> 

It seems to hurt at least in some cases (initial allocation speed can be
worse than without the option, in particular when using
preallocation=metadata). If we have it as opt-in already, users can give
it a try and then we can still think about making it the default for
9.0, should there be enough positive feedback.

We could also think about only using it for linked clones by default
initially.

Independently, you can make it the default for LvmQcow2Plugin of course,
since you see much better results for that use case :)


_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel