* [pve-devel] [PATCH pve-storage] qcow2 format: enable subcluster allocation by default @ 2024-07-03 14:24 Alexandre Derumier via pve-devel 2024-09-11 11:44 ` Fiona Ebner 0 siblings, 1 reply; 4+ messages in thread From: Alexandre Derumier via pve-devel @ 2024-07-03 14:24 UTC (permalink / raw) To: pve-devel; +Cc: Alexandre Derumier [-- Attachment #1: Type: message/rfc822, Size: 4454 bytes --] From: Alexandre Derumier <alexandre.derumier@groupe-cyllene.com> To: pve-devel@lists.proxmox.com Subject: [PATCH pve-storage] qcow2 format: enable subcluster allocation by default Date: Wed, 3 Jul 2024 16:24:47 +0200 Message-ID: <20240703142447.602210-1-alexandre.derumier@groupe-cyllene.com> extended_l2 is an optimisation to reduce write amplification. Currently,without it, when a vm write 4k, a full 64k cluster need to be writen. When enabled, the cluster is splitted in 32 subclusters. We use a 128k cluster by default, to have 32 * 4k subclusters https://blogs.igalia.com/berto/2020/12/03/subcluster-allocation-for-qcow2-images/ https://static.sched.com/hosted_files/kvmforum2020/d9/qcow2-subcluster-allocation.pdf some stats for 4k randwrite benchmark Cluster size Without subclusters With subclusters 16 KB 5859 IOPS 8063 IOPS 32 KB 5674 IOPS 11107 IOPS 64 KB 2527 IOPS 12731 IOPS 128 KB 1576 IOPS 11808 IOPS 256 KB 976 IOPS 9195 IOPS 512 KB 510 IOPS 7079 IOPS 1 MB 448 IOPS 3306 IOPS 2 MB 262 IOPS 2269 IOPS Signed-off-by: Alexandre Derumier <alexandre.derumier@groupe-cyllene.com> --- src/PVE/Storage/Plugin.pm | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/PVE/Storage/Plugin.pm b/src/PVE/Storage/Plugin.pm index 6444390..31b20fe 100644 --- a/src/PVE/Storage/Plugin.pm +++ b/src/PVE/Storage/Plugin.pm @@ -561,7 +561,7 @@ sub preallocation_cmd_option { die "preallocation mode '$prealloc' not supported by format '$fmt'\n" if !$QCOW2_PREALLOCATION->{$prealloc}; - return "preallocation=$prealloc"; + return "preallocation=$prealloc,extended_l2=on,cluster_size=128k"; } elsif ($fmt eq 'raw') { $prealloc = $prealloc // 'off'; $prealloc = 'off' if $prealloc eq 'metadata'; -- 2.39.2 [-- Attachment #2: Type: text/plain, Size: 160 bytes --] _______________________________________________ pve-devel mailing list pve-devel@lists.proxmox.com https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [pve-devel] [PATCH pve-storage] qcow2 format: enable subcluster allocation by default 2024-07-03 14:24 [pve-devel] [PATCH pve-storage] qcow2 format: enable subcluster allocation by default Alexandre Derumier via pve-devel @ 2024-09-11 11:44 ` Fiona Ebner 2024-11-14 8:31 ` DERUMIER, Alexandre via pve-devel [not found] ` <98cdc246d14fdfc5dcfedf09dd4bc596acb0814f.camel@groupe-cyllene.com> 0 siblings, 2 replies; 4+ messages in thread From: Fiona Ebner @ 2024-09-11 11:44 UTC (permalink / raw) To: Proxmox VE development discussion Am 03.07.24 um 16:24 schrieb Alexandre Derumier via pve-devel: > > > extended_l2 is an optimisation to reduce write amplification. > Currently,without it, when a vm write 4k, a full 64k cluster s/write/writes/ > need to be writen. needs to be written. > > When enabled, the cluster is splitted in 32 subclusters. s/splitted/split/ > > We use a 128k cluster by default, to have 32 * 4k subclusters > > https://blogs.igalia.com/berto/2020/12/03/subcluster-allocation-for-qcow2-images/ > https://static.sched.com/hosted_files/kvmforum2020/d9/qcow2-subcluster-allocation.pdf > > some stats for 4k randwrite benchmark Can you please share the exact command you used? What kind of underlying disks do you have? > > Cluster size Without subclusters With subclusters > 16 KB 5859 IOPS 8063 IOPS > 32 KB 5674 IOPS 11107 IOPS > 64 KB 2527 IOPS 12731 IOPS > 128 KB 1576 IOPS 11808 IOPS > 256 KB 976 IOPS 9195 IOPS > 512 KB 510 IOPS 7079 IOPS > 1 MB 448 IOPS 3306 IOPS > 2 MB 262 IOPS 2269 IOPS > How does read performance compare for you (with 128 KiB cluster size)? I don't see any noticeable difference in my testing with an ext4 directory storage on an SSD, attaching the qcow2 images as SCSI disks to the VM, neither for reading nor writing. I only tested without your change and with your change using 4k (rand)read and (rand)write. I'm not sure we should enable this for everybody, there's always a risk to break stuff with added complexity. Maybe it's better to have a storage configuration option that people can opt-in to, e.g. qcow2-create-opts extended_l2=on,cluster_size=128k If we get enough positive feedback, we can still change the default in a future (major) release. > Signed-off-by: Alexandre Derumier <alexandre.derumier@groupe-cyllene.com> > --- > src/PVE/Storage/Plugin.pm | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/src/PVE/Storage/Plugin.pm b/src/PVE/Storage/Plugin.pm > index 6444390..31b20fe 100644 > --- a/src/PVE/Storage/Plugin.pm > +++ b/src/PVE/Storage/Plugin.pm > @@ -561,7 +561,7 @@ sub preallocation_cmd_option { > die "preallocation mode '$prealloc' not supported by format '$fmt'\n" > if !$QCOW2_PREALLOCATION->{$prealloc}; > > - return "preallocation=$prealloc"; > + return "preallocation=$prealloc,extended_l2=on,cluster_size=128k"; Also, it doesn't really fit here in the preallocation helper as the helper is specific to that setting. > } elsif ($fmt eq 'raw') { > $prealloc = $prealloc // 'off'; > $prealloc = 'off' if $prealloc eq 'metadata'; > -- > 2.39.2 > > _______________________________________________ pve-devel mailing list pve-devel@lists.proxmox.com https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [pve-devel] [PATCH pve-storage] qcow2 format: enable subcluster allocation by default 2024-09-11 11:44 ` Fiona Ebner @ 2024-11-14 8:31 ` DERUMIER, Alexandre via pve-devel [not found] ` <98cdc246d14fdfc5dcfedf09dd4bc596acb0814f.camel@groupe-cyllene.com> 1 sibling, 0 replies; 4+ messages in thread From: DERUMIER, Alexandre via pve-devel @ 2024-11-14 8:31 UTC (permalink / raw) To: pve-devel, f.ebner; +Cc: DERUMIER, Alexandre [-- Attachment #1: Type: message/rfc822, Size: 16382 bytes --] From: "DERUMIER, Alexandre" <alexandre.derumier@groupe-cyllene.com> To: "pve-devel@lists.proxmox.com" <pve-devel@lists.proxmox.com>, "f.ebner@proxmox.com" <f.ebner@proxmox.com> Subject: Re: [pve-devel] [PATCH pve-storage] qcow2 format: enable subcluster allocation by default Date: Thu, 14 Nov 2024 08:31:06 +0000 Message-ID: <98cdc246d14fdfc5dcfedf09dd4bc596acb0814f.camel@groupe-cyllene.com> Hi Fiona, I'm really sorry, I didn't see your reponse, lost in the flood of email :( >>How does read performance compare for you (with 128 KiB cluster size)? >>I don't see any noticeable difference in my testing with an ext4 >>directory storage on an SSD, attaching the qcow2 images as SCSI disks >>to >>the VM, neither for reading nor writing. I only tested without your >>change and with your change using 4k (rand)read and (rand)write. >> >>I'm not sure we should enable this for everybody, there's always a >>risk >>to break stuff with added complexity. Maybe it's better to have a >>storage configuration option that people can opt-in to, e.g. >> >>qcow2-create-opts extended_l2=on,cluster_size=128k >> >>If we get enough positive feedback, we can still change the default >>in a >>future (major) release. What disk size do you use for your bench ? It's really important, because generally, the more bigger, the slower read in qcow2 will be , because l2 metadatas need to be cached handle in memory. (and qemu have 1MB cache by default) witout subcluster, for 1TB image, it's around 128MB l2 metadas in qcow2 file. for write, it's mostly for first block allocation, so it's depend of the filesystem behind, if fallocate is supported. (I have seen really big difference on shared ocfs2/gfs2 fs) Also, for write, this is helping a lot with backing file. (so for linked qcow2 clone, and maybe in the future for external snapshots). Because currently, if you write 4k on a overlay , you need to read the full cluster 64k on the base write, and rewrite it. (so 8x overamplification) They are good information from the developper : https://blogs.igalia.com/berto/2020/12/03/subcluster-allocation-for-qcow2-images/ I really don't think it's hurt, but maybe a default for 9.0 release to be safe. (I'll try to have the external snapshot ready for this date too) Regards, Alexandre > Signed-off-by: Alexandre Derumier <alexandre.derumier@groupe- > cyllene.com> > --- > src/PVE/Storage/Plugin.pm | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/src/PVE/Storage/Plugin.pm b/src/PVE/Storage/Plugin.pm > index 6444390..31b20fe 100644 > --- a/src/PVE/Storage/Plugin.pm > +++ b/src/PVE/Storage/Plugin.pm > @@ -561,7 +561,7 @@ sub preallocation_cmd_option { > die "preallocation mode '$prealloc' not supported by format > '$fmt'\n" > if !$QCOW2_PREALLOCATION->{$prealloc}; > > - return "preallocation=$prealloc"; > + return "preallocation=$prealloc,extended_l2=on,cluster_size=128k"; Also, it doesn't really fit here in the preallocation helper as the helper is specific to that setting. > } elsif ($fmt eq 'raw') { > $prealloc = $prealloc // 'off'; > $prealloc = 'off' if $prealloc eq 'metadata'; > -- > 2.39.2 > > [-- Attachment #2: Type: text/plain, Size: 160 bytes --] _______________________________________________ pve-devel mailing list pve-devel@lists.proxmox.com https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel ^ permalink raw reply [flat|nested] 4+ messages in thread
[parent not found: <98cdc246d14fdfc5dcfedf09dd4bc596acb0814f.camel@groupe-cyllene.com>]
* Re: [pve-devel] [PATCH pve-storage] qcow2 format: enable subcluster allocation by default [not found] ` <98cdc246d14fdfc5dcfedf09dd4bc596acb0814f.camel@groupe-cyllene.com> @ 2024-11-25 15:06 ` Fiona Ebner 0 siblings, 0 replies; 4+ messages in thread From: Fiona Ebner @ 2024-11-25 15:06 UTC (permalink / raw) To: DERUMIER, Alexandre, pve-devel Am 14.11.24 um 09:31 schrieb DERUMIER, Alexandre: > Hi Fiona, > > I'm really sorry, I didn't see your reponse, lost in the flood of email > :( > > >>> How does read performance compare for you (with 128 KiB cluster > size)? > >>> I don't see any noticeable difference in my testing with an ext4 >>> directory storage on an SSD, attaching the qcow2 images as SCSI disks >>> to >>> the VM, neither for reading nor writing. I only tested without your >>> change and with your change using 4k (rand)read and (rand)write. >>> >>> I'm not sure we should enable this for everybody, there's always a >>> risk >>> to break stuff with added complexity. Maybe it's better to have a >>> storage configuration option that people can opt-in to, e.g. >>> >>> qcow2-create-opts extended_l2=on,cluster_size=128k >>> >>> If we get enough positive feedback, we can still change the default >>> in a >>> future (major) release. > > What disk size do you use for your bench ? > It's really important, because generally, the more bigger, the slower > read in qcow2 will be , because l2 metadatas need to be cached handle > in memory. (and qemu have 1MB cache by default) > I don't recall the exact size. Today I retested with a 900 GiB image. > witout subcluster, for 1TB image, it's around 128MB l2 metadas in qcow2 > file. > > for write, it's mostly for first block allocation, so it's depend > of the filesystem behind, if fallocate is supported. > (I have seen really big difference on shared ocfs2/gfs2 fs) > Okay, I see. Mine are backed by an ext4 storage. With 'extended_l2=on,cluster_size=128k' I actually get much slower results for the initial allocations (note that this is with preallocation=metadata - without preallocation, I still see "without" being about 1.5 times faster, i.e. 375MB written versus 255MB, however the usage on the underlying storage is much(!) worse, i.e. 5.71 GiB versus 373 MiB): > fio --name=rw --filename=/dev/sdb --ioengine=libaio --direct 1 --bs=4k --rw=randwrite --numjobs=1 --group_reporting --time_based --runtime 60 With options: > WRITE: bw=5603KiB/s (5737kB/s), 5603KiB/s-5603KiB/s (5737kB/s-5737kB/s), io=328MiB (344MB), run=60001-60001msec Without: > WRITE: bw=27.1MiB/s (28.4MB/s), 27.1MiB/s-27.1MiB/s (28.4MB/s-28.4MB/s), io=1626MiB (1705MB), run=60001-60001msec For a subsequent randrw: > fio --name=rw --filename=/dev/sdb --ioengine=libaio --direct 1 --bs=4k --rw=randrw --numjobs=1 --group_reporting --time_based --runtime 60 With options: > READ: bw=7071KiB/s (7241kB/s), 7071KiB/s-7071KiB/s (7241kB/s-7241kB/s), io=414MiB (434MB), run=60001-60001msec > WRITE: bw=7058KiB/s (7227kB/s), 7058KiB/s-7058KiB/s (7227kB/s-7227kB/s), io=414MiB (434MB), run=60001-60001msec Without: > READ: bw=11.8MiB/s (12.4MB/s), 11.8MiB/s-11.8MiB/s (12.4MB/s-12.4MB/s), io=708MiB (742MB), run=60001-60001msec > WRITE: bw=11.8MiB/s (12.3MB/s), 11.8MiB/s-11.8MiB/s (12.3MB/s-12.3MB/s), io=707MiB (741MB), run=60001-60001msec Took a snapshot and afterwards I get: > fio --name=rw --filename=/dev/sdb --ioengine=libaio --direct 1 --bs=4k --rw=randrw --numjobs=1 --group_reporting --time_based --runtime 60 With options: > READ: bw=1250KiB/s (1280kB/s), 1250KiB/s-1250KiB/s (1280kB/s-1280kB/s), io=73.3MiB (76.8MB), run=60003-60003msec > WRITE: bw=1251KiB/s (1281kB/s), 1251KiB/s-1251KiB/s (1281kB/s-1281kB/s), io=73.3MiB (76.9MB), run=60003-60003msec Without: > READ: bw=1198KiB/s (1227kB/s), 1198KiB/s-1198KiB/s (1227kB/s-1227kB/s), io=70.2MiB (73.6MB), run=60002-60002msec > WRITE: bw=1201KiB/s (1230kB/s), 1201KiB/s-1201KiB/s (1230kB/s-1230kB/s), io=70.4MiB (73.8MB), run=60002-60002msec But there is a big difference in the read performance, this time in favor of "with options" (but there is less allocated on the image, as a consequence of the previous tests): > fio --name=rw --filename=/dev/sdb --ioengine=libaio --direct 1 --bs=4k --rw=randread --numjobs=1 --group_reporting --time_based --runtime 60 With options: > READ: bw=31.2MiB/s (32.7MB/s), 31.2MiB/s-31.2MiB/s (32.7MB/s-32.7MB/s), io=1871MiB (1962MB), run=60001-60001msec Without: > READ: bw=19.2MiB/s (20.1MB/s), 19.2MiB/s-19.2MiB/s (20.1MB/s-20.1MB/s), io=1151MiB (1207MB), run=60001-60001msec > Also, for write, this is helping a lot with backing file. (so for > linked qcow2 clone, and maybe in the future for external snapshots). > Because currently, if you write 4k on a overlay , you need to read the > full cluster 64k on the base write, and rewrite it. (so 8x > overamplification) > For clone, we should add the options in clone_image() too ;) > > They are good information from the developper : > https://blogs.igalia.com/berto/2020/12/03/subcluster-allocation-for-qcow2-images/ > > > I really don't think it's hurt, but maybe a default for 9.0 release to > be safe. (I'll try to have the external snapshot ready for this date > too) > It seems to hurt at least in some cases (initial allocation speed can be worse than without the option, in particular when using preallocation=metadata). If we have it as opt-in already, users can give it a try and then we can still think about making it the default for 9.0, should there be enough positive feedback. We could also think about only using it for linked clones by default initially. Independently, you can make it the default for LvmQcow2Plugin of course, since you see much better results for that use case :) _______________________________________________ pve-devel mailing list pve-devel@lists.proxmox.com https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel ^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2024-11-25 15:07 UTC | newest] Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2024-07-03 14:24 [pve-devel] [PATCH pve-storage] qcow2 format: enable subcluster allocation by default Alexandre Derumier via pve-devel 2024-09-11 11:44 ` Fiona Ebner 2024-11-14 8:31 ` DERUMIER, Alexandre via pve-devel [not found] ` <98cdc246d14fdfc5dcfedf09dd4bc596acb0814f.camel@groupe-cyllene.com> 2024-11-25 15:06 ` Fiona Ebner
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox