From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by lists.proxmox.com (Postfix) with ESMTPS id AFC81649BD for ; Mon, 20 Jul 2020 20:31:13 +0200 (CEST) Received: from firstgate.proxmox.com (localhost [127.0.0.1]) by firstgate.proxmox.com (Proxmox) with ESMTP id 8DA60113A3 for ; Mon, 20 Jul 2020 20:30:43 +0200 (CEST) Received: from proxmox-new.maurer-it.com (proxmox-new.maurer-it.com [212.186.127.180]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by firstgate.proxmox.com (Proxmox) with ESMTPS id C0B3211393 for ; Mon, 20 Jul 2020 20:30:41 +0200 (CEST) Received: from proxmox-new.maurer-it.com (localhost.localdomain [127.0.0.1]) by proxmox-new.maurer-it.com (Proxmox) with ESMTP id B8DF84326C for ; Mon, 20 Jul 2020 20:30:40 +0200 (CEST) Date: Mon, 20 Jul 2020 20:30:39 +0200 From: Stoiko Ivanov To: Aaron Lauterer Cc: Proxmox VE development discussion Message-ID: <20200720203039.2f1f0b74@rosa.proxmox.com> In-Reply-To: <20200717121232.29020-1-a.lauterer@proxmox.com> References: <20200717121232.29020-1-a.lauterer@proxmox.com> X-Mailer: Claws Mail 3.17.3 (GTK+ 2.24.32; x86_64-pc-linux-gnu) MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-SPAM-LEVEL: Spam detection results: 0 KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment RCVD_IN_DNSWL_MED -2.3 Sender listed at https://www.dnswl.org/, medium trust SPF_HELO_NONE 0.001 SPF: HELO does not publish an SPF Record SPF_PASS -0.001 SPF: sender matches SPF record Subject: Re: [pve-devel] [PATCH docs DRAFT] Add section with more infos about ZFS RAID levels X-BeenThere: pve-devel@lists.proxmox.com X-Mailman-Version: 2.1.29 Precedence: list List-Id: Proxmox VE development discussion List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 20 Jul 2020 18:31:13 -0000 Thanks for picking this up! Looking forward to not searching the web/our forum for the good answers to questions that come up quite often. a few mostly stylistic (as in more a matter of my taste) comments inline: On Fri, 17 Jul 2020 14:12:32 +0200 Aaron Lauterer wrote: > This new section explains the performance and failure properties of > mirror and RAIDZ VDEVs as well as the "unexpected" higher space usage by > ZVOLs on a RAIDZ. > > Signed-off-by: Aaron Lauterer > --- > > This is a first draft to explain the performance characteristics of the > different RAID levels / VDEV types, as well as their failure behavior. > > Additionally it explains the situation why a VM disk (ZVOL) can end up > using quite a bit more space than expected when placed on a pool made of > RAIDZ VDEVs. > > The motivation behind this is, that in the recent past, these things > came up quite a bit. Thus it would be nice to have some documentation > that we can link to and having it in the docs might help users to make > an informed decision from the start. > > I hope I did not mess up any technical details and that it is > understandable enough. > > local-zfs.adoc | 96 ++++++++++++++++++++++++++++++++++++++++++++++++++ > 1 file changed, 96 insertions(+) > > diff --git a/local-zfs.adoc b/local-zfs.adoc > index fd03e89..48f6540 100644 > --- a/local-zfs.adoc > +++ b/local-zfs.adoc > @@ -151,6 +151,102 @@ rpool/swap 4.25G 7.69T 64K - > ---- > > > +[[sysadmin_zfs_raid_considerations]] > +ZFS RAID Level Considerations > +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > + > +There are a few factors to take into consideration when choosing the layout of > +a ZFS pool. > + > + > +[[sysadmin_zfs_raid_performance]] > +Performance > +^^^^^^^^^^^ > + > +Different types of VDEVs have different performance behaviors. The two we have a few mentions of vdev (written without caps in the system-booting.adoc) - for consistency either write it small here as well or change the system-booting part. as for the content - a short explanation what a vdev is might be helpful, and mentioning that all top level vdevs in a pool are striped (as in RAID0) together > +parameters of interest are the IOPS (Input/Output Operations per Second) and the > +bandwidth with which data can be written or read. > + > +A 'mirror' VDEV will approximately behave like a single disk in regards to both > +parameters when writing data. When reading data if will behave like the number > +of disks in the mirror. in the section above (and in the installer and disk management GUI we talk about RAIDX - maybe refer to this at least in paranthesis: A 'mirror' VDEV (RAID1) .... > + > +A common situation is to have 4 disks. When setting it up as 2 mirror VDEVs the here the same with RAID10 > +pool will have the write characteristics as two single disks in regard of IOPS > +and bandwidth. For read operations it will resemble 4 single disks. > + > +A 'RAIDZ' of any redundancy level will approximately behave like a single disk > +in regard of IOPS with a lot of bandwidth. How much bandwidth depends on the > +size of the RAIDZ VDEV and the redundancy level. > + > +For running VMs, IOPS is the more important metric in most situations. > + > + > +[[sysadmin_zfs_raid_size_space_usage_redundancy]] > +Size, Space usage and Redundancy > +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > + > +While a pool made of 'mirror' VDEVs will have the best performance > +characteristics, the usable space will be 50% of the disks available. Less if a > +mirror VDEV consists of more than 2 disks. To stay functional it needs to have maybe add (e.g. a 3-way mirror) after 2 disks s/To stay functional it needs to have at least one disk per mirror VDEV available/At least one healthy disk per mirror is needed for the pool to work/ ? > +at least one disk per mirror VDEV available. The pool will fail once all disks > +in a mirror VDEV fail. maybe drop the last sentence > + > +When using a pool made of 'RAIDZ' VDEVs the usable space to disk ratio will be > +better in most situations than using mirror VDEVs. This is especially true when Why not actively describe the usable space: The usable space of a 'RAIDZ' type VDEV of N disks is roughly N-X, with X being the RAIDZ-level. The RAIDZ-level also indicates how many arbitrary disks can fail without losing data. (and drop the redundancy sentence below > +using a large number of disks. A special case is a 4 disk pool with RAIDZ2. In > +this situation it is usually better to use 2 mirror VDEVs for the better > +performance as the usable space will be the same. In a RAIDZ VDEV, any drive > +can fail and it will stay operational. The number of sustainable drive failures > +is defined by the redundancy level, a RAIDZ1 can survive the loss of 1 disk, > +consequently, a RAIDZ2 the loss of 2 and a RAIDZ3 the loss of 3 disks. > + > +Another important factor when using any RAIDZ level is how ZVOL datasets, which > +are used for VM disks, behave. For each data block the pool needs parity data > +which are at least the size of the minimum block size defined by the `ashift` which _is_ at least the size of? > +value of the pool. With an ashift of 12 the block size of the pool is 4k. The > +default block size for a ZVOL is 8k. Therefore, in a RAIDZ2 each 8k block > +written will cause two additional 4k parity blocks to be written, > +8k + 4k + 4k = 16k. This is of course a simplified approach and the real > +situation will be slightly different with metadata, compression and such not > +being accounted for in this example. > + > +This behavior can be observed when checking the following properties of the > +ZVOL: > + > + * `volsize` > + * `refreservation` (if the pool is not thin provisioned) > + * `used` (if the pool is thin provisioned and without snapshots present) > + > +---- > +# zfs get volsize,refreservation,used //vm--disk-X > +---- the '/' in the beginning should be dropped > + > +`volsize` is the size of the disk as it is presented to the VM, while > +`refreservation` shows the reserved space on the pool which includes the > +expected space needed for the parity data. If the pool is thin provisioned, the > +`refreservation` will be set to 0. Another way to observe the behavior is to > +compare the used disk space within the VM and the `used` property. Be aware > +that snapshots will skew the value. > + > +To counter this effect there are a few options. s/this effect/the increased use of space/ > + > +* Increase the `volblocksize` to improve the data to parity ratio > +* Use 'mirror' VDEVs instead of 'RAIDZ' > +* Use `ashift=9` (block size of 512 bytes) > + > +The `volblocksize` property can only be set when creating a ZVOL. The default > +value can be changed in the storage configuration. When doing this, the guest > +needs to be tuned accordingly and depending on the use case, the problem of > +write amplification if just moved from the ZFS layer up to the guest. > + > +A pool made of 'mirror' VDEVs has a different usable space and failure behavior > +than a 'RAIDZ' pool. This is already explained above? Maybe add a short recommendation - 'RAID10 has favorable behavior for VM workloads - use RAID10, unless your environment has specific needs and characteristics where RAIDZ performance characteristics are acceptable' ? > + > +Using `ashift=9` when creating the pool can lead to bad > +performance, depending on the disks underneath, and cannot be changed later on. > + > + > Bootloader > ~~~~~~~~~~ >