* [pve-devel] [PATCH docs DRAFT] Add section with more infos about ZFS RAID levels
@ 2020-07-17 12:12 Aaron Lauterer
2020-07-17 13:23 ` Andreas Steinel
2020-07-20 18:30 ` Stoiko Ivanov
0 siblings, 2 replies; 5+ messages in thread
From: Aaron Lauterer @ 2020-07-17 12:12 UTC (permalink / raw)
To: pve-devel
This new section explains the performance and failure properties of
mirror and RAIDZ VDEVs as well as the "unexpected" higher space usage by
ZVOLs on a RAIDZ.
Signed-off-by: Aaron Lauterer <a.lauterer@proxmox.com>
---
This is a first draft to explain the performance characteristics of the
different RAID levels / VDEV types, as well as their failure behavior.
Additionally it explains the situation why a VM disk (ZVOL) can end up
using quite a bit more space than expected when placed on a pool made of
RAIDZ VDEVs.
The motivation behind this is, that in the recent past, these things
came up quite a bit. Thus it would be nice to have some documentation
that we can link to and having it in the docs might help users to make
an informed decision from the start.
I hope I did not mess up any technical details and that it is
understandable enough.
local-zfs.adoc | 96 ++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 96 insertions(+)
diff --git a/local-zfs.adoc b/local-zfs.adoc
index fd03e89..48f6540 100644
--- a/local-zfs.adoc
+++ b/local-zfs.adoc
@@ -151,6 +151,102 @@ rpool/swap 4.25G 7.69T 64K -
----
+[[sysadmin_zfs_raid_considerations]]
+ZFS RAID Level Considerations
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+There are a few factors to take into consideration when choosing the layout of
+a ZFS pool.
+
+
+[[sysadmin_zfs_raid_performance]]
+Performance
+^^^^^^^^^^^
+
+Different types of VDEVs have different performance behaviors. The two
+parameters of interest are the IOPS (Input/Output Operations per Second) and the
+bandwidth with which data can be written or read.
+
+A 'mirror' VDEV will approximately behave like a single disk in regards to both
+parameters when writing data. When reading data if will behave like the number
+of disks in the mirror.
+
+A common situation is to have 4 disks. When setting it up as 2 mirror VDEVs the
+pool will have the write characteristics as two single disks in regard of IOPS
+and bandwidth. For read operations it will resemble 4 single disks.
+
+A 'RAIDZ' of any redundancy level will approximately behave like a single disk
+in regard of IOPS with a lot of bandwidth. How much bandwidth depends on the
+size of the RAIDZ VDEV and the redundancy level.
+
+For running VMs, IOPS is the more important metric in most situations.
+
+
+[[sysadmin_zfs_raid_size_space_usage_redundancy]]
+Size, Space usage and Redundancy
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+While a pool made of 'mirror' VDEVs will have the best performance
+characteristics, the usable space will be 50% of the disks available. Less if a
+mirror VDEV consists of more than 2 disks. To stay functional it needs to have
+at least one disk per mirror VDEV available. The pool will fail once all disks
+in a mirror VDEV fail.
+
+When using a pool made of 'RAIDZ' VDEVs the usable space to disk ratio will be
+better in most situations than using mirror VDEVs. This is especially true when
+using a large number of disks. A special case is a 4 disk pool with RAIDZ2. In
+this situation it is usually better to use 2 mirror VDEVs for the better
+performance as the usable space will be the same. In a RAIDZ VDEV, any drive
+can fail and it will stay operational. The number of sustainable drive failures
+is defined by the redundancy level, a RAIDZ1 can survive the loss of 1 disk,
+consequently, a RAIDZ2 the loss of 2 and a RAIDZ3 the loss of 3 disks.
+
+Another important factor when using any RAIDZ level is how ZVOL datasets, which
+are used for VM disks, behave. For each data block the pool needs parity data
+which are at least the size of the minimum block size defined by the `ashift`
+value of the pool. With an ashift of 12 the block size of the pool is 4k. The
+default block size for a ZVOL is 8k. Therefore, in a RAIDZ2 each 8k block
+written will cause two additional 4k parity blocks to be written,
+8k + 4k + 4k = 16k. This is of course a simplified approach and the real
+situation will be slightly different with metadata, compression and such not
+being accounted for in this example.
+
+This behavior can be observed when checking the following properties of the
+ZVOL:
+
+ * `volsize`
+ * `refreservation` (if the pool is not thin provisioned)
+ * `used` (if the pool is thin provisioned and without snapshots present)
+
+----
+# zfs get volsize,refreservation,used /<pool>/vm-<vmid>-disk-X
+----
+
+`volsize` is the size of the disk as it is presented to the VM, while
+`refreservation` shows the reserved space on the pool which includes the
+expected space needed for the parity data. If the pool is thin provisioned, the
+`refreservation` will be set to 0. Another way to observe the behavior is to
+compare the used disk space within the VM and the `used` property. Be aware
+that snapshots will skew the value.
+
+To counter this effect there are a few options.
+
+* Increase the `volblocksize` to improve the data to parity ratio
+* Use 'mirror' VDEVs instead of 'RAIDZ'
+* Use `ashift=9` (block size of 512 bytes)
+
+The `volblocksize` property can only be set when creating a ZVOL. The default
+value can be changed in the storage configuration. When doing this, the guest
+needs to be tuned accordingly and depending on the use case, the problem of
+write amplification if just moved from the ZFS layer up to the guest.
+
+A pool made of 'mirror' VDEVs has a different usable space and failure behavior
+than a 'RAIDZ' pool.
+
+Using `ashift=9` when creating the pool can lead to bad
+performance, depending on the disks underneath, and cannot be changed later on.
+
+
Bootloader
~~~~~~~~~~
--
2.20.1
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [pve-devel] [PATCH docs DRAFT] Add section with more infos about ZFS RAID levels
2020-07-17 12:12 [pve-devel] [PATCH docs DRAFT] Add section with more infos about ZFS RAID levels Aaron Lauterer
@ 2020-07-17 13:23 ` Andreas Steinel
2020-07-21 7:54 ` Aaron Lauterer
2020-07-20 18:30 ` Stoiko Ivanov
1 sibling, 1 reply; 5+ messages in thread
From: Andreas Steinel @ 2020-07-17 13:23 UTC (permalink / raw)
To: Proxmox VE development discussion
Very good.
Maybe we can also include some references to books, e.g. the ZFS books from
Allan Jude and Michael W. Lucas for further reading?
On Fri, Jul 17, 2020 at 2:13 PM Aaron Lauterer <a.lauterer@proxmox.com>
wrote:
> This new section explains the performance and failure properties of
> mirror and RAIDZ VDEVs as well as the "unexpected" higher space usage by
> ZVOLs on a RAIDZ.
>
> Signed-off-by: Aaron Lauterer <a.lauterer@proxmox.com>
> ---
>
> This is a first draft to explain the performance characteristics of the
> different RAID levels / VDEV types, as well as their failure behavior.
>
> Additionally it explains the situation why a VM disk (ZVOL) can end up
> using quite a bit more space than expected when placed on a pool made of
> RAIDZ VDEVs.
>
> The motivation behind this is, that in the recent past, these things
> came up quite a bit. Thus it would be nice to have some documentation
> that we can link to and having it in the docs might help users to make
> an informed decision from the start.
>
> I hope I did not mess up any technical details and that it is
> understandable enough.
>
> local-zfs.adoc | 96 ++++++++++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 96 insertions(+)
>
> diff --git a/local-zfs.adoc b/local-zfs.adoc
> index fd03e89..48f6540 100644
> --- a/local-zfs.adoc
> +++ b/local-zfs.adoc
> @@ -151,6 +151,102 @@ rpool/swap 4.25G 7.69T 64K -
> ----
>
>
> +[[sysadmin_zfs_raid_considerations]]
> +ZFS RAID Level Considerations
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +There are a few factors to take into consideration when choosing the
> layout of
> +a ZFS pool.
> +
> +
> +[[sysadmin_zfs_raid_performance]]
> +Performance
> +^^^^^^^^^^^
> +
> +Different types of VDEVs have different performance behaviors. The two
> +parameters of interest are the IOPS (Input/Output Operations per Second)
> and the
> +bandwidth with which data can be written or read.
> +
> +A 'mirror' VDEV will approximately behave like a single disk in regards
> to both
> +parameters when writing data. When reading data if will behave like the
> number
> +of disks in the mirror.
> +
> +A common situation is to have 4 disks. When setting it up as 2 mirror
> VDEVs the
> +pool will have the write characteristics as two single disks in regard of
> IOPS
> +and bandwidth. For read operations it will resemble 4 single disks.
> +
> +A 'RAIDZ' of any redundancy level will approximately behave like a single
> disk
> +in regard of IOPS with a lot of bandwidth. How much bandwidth depends on
> the
> +size of the RAIDZ VDEV and the redundancy level.
> +
> +For running VMs, IOPS is the more important metric in most situations.
> +
> +
> +[[sysadmin_zfs_raid_size_space_usage_redundancy]]
> +Size, Space usage and Redundancy
> +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> +
> +While a pool made of 'mirror' VDEVs will have the best performance
> +characteristics, the usable space will be 50% of the disks available.
> Less if a
> +mirror VDEV consists of more than 2 disks. To stay functional it needs to
> have
> +at least one disk per mirror VDEV available. The pool will fail once all
> disks
> +in a mirror VDEV fail.
> +
> +When using a pool made of 'RAIDZ' VDEVs the usable space to disk ratio
> will be
> +better in most situations than using mirror VDEVs. This is especially
> true when
> +using a large number of disks. A special case is a 4 disk pool with
> RAIDZ2. In
> +this situation it is usually better to use 2 mirror VDEVs for the better
> +performance as the usable space will be the same. In a RAIDZ VDEV, any
> drive
> +can fail and it will stay operational. The number of sustainable drive
> failures
> +is defined by the redundancy level, a RAIDZ1 can survive the loss of 1
> disk,
> +consequently, a RAIDZ2 the loss of 2 and a RAIDZ3 the loss of 3 disks.
> +
> +Another important factor when using any RAIDZ level is how ZVOL datasets,
> which
> +are used for VM disks, behave. For each data block the pool needs parity
> data
> +which are at least the size of the minimum block size defined by the
> `ashift`
> +value of the pool. With an ashift of 12 the block size of the pool is
> 4k. The
> +default block size for a ZVOL is 8k. Therefore, in a RAIDZ2 each 8k block
> +written will cause two additional 4k parity blocks to be written,
> +8k + 4k + 4k = 16k. This is of course a simplified approach and the real
> +situation will be slightly different with metadata, compression and such
> not
> +being accounted for in this example.
> +
> +This behavior can be observed when checking the following properties of
> the
> +ZVOL:
> +
> + * `volsize`
> + * `refreservation` (if the pool is not thin provisioned)
> + * `used` (if the pool is thin provisioned and without snapshots present)
> +
> +----
> +# zfs get volsize,refreservation,used /<pool>/vm-<vmid>-disk-X
> +----
> +
> +`volsize` is the size of the disk as it is presented to the VM, while
> +`refreservation` shows the reserved space on the pool which includes the
> +expected space needed for the parity data. If the pool is thin
> provisioned, the
> +`refreservation` will be set to 0. Another way to observe the behavior is
> to
> +compare the used disk space within the VM and the `used` property. Be
> aware
> +that snapshots will skew the value.
> +
> +To counter this effect there are a few options.
> +
> +* Increase the `volblocksize` to improve the data to parity ratio
> +* Use 'mirror' VDEVs instead of 'RAIDZ'
> +* Use `ashift=9` (block size of 512 bytes)
> +
> +The `volblocksize` property can only be set when creating a ZVOL. The
> default
> +value can be changed in the storage configuration. When doing this, the
> guest
> +needs to be tuned accordingly and depending on the use case, the problem
> of
> +write amplification if just moved from the ZFS layer up to the guest.
> +
> +A pool made of 'mirror' VDEVs has a different usable space and failure
> behavior
> +than a 'RAIDZ' pool.
> +
> +Using `ashift=9` when creating the pool can lead to bad
> +performance, depending on the disks underneath, and cannot be changed
> later on.
> +
> +
> Bootloader
> ~~~~~~~~~~
>
> --
> 2.20.1
>
>
>
> _______________________________________________
> pve-devel mailing list
> pve-devel@lists.proxmox.com
> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
>
>
--
With kind regards / Mit freundlichen Grüßen
Andreas Steinel
M.Sc. Visual Computing
M.Sc. Informatik
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [pve-devel] [PATCH docs DRAFT] Add section with more infos about ZFS RAID levels
2020-07-17 13:23 ` Andreas Steinel
@ 2020-07-21 7:54 ` Aaron Lauterer
0 siblings, 0 replies; 5+ messages in thread
From: Aaron Lauterer @ 2020-07-21 7:54 UTC (permalink / raw)
To: Proxmox VE development discussion, Andreas Steinel
On 7/17/20 3:23 PM, Andreas Steinel wrote:
> Very good.
>
> Maybe we can also include some references to books, e.g. the ZFS books from
> Allan Jude and Michael W. Lucas for further reading?
>
That is a good idea IMHO. But I think this section is not the right place. I would put it somewhere in the introduction at the beginning of the ZFS chapter.
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [pve-devel] [PATCH docs DRAFT] Add section with more infos about ZFS RAID levels
2020-07-17 12:12 [pve-devel] [PATCH docs DRAFT] Add section with more infos about ZFS RAID levels Aaron Lauterer
2020-07-17 13:23 ` Andreas Steinel
@ 2020-07-20 18:30 ` Stoiko Ivanov
2020-07-21 7:55 ` Aaron Lauterer
1 sibling, 1 reply; 5+ messages in thread
From: Stoiko Ivanov @ 2020-07-20 18:30 UTC (permalink / raw)
To: Aaron Lauterer; +Cc: Proxmox VE development discussion
Thanks for picking this up! Looking forward to not searching the web/our
forum for the good answers to questions that come up quite often.
a few mostly stylistic (as in more a matter of my taste) comments inline:
On Fri, 17 Jul 2020 14:12:32 +0200
Aaron Lauterer <a.lauterer@proxmox.com> wrote:
> This new section explains the performance and failure properties of
> mirror and RAIDZ VDEVs as well as the "unexpected" higher space usage by
> ZVOLs on a RAIDZ.
>
> Signed-off-by: Aaron Lauterer <a.lauterer@proxmox.com>
> ---
>
> This is a first draft to explain the performance characteristics of the
> different RAID levels / VDEV types, as well as their failure behavior.
>
> Additionally it explains the situation why a VM disk (ZVOL) can end up
> using quite a bit more space than expected when placed on a pool made of
> RAIDZ VDEVs.
>
> The motivation behind this is, that in the recent past, these things
> came up quite a bit. Thus it would be nice to have some documentation
> that we can link to and having it in the docs might help users to make
> an informed decision from the start.
>
> I hope I did not mess up any technical details and that it is
> understandable enough.
>
> local-zfs.adoc | 96 ++++++++++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 96 insertions(+)
>
> diff --git a/local-zfs.adoc b/local-zfs.adoc
> index fd03e89..48f6540 100644
> --- a/local-zfs.adoc
> +++ b/local-zfs.adoc
> @@ -151,6 +151,102 @@ rpool/swap 4.25G 7.69T 64K -
> ----
>
>
> +[[sysadmin_zfs_raid_considerations]]
> +ZFS RAID Level Considerations
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +There are a few factors to take into consideration when choosing the layout of
> +a ZFS pool.
> +
> +
> +[[sysadmin_zfs_raid_performance]]
> +Performance
> +^^^^^^^^^^^
> +
> +Different types of VDEVs have different performance behaviors. The two
we have a few mentions of vdev (written without caps in the
system-booting.adoc) - for consistency either write it small here as well
or change the system-booting part.
as for the content - a short explanation what a vdev is might be helpful,
and mentioning that all top level vdevs in a pool are striped (as in
RAID0) together
> +parameters of interest are the IOPS (Input/Output Operations per Second) and the
> +bandwidth with which data can be written or read.
> +
> +A 'mirror' VDEV will approximately behave like a single disk in regards to both
> +parameters when writing data. When reading data if will behave like the number
> +of disks in the mirror.
in the section above (and in the installer and disk management GUI we talk about
RAIDX - maybe refer to this at least in paranthesis:
A 'mirror' VDEV (RAID1) ....
> +
> +A common situation is to have 4 disks. When setting it up as 2 mirror VDEVs the
here the same with RAID10
> +pool will have the write characteristics as two single disks in regard of IOPS
> +and bandwidth. For read operations it will resemble 4 single disks.
> +
> +A 'RAIDZ' of any redundancy level will approximately behave like a single disk
> +in regard of IOPS with a lot of bandwidth. How much bandwidth depends on the
> +size of the RAIDZ VDEV and the redundancy level.
> +
> +For running VMs, IOPS is the more important metric in most situations.
> +
> +
> +[[sysadmin_zfs_raid_size_space_usage_redundancy]]
> +Size, Space usage and Redundancy
> +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> +
> +While a pool made of 'mirror' VDEVs will have the best performance
> +characteristics, the usable space will be 50% of the disks available. Less if a
> +mirror VDEV consists of more than 2 disks. To stay functional it needs to have
maybe add (e.g. a 3-way mirror) after 2 disks
s/To stay functional it needs to have at least one disk per mirror VDEV
available/At least one healthy disk per mirror is needed for the pool to
work/ ?
> +at least one disk per mirror VDEV available. The pool will fail once all disks
> +in a mirror VDEV fail.
maybe drop the last sentence
> +
> +When using a pool made of 'RAIDZ' VDEVs the usable space to disk ratio will be
> +better in most situations than using mirror VDEVs. This is especially true when
Why not actively describe the usable space: The usable space of a 'RAIDZ'
type VDEV of N disks is roughly N-X, with X being the RAIDZ-level.
The RAIDZ-level also indicates how many arbitrary disks can fail without
losing data. (and drop the redundancy sentence below
> +using a large number of disks. A special case is a 4 disk pool with RAIDZ2. In
> +this situation it is usually better to use 2 mirror VDEVs for the better
> +performance as the usable space will be the same. In a RAIDZ VDEV, any drive
> +can fail and it will stay operational. The number of sustainable drive failures
> +is defined by the redundancy level, a RAIDZ1 can survive the loss of 1 disk,
> +consequently, a RAIDZ2 the loss of 2 and a RAIDZ3 the loss of 3 disks.
> +
> +Another important factor when using any RAIDZ level is how ZVOL datasets, which
> +are used for VM disks, behave. For each data block the pool needs parity data
> +which are at least the size of the minimum block size defined by the `ashift`
which _is_ at least the size of?
> +value of the pool. With an ashift of 12 the block size of the pool is 4k. The
> +default block size for a ZVOL is 8k. Therefore, in a RAIDZ2 each 8k block
> +written will cause two additional 4k parity blocks to be written,
> +8k + 4k + 4k = 16k. This is of course a simplified approach and the real
> +situation will be slightly different with metadata, compression and such not
> +being accounted for in this example.
> +
> +This behavior can be observed when checking the following properties of the
> +ZVOL:
> +
> + * `volsize`
> + * `refreservation` (if the pool is not thin provisioned)
> + * `used` (if the pool is thin provisioned and without snapshots present)
> +
> +----
> +# zfs get volsize,refreservation,used /<pool>/vm-<vmid>-disk-X
> +----
the '/' in the beginning should be dropped
> +
> +`volsize` is the size of the disk as it is presented to the VM, while
> +`refreservation` shows the reserved space on the pool which includes the
> +expected space needed for the parity data. If the pool is thin provisioned, the
> +`refreservation` will be set to 0. Another way to observe the behavior is to
> +compare the used disk space within the VM and the `used` property. Be aware
> +that snapshots will skew the value.
> +
> +To counter this effect there are a few options.
s/this effect/the increased use of space/
> +
> +* Increase the `volblocksize` to improve the data to parity ratio
> +* Use 'mirror' VDEVs instead of 'RAIDZ'
> +* Use `ashift=9` (block size of 512 bytes)
> +
> +The `volblocksize` property can only be set when creating a ZVOL. The default
> +value can be changed in the storage configuration. When doing this, the guest
> +needs to be tuned accordingly and depending on the use case, the problem of
> +write amplification if just moved from the ZFS layer up to the guest.
> +
> +A pool made of 'mirror' VDEVs has a different usable space and failure behavior
> +than a 'RAIDZ' pool.
This is already explained above?
Maybe add a short recommendation - 'RAID10 has favorable behavior for VM
workloads - use RAID10, unless your environment has specific needs and
characteristics where RAIDZ performance characteristics are acceptable' ?
> +
> +Using `ashift=9` when creating the pool can lead to bad
> +performance, depending on the disks underneath, and cannot be changed later on.
> +
> +
> Bootloader
> ~~~~~~~~~~
>
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [pve-devel] [PATCH docs DRAFT] Add section with more infos about ZFS RAID levels
2020-07-20 18:30 ` Stoiko Ivanov
@ 2020-07-21 7:55 ` Aaron Lauterer
0 siblings, 0 replies; 5+ messages in thread
From: Aaron Lauterer @ 2020-07-21 7:55 UTC (permalink / raw)
To: Stoiko Ivanov; +Cc: Proxmox VE development discussion
Thanks, will incorporate these.
On 7/20/20 8:30 PM, Stoiko Ivanov wrote:
> Thanks for picking this up! Looking forward to not searching the web/our
> forum for the good answers to questions that come up quite often.
>
> a few mostly stylistic (as in more a matter of my taste) comments inline:
>
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2020-07-21 7:55 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-07-17 12:12 [pve-devel] [PATCH docs DRAFT] Add section with more infos about ZFS RAID levels Aaron Lauterer
2020-07-17 13:23 ` Andreas Steinel
2020-07-21 7:54 ` Aaron Lauterer
2020-07-20 18:30 ` Stoiko Ivanov
2020-07-21 7:55 ` Aaron Lauterer
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox