From: Stoiko Ivanov <s.ivanov@proxmox.com>
To: Aaron Lauterer <a.lauterer@proxmox.com>
Cc: Proxmox VE development discussion <pve-devel@lists.proxmox.com>
Subject: Re: [pve-devel] [PATCH docs DRAFT] Add section with more infos about ZFS RAID levels
Date: Mon, 20 Jul 2020 20:30:39 +0200 [thread overview]
Message-ID: <20200720203039.2f1f0b74@rosa.proxmox.com> (raw)
In-Reply-To: <20200717121232.29020-1-a.lauterer@proxmox.com>
Thanks for picking this up! Looking forward to not searching the web/our
forum for the good answers to questions that come up quite often.
a few mostly stylistic (as in more a matter of my taste) comments inline:
On Fri, 17 Jul 2020 14:12:32 +0200
Aaron Lauterer <a.lauterer@proxmox.com> wrote:
> This new section explains the performance and failure properties of
> mirror and RAIDZ VDEVs as well as the "unexpected" higher space usage by
> ZVOLs on a RAIDZ.
>
> Signed-off-by: Aaron Lauterer <a.lauterer@proxmox.com>
> ---
>
> This is a first draft to explain the performance characteristics of the
> different RAID levels / VDEV types, as well as their failure behavior.
>
> Additionally it explains the situation why a VM disk (ZVOL) can end up
> using quite a bit more space than expected when placed on a pool made of
> RAIDZ VDEVs.
>
> The motivation behind this is, that in the recent past, these things
> came up quite a bit. Thus it would be nice to have some documentation
> that we can link to and having it in the docs might help users to make
> an informed decision from the start.
>
> I hope I did not mess up any technical details and that it is
> understandable enough.
>
> local-zfs.adoc | 96 ++++++++++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 96 insertions(+)
>
> diff --git a/local-zfs.adoc b/local-zfs.adoc
> index fd03e89..48f6540 100644
> --- a/local-zfs.adoc
> +++ b/local-zfs.adoc
> @@ -151,6 +151,102 @@ rpool/swap 4.25G 7.69T 64K -
> ----
>
>
> +[[sysadmin_zfs_raid_considerations]]
> +ZFS RAID Level Considerations
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +There are a few factors to take into consideration when choosing the layout of
> +a ZFS pool.
> +
> +
> +[[sysadmin_zfs_raid_performance]]
> +Performance
> +^^^^^^^^^^^
> +
> +Different types of VDEVs have different performance behaviors. The two
we have a few mentions of vdev (written without caps in the
system-booting.adoc) - for consistency either write it small here as well
or change the system-booting part.
as for the content - a short explanation what a vdev is might be helpful,
and mentioning that all top level vdevs in a pool are striped (as in
RAID0) together
> +parameters of interest are the IOPS (Input/Output Operations per Second) and the
> +bandwidth with which data can be written or read.
> +
> +A 'mirror' VDEV will approximately behave like a single disk in regards to both
> +parameters when writing data. When reading data if will behave like the number
> +of disks in the mirror.
in the section above (and in the installer and disk management GUI we talk about
RAIDX - maybe refer to this at least in paranthesis:
A 'mirror' VDEV (RAID1) ....
> +
> +A common situation is to have 4 disks. When setting it up as 2 mirror VDEVs the
here the same with RAID10
> +pool will have the write characteristics as two single disks in regard of IOPS
> +and bandwidth. For read operations it will resemble 4 single disks.
> +
> +A 'RAIDZ' of any redundancy level will approximately behave like a single disk
> +in regard of IOPS with a lot of bandwidth. How much bandwidth depends on the
> +size of the RAIDZ VDEV and the redundancy level.
> +
> +For running VMs, IOPS is the more important metric in most situations.
> +
> +
> +[[sysadmin_zfs_raid_size_space_usage_redundancy]]
> +Size, Space usage and Redundancy
> +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> +
> +While a pool made of 'mirror' VDEVs will have the best performance
> +characteristics, the usable space will be 50% of the disks available. Less if a
> +mirror VDEV consists of more than 2 disks. To stay functional it needs to have
maybe add (e.g. a 3-way mirror) after 2 disks
s/To stay functional it needs to have at least one disk per mirror VDEV
available/At least one healthy disk per mirror is needed for the pool to
work/ ?
> +at least one disk per mirror VDEV available. The pool will fail once all disks
> +in a mirror VDEV fail.
maybe drop the last sentence
> +
> +When using a pool made of 'RAIDZ' VDEVs the usable space to disk ratio will be
> +better in most situations than using mirror VDEVs. This is especially true when
Why not actively describe the usable space: The usable space of a 'RAIDZ'
type VDEV of N disks is roughly N-X, with X being the RAIDZ-level.
The RAIDZ-level also indicates how many arbitrary disks can fail without
losing data. (and drop the redundancy sentence below
> +using a large number of disks. A special case is a 4 disk pool with RAIDZ2. In
> +this situation it is usually better to use 2 mirror VDEVs for the better
> +performance as the usable space will be the same. In a RAIDZ VDEV, any drive
> +can fail and it will stay operational. The number of sustainable drive failures
> +is defined by the redundancy level, a RAIDZ1 can survive the loss of 1 disk,
> +consequently, a RAIDZ2 the loss of 2 and a RAIDZ3 the loss of 3 disks.
> +
> +Another important factor when using any RAIDZ level is how ZVOL datasets, which
> +are used for VM disks, behave. For each data block the pool needs parity data
> +which are at least the size of the minimum block size defined by the `ashift`
which _is_ at least the size of?
> +value of the pool. With an ashift of 12 the block size of the pool is 4k. The
> +default block size for a ZVOL is 8k. Therefore, in a RAIDZ2 each 8k block
> +written will cause two additional 4k parity blocks to be written,
> +8k + 4k + 4k = 16k. This is of course a simplified approach and the real
> +situation will be slightly different with metadata, compression and such not
> +being accounted for in this example.
> +
> +This behavior can be observed when checking the following properties of the
> +ZVOL:
> +
> + * `volsize`
> + * `refreservation` (if the pool is not thin provisioned)
> + * `used` (if the pool is thin provisioned and without snapshots present)
> +
> +----
> +# zfs get volsize,refreservation,used /<pool>/vm-<vmid>-disk-X
> +----
the '/' in the beginning should be dropped
> +
> +`volsize` is the size of the disk as it is presented to the VM, while
> +`refreservation` shows the reserved space on the pool which includes the
> +expected space needed for the parity data. If the pool is thin provisioned, the
> +`refreservation` will be set to 0. Another way to observe the behavior is to
> +compare the used disk space within the VM and the `used` property. Be aware
> +that snapshots will skew the value.
> +
> +To counter this effect there are a few options.
s/this effect/the increased use of space/
> +
> +* Increase the `volblocksize` to improve the data to parity ratio
> +* Use 'mirror' VDEVs instead of 'RAIDZ'
> +* Use `ashift=9` (block size of 512 bytes)
> +
> +The `volblocksize` property can only be set when creating a ZVOL. The default
> +value can be changed in the storage configuration. When doing this, the guest
> +needs to be tuned accordingly and depending on the use case, the problem of
> +write amplification if just moved from the ZFS layer up to the guest.
> +
> +A pool made of 'mirror' VDEVs has a different usable space and failure behavior
> +than a 'RAIDZ' pool.
This is already explained above?
Maybe add a short recommendation - 'RAID10 has favorable behavior for VM
workloads - use RAID10, unless your environment has specific needs and
characteristics where RAIDZ performance characteristics are acceptable' ?
> +
> +Using `ashift=9` when creating the pool can lead to bad
> +performance, depending on the disks underneath, and cannot be changed later on.
> +
> +
> Bootloader
> ~~~~~~~~~~
>
next prev parent reply other threads:[~2020-07-20 18:31 UTC|newest]
Thread overview: 5+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-07-17 12:12 Aaron Lauterer
2020-07-17 13:23 ` Andreas Steinel
2020-07-21 7:54 ` Aaron Lauterer
2020-07-20 18:30 ` Stoiko Ivanov [this message]
2020-07-21 7:55 ` Aaron Lauterer
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20200720203039.2f1f0b74@rosa.proxmox.com \
--to=s.ivanov@proxmox.com \
--cc=a.lauterer@proxmox.com \
--cc=pve-devel@lists.proxmox.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox