From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by lists.proxmox.com (Postfix) with ESMTPS id 6E21363E3A for ; Fri, 17 Jul 2020 15:24:31 +0200 (CEST) Received: from firstgate.proxmox.com (localhost [127.0.0.1]) by firstgate.proxmox.com (Proxmox) with ESMTP id 59CAF1C4E3 for ; Fri, 17 Jul 2020 15:24:01 +0200 (CEST) Received: from mail-il1-x144.google.com (mail-il1-x144.google.com [IPv6:2607:f8b0:4864:20::144]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by firstgate.proxmox.com (Proxmox) with ESMTPS id 6A21D1C4C3 for ; Fri, 17 Jul 2020 15:23:58 +0200 (CEST) Received: by mail-il1-x144.google.com with SMTP id h16so7299609ilj.11 for ; Fri, 17 Jul 2020 06:23:58 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to; bh=AfWpnFPlJ0DBEttcoUsQiQ7MPyn8Uwa3r97WHwwRzZQ=; b=nzfoZx2j3b9nREOsE1ApWifQjrqdwq7IZGedrhGo6GVGcZvNTqJvZW8q8ZRrq0vRsq sYWw8HnNlhgKTyPmfavVIHz/1rOZshzUpvO/VjijDs76zWkoXycYxrd/vtZ3CFPbkqHt E7cyStkiqhD48ibWqJFX4ng5YuDZVGdkGYXMK1Y4PZ4yQXOu8Qxqbh9MLI0pMbreo/HJ ix04rL9elGWTNL68ntnkWfDaGo8+cCcReJPHeHkUbpTQg7yJ3XHtigXXx0fpcdCo/W2v DDuoXVHFg5YT9QHY5XnFa3QgseoZiN95GFiHwfK4sCqRl3s0hx7//dgb5Bsw18V57CfK pCFA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to; bh=AfWpnFPlJ0DBEttcoUsQiQ7MPyn8Uwa3r97WHwwRzZQ=; b=h356f8vBfc6s+na6Mf6ZHFgquhx47KNSv4vdFtiWsEwLSJsZbqwq8C9wFArg50GawR bzgBYnnYztV4PCU0XK05xPB65CFtqk3xJJICUVzNULmh5T3jErcEL/G7BGvdk6RdTMpW 7uqyvDlC/5jcrJfHVdKPjQ3NZWNyY0Zs5F8pgjemmJO5Jsp01c5WlESFmWhHi9RvvTNG 3Enjnhw8OsU1KrXMfJ3fOvuntuLngw59QZJ+ncXfG/Dk+HeVHjlqAzmSXHbdPU86fEQF ZKz0uK/jPjtd9HXYkOGJLfqc1EtfByKK+fX9C1AUB3ev7ULvnem+zEtodPZb2OPcGFNA MGKQ== X-Gm-Message-State: AOAM531IXdGEJrFDyQA2aaSLPRB5MsMxr5OjMm86Krg5ZLpmnLwP19sL eXeiaOhTzf0rFdxQYCndMbOkxjUQIF4LTFEKzcI4Tg== X-Google-Smtp-Source: ABdhPJyhlNXfEHpIvQHJGjia1DkK3BAoIPzQ6SCDlhktDfDN7Tu6EO3S3zXBlDFiHbeLiYCQRkRJmOmVW/gbldFbwXw= X-Received: by 2002:a05:6e02:1346:: with SMTP id k6mr10227178ilr.0.1594992230560; Fri, 17 Jul 2020 06:23:50 -0700 (PDT) MIME-Version: 1.0 References: <20200717121232.29020-1-a.lauterer@proxmox.com> In-Reply-To: <20200717121232.29020-1-a.lauterer@proxmox.com> From: Andreas Steinel Date: Fri, 17 Jul 2020 15:23:39 +0200 Message-ID: To: Proxmox VE development discussion X-SPAM-LEVEL: Spam detection results: 0 DKIM_SIGNED 0.1 Message has a DKIM or DK signature, not necessarily valid DKIM_VALID -0.1 Message has at least one valid DKIM or DK signature DKIM_VALID_AU -0.1 Message has a valid DKIM or DK signature from author's domain DKIM_VALID_EF -0.1 Message has a valid DKIM or DK signature from envelope-from domain FREEMAIL_FROM 0.001 Sender email is commonly abused enduser mail provider HTML_MESSAGE 0.001 HTML included in message RCVD_IN_DNSWL_NONE -0.0001 Sender listed at https://www.dnswl.org/, no trust SPF_HELO_NONE 0.001 SPF: HELO does not publish an SPF Record SPF_PASS -0.001 SPF: sender matches SPF record URIBL_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to URIBL was blocked. See http://wiki.apache.org/spamassassin/DnsBlocklists#dnsbl-block for more information. [m.sc, proxmox.com] Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Content-Filtered-By: Mailman/MimeDel 2.1.29 Subject: Re: [pve-devel] [PATCH docs DRAFT] Add section with more infos about ZFS RAID levels X-BeenThere: pve-devel@lists.proxmox.com X-Mailman-Version: 2.1.29 Precedence: list List-Id: Proxmox VE development discussion List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 17 Jul 2020 13:24:31 -0000 Very good. Maybe we can also include some references to books, e.g. the ZFS books from Allan Jude and Michael W. Lucas for further reading? On Fri, Jul 17, 2020 at 2:13 PM Aaron Lauterer wrote: > This new section explains the performance and failure properties of > mirror and RAIDZ VDEVs as well as the "unexpected" higher space usage by > ZVOLs on a RAIDZ. > > Signed-off-by: Aaron Lauterer > --- > > This is a first draft to explain the performance characteristics of the > different RAID levels / VDEV types, as well as their failure behavior. > > Additionally it explains the situation why a VM disk (ZVOL) can end up > using quite a bit more space than expected when placed on a pool made of > RAIDZ VDEVs. > > The motivation behind this is, that in the recent past, these things > came up quite a bit. Thus it would be nice to have some documentation > that we can link to and having it in the docs might help users to make > an informed decision from the start. > > I hope I did not mess up any technical details and that it is > understandable enough. > > local-zfs.adoc | 96 ++++++++++++++++++++++++++++++++++++++++++++++++++ > 1 file changed, 96 insertions(+) > > diff --git a/local-zfs.adoc b/local-zfs.adoc > index fd03e89..48f6540 100644 > --- a/local-zfs.adoc > +++ b/local-zfs.adoc > @@ -151,6 +151,102 @@ rpool/swap 4.25G 7.69T 64K - > ---- > > > +[[sysadmin_zfs_raid_considerations]] > +ZFS RAID Level Considerations > +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > + > +There are a few factors to take into consideration when choosing the > layout of > +a ZFS pool. > + > + > +[[sysadmin_zfs_raid_performance]] > +Performance > +^^^^^^^^^^^ > + > +Different types of VDEVs have different performance behaviors. The two > +parameters of interest are the IOPS (Input/Output Operations per Second) > and the > +bandwidth with which data can be written or read. > + > +A 'mirror' VDEV will approximately behave like a single disk in regards > to both > +parameters when writing data. When reading data if will behave like the > number > +of disks in the mirror. > + > +A common situation is to have 4 disks. When setting it up as 2 mirror > VDEVs the > +pool will have the write characteristics as two single disks in regard o= f > IOPS > +and bandwidth. For read operations it will resemble 4 single disks. > + > +A 'RAIDZ' of any redundancy level will approximately behave like a singl= e > disk > +in regard of IOPS with a lot of bandwidth. How much bandwidth depends on > the > +size of the RAIDZ VDEV and the redundancy level. > + > +For running VMs, IOPS is the more important metric in most situations. > + > + > +[[sysadmin_zfs_raid_size_space_usage_redundancy]] > +Size, Space usage and Redundancy > +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > + > +While a pool made of 'mirror' VDEVs will have the best performance > +characteristics, the usable space will be 50% of the disks available. > Less if a > +mirror VDEV consists of more than 2 disks. To stay functional it needs t= o > have > +at least one disk per mirror VDEV available. The pool will fail once all > disks > +in a mirror VDEV fail. > + > +When using a pool made of 'RAIDZ' VDEVs the usable space to disk ratio > will be > +better in most situations than using mirror VDEVs. This is especially > true when > +using a large number of disks. A special case is a 4 disk pool with > RAIDZ2. In > +this situation it is usually better to use 2 mirror VDEVs for the better > +performance as the usable space will be the same. In a RAIDZ VDEV, any > drive > +can fail and it will stay operational. The number of sustainable drive > failures > +is defined by the redundancy level, a RAIDZ1 can survive the loss of 1 > disk, > +consequently, a RAIDZ2 the loss of 2 and a RAIDZ3 the loss of 3 disks. > + > +Another important factor when using any RAIDZ level is how ZVOL datasets= , > which > +are used for VM disks, behave. For each data block the pool needs parity > data > +which are at least the size of the minimum block size defined by the > `ashift` > +value of the pool. With an ashift of 12 the block size of the pool is > 4k. The > +default block size for a ZVOL is 8k. Therefore, in a RAIDZ2 each 8k bloc= k > +written will cause two additional 4k parity blocks to be written, > +8k + 4k + 4k =3D 16k. This is of course a simplified approach and the r= eal > +situation will be slightly different with metadata, compression and such > not > +being accounted for in this example. > + > +This behavior can be observed when checking the following properties of > the > +ZVOL: > + > + * `volsize` > + * `refreservation` (if the pool is not thin provisioned) > + * `used` (if the pool is thin provisioned and without snapshots present= ) > + > +---- > +# zfs get volsize,refreservation,used //vm--disk-X > +---- > + > +`volsize` is the size of the disk as it is presented to the VM, while > +`refreservation` shows the reserved space on the pool which includes the > +expected space needed for the parity data. If the pool is thin > provisioned, the > +`refreservation` will be set to 0. Another way to observe the behavior i= s > to > +compare the used disk space within the VM and the `used` property. Be > aware > +that snapshots will skew the value. > + > +To counter this effect there are a few options. > + > +* Increase the `volblocksize` to improve the data to parity ratio > +* Use 'mirror' VDEVs instead of 'RAIDZ' > +* Use `ashift=3D9` (block size of 512 bytes) > + > +The `volblocksize` property can only be set when creating a ZVOL. The > default > +value can be changed in the storage configuration. When doing this, the > guest > +needs to be tuned accordingly and depending on the use case, the problem > of > +write amplification if just moved from the ZFS layer up to the guest. > + > +A pool made of 'mirror' VDEVs has a different usable space and failure > behavior > +than a 'RAIDZ' pool. > + > +Using `ashift=3D9` when creating the pool can lead to bad > +performance, depending on the disks underneath, and cannot be changed > later on. > + > + > Bootloader > ~~~~~~~~~~ > > -- > 2.20.1 > > > > _______________________________________________ > pve-devel mailing list > pve-devel@lists.proxmox.com > https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel > > --=20 With kind regards / Mit freundlichen Gr=C3=BC=C3=9Fen Andreas Steinel M.Sc. Visual Computing M.Sc. Informatik