From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits)) (No client certificate requested) by lists.proxmox.com (Postfix) with ESMTPS id 7A9075BE66; Wed, 8 Jul 2020 14:42:21 +0200 (CEST) Received: from firstgate.proxmox.com (localhost [127.0.0.1]) by firstgate.proxmox.com (Proxmox) with ESMTP id 61E9D9FD1; Wed, 8 Jul 2020 14:41:51 +0200 (CEST) Received: from gaia.proxmox.com (212-186-127-178.static.upcbusiness.at [212.186.127.178]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by firstgate.proxmox.com (Proxmox) with ESMTPS id BC7529FC9; Wed, 8 Jul 2020 14:41:49 +0200 (CEST) Received: from gaia.proxmox.com (localhost.localdomain [127.0.0.1]) by gaia.proxmox.com (8.15.2/8.15.2/Debian-14~deb10u1) with ESMTP id 068Cfn4a359431; Wed, 8 Jul 2020 14:41:49 +0200 Received: (from oguz@localhost) by gaia.proxmox.com (8.15.2/8.15.2/Submit) id 068CfnmN359430; Wed, 8 Jul 2020 14:41:49 +0200 From: Oguz Bektas To: pve-devel@lists.proxmox.com, pbs-devel@lists.proxmox.com Date: Wed, 8 Jul 2020 14:41:48 +0200 Message-Id: <20200708124148.359379-1-o.bektas@proxmox.com> X-Mailer: git-send-email 2.20.1 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-SPAM-LEVEL: Spam detection results: 1 AWL -1.189 Adjusted score from AWL reputation of From: address KAM_ASCII_DIVIDERS 0.8 Spam that uses ascii formatting tricks KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment KAM_LAZY_DOMAIN_SECURITY 1 Sending domain does not have any anti-forgery methods KHOP_HELO_FCRDNS 0.275 Relay HELO differs from its IP's reverse DNS NO_DNS_FOR_FROM 0.379 Envelope sender has no MX or A DNS records SPF_HELO_NONE 0.001 SPF: HELO does not publish an SPF Record SPF_NONE 0.001 SPF: sender does not publish an SPF Record Subject: [pve-devel] [PATCH proxmox-backup] add local-zfs.rst X-BeenThere: pve-devel@lists.proxmox.com X-Mailman-Version: 2.1.29 Precedence: list List-Id: Proxmox VE development discussion List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 08 Jul 2020 12:42:21 -0000 content is > 90% same as local-zfs.adoc in pve-docs. adapted the format for .rst fixed some typos and wrote some parts slightly different (wording). Signed-off-by: Oguz Bektas --- docs/local-zfs.rst | 374 +++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 374 insertions(+) create mode 100644 docs/local-zfs.rst diff --git a/docs/local-zfs.rst b/docs/local-zfs.rst new file mode 100644 index 00000000..fd56474a --- /dev/null +++ b/docs/local-zfs.rst @@ -0,0 +1,374 @@ +ZFS on Linux +============= +.. code-block:: console.. code-block:: console.. code-block:: console + +ZFS is a combined file system and logical volume manager designed by +Sun Microsystems. There is no need for manually compile ZFS modules - all +packages are included. + +By using ZFS, its possible to achieve maximum enterprise features with +low budget hardware, but also high performance systems by leveraging +SSD caching or even SSD only setups. ZFS can replace cost intense +hardware raid cards by moderate CPU and memory load combined with easy +management. + +General ZFS advantages + +* Easy configuration and management with GUI and CLI. +* Reliable +* Protection against data corruption +* Data compression on file system level +* Snapshots +* Copy-on-write clone +* Various raid levels: RAID0, RAID1, RAID10, RAIDZ-1, RAIDZ-2 and RAIDZ-3 +* Can use SSD for cache +* Self healing +* Continuous integrity checking +* Designed for high storage capacities +* Protection against data corruption +* Asynchronous replication over network +* Open Source +* Encryption + +Hardware +--------- + +ZFS depends heavily on memory, so you need at least 8GB to start. In +practice, use as much you can get for your hardware/budget. To prevent +data corruption, we recommend the use of high quality ECC RAM. + +If you use a dedicated cache and/or log disk, you should use an +enterprise class SSD (e.g. Intel SSD DC S3700 Series). This can +increase the overall performance significantly. + +IMPORTANT: Do not use ZFS on top of hardware controller which has its +own cache management. ZFS needs to directly communicate with disks. An +HBA adapter is the way to go, or something like LSI controller flashed +in ``IT`` mode. + + + + +ZFS Administration +------------------ + +This section gives you some usage examples for common tasks. ZFS +itself is really powerful and provides many options. The main commands +to manage ZFS are `zfs` and `zpool`. Both commands come with great +manual pages, which can be read with: + +.. code-block:: console + # man zpool + # man zfs + +Create a new zpool +~~~~~~~~~~~~~~~~~~ + +To create a new pool, at least one disk is needed. The `ashift` should +have the same sector-size (2 power of `ashift`) or larger as the +underlying disk. + +.. code-block:: console + # zpool create -f -o ashift=12 + +Create a new pool with RAID-0 +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Minimum 1 disk + +.. code-block:: console + # zpool create -f -o ashift=12 + +Create a new pool with RAID-1 +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Minimum 2 disks + +.. code-block:: console + # zpool create -f -o ashift=12 mirror + +Create a new pool with RAID-10 +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Minimum 4 disks + +.. code-block:: console + # zpool create -f -o ashift=12 mirror mirror + +Create a new pool with RAIDZ-1 +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Minimum 3 disks + +.. code-block:: console + # zpool create -f -o ashift=12 raidz1 + +Create a new pool with RAIDZ-2 +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Minimum 4 disks + +.. code-block:: console + # zpool create -f -o ashift=12 raidz2 + +Create a new pool with cache (L2ARC) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +It is possible to use a dedicated cache drive partition to increase +the performance (use SSD). + +As `` it is possible to use more devices, like it's shown in +"Create a new pool with RAID*". + +.. code-block:: console + # zpool create -f -o ashift=12 cache + +Create a new pool with log (ZIL) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +It is possible to use a dedicated cache drive partition to increase +the performance (SSD). + +As `` it is possible to use more devices, like it's shown in +"Create a new pool with RAID*". + +.. code-block:: console + # zpool create -f -o ashift=12 log + +Add cache and log to an existing pool +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +If you have a pool without cache and log. First partition the SSD in +2 partition with `parted` or `gdisk` + +.. important:: Always use GPT partition tables. + +The maximum size of a log device should be about half the size of +physical memory, so this is usually quite small. The rest of the SSD +can be used as cache. + +.. code-block:: console + # zpool add -f log cache + + +Changing a failed device +~~~~~~~~~~~~~~~~~~~~~~~~ + +.. code-block:: console + # zpool replace -f + + +Changing a failed bootable device +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Depending on how Proxmox Backup was installed it is either using `grub` or `systemd-boot` +as bootloader. + +The first steps of copying the partition table, reissuing GUIDs and replacing +the ZFS partition are the same. To make the system bootable from the new disk, +different steps are needed which depend on the bootloader in use. + +.. code-block:: console + # sgdisk -R + # sgdisk -G + # zpool replace -f + +.. NOTE:: Use the `zpool status -v` command to monitor how far the resilvering process of the new disk has progressed. + +With `systemd-boot`: + +.. code-block:: console + # pve-efiboot-tool format + # pve-efiboot-tool init + +.. NOTE:: `ESP` stands for EFI System Partition, which is setup as partition #2 on + bootable disks setup by the {pve} installer since version 5.4. For details, see + xref:sysboot_systemd_boot_setup[Setting up a new partition for use as synced ESP]. + +With `grub`: + +Usually `grub.cfg` is located in `/boot/grub/grub.cfg` + +.. code-block:: console + # grub-install + # grub-mkconfig -o /path/to/grub.cfg + + +Activate E-Mail Notification +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +ZFS comes with an event daemon, which monitors events generated by the +ZFS kernel module. The daemon can also send emails on ZFS events like +pool errors. Newer ZFS packages ship the daemon in a separate package, +and you can install it using `apt-get`: + +.. code-block:: console + # apt-get install zfs-zed + +To activate the daemon it is necessary to edit `/etc/zfs/zed.d/zed.rc` with your +favourite editor, and uncomment the `ZED_EMAIL_ADDR` setting: + +.. code-block:: console + ZED_EMAIL_ADDR="root" + +Please note Proxmox Backup forwards mails to `root` to the email address +configured for the root user. + +IMPORTANT: The only setting that is required is `ZED_EMAIL_ADDR`. All +other settings are optional. + +Limit ZFS Memory Usage +~~~~~~~~~~~~~~~~~~~~~~ + +It is good to use at most 50 percent (which is the default) of the +system memory for ZFS ARC to prevent performance shortage of the +host. Use your preferred editor to change the configuration in +`/etc/modprobe.d/zfs.conf` and insert: + +.. code-block:: console + options zfs zfs_arc_max=8589934592 + +This example setting limits the usage to 8GB. + +.. IMPORTANT:: If your root file system is ZFS you must update your initramfs every time this value changes: + +.. code-block:: console + # update-initramfs -u + + +SWAP on ZFS +~~~~~~~~~~~ + +Swap-space created on a zvol may generate some troubles, like blocking the +server or generating a high IO load, often seen when starting a Backup +to an external Storage. + +We strongly recommend to use enough memory, so that you normally do not +run into low memory situations. Should you need or want to add swap, it is +preferred to create a partition on a physical disk and use it as swapdevice. +You can leave some space free for this purpose in the advanced options of the +installer. Additionally, you can lower the `swappiness` value. +A good value for servers is 10: + +.. code-block:: console + # sysctl -w vm.swappiness=10 + +To make the swappiness persistent, open `/etc/sysctl.conf` with +an editor of your choice and add the following line: + +.. code-block:: console + vm.swappiness = 10 + +.. table:: Linux kernel `swappiness` parameter values + :widths:auto + ========= ============ + Value Strategy + ========= ============ + vm.swappiness = 0 The kernel will swap only to avoid an 'out of memory' condition + vm.swappiness = 1 Minimum amount of swapping without disabling it entirely. + vm.swappiness = 10 This value is sometimes recommended to improve performance when sufficient memory exists in a system. + vm.swappiness = 60 The default value. + vm.swappiness = 100 The kernel will swap aggressively. + ========= ============ + +ZFS Compression +~~~~~~~~~~~~~~~ + +To activate compression: +.. code-block:: console + # zpool set compression=lz4 + +We recommend using the `lz4` algorithm, since it adds very little CPU overhead. +Other algorithms such as `lzjb` and `gzip-N` (where `N` is an integer `1-9` representing +the compression ratio, 1 is fastest and 9 is best compression) are also available. +Depending on the algorithm and how compressible the data is, having compression enabled can even increase +I/O performance. + +You can disable compression at any time with: +.. code-block:: console + # zfs set compression=off + +Only new blocks will be affected by this change. + +ZFS Special Device +~~~~~~~~~~~~~~~~~~ + +Since version 0.8.0 ZFS supports `special` devices. A `special` device in a +pool is used to store metadata, deduplication tables, and optionally small +file blocks. + +A `special` device can improve the speed of a pool consisting of slow spinning +hard disks with a lot of metadata changes. For example workloads that involve +creating, updating or deleting a large number of files will benefit from the +presence of a `special` device. ZFS datasets can also be configured to store +whole small files on the `special` device which can further improve the +performance. Use fast SSDs for the `special` device. + +.. IMPORTANT:: The redundancy of the `special` device should match the one of the + pool, since the `special` device is a point of failure for the whole pool. + +.. WARNING:: Adding a `special` device to a pool cannot be undone! + +Create a pool with `special` device and RAID-1: + +.. code-block:: console + # zpool create -f -o ashift=12 mirror special mirror + +Adding a `special` device to an existing pool with RAID-1: + +.. code-block:: console + # zpool add special mirror + +ZFS datasets expose the `special_small_blocks=` property. `size` can be +`0` to disable storing small file blocks on the `special` device or a power of +two in the range between `512B` to `128K`. After setting the property new file +blocks smaller than `size` will be allocated on the `special` device. + +.. IMPORTANT:: If the value for `special_small_blocks` is greater than or equal to + the `recordsize` (default `128K`) of the dataset, *all* data will be written to + the `special` device, so be careful! + +Setting the `special_small_blocks` property on a pool will change the default +value of that property for all child ZFS datasets (for example all containers +in the pool will opt in for small file blocks). + +Opt in for all file smaller than 4K-blocks pool-wide: + +.. code-block:: console + # zfs set special_small_blocks=4K + +Opt in for small file blocks for a single dataset: + +.. code-block:: console + # zfs set special_small_blocks=4K / + +Opt out from small file blocks for a single dataset: + +.. code-block:: console + # zfs set special_small_blocks=0 / + +Troubleshooting +~~~~~~~~~~~~~~~ + +Corrupted cachefile + +In case of a corrupted ZFS cachefile, some volumes may not be mounted during +boot until mounted manually later. + +For each pool, run: + +.. code-block:: console + # zpool set cachefile=/etc/zfs/zpool.cache POOLNAME + +and afterwards update the `initramfs` by running: + +.. code-block:: console + # update-initramfs -u -k all + +and finally reboot your node. + +Sometimes the ZFS cachefile can get corrupted, and `zfs-import-cache.service` +doesn't import the pools that aren't present in the cachefile. + +Another workaround to this problem is enabling the `zfs-import-scan.service`, +which searches and imports pools via device scanning (usually slower). -- 2.20.1