From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <oguz@gaia.proxmox.com>
Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68])
 (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
 key-exchange X25519 server-signature RSA-PSS (2048 bits))
 (No client certificate requested)
 by lists.proxmox.com (Postfix) with ESMTPS id 7A9075BE66;
 Wed,  8 Jul 2020 14:42:21 +0200 (CEST)
Received: from firstgate.proxmox.com (localhost [127.0.0.1])
 by firstgate.proxmox.com (Proxmox) with ESMTP id 61E9D9FD1;
 Wed,  8 Jul 2020 14:41:51 +0200 (CEST)
Received: from gaia.proxmox.com (212-186-127-178.static.upcbusiness.at
 [212.186.127.178])
 (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
 key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256)
 (No client certificate requested)
 by firstgate.proxmox.com (Proxmox) with ESMTPS id BC7529FC9;
 Wed,  8 Jul 2020 14:41:49 +0200 (CEST)
Received: from gaia.proxmox.com (localhost.localdomain [127.0.0.1])
 by gaia.proxmox.com (8.15.2/8.15.2/Debian-14~deb10u1) with ESMTP id
 068Cfn4a359431; Wed, 8 Jul 2020 14:41:49 +0200
Received: (from oguz@localhost)
 by gaia.proxmox.com (8.15.2/8.15.2/Submit) id 068CfnmN359430;
 Wed, 8 Jul 2020 14:41:49 +0200
From: Oguz Bektas <o.bektas@proxmox.com>
To: pve-devel@lists.proxmox.com, pbs-devel@lists.proxmox.com
Date: Wed,  8 Jul 2020 14:41:48 +0200
Message-Id: <20200708124148.359379-1-o.bektas@proxmox.com>
X-Mailer: git-send-email 2.20.1
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-SPAM-LEVEL: Spam detection results:  1
 AWL -1.189 Adjusted score from AWL reputation of From: address
 KAM_ASCII_DIVIDERS        0.8 Spam that uses ascii formatting tricks
 KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment
 KAM_LAZY_DOMAIN_SECURITY 1 Sending domain does not have any anti-forgery
 methods
 KHOP_HELO_FCRDNS        0.275 Relay HELO differs from its IP's reverse DNS
 NO_DNS_FOR_FROM         0.379 Envelope sender has no MX or A DNS records
 SPF_HELO_NONE           0.001 SPF: HELO does not publish an SPF Record
 SPF_NONE                0.001 SPF: sender does not publish an SPF Record
Subject: [pve-devel] [PATCH proxmox-backup] add local-zfs.rst
X-BeenThere: pve-devel@lists.proxmox.com
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Proxmox VE development discussion <pve-devel.lists.proxmox.com>
List-Unsubscribe: <https://lists.proxmox.com/cgi-bin/mailman/options/pve-devel>, 
 <mailto:pve-devel-request@lists.proxmox.com?subject=unsubscribe>
List-Archive: <http://lists.proxmox.com/pipermail/pve-devel/>
List-Post: <mailto:pve-devel@lists.proxmox.com>
List-Help: <mailto:pve-devel-request@lists.proxmox.com?subject=help>
List-Subscribe: <https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel>, 
 <mailto:pve-devel-request@lists.proxmox.com?subject=subscribe>
X-List-Received-Date: Wed, 08 Jul 2020 12:42:21 -0000

content is > 90% same as local-zfs.adoc in pve-docs.

adapted the format for .rst

fixed some typos and wrote some parts slightly different (wording).

Signed-off-by: Oguz Bektas <o.bektas@proxmox.com>
---
 docs/local-zfs.rst | 374 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 374 insertions(+)
 create mode 100644 docs/local-zfs.rst

diff --git a/docs/local-zfs.rst b/docs/local-zfs.rst
new file mode 100644
index 00000000..fd56474a
--- /dev/null
+++ b/docs/local-zfs.rst
@@ -0,0 +1,374 @@
+ZFS on Linux
+=============
+.. code-block:: console.. code-block:: console.. code-block:: console
+
+ZFS is a combined file system and logical volume manager designed by
+Sun Microsystems. There is no need for manually compile ZFS modules - all
+packages are included.
+
+By using ZFS, its possible to achieve maximum enterprise features with
+low budget hardware, but also high performance systems by leveraging
+SSD caching or even SSD only setups. ZFS can replace cost intense
+hardware raid cards by moderate CPU and memory load combined with easy
+management.
+
+General ZFS advantages
+
+* Easy configuration and management with GUI and CLI.
+* Reliable
+* Protection against data corruption
+* Data compression on file system level
+* Snapshots
+* Copy-on-write clone
+* Various raid levels: RAID0, RAID1, RAID10, RAIDZ-1, RAIDZ-2 and RAIDZ-3
+* Can use SSD for cache
+* Self healing
+* Continuous integrity checking
+* Designed for high storage capacities
+* Protection against data corruption
+* Asynchronous replication over network
+* Open Source
+* Encryption
+
+Hardware
+---------
+
+ZFS depends heavily on memory, so you need at least 8GB to start. In
+practice, use as much you can get for your hardware/budget. To prevent
+data corruption, we recommend the use of high quality ECC RAM.
+
+If you use a dedicated cache and/or log disk, you should use an
+enterprise class SSD (e.g. Intel SSD DC S3700 Series). This can
+increase the overall performance significantly.
+
+IMPORTANT: Do not use ZFS on top of hardware controller which has its
+own cache management. ZFS needs to directly communicate with disks. An
+HBA adapter is the way to go, or something like LSI controller flashed
+in ``IT`` mode.
+
+
+
+
+ZFS Administration
+------------------
+
+This section gives you some usage examples for common tasks. ZFS
+itself is really powerful and provides many options. The main commands
+to manage ZFS are `zfs` and `zpool`. Both commands come with great
+manual pages, which can be read with:
+
+.. code-block:: console
+  # man zpool
+  # man zfs
+
+Create a new zpool
+~~~~~~~~~~~~~~~~~~
+
+To create a new pool, at least one disk is needed. The `ashift` should
+have the same sector-size (2 power of `ashift`) or larger as the
+underlying disk.
+
+.. code-block:: console
+  # zpool create -f -o ashift=12 <pool> <device>
+
+Create a new pool with RAID-0
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Minimum 1 disk
+
+.. code-block:: console
+  # zpool create -f -o ashift=12 <pool> <device1> <device2>
+
+Create a new pool with RAID-1
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Minimum 2 disks
+
+.. code-block:: console
+  # zpool create -f -o ashift=12 <pool> mirror <device1> <device2>
+
+Create a new pool with RAID-10
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Minimum 4 disks
+
+.. code-block:: console
+  # zpool create -f -o ashift=12 <pool> mirror <device1> <device2> mirror <device3> <device4>
+
+Create a new pool with RAIDZ-1
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Minimum 3 disks
+
+.. code-block:: console
+  # zpool create -f -o ashift=12 <pool> raidz1 <device1> <device2> <device3>
+
+Create a new pool with RAIDZ-2
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Minimum 4 disks
+
+.. code-block:: console
+  # zpool create -f -o ashift=12 <pool> raidz2 <device1> <device2> <device3> <device4>
+
+Create a new pool with cache (L2ARC)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+It is possible to use a dedicated cache drive partition to increase
+the performance (use SSD).
+
+As `<device>` it is possible to use more devices, like it's shown in
+"Create a new pool with RAID*".
+
+.. code-block:: console
+  # zpool create -f -o ashift=12 <pool> <device> cache <cache_device>
+
+Create a new pool with log (ZIL)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+It is possible to use a dedicated cache drive partition to increase
+the performance (SSD).
+
+As `<device>` it is possible to use more devices, like it's shown in
+"Create a new pool with RAID*".
+
+.. code-block:: console
+  # zpool create -f -o ashift=12 <pool> <device> log <log_device>
+
+Add cache and log to an existing pool
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+If you have a pool without cache and log. First partition the SSD in
+2 partition with `parted` or `gdisk`
+
+.. important:: Always use GPT partition tables.
+
+The maximum size of a log device should be about half the size of
+physical memory, so this is usually quite small. The rest of the SSD
+can be used as cache.
+
+.. code-block:: console
+  # zpool add -f <pool> log <device-part1> cache <device-part2>
+
+
+Changing a failed device
+~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. code-block:: console
+  # zpool replace -f <pool> <old device> <new device>
+
+
+Changing a failed bootable device
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Depending on how Proxmox Backup was installed it is either using `grub` or `systemd-boot`
+as bootloader.
+
+The first steps of copying the partition table, reissuing GUIDs and replacing
+the ZFS partition are the same. To make the system bootable from the new disk,
+different steps are needed which depend on the bootloader in use.
+
+.. code-block:: console
+  # sgdisk <healthy bootable device> -R <new device>
+  # sgdisk -G <new device>
+  # zpool replace -f <pool> <old zfs partition> <new zfs partition>
+
+.. NOTE:: Use the `zpool status -v` command to monitor how far the resilvering process of the new disk has progressed.
+
+With `systemd-boot`:
+
+.. code-block:: console
+  # pve-efiboot-tool format <new disk's ESP>
+  # pve-efiboot-tool init <new disk's ESP>
+
+.. NOTE:: `ESP` stands for EFI System Partition, which is setup as partition #2 on
+  bootable disks setup by the {pve} installer since version 5.4. For details, see
+  xref:sysboot_systemd_boot_setup[Setting up a new partition for use as synced ESP].
+
+With `grub`:
+
+Usually `grub.cfg` is located in `/boot/grub/grub.cfg`
+
+.. code-block:: console
+  # grub-install <new disk>
+  # grub-mkconfig -o /path/to/grub.cfg
+
+
+Activate E-Mail Notification
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+ZFS comes with an event daemon, which monitors events generated by the
+ZFS kernel module. The daemon can also send emails on ZFS events like
+pool errors. Newer ZFS packages ship the daemon in a separate package,
+and you can install it using `apt-get`:
+
+.. code-block:: console
+  # apt-get install zfs-zed
+
+To activate the daemon it is necessary to edit `/etc/zfs/zed.d/zed.rc` with your
+favourite editor, and uncomment the `ZED_EMAIL_ADDR` setting:
+
+.. code-block:: console
+  ZED_EMAIL_ADDR="root"
+
+Please note Proxmox Backup forwards mails to `root` to the email address
+configured for the root user.
+
+IMPORTANT: The only setting that is required is `ZED_EMAIL_ADDR`. All
+other settings are optional.
+
+Limit ZFS Memory Usage
+~~~~~~~~~~~~~~~~~~~~~~
+
+It is good to use at most 50 percent (which is the default) of the
+system memory for ZFS ARC to prevent performance shortage of the
+host. Use your preferred editor to change the configuration in
+`/etc/modprobe.d/zfs.conf` and insert:
+
+.. code-block:: console
+  options zfs zfs_arc_max=8589934592
+
+This example setting limits the usage to 8GB.
+
+.. IMPORTANT:: If your root file system is ZFS you must update your initramfs every time this value changes:
+
+.. code-block:: console
+  # update-initramfs -u
+
+
+SWAP on ZFS
+~~~~~~~~~~~
+
+Swap-space created on a zvol may generate some troubles, like blocking the
+server or generating a high IO load, often seen when starting a Backup
+to an external Storage.
+
+We strongly recommend to use enough memory, so that you normally do not
+run into low memory situations. Should you need or want to add swap, it is
+preferred to create a partition on a physical disk and use it as swapdevice.
+You can leave some space free for this purpose in the advanced options of the
+installer. Additionally, you can lower the `swappiness` value. 
+A good value for servers is 10:
+
+.. code-block:: console
+  # sysctl -w vm.swappiness=10
+
+To make the swappiness persistent, open `/etc/sysctl.conf` with
+an editor of your choice and add the following line:
+
+.. code-block:: console
+  vm.swappiness = 10
+
+.. table:: Linux kernel `swappiness` parameter values
+  :widths:auto
+  =========             ============
+   Value                Strategy
+  =========             ============
+   vm.swappiness = 0    The kernel will swap only to avoid an 'out of memory' condition
+   vm.swappiness = 1    Minimum amount of swapping without disabling it entirely.
+   vm.swappiness = 10   This value is sometimes recommended to improve performance when sufficient memory exists in a system.
+   vm.swappiness = 60   The default value.
+   vm.swappiness = 100  The kernel will swap aggressively.
+  =========             ============
+
+ZFS Compression
+~~~~~~~~~~~~~~~
+
+To activate compression:
+.. code-block:: console
+  # zpool set compression=lz4 <pool>
+
+We recommend using the `lz4` algorithm, since it adds very little CPU overhead.
+Other algorithms such as `lzjb` and `gzip-N` (where `N` is an integer `1-9` representing
+the compression ratio, 1 is fastest and 9 is best compression) are also available.
+Depending on the algorithm and how compressible the data is, having compression enabled can even increase
+I/O performance.
+
+You can disable compression at any time with:
+.. code-block:: console
+  # zfs set compression=off <dataset>
+
+Only new blocks will be affected by this change.
+
+ZFS Special Device
+~~~~~~~~~~~~~~~~~~
+
+Since version 0.8.0 ZFS supports `special` devices. A `special` device in a
+pool is used to store metadata, deduplication tables, and optionally small
+file blocks.
+
+A `special` device can improve the speed of a pool consisting of slow spinning
+hard disks with a lot of metadata changes. For example workloads that involve
+creating, updating or deleting a large number of files will benefit from the
+presence of a `special` device. ZFS datasets can also be configured to store
+whole small files on the `special` device which can further improve the
+performance. Use fast SSDs for the `special` device.
+
+.. IMPORTANT:: The redundancy of the `special` device should match the one of the
+  pool, since the `special` device is a point of failure for the whole pool.
+
+.. WARNING:: Adding a `special` device to a pool cannot be undone!
+
+Create a pool with `special` device and RAID-1:
+
+.. code-block:: console
+  # zpool create -f -o ashift=12 <pool> mirror <device1> <device2> special mirror <device3> <device4>
+
+Adding a `special` device to an existing pool with RAID-1:
+
+.. code-block:: console
+  # zpool add <pool> special mirror <device1> <device2>
+
+ZFS datasets expose the `special_small_blocks=<size>` property. `size` can be
+`0` to disable storing small file blocks on the `special` device or a power of
+two in the range between `512B` to `128K`. After setting the property new file
+blocks smaller than `size` will be allocated on the `special` device.
+
+.. IMPORTANT:: If the value for `special_small_blocks` is greater than or equal to
+  the `recordsize` (default `128K`) of the dataset, *all* data will be written to
+  the `special` device, so be careful!
+
+Setting the `special_small_blocks` property on a pool will change the default
+value of that property for all child ZFS datasets (for example all containers
+in the pool will opt in for small file blocks).
+
+Opt in for all file smaller than 4K-blocks pool-wide:
+
+.. code-block:: console
+  # zfs set special_small_blocks=4K <pool>
+
+Opt in for small file blocks for a single dataset:
+
+.. code-block:: console
+  # zfs set special_small_blocks=4K <pool>/<filesystem>
+
+Opt out from small file blocks for a single dataset:
+
+.. code-block:: console
+  # zfs set special_small_blocks=0 <pool>/<filesystem>
+
+Troubleshooting
+~~~~~~~~~~~~~~~
+
+Corrupted cachefile
+
+In case of a corrupted ZFS cachefile, some volumes may not be mounted during
+boot until mounted manually later.
+
+For each pool, run:
+
+.. code-block:: console
+  # zpool set cachefile=/etc/zfs/zpool.cache POOLNAME
+
+and afterwards update the `initramfs` by running:
+
+.. code-block:: console
+  # update-initramfs -u -k all
+
+and finally reboot your node.
+
+Sometimes the ZFS cachefile can get corrupted, and `zfs-import-cache.service`
+doesn't import the pools that aren't present in the cachefile.
+
+Another workaround to this problem is enabling the `zfs-import-scan.service`,
+which searches and imports pools via device scanning (usually slower).
-- 
2.20.1