From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <s.reiter@proxmox.com>
Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68])
 (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
 key-exchange X25519 server-signature RSA-PSS (2048 bits))
 (No client certificate requested)
 by lists.proxmox.com (Postfix) with ESMTPS id C5F8960332
 for <pbs-devel@lists.proxmox.com>; Wed, 14 Oct 2020 14:17:22 +0200 (CEST)
Received: from firstgate.proxmox.com (localhost [127.0.0.1])
 by firstgate.proxmox.com (Proxmox) with ESMTP id A40349CCA
 for <pbs-devel@lists.proxmox.com>; Wed, 14 Oct 2020 14:16:52 +0200 (CEST)
Received: from proxmox-new.maurer-it.com (proxmox-new.maurer-it.com
 [212.186.127.180])
 (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
 key-exchange X25519 server-signature RSA-PSS (2048 bits))
 (No client certificate requested)
 by firstgate.proxmox.com (Proxmox) with ESMTPS id 460849B16
 for <pbs-devel@lists.proxmox.com>; Wed, 14 Oct 2020 14:16:49 +0200 (CEST)
Received: from proxmox-new.maurer-it.com (localhost.localdomain [127.0.0.1])
 by proxmox-new.maurer-it.com (Proxmox) with ESMTP id 0BD9845D4B
 for <pbs-devel@lists.proxmox.com>; Wed, 14 Oct 2020 14:16:49 +0200 (CEST)
From: Stefan Reiter <s.reiter@proxmox.com>
To: pbs-devel@lists.proxmox.com
Date: Wed, 14 Oct 2020 14:16:39 +0200
Message-Id: <20201014121639.25276-12-s.reiter@proxmox.com>
X-Mailer: git-send-email 2.20.1
In-Reply-To: <20201014121639.25276-1-s.reiter@proxmox.com>
References: <20201014121639.25276-1-s.reiter@proxmox.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
X-SPAM-LEVEL: Spam detection results:  0
 AWL -0.038 Adjusted score from AWL reputation of From: address
 KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment
 RCVD_IN_DNSWL_MED        -2.3 Sender listed at https://www.dnswl.org/,
 medium trust
 SPF_HELO_NONE           0.001 SPF: HELO does not publish an SPF Record
 SPF_PASS               -0.001 SPF: sender matches SPF record
 URIBL_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to URIBL was blocked. See
 http://wiki.apache.org/spamassassin/DnsBlocklists#dnsbl-block for more
 information. [backup.rs, tablesgenerator.com]
Subject: [pbs-devel] [PATCH proxmox-backup 11/11] rustdoc: overhaul backup
 rustdoc and add locking table
X-BeenThere: pbs-devel@lists.proxmox.com
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Proxmox Backup Server development discussion
 <pbs-devel.lists.proxmox.com>
List-Unsubscribe: <https://lists.proxmox.com/cgi-bin/mailman/options/pbs-devel>, 
 <mailto:pbs-devel-request@lists.proxmox.com?subject=unsubscribe>
List-Archive: <http://lists.proxmox.com/pipermail/pbs-devel/>
List-Post: <mailto:pbs-devel@lists.proxmox.com>
List-Help: <mailto:pbs-devel-request@lists.proxmox.com?subject=help>
List-Subscribe: <https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel>, 
 <mailto:pbs-devel-request@lists.proxmox.com?subject=subscribe>
X-List-Received-Date: Wed, 14 Oct 2020 12:17:22 -0000

Rewrite most of the documentation to be more readable and correct
(according to the current implementations).

Add a table visualizing all different locks used to synchronize
concurrent operations.

Signed-off-by: Stefan Reiter <s.reiter@proxmox.com>
---

FYI: I used https://www.tablesgenerator.com/markdown_tables for the table

 src/backup.rs | 201 ++++++++++++++++++++++++++++++--------------------
 1 file changed, 121 insertions(+), 80 deletions(-)

diff --git a/src/backup.rs b/src/backup.rs
index 1b2180bc..95e45d10 100644
--- a/src/backup.rs
+++ b/src/backup.rs
@@ -1,107 +1,148 @@
-//! This module implements the proxmox backup data storage
+//! This module implements the data storage and access layer.
 //!
-//! Proxmox backup splits large files into chunks, and stores them
-//! deduplicated using a content addressable storage format.
+//! # Data formats
 //!
-//! A chunk is simply defined as binary blob, which is stored inside a
-//! `ChunkStore`, addressed by the SHA256 digest of the binary blob.
+//! PBS splits large files into chunks, and stores them deduplicated using
+//! a content addressable storage format.
 //!
-//! Index files are used to reconstruct the original file. They
-//! basically contain a list of SHA256 checksums. The `DynamicIndex*`
-//! format is able to deal with dynamic chunk sizes, whereas the
-//! `FixedIndex*` format is an optimization to store a list of equal
-//! sized chunks.
+//! Backup snapshots are stored as folders containing a manifest file and
+//! potentially one or more index or blob files.
 //!
-//! # ChunkStore Locking
+//! The manifest contains hashes of all other files and can be signed by
+//! the client.
 //!
-//! We need to be able to restart the proxmox-backup service daemons,
-//! so that we can update the software without rebooting the host. But
-//! such restarts must not abort running backup jobs, so we need to
-//! keep the old service running until those jobs are finished. This
-//! implies that we need some kind of locking for the
-//! ChunkStore. Please note that it is perfectly valid to have
-//! multiple parallel ChunkStore writers, even when they write the
-//! same chunk (because the chunk would have the same name and the
-//! same data). The only real problem is garbage collection, because
-//! we need to avoid deleting chunks which are still referenced.
+//! Blob files contain data directly. They are used for config files and
+//! the like.
 //!
-//! * Read Index Files:
+//! Index files are used to reconstruct an original file. They contain a
+//! list of SHA256 checksums. The `DynamicIndex*` format is able to deal
+//! with dynamic chunk sizes (CT and host backups), whereas the
+//! `FixedIndex*` format is an optimization to store a list of equal sized
+//! chunks (VMs, whole block devices).
 //!
-//!   Acquire shared lock for .idx files.
-//!
-//!
-//! * Delete Index Files:
-//!
-//!   Acquire exclusive lock for .idx files. This makes sure that we do
-//!   not delete index files while they are still in use.
-//!
-//!
-//! * Create Index Files:
-//!
-//!   Acquire shared lock for ChunkStore (process wide).
-//!
-//!   Note: When creating .idx files, we create temporary a (.tmp) file,
-//!   then do an atomic rename ...
-//!
-//!
-//! * Garbage Collect:
-//!
-//!   Acquire exclusive lock for ChunkStore (process wide). If we have
-//!   already a shared lock for the ChunkStore, try to upgrade that
-//!   lock.
-//!
-//!
-//! * Server Restart
-//!
-//!   Try to abort the running garbage collection to release exclusive
-//!   ChunkStore locks ASAP. Start the new service with the existing listening
-//!   socket.
+//! A chunk is defined as a binary blob, which is stored inside a
+//! [ChunkStore](struct.ChunkStore.html) instead of the backup directory
+//! directly, and can be addressed by its SHA256 digest.
 //!
 //!
 //! # Garbage Collection (GC)
 //!
-//! Deleting backups is as easy as deleting the corresponding .idx
-//! files. Unfortunately, this does not free up any storage, because
-//! those files just contain references to chunks.
+//! Deleting backups is as easy as deleting the corresponding .idx files.
+//! However, this does not free up any storage, because those files just
+//! contain references to chunks.
 //!
 //! To free up some storage, we run a garbage collection process at
-//! regular intervals. The collector uses a mark and sweep
-//! approach. In the first phase, it scans all .idx files to mark used
-//! chunks. The second phase then removes all unmarked chunks from the
-//! store.
+//! regular intervals. The collector uses a mark and sweep approach. In
+//! the first phase, it scans all .idx files to mark used chunks. The
+//! second phase then removes all unmarked chunks from the store.
 //!
-//! The above locking mechanism makes sure that we are the only
-//! process running GC. But we still want to be able to create backups
-//! during GC, so there may be multiple backup threads/tasks
-//! running. Either started before GC started, or started while GC is
-//! running.
+//! The locking mechanisms mentioned below make sure that we are the only
+//! process running GC. We still want to be able to create backups during
+//! GC, so there may be multiple backup threads/tasks running, either
+//! started before GC, or while GC is running.
 //!
 //! ## `atime` based GC
 //!
 //! The idea here is to mark chunks by updating the `atime` (access
-//! timestamp) on the chunk file. This is quite simple and does not
-//! need additional RAM.
+//! timestamp) on the chunk file. This is quite simple and does not need
+//! additional RAM.
 //!
 //! One minor problem is that recent Linux versions use the `relatime`
-//! mount flag by default for performance reasons (yes, we want
-//! that). When enabled, `atime` data is written to the disk only if
-//! the file has been modified since the `atime` data was last updated
-//! (`mtime`), or if the file was last accessed more than a certain
-//! amount of time ago (by default 24h). So we may only delete chunks
-//! with `atime` older than 24 hours.
-//!
-//! Another problem arises from running backups. The mark phase does
-//! not find any chunks from those backups, because there is no .idx
-//! file for them (created after the backup). Chunks created or
-//! touched by those backups may have an `atime` as old as the start
-//! time of those backups. Please note that the backup start time may
-//! predate the GC start time. So we may only delete chunks older than
-//! the start time of those running backup jobs.
+//! mount flag by default for performance reasons (and we want that). When
+//! enabled, `atime` data is written to the disk only if the file has been
+//! modified since the `atime` data was last updated (`mtime`), or if the
+//! file was last accessed more than a certain amount of time ago (by
+//! default 24h). So we may only delete chunks with `atime` older than 24
+//! hours.
 //!
+//! Another problem arises from running backups. The mark phase does not
+//! find any chunks from those backups, because there is no .idx file for
+//! them (created after the backup). Chunks created or touched by those
+//! backups may have an `atime` as old as the start time of those backups.
+//! Please note that the backup start time may predate the GC start time.
+//! So we may only delete chunks older than the start time of those
+//! running backup jobs, which might be more than 24h back (this is the
+//! reason why ProcessLocker exclusive locks only have to be exclusive
+//! between processes, since within one we can determine the age of the
+//! oldest shared lock).
 //!
 //! ## Store `marks` in RAM using a HASH
 //!
-//! Not sure if this is better. TODO
+//! Might be better. Under investigation.
+//!
+//!
+//! # Locking
+//!
+//! Since PBS allows multiple potentially interfering operations at the
+//! same time (e.g. garbage collect, prune, multiple backup creations
+//! (only in seperate groups), forget, ...), these need to lock against
+//! each other in certain scenarios. There is no overarching global lock
+//! though, instead always the finest grained lock possible is used,
+//! because running these operations concurrently is treated as a feature
+//! on its own.
+//!
+//! ## Inter-process Locking
+//!
+//! We need to be able to restart the proxmox-backup service daemons, so
+//! that we can update the software without rebooting the host. But such
+//! restarts must not abort running backup jobs, so we need to keep the
+//! old service running until those jobs are finished. This implies that
+//! we need some kind of locking for modifying chunks and indices in the
+//! ChunkStore.
+//!
+//! Please note that it is perfectly valid to have multiple
+//! parallel ChunkStore writers, even when they write the same chunk
+//! (because the chunk would have the same name and the same data, and
+//! writes are completed atomically via a rename). The only problem is
+//! garbage collection, because we need to avoid deleting chunks which are
+//! still referenced.
+//!
+//! To do this we use the
+//! [ProcessLocker](../tools/struct.ProcessLocker.html).
+//!
+//! ### ChunkStore-wide
+//!
+//! * Create Index Files:
+//!
+//!   Acquire shared lock for ChunkStore.
+//!
+//!   Note: When creating .idx files, we create a temporary .tmp file,
+//!   then do an atomic rename.
+//!
+//! * Garbage Collect:
+//!
+//!   Acquire exclusive lock for ChunkStore. If we have
+//!   already a shared lock for the ChunkStore, try to upgrade that
+//!   lock.
+//!
+//! Exclusive locks only work _between processes_. It is valid to have an
+//! exclusive and one or more shared locks held within one process. Writing
+//! chunks within one process is synchronized using the gc_mutex.
+//!
+//! On server restart, we stop any running GC in the old process to avoid
+//! having the exclusive lock held for too long.
+//!
+//! ## Locking table
+//!
+//! Below table shows all operations that play a role in locking, and which
+//! mechanisms are used to make their concurrent usage safe.
+//!
+//! | starting ><br>v during | read index file | create index file | GC mark | GC sweep | read manifest | write manifest | forget | prune | create backup | verify | reader api |
+//! |-|-|-|-|-|-|-|-|-|-|-|-|
+//! | **read index file** | / | / | / | / | / | / | mmap stays valid, oldest_shared_lock prevents GC | see forget column | / | / | / |
+//! | **create index file** | / | / | / | / | / | / | / | / | /, happens at the end, after all chunks are touched | /, only happens without a manifest | / |
+//! | **GC mark** | / | Datastore process-lock shared | gc_mutex, exclusive ProcessLocker | gc_mutex | / | /, GC only cares about index files, not manifests | tells GC about removed chunks | see forget column | /, index files don’t exist yet | / | / |
+//! | **GC sweep** | / | Datastore process-lock shared | gc_mutex, exclusive ProcessLocker | gc_mutex | / | / | /, chunks already marked | see forget column | chunks get touched; chunk_store.mutex; oldest PL lock | / | / |
+//! | **read manifest** | / | / | / | / | update_manifest lock | update_manifest lock | update_manifest lock, remove dir under lock | see forget column | / | / | / |
+//! | **write manifest** | / | / | / | / | update_manifest lock | update_manifest lock | update_manifest lock | see forget column | /, “write manifest” happens at the end | /, can call “write manifest”, see that column | / |
+//! | **forget** | / | / | removed_during_gc mutex is held during unlink | marking done, doesn’t matter if forgotten now | / | update_manifest lock | /, unlink is atomic | causes forget to fail, but that’s OK | running backup has snapshot flock | /, potentially detects missing folder | shared snap flock |
+//! | **prune** | / | / | see forget row | see forget row | / | see forget row | causes warn in prune, but no error | see forget column | running and last non-running can’t be pruned | see forget row | shared snap flock |
+//! | **create backup** | / | only time this happens, thus has snapshot flock | / | chunks get touched; chunk_store.mutex; oldest PL lock | / | see “write manifest” row | snapshot flock, can’t be forgotten | running and last non-running can’t be pruned | snapshot group flock, only one running per group | /, won’t be verified since manifest missing | / |
+//! | **verify** | / | / | / | / | / | see “write manifest” row | /, potentially detects missing folder | see forget column | / | /, but useless (“write manifest” protects itself) | / |
+//! | **reader api** | / | / | / | /, open snap can’t be forgotten, so ref must exist | / | / | prevented by shared snap flock | prevented by shared snap flock | / | / | /, lock is shared |
+//!
+//! * / = no interaction
+//! * shared/exclusive from POV of 'starting' process
 
 use anyhow::{bail, Error};
 
-- 
2.20.1