From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits)) (No client certificate requested) by lists.proxmox.com (Postfix) with ESMTPS id C5F8960332 for ; Wed, 14 Oct 2020 14:17:22 +0200 (CEST) Received: from firstgate.proxmox.com (localhost [127.0.0.1]) by firstgate.proxmox.com (Proxmox) with ESMTP id A40349CCA for ; Wed, 14 Oct 2020 14:16:52 +0200 (CEST) Received: from proxmox-new.maurer-it.com (proxmox-new.maurer-it.com [212.186.127.180]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits)) (No client certificate requested) by firstgate.proxmox.com (Proxmox) with ESMTPS id 460849B16 for ; Wed, 14 Oct 2020 14:16:49 +0200 (CEST) Received: from proxmox-new.maurer-it.com (localhost.localdomain [127.0.0.1]) by proxmox-new.maurer-it.com (Proxmox) with ESMTP id 0BD9845D4B for ; Wed, 14 Oct 2020 14:16:49 +0200 (CEST) From: Stefan Reiter To: pbs-devel@lists.proxmox.com Date: Wed, 14 Oct 2020 14:16:39 +0200 Message-Id: <20201014121639.25276-12-s.reiter@proxmox.com> X-Mailer: git-send-email 2.20.1 In-Reply-To: <20201014121639.25276-1-s.reiter@proxmox.com> References: <20201014121639.25276-1-s.reiter@proxmox.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-SPAM-LEVEL: Spam detection results: 0 AWL -0.038 Adjusted score from AWL reputation of From: address KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment RCVD_IN_DNSWL_MED -2.3 Sender listed at https://www.dnswl.org/, medium trust SPF_HELO_NONE 0.001 SPF: HELO does not publish an SPF Record SPF_PASS -0.001 SPF: sender matches SPF record URIBL_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to URIBL was blocked. See http://wiki.apache.org/spamassassin/DnsBlocklists#dnsbl-block for more information. [backup.rs, tablesgenerator.com] Subject: [pbs-devel] [PATCH proxmox-backup 11/11] rustdoc: overhaul backup rustdoc and add locking table X-BeenThere: pbs-devel@lists.proxmox.com X-Mailman-Version: 2.1.29 Precedence: list List-Id: Proxmox Backup Server development discussion List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 14 Oct 2020 12:17:22 -0000 Rewrite most of the documentation to be more readable and correct (according to the current implementations). Add a table visualizing all different locks used to synchronize concurrent operations. Signed-off-by: Stefan Reiter --- FYI: I used https://www.tablesgenerator.com/markdown_tables for the table src/backup.rs | 201 ++++++++++++++++++++++++++++++-------------------- 1 file changed, 121 insertions(+), 80 deletions(-) diff --git a/src/backup.rs b/src/backup.rs index 1b2180bc..95e45d10 100644 --- a/src/backup.rs +++ b/src/backup.rs @@ -1,107 +1,148 @@ -//! This module implements the proxmox backup data storage +//! This module implements the data storage and access layer. //! -//! Proxmox backup splits large files into chunks, and stores them -//! deduplicated using a content addressable storage format. +//! # Data formats //! -//! A chunk is simply defined as binary blob, which is stored inside a -//! `ChunkStore`, addressed by the SHA256 digest of the binary blob. +//! PBS splits large files into chunks, and stores them deduplicated using +//! a content addressable storage format. //! -//! Index files are used to reconstruct the original file. They -//! basically contain a list of SHA256 checksums. The `DynamicIndex*` -//! format is able to deal with dynamic chunk sizes, whereas the -//! `FixedIndex*` format is an optimization to store a list of equal -//! sized chunks. +//! Backup snapshots are stored as folders containing a manifest file and +//! potentially one or more index or blob files. //! -//! # ChunkStore Locking +//! The manifest contains hashes of all other files and can be signed by +//! the client. //! -//! We need to be able to restart the proxmox-backup service daemons, -//! so that we can update the software without rebooting the host. But -//! such restarts must not abort running backup jobs, so we need to -//! keep the old service running until those jobs are finished. This -//! implies that we need some kind of locking for the -//! ChunkStore. Please note that it is perfectly valid to have -//! multiple parallel ChunkStore writers, even when they write the -//! same chunk (because the chunk would have the same name and the -//! same data). The only real problem is garbage collection, because -//! we need to avoid deleting chunks which are still referenced. +//! Blob files contain data directly. They are used for config files and +//! the like. //! -//! * Read Index Files: +//! Index files are used to reconstruct an original file. They contain a +//! list of SHA256 checksums. The `DynamicIndex*` format is able to deal +//! with dynamic chunk sizes (CT and host backups), whereas the +//! `FixedIndex*` format is an optimization to store a list of equal sized +//! chunks (VMs, whole block devices). //! -//! Acquire shared lock for .idx files. -//! -//! -//! * Delete Index Files: -//! -//! Acquire exclusive lock for .idx files. This makes sure that we do -//! not delete index files while they are still in use. -//! -//! -//! * Create Index Files: -//! -//! Acquire shared lock for ChunkStore (process wide). -//! -//! Note: When creating .idx files, we create temporary a (.tmp) file, -//! then do an atomic rename ... -//! -//! -//! * Garbage Collect: -//! -//! Acquire exclusive lock for ChunkStore (process wide). If we have -//! already a shared lock for the ChunkStore, try to upgrade that -//! lock. -//! -//! -//! * Server Restart -//! -//! Try to abort the running garbage collection to release exclusive -//! ChunkStore locks ASAP. Start the new service with the existing listening -//! socket. +//! A chunk is defined as a binary blob, which is stored inside a +//! [ChunkStore](struct.ChunkStore.html) instead of the backup directory +//! directly, and can be addressed by its SHA256 digest. //! //! //! # Garbage Collection (GC) //! -//! Deleting backups is as easy as deleting the corresponding .idx -//! files. Unfortunately, this does not free up any storage, because -//! those files just contain references to chunks. +//! Deleting backups is as easy as deleting the corresponding .idx files. +//! However, this does not free up any storage, because those files just +//! contain references to chunks. //! //! To free up some storage, we run a garbage collection process at -//! regular intervals. The collector uses a mark and sweep -//! approach. In the first phase, it scans all .idx files to mark used -//! chunks. The second phase then removes all unmarked chunks from the -//! store. +//! regular intervals. The collector uses a mark and sweep approach. In +//! the first phase, it scans all .idx files to mark used chunks. The +//! second phase then removes all unmarked chunks from the store. //! -//! The above locking mechanism makes sure that we are the only -//! process running GC. But we still want to be able to create backups -//! during GC, so there may be multiple backup threads/tasks -//! running. Either started before GC started, or started while GC is -//! running. +//! The locking mechanisms mentioned below make sure that we are the only +//! process running GC. We still want to be able to create backups during +//! GC, so there may be multiple backup threads/tasks running, either +//! started before GC, or while GC is running. //! //! ## `atime` based GC //! //! The idea here is to mark chunks by updating the `atime` (access -//! timestamp) on the chunk file. This is quite simple and does not -//! need additional RAM. +//! timestamp) on the chunk file. This is quite simple and does not need +//! additional RAM. //! //! One minor problem is that recent Linux versions use the `relatime` -//! mount flag by default for performance reasons (yes, we want -//! that). When enabled, `atime` data is written to the disk only if -//! the file has been modified since the `atime` data was last updated -//! (`mtime`), or if the file was last accessed more than a certain -//! amount of time ago (by default 24h). So we may only delete chunks -//! with `atime` older than 24 hours. -//! -//! Another problem arises from running backups. The mark phase does -//! not find any chunks from those backups, because there is no .idx -//! file for them (created after the backup). Chunks created or -//! touched by those backups may have an `atime` as old as the start -//! time of those backups. Please note that the backup start time may -//! predate the GC start time. So we may only delete chunks older than -//! the start time of those running backup jobs. +//! mount flag by default for performance reasons (and we want that). When +//! enabled, `atime` data is written to the disk only if the file has been +//! modified since the `atime` data was last updated (`mtime`), or if the +//! file was last accessed more than a certain amount of time ago (by +//! default 24h). So we may only delete chunks with `atime` older than 24 +//! hours. //! +//! Another problem arises from running backups. The mark phase does not +//! find any chunks from those backups, because there is no .idx file for +//! them (created after the backup). Chunks created or touched by those +//! backups may have an `atime` as old as the start time of those backups. +//! Please note that the backup start time may predate the GC start time. +//! So we may only delete chunks older than the start time of those +//! running backup jobs, which might be more than 24h back (this is the +//! reason why ProcessLocker exclusive locks only have to be exclusive +//! between processes, since within one we can determine the age of the +//! oldest shared lock). //! //! ## Store `marks` in RAM using a HASH //! -//! Not sure if this is better. TODO +//! Might be better. Under investigation. +//! +//! +//! # Locking +//! +//! Since PBS allows multiple potentially interfering operations at the +//! same time (e.g. garbage collect, prune, multiple backup creations +//! (only in seperate groups), forget, ...), these need to lock against +//! each other in certain scenarios. There is no overarching global lock +//! though, instead always the finest grained lock possible is used, +//! because running these operations concurrently is treated as a feature +//! on its own. +//! +//! ## Inter-process Locking +//! +//! We need to be able to restart the proxmox-backup service daemons, so +//! that we can update the software without rebooting the host. But such +//! restarts must not abort running backup jobs, so we need to keep the +//! old service running until those jobs are finished. This implies that +//! we need some kind of locking for modifying chunks and indices in the +//! ChunkStore. +//! +//! Please note that it is perfectly valid to have multiple +//! parallel ChunkStore writers, even when they write the same chunk +//! (because the chunk would have the same name and the same data, and +//! writes are completed atomically via a rename). The only problem is +//! garbage collection, because we need to avoid deleting chunks which are +//! still referenced. +//! +//! To do this we use the +//! [ProcessLocker](../tools/struct.ProcessLocker.html). +//! +//! ### ChunkStore-wide +//! +//! * Create Index Files: +//! +//! Acquire shared lock for ChunkStore. +//! +//! Note: When creating .idx files, we create a temporary .tmp file, +//! then do an atomic rename. +//! +//! * Garbage Collect: +//! +//! Acquire exclusive lock for ChunkStore. If we have +//! already a shared lock for the ChunkStore, try to upgrade that +//! lock. +//! +//! Exclusive locks only work _between processes_. It is valid to have an +//! exclusive and one or more shared locks held within one process. Writing +//! chunks within one process is synchronized using the gc_mutex. +//! +//! On server restart, we stop any running GC in the old process to avoid +//! having the exclusive lock held for too long. +//! +//! ## Locking table +//! +//! Below table shows all operations that play a role in locking, and which +//! mechanisms are used to make their concurrent usage safe. +//! +//! | starting >
v during | read index file | create index file | GC mark | GC sweep | read manifest | write manifest | forget | prune | create backup | verify | reader api | +//! |-|-|-|-|-|-|-|-|-|-|-|-| +//! | **read index file** | / | / | / | / | / | / | mmap stays valid, oldest_shared_lock prevents GC | see forget column | / | / | / | +//! | **create index file** | / | / | / | / | / | / | / | / | /, happens at the end, after all chunks are touched | /, only happens without a manifest | / | +//! | **GC mark** | / | Datastore process-lock shared | gc_mutex, exclusive ProcessLocker | gc_mutex | / | /, GC only cares about index files, not manifests | tells GC about removed chunks | see forget column | /, index files don’t exist yet | / | / | +//! | **GC sweep** | / | Datastore process-lock shared | gc_mutex, exclusive ProcessLocker | gc_mutex | / | / | /, chunks already marked | see forget column | chunks get touched; chunk_store.mutex; oldest PL lock | / | / | +//! | **read manifest** | / | / | / | / | update_manifest lock | update_manifest lock | update_manifest lock, remove dir under lock | see forget column | / | / | / | +//! | **write manifest** | / | / | / | / | update_manifest lock | update_manifest lock | update_manifest lock | see forget column | /, “write manifest” happens at the end | /, can call “write manifest”, see that column | / | +//! | **forget** | / | / | removed_during_gc mutex is held during unlink | marking done, doesn’t matter if forgotten now | / | update_manifest lock | /, unlink is atomic | causes forget to fail, but that’s OK | running backup has snapshot flock | /, potentially detects missing folder | shared snap flock | +//! | **prune** | / | / | see forget row | see forget row | / | see forget row | causes warn in prune, but no error | see forget column | running and last non-running can’t be pruned | see forget row | shared snap flock | +//! | **create backup** | / | only time this happens, thus has snapshot flock | / | chunks get touched; chunk_store.mutex; oldest PL lock | / | see “write manifest” row | snapshot flock, can’t be forgotten | running and last non-running can’t be pruned | snapshot group flock, only one running per group | /, won’t be verified since manifest missing | / | +//! | **verify** | / | / | / | / | / | see “write manifest” row | /, potentially detects missing folder | see forget column | / | /, but useless (“write manifest” protects itself) | / | +//! | **reader api** | / | / | / | /, open snap can’t be forgotten, so ref must exist | / | / | prevented by shared snap flock | prevented by shared snap flock | / | / | /, lock is shared | +//! +//! * / = no interaction +//! * shared/exclusive from POV of 'starting' process use anyhow::{bail, Error}; -- 2.20.1