From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits)) (No client certificate requested) by lists.proxmox.com (Postfix) with ESMTPS id 45C4889C8 for ; Tue, 22 Aug 2023 13:56:24 +0200 (CEST) Received: from firstgate.proxmox.com (localhost [127.0.0.1]) by firstgate.proxmox.com (Proxmox) with ESMTP id 2750BC318 for ; Tue, 22 Aug 2023 13:56:24 +0200 (CEST) Received: from proxmox-new.maurer-it.com (proxmox-new.maurer-it.com [94.136.29.106]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits)) (No client certificate requested) by firstgate.proxmox.com (Proxmox) with ESMTPS for ; Tue, 22 Aug 2023 13:56:23 +0200 (CEST) Received: from proxmox-new.maurer-it.com (localhost.localdomain [127.0.0.1]) by proxmox-new.maurer-it.com (Proxmox) with ESMTP id CDEA4432F9 for ; Tue, 22 Aug 2023 13:56:22 +0200 (CEST) Date: Tue, 22 Aug 2023 13:56:21 +0200 From: Wolfgang Bumiller To: Max Carrara Cc: Proxmox VE development discussion , Lukas Wagner Message-ID: <2cg2yzartrq2r6mmednb6eqobulfmdtf733wm4mnsscwpyo5ty@zoeoswy5o2fp> References: <20230821134444.620021-1-l.wagner@proxmox.com> <20230821134444.620021-5-l.wagner@proxmox.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: X-SPAM-LEVEL: Spam detection results: 0 AWL 0.103 Adjusted score from AWL reputation of From: address BAYES_00 -1.9 Bayes spam probability is 0 to 1% DMARC_MISSING 0.1 Missing DMARC policy KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment SPF_HELO_NONE 0.001 SPF: HELO does not publish an SPF Record SPF_PASS -0.001 SPF: sender matches SPF record URIBL_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to URIBL was blocked. See http://wiki.apache.org/spamassassin/DnsBlocklists#dnsbl-block for more information. [performance.rs, lib.rs, mod.rs] Subject: Re: [pve-devel] [RFC proxmox 4/7] cache: add new crate 'proxmox-cache' X-BeenThere: pve-devel@lists.proxmox.com X-Mailman-Version: 2.1.29 Precedence: list List-Id: Proxmox VE development discussion List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 22 Aug 2023 11:56:24 -0000 On Tue, Aug 22, 2023 at 12:08:38PM +0200, Max Carrara wrote: > On 8/21/23 15:44, Lukas Wagner wrote: > > For now, it contains a file-backed cache with expiration logic. > > The cache should be safe to be accessed from multiple processes at > > once. > > > > This seems pretty neat! The cache implementation seems straightforward > enough. I'll see if I can test it more thoroughly later. > > However, in my opinion we should have a crate like > "proxmox-collections" (or something of the sort) with modules for each > data structure / collection similar to the standard library; I'm > curious what others think about that. imo it would be a great > opportunity to introduce that crate in this series, since you're > already introducing one for the cache anyway. > > So, proxmox-collections would look something like this: > > proxmox-collections > └── src > ├── cache > │   ├── mod.rs > │   └── shared_cache.rs > └── lib.rs > > Let me know what you think! Smaller crates help with compile time, I don't want to pull in a big collections crate if I need one single type of collection from it ;-) > > > The cache stores values in a directory, based on the key. > > E.g. key "foo" results in a file 'foo.json' in the given base > > directory. If a new value is set, the file is atomically replaced. > > The JSON file also contains some metadata, namely 'added_at' and > > 'expire_in' - they are used for cache expiration. > > > > Note: This cache is not suited to applications that > > - Might want to cache huge amounts of data, and/or access the cache > > very frequently (due to the overhead of JSON de/serialization) > > - Require arbitrary keys - right now, keys are limited by > > SAFE_ID_REGEX > > > > The cache was developed for the use in pvestatd, in order to cache > > e.g. storage plugin status. There, these limitations do not really > > play any role. > > > > Signed-off-by: Lukas Wagner > > --- > > Cargo.toml | 1 + > > proxmox-cache/Cargo.toml | 20 ++ > > proxmox-cache/examples/performance.rs | 82 ++++++++ > > proxmox-cache/src/lib.rs | 40 ++++ > > proxmox-cache/src/shared_cache.rs | 263 ++++++++++++++++++++++++++ > > 5 files changed, 406 insertions(+) > > create mode 100644 proxmox-cache/Cargo.toml > > create mode 100644 proxmox-cache/examples/performance.rs > > create mode 100644 proxmox-cache/src/lib.rs > > create mode 100644 proxmox-cache/src/shared_cache.rs > > > > diff --git a/Cargo.toml b/Cargo.toml > > index e334ac1..940e1d0 100644 > > --- a/Cargo.toml > > +++ b/Cargo.toml > > @@ -5,6 +5,7 @@ members = [ > > "proxmox-async", > > "proxmox-auth-api", > > "proxmox-borrow", > > + "proxmox-cache", > > "proxmox-client", > > "proxmox-compression", > > "proxmox-http", > > diff --git a/proxmox-cache/src/lib.rs b/proxmox-cache/src/lib.rs > > new file mode 100644 > > index 0000000..d496dc7 > > --- /dev/null > > +++ b/proxmox-cache/src/lib.rs > > @@ -0,0 +1,40 @@ > > +use anyhow::Error; > > +use serde_json::Value; > > + > > +pub mod shared_cache; > > + > > +pub use shared_cache::SharedCache; > > + > > +trait TimeProvider { Also mentioned off list: this trait will be dropped. For tests, a `#[cfg(test)]` alternative will be used for fetching the time. We just never need a different provider outside of tests, and storing it as `Box` immediately drops all the auto traits from the cache struct, since trait objects *only* ever get the auto traits you *explicitly* define. > > + /// Returns the current time as a UNIX epoch (second resolution) > > + fn now(&self) -> i64; > > +} > > + > > +struct DefaultTimeProvider; > > + > > +impl TimeProvider for DefaultTimeProvider { > > + fn now(&self) -> i64 { > > + proxmox_time::epoch_i64() > > + } > > +} > > + > > +pub trait Cache { > > + /// Set or insert a cache entry. > > + /// > > + /// If `expires_in` is set, this entry will expire in the desired number of > > + /// seconds. > > + fn set>( > > + &self, > > + key: S, > > + value: Value, > > + expires_in: Option, > > + ) -> Result<(), Error>; > > + > > + /// Delete a cache entry. > > + fn delete>(&self, key: S) -> Result<(), Error>; > > + > > + /// Get a value from the cache. > > + /// > > + /// Expired entries will *not* be returned. > > + fn get>(&self, key: S) -> Result, Error>; > > +} > > I don't necessarily think that a trait would be necessary in this > case, as there's not really any other structure (that can be used as > caching mechanism) that you're abstracting over. (more below) Already somewhat mentioned this off list. This should not be a trait for now. If we ever need one we can still add one. > > > diff --git a/proxmox-cache/src/shared_cache.rs b/proxmox-cache/src/shared_cache.rs > > new file mode 100644 > > index 0000000..be6212c > > --- /dev/null > > +++ b/proxmox-cache/src/shared_cache.rs > > @@ -0,0 +1,263 @@ > > +use std::path::{Path, PathBuf}; > > + > > +use anyhow::{bail, Error}; > > +use serde::{Deserialize, Serialize}; > > +use serde_json::Value; > > + > > +use proxmox_schema::api_types::SAFE_ID_FORMAT; > > +use proxmox_sys::fs::CreateOptions; > > + > > +use crate::{Cache, DefaultTimeProvider, TimeProvider}; > > + > > +/// A simple, file-backed cache that can be used from multiple processes concurrently. > > +/// > > +/// Cache entries are stored as individual files inside a base directory. For instance, > > +/// a cache entry with the key 'disk_stats' will result in a file 'disk_stats.json' inside > > +/// the base directory. As the extension implies, the cached data will be stored as a JSON > > +/// string. > > +/// > > +/// For optimal performance, `SharedCache` should have its base directory in a `tmpfs`. > > +/// > > +/// ## Key Space > > +/// Due to the fact that cache keys are being directly used as filenames, they have to match the > > +/// following regular expression: `[A-Za-z0-9_][A-Za-z0-9._\-]*` > > +/// > > +/// ## Concurrency > > +/// All cache operations are based on atomic file operations, thus accessing/updating the cache from > > +/// multiple processes at the same time is safe. > > +/// > > +/// ## Performance > > +/// On a tmpfs: > > +/// ```sh > > +/// $ cargo run --release --example=performance > > +/// inserting 100000 keys took 896.609758ms (8.966µs per key) > > +/// getting 100000 keys took 584.874842ms (5.848µs per key) > > +/// deleting 100000 keys took 247.742702ms (2.477µs per key) > > +/// > > +/// Inserting/getting large objects might of course result in lower performance due to the cost > > +/// of serialization. > > +/// ``` > > +/// > > +pub struct SharedCache { > > + base_path: PathBuf, > > + time_provider: Box, > > + create_options: CreateOptions, > > +} > > Instead, this should be generic: > > pub struct SharedCache { ... } > > .. and maybe rename it to SharedFileCache to make it explicit that this > operates on a file. (but that's more dependent on one's taste tbh) > > .. and the impl block below ... The main issue here is that it'll all boil down to one general "json thingy" because it'll be used from out of perl where if we have multiple types here to instantiate we'll need to create separate `perlmod #[export]`s for each and every type we need in our perl code. OTOH, we can still have the crate side here generic and have the perl-rs side just instantiate a single `SharedCache`... A generic `K` might make it annoying to work with paths, though, so IMO that can just stay a string ;-) > > > + > > +impl Cache for SharedCache { > > + fn set>( > > + &self, > > + key: S, > > + value: Value, In any case, I'd definitely argue against `Value` here and just go with a generic `V: Serialize` at least in the methods. Because why serialize twice? (Once from some struct to Value, then from Value to json.) Otherwise we could also just store `&[u8]` and have this be nothing more than a thing to attach timeouts to bytes in files ;-) > > + expires_in: Option, > > + ) -> Result<(), Error> { > > + let path = self.get_path_for_key(key.as_ref())?; > > + let added_at = self.time_provider.now(); > > + > > + let item = CachedItem { > > + value, > > + added_at, > > + expires_in, > > + }; > > + > > + let serialized = serde_json::to_vec_pretty(&item)?; > > + > > + // Atomically replace file > > + proxmox_sys::fs::replace_file(path, &serialized, self.create_options.clone(), true)?; > > + Ok(()) > > + } > > + > > + fn delete>(&self, key: S) -> Result<(), Error> { > > + let path = self.get_path_for_key(key.as_ref())?; > > + std::fs::remove_file(path)?; > > + Ok(()) > > + } > > + > > + fn get>(&self, key: S) -> Result, Error> { Same here of course. > > + let path = self.get_path_for_key(key.as_ref())?; > > + > > + let value = if let Some(content) = proxmox_sys::fs::file_get_optional_contents(path)? { > > + let value: CachedItem = serde_json::from_slice(&content)?; `CachedItem` would need to be generic, and here you could use `CachedItem<&'_ serde_json::RawValue>` which *should* just refer to a subslice in `content` here to defer the payload parsing to later. This way you avoid at least *some* overhead of the intermediate `Value` while still being able to access the value's time. Alternatively we could think of a more efficient format where the added and expires fields are in a fixed-sized binary header... but it shouldn't matter much. > > + > > + let now = self.time_provider.now(); > > + > > + if let Some(expires_in) = value.expires_in { > > + // Check if value is not expired yet. Also do not allow > > + // values from the future, in case we have clock jumps > > + if value.added_at + expires_in > now && value.added_at <= now { > > + Some(value.value) > > + } else { > > + None > > + } > > + } else { > > + Some(value.value) > > + } > > + } else { > > + None > > + }; > > + > > + Ok(value) > > + } > > +} > > ... can be replaced as follows, in order to make it similar to > std::collections::{HashMap, BTreeMap}: > > impl> for SharedCache { > // Returns old value on successful insert, if given > fn insert(&self, k: K, v: Value) -> Result, Error> { > // ... > } > > fn get(&self, k: K) -> Result, Error> { > // ... > } > > fn remove(&self, k: K) -> Result, Error> { > // ... > } > } > > If necessary / sensible, other methods (inspired by {HashMap, BTreeMap} can > be added as well, such as remove_entry, retain, clear, etc. Using the naming found in std sure makes sense. But please hold off on the `Entry` API for now, that one will bite ya ;-)