From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits)) (No client certificate requested) by lists.proxmox.com (Postfix) with ESMTPS id EB8A38B12 for ; Tue, 22 Aug 2023 15:53:06 +0200 (CEST) Received: from firstgate.proxmox.com (localhost [127.0.0.1]) by firstgate.proxmox.com (Proxmox) with ESMTP id C3FB9DC69 for ; Tue, 22 Aug 2023 15:52:36 +0200 (CEST) Received: from proxmox-new.maurer-it.com (proxmox-new.maurer-it.com [94.136.29.106]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits)) (No client certificate requested) by firstgate.proxmox.com (Proxmox) with ESMTPS for ; Tue, 22 Aug 2023 15:52:34 +0200 (CEST) Received: from proxmox-new.maurer-it.com (localhost.localdomain [127.0.0.1]) by proxmox-new.maurer-it.com (Proxmox) with ESMTP id 529094334D for ; Tue, 22 Aug 2023 15:52:34 +0200 (CEST) Message-ID: <7e71d46e-554d-6329-61a7-fea37ec0a666@proxmox.com> Date: Tue, 22 Aug 2023 15:52:32 +0200 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.14.0 Content-Language: en-US To: Wolfgang Bumiller Cc: Proxmox VE development discussion , Lukas Wagner References: <20230821134444.620021-1-l.wagner@proxmox.com> <20230821134444.620021-5-l.wagner@proxmox.com> <2cg2yzartrq2r6mmednb6eqobulfmdtf733wm4mnsscwpyo5ty@zoeoswy5o2fp> From: Max Carrara In-Reply-To: <2cg2yzartrq2r6mmednb6eqobulfmdtf733wm4mnsscwpyo5ty@zoeoswy5o2fp> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-SPAM-LEVEL: Spam detection results: 0 AWL 1.637 Adjusted score from AWL reputation of From: address BAYES_00 -1.9 Bayes spam probability is 0 to 1% DMARC_MISSING 0.1 Missing DMARC policy KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment NICE_REPLY_A -3.374 Looks like a legit reply (A) SPF_HELO_NONE 0.001 SPF: HELO does not publish an SPF Record SPF_PASS -0.001 SPF: sender matches SPF record URIBL_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to URIBL was blocked. See http://wiki.apache.org/spamassassin/DnsBlocklists#dnsbl-block for more information. [lib.rs, mod.rs, performance.rs, rust-lang.org] Subject: Re: [pve-devel] [RFC proxmox 4/7] cache: add new crate 'proxmox-cache' X-BeenThere: pve-devel@lists.proxmox.com X-Mailman-Version: 2.1.29 Precedence: list List-Id: Proxmox VE development discussion List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 22 Aug 2023 13:53:07 -0000 On 8/22/23 13:56, Wolfgang Bumiller wrote: > On Tue, Aug 22, 2023 at 12:08:38PM +0200, Max Carrara wrote: >> On 8/21/23 15:44, Lukas Wagner wrote: >>> For now, it contains a file-backed cache with expiration logic. >>> The cache should be safe to be accessed from multiple processes at >>> once. >>> >> >> This seems pretty neat! The cache implementation seems straightforward >> enough. I'll see if I can test it more thoroughly later. >> >> However, in my opinion we should have a crate like >> "proxmox-collections" (or something of the sort) with modules for each >> data structure / collection similar to the standard library; I'm >> curious what others think about that. imo it would be a great >> opportunity to introduce that crate in this series, since you're >> already introducing one for the cache anyway. >> >> So, proxmox-collections would look something like this: >> >> proxmox-collections >> └── src >> ├── cache >> │   ├── mod.rs >> │   └── shared_cache.rs >> └── lib.rs >> >> Let me know what you think! > > Smaller crates help with compile time, I don't want to pull in a big > collections crate if I need one single type of collection from it ;-) > They do indeed, but in this case I'd prefer a single crate that guards each collection behind a feature, which has essentially the same effect. Or as an alternative, let proxmox-collections be a meta-crate that re-exports smaller crates in its own namespace; let those smaller crates then be optional dependencies which are only pulled in and compiled when their respective features are specified[0]. I'm not sure how the latter works in conjunction with deb packaging, however. That being said, I'm otherwise not opposed against proxmox-cache becoming its own crate, but I do think that we should group our code logically in some way. >> >>> The cache stores values in a directory, based on the key. >>> E.g. key "foo" results in a file 'foo.json' in the given base >>> directory. If a new value is set, the file is atomically replaced. >>> The JSON file also contains some metadata, namely 'added_at' and >>> 'expire_in' - they are used for cache expiration. >>> >>> Note: This cache is not suited to applications that >>> - Might want to cache huge amounts of data, and/or access the cache >>> very frequently (due to the overhead of JSON de/serialization) >>> - Require arbitrary keys - right now, keys are limited by >>> SAFE_ID_REGEX >>> >>> The cache was developed for the use in pvestatd, in order to cache >>> e.g. storage plugin status. There, these limitations do not really >>> play any role. >>> >>> Signed-off-by: Lukas Wagner >>> --- >>> Cargo.toml | 1 + >>> proxmox-cache/Cargo.toml | 20 ++ >>> proxmox-cache/examples/performance.rs | 82 ++++++++ >>> proxmox-cache/src/lib.rs | 40 ++++ >>> proxmox-cache/src/shared_cache.rs | 263 ++++++++++++++++++++++++++ >>> 5 files changed, 406 insertions(+) >>> create mode 100644 proxmox-cache/Cargo.toml >>> create mode 100644 proxmox-cache/examples/performance.rs >>> create mode 100644 proxmox-cache/src/lib.rs >>> create mode 100644 proxmox-cache/src/shared_cache.rs >>> >>> diff --git a/Cargo.toml b/Cargo.toml >>> index e334ac1..940e1d0 100644 >>> --- a/Cargo.toml >>> +++ b/Cargo.toml >>> @@ -5,6 +5,7 @@ members = [ >>> "proxmox-async", >>> "proxmox-auth-api", >>> "proxmox-borrow", >>> + "proxmox-cache", >>> "proxmox-client", >>> "proxmox-compression", >>> "proxmox-http", >>> diff --git a/proxmox-cache/src/lib.rs b/proxmox-cache/src/lib.rs >>> new file mode 100644 >>> index 0000000..d496dc7 >>> --- /dev/null >>> +++ b/proxmox-cache/src/lib.rs >>> @@ -0,0 +1,40 @@ >>> +use anyhow::Error; >>> +use serde_json::Value; >>> + >>> +pub mod shared_cache; >>> + >>> +pub use shared_cache::SharedCache; >>> + >>> +trait TimeProvider { > > Also mentioned off list: this trait will be dropped. > For tests, a `#[cfg(test)]` alternative will be used for fetching the > time. > > We just never need a different provider outside of tests, and storing it > as `Box` immediately drops all the auto traits from > the cache struct, since trait objects *only* ever get the auto traits > you *explicitly* define. > >>> + /// Returns the current time as a UNIX epoch (second resolution) >>> + fn now(&self) -> i64; >>> +} >>> + >>> +struct DefaultTimeProvider; >>> + >>> +impl TimeProvider for DefaultTimeProvider { >>> + fn now(&self) -> i64 { >>> + proxmox_time::epoch_i64() >>> + } >>> +} >>> + >>> +pub trait Cache { >>> + /// Set or insert a cache entry. >>> + /// >>> + /// If `expires_in` is set, this entry will expire in the desired number of >>> + /// seconds. >>> + fn set>( >>> + &self, >>> + key: S, >>> + value: Value, >>> + expires_in: Option, >>> + ) -> Result<(), Error>; >>> + >>> + /// Delete a cache entry. >>> + fn delete>(&self, key: S) -> Result<(), Error>; >>> + >>> + /// Get a value from the cache. >>> + /// >>> + /// Expired entries will *not* be returned. >>> + fn get>(&self, key: S) -> Result, Error>; >>> +} >> >> I don't necessarily think that a trait would be necessary in this >> case, as there's not really any other structure (that can be used as >> caching mechanism) that you're abstracting over. (more below) > > Already somewhat mentioned this off list. This should not be a trait for > now. If we ever need one we can still add one. > >> >>> diff --git a/proxmox-cache/src/shared_cache.rs b/proxmox-cache/src/shared_cache.rs >>> new file mode 100644 >>> index 0000000..be6212c >>> --- /dev/null >>> +++ b/proxmox-cache/src/shared_cache.rs >>> @@ -0,0 +1,263 @@ >>> +use std::path::{Path, PathBuf}; >>> + >>> +use anyhow::{bail, Error}; >>> +use serde::{Deserialize, Serialize}; >>> +use serde_json::Value; >>> + >>> +use proxmox_schema::api_types::SAFE_ID_FORMAT; >>> +use proxmox_sys::fs::CreateOptions; >>> + >>> +use crate::{Cache, DefaultTimeProvider, TimeProvider}; >>> + >>> +/// A simple, file-backed cache that can be used from multiple processes concurrently. >>> +/// >>> +/// Cache entries are stored as individual files inside a base directory. For instance, >>> +/// a cache entry with the key 'disk_stats' will result in a file 'disk_stats.json' inside >>> +/// the base directory. As the extension implies, the cached data will be stored as a JSON >>> +/// string. >>> +/// >>> +/// For optimal performance, `SharedCache` should have its base directory in a `tmpfs`. >>> +/// >>> +/// ## Key Space >>> +/// Due to the fact that cache keys are being directly used as filenames, they have to match the >>> +/// following regular expression: `[A-Za-z0-9_][A-Za-z0-9._\-]*` >>> +/// >>> +/// ## Concurrency >>> +/// All cache operations are based on atomic file operations, thus accessing/updating the cache from >>> +/// multiple processes at the same time is safe. >>> +/// >>> +/// ## Performance >>> +/// On a tmpfs: >>> +/// ```sh >>> +/// $ cargo run --release --example=performance >>> +/// inserting 100000 keys took 896.609758ms (8.966µs per key) >>> +/// getting 100000 keys took 584.874842ms (5.848µs per key) >>> +/// deleting 100000 keys took 247.742702ms (2.477µs per key) >>> +/// >>> +/// Inserting/getting large objects might of course result in lower performance due to the cost >>> +/// of serialization. >>> +/// ``` >>> +/// >>> +pub struct SharedCache { >>> + base_path: PathBuf, >>> + time_provider: Box, >>> + create_options: CreateOptions, >>> +} >> >> Instead, this should be generic: >> >> pub struct SharedCache { ... } >> >> .. and maybe rename it to SharedFileCache to make it explicit that this >> operates on a file. (but that's more dependent on one's taste tbh) >> >> .. and the impl block below ... > > The main issue here is that it'll all boil down to one general "json > thingy" because it'll be used from out of perl where if we have multiple > types here to instantiate we'll need to create separate `perlmod > #[export]`s for each and every type we need in our perl code. > > OTOH, we can still have the crate side here generic and have the perl-rs > side just instantiate a single `SharedCache`... > Or perhaps have the perl-rs side wrap `SharedCache` and expose only the methods that are needed; I don't see the issue here ;-) > A generic `K` might make it annoying to work with paths, though, so IMO > that can just stay a string ;-) > >> >>> + >>> +impl Cache for SharedCache { >>> + fn set>( >>> + &self, >>> + key: S, >>> + value: Value, > > In any case, I'd definitely argue against `Value` here and just go with > a generic `V: Serialize` at least in the methods. > > Because why serialize twice? (Once from some struct to Value, then from > Value to json.) > ACK, that one I hadn't considered. Good point! > Otherwise we could also just store `&[u8]` and have this be nothing more > than a thing to attach timeouts to bytes in files ;-) > >>> + expires_in: Option, >>> + ) -> Result<(), Error> { >>> + let path = self.get_path_for_key(key.as_ref())?; >>> + let added_at = self.time_provider.now(); >>> + >>> + let item = CachedItem { >>> + value, >>> + added_at, >>> + expires_in, >>> + }; >>> + >>> + let serialized = serde_json::to_vec_pretty(&item)?; >>> + >>> + // Atomically replace file >>> + proxmox_sys::fs::replace_file(path, &serialized, self.create_options.clone(), true)?; >>> + Ok(()) >>> + } >>> + >>> + fn delete>(&self, key: S) -> Result<(), Error> { >>> + let path = self.get_path_for_key(key.as_ref())?; >>> + std::fs::remove_file(path)?; >>> + Ok(()) >>> + } >>> + >>> + fn get>(&self, key: S) -> Result, Error> { > > Same here of course. > >>> + let path = self.get_path_for_key(key.as_ref())?; >>> + >>> + let value = if let Some(content) = proxmox_sys::fs::file_get_optional_contents(path)? { >>> + let value: CachedItem = serde_json::from_slice(&content)?; > > `CachedItem` would need to be generic, and here you could use > `CachedItem<&'_ serde_json::RawValue>` which *should* just refer to a > subslice in `content` here to defer the payload parsing to later. > > This way you avoid at least *some* overhead of the intermediate `Value` > while still being able to access the value's time. > > Alternatively we could think of a more efficient format where the added > and expires fields are in a fixed-sized binary header... but it > shouldn't matter much. > >>> + >>> + let now = self.time_provider.now(); >>> + >>> + if let Some(expires_in) = value.expires_in { >>> + // Check if value is not expired yet. Also do not allow >>> + // values from the future, in case we have clock jumps >>> + if value.added_at + expires_in > now && value.added_at <= now { >>> + Some(value.value) >>> + } else { >>> + None >>> + } >>> + } else { >>> + Some(value.value) >>> + } >>> + } else { >>> + None >>> + }; >>> + >>> + Ok(value) >>> + } >>> +} >> >> ... can be replaced as follows, in order to make it similar to >> std::collections::{HashMap, BTreeMap}: >> >> impl> for SharedCache { >> // Returns old value on successful insert, if given >> fn insert(&self, k: K, v: Value) -> Result, Error> { >> // ... >> } >> >> fn get(&self, k: K) -> Result, Error> { >> // ... >> } >> >> fn remove(&self, k: K) -> Result, Error> { >> // ... >> } >> } >> >> If necessary / sensible, other methods (inspired by {HashMap, BTreeMap} can >> be added as well, such as remove_entry, retain, clear, etc. > > Using the naming found in std sure makes sense. > > But please hold off on the `Entry` API for now, that one will bite ya ;-) [0]: https://doc.rust-lang.org/cargo/reference/features.html#optional-dependencies