Re: [pbs-devel] [PATCH proxmox-backup 1/3] pbs-config: cache verified API token secrets

From: Samuel Rufinatscha <s.rufinatscha@proxmox.com>
To: Shannon Sterz <s.sterz@proxmox.com>
Cc: Proxmox Backup Server development discussion
	<pbs-devel@lists.proxmox.com>
Subject: Re: [pbs-devel] [PATCH proxmox-backup 1/3] pbs-config: cache verified API token secrets
Date: Tue, 9 Dec 2025 14:29:17 +0100	[thread overview]
Message-ID: <9f3700c7-39e5-4b50-80c5-e28385de16dc@proxmox.com> (raw)
In-Reply-To: <DEQC8V17QVC8.MDWJDTV3DB71@proxmox.com>

On 12/5/25 3:03 PM, Shannon Sterz wrote:
> On Fri Dec 5, 2025 at 2:25 PM CET, Samuel Rufinatscha wrote:
>> Currently, every token-based API request reads the token.shadow file and
>> runs the expensive password hash verification for the given token
>> secret. This shows up as a hotspot in /status profiling (see
>> bug #6049 [1]).
>>
>> This patch introduces an in-memory cache of successfully verified token
>> secrets. Subsequent requests for the same token+secret combination only
>> perform a comparison using openssl::memcmp::eq and avoid re-running the
>> password hash. The cache is updated when a token secret is set and
>> cleared when a token is deleted. Note, this does NOT include manual
>> config changes, which will be covered in a subsequent patch.
>>
>> This patch partly fixes bug #6049 [1].
>>
>> [1] https://bugzilla.proxmox.com/show_bug.cgi?id=7017
>>
>> Signed-off-by: Samuel Rufinatscha <s.rufinatscha@proxmox.com>
>> ---
>>   pbs-config/src/token_shadow.rs | 58 +++++++++++++++++++++++++++++++++-
>>   1 file changed, 57 insertions(+), 1 deletion(-)
>>
>> diff --git a/pbs-config/src/token_shadow.rs b/pbs-config/src/token_shadow.rs
>> index 640fabbf..47aa2fc2 100644
>> --- a/pbs-config/src/token_shadow.rs
>> +++ b/pbs-config/src/token_shadow.rs
>> @@ -1,6 +1,8 @@
>>   use std::collections::HashMap;
>> +use std::sync::RwLock;
>>
>>   use anyhow::{bail, format_err, Error};
>> +use once_cell::sync::OnceCell;
>>   use serde::{Deserialize, Serialize};
>>   use serde_json::{from_value, Value};
>>
>> @@ -13,6 +15,13 @@ use crate::{open_backup_lockfile, BackupLockGuard};
>>   const LOCK_FILE: &str = pbs_buildcfg::configdir!("/token.shadow.lock");
>>   const CONF_FILE: &str = pbs_buildcfg::configdir!("/token.shadow");
>>
>> +/// Global in-memory cache for successfully verified API token secrets.
>> +/// The cache stores plain text secrets for token Authids that have already been
>> +/// verified against the hashed values in `token.shadow`. This allows for cheap
>> +/// subsequent authentications for the same token+secret combination, avoiding
>> +/// recomputing the password hash on every request.
>> +static TOKEN_SECRET_CACHE: OnceCell<RwLock<ApiTokenSecretCache>> = OnceCell::new();
> 
> any reason you are using a once cell with a cutom get_or_init function
> instead of a simple `LazyCell` [1] here? seems to me that this would be
> the more appropriate type here? similar question for the
> proxmox-access-control portion of this series.
> 
> [1]: https://doc.rust-lang.org/std/cell/struct.LazyCell.html
>

Good point, we should / can directly initialize it! Will change
to LazyCell. Thanks!

>> +
>>   #[derive(Serialize, Deserialize)]
>>   #[serde(rename_all = "kebab-case")]
>>   /// ApiToken id / secret pair
>> @@ -54,9 +63,25 @@ pub fn verify_secret(tokenid: &Authid, secret: &str) -> Result<(), Error> {
>>           bail!("not an API token ID");
>>       }
>>
>> +    // Fast path
>> +    if let Some(cached) = token_secret_cache().read().unwrap().secrets.get(tokenid) {
>> +        // Compare cached secret with provided one using constant time comparison
>> +        if openssl::memcmp::eq(cached.as_bytes(), secret.as_bytes()) {
>> +            // Already verified before
>> +            return Ok(());
>> +        }
>> +        // Fall through to slow path if secret doesn't match cached one
>> +    }
>> +
>> +    // Slow path: read file + verify hash
>>       let data = read_file()?;
>>       match data.get(tokenid) {
>> -        Some(hashed_secret) => proxmox_sys::crypt::verify_crypt_pw(secret, hashed_secret),
>> +        Some(hashed_secret) => {
>> +            proxmox_sys::crypt::verify_crypt_pw(secret, hashed_secret)?;
>> +            // Cache the plain secret for future requests
>> +            cache_insert_secret(tokenid.clone(), secret.to_owned());
>> +            Ok(())
>> +        }
>>           None => bail!("invalid API token"),
>>       }
>>   }
>> @@ -82,6 +107,8 @@ fn set_secret(tokenid: &Authid, secret: &str) -> Result<(), Error> {
>>       data.insert(tokenid.clone(), hashed_secret);
>>       write_file(data)?;
>>
>> +    cache_insert_secret(tokenid.clone(), secret.to_owned());
>> +
>>       Ok(())
>>   }
>>
>> @@ -97,5 +124,34 @@ pub fn delete_secret(tokenid: &Authid) -> Result<(), Error> {
>>       data.remove(tokenid);
>>       write_file(data)?;
>>
>> +    cache_remove_secret(tokenid);
>> +
>>       Ok(())
>>   }
>> +
>> +struct ApiTokenSecretCache {
>> +    /// Keys are token Authids, values are the corresponding plain text secrets.
>> +    /// Entries are added after a successful on-disk verification in
>> +    /// `verify_secret` or when a new token secret is generated by
>> +    /// `generate_and_set_secret`. Used to avoid repeated
>> +    /// password-hash computation on subsequent authentications.
>> +    secrets: HashMap<Authid, String>,
>> +}
>> +
>> +fn token_secret_cache() -> &'static RwLock<ApiTokenSecretCache> {
>> +    TOKEN_SECRET_CACHE.get_or_init(|| {
>> +        RwLock::new(ApiTokenSecretCache {
>> +            secrets: HashMap::new(),
>> +        })
>> +    })
>> +}
>> +
>> +fn cache_insert_secret(tokenid: Authid, secret: String) {
>> +    let mut cache = token_secret_cache().write().unwrap();
> 
> unwrap here could panic if another thread is holding a guard, any reason
> to not return a result here and bubble up the error instead?
>

Unwrap only panics here if another thread panicked while holding the
write lock. If that happens the cache might be in an inconsistent
state and future read() / write() will also return PoisonError. If we
return an error here we return the poison error to every subsequent
request.

I think we can:
– treat this as a hard bug and let the process panic on PoisonError; so
keep write().unwrap()
- catch the error, clear the cache and access the data via .into_inner().
but still forces every future read/write call to handle the poison logic
correctly

I think it makes sense to fail hard here. If the lock is poisoned the
state is likely broken and it seems better to let the process restart

>> +    cache.secrets.insert(tokenid, secret);
>> +}
>> +
>> +fn cache_remove_secret(tokenid: &Authid) {
>> +    let mut cache = token_secret_cache().write().unwrap();
> 
> same here and in the following patches (i won't comment on each
> occurrence there separately.)
> 
>> +    cache.secrets.remove(tokenid);
>> +}
> 

_______________________________________________
pbs-devel mailing list
pbs-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel