Hi,

We upgraded to 3.3 yesterday, not much gain to notice with regards to the new version or the change in keying. It’s still (obvioulsy) pretty busy.

However, I also tried to remove some datastores, which failed with timeouts. PBS even stopped authenticating (so probably just working) all together for about 10 seconds, which was an unpleasant surprise.

So looking into that further, I noticed the following logging:
Dec 18 16:14:32 pbs005 proxmox-backup-proxy[39143]: GET /api2/json/admin/datastore/XXXXXX/status: 400 Bad Request: [client [::ffff]:42104] Unable to acquire lock "/etc/proxmox-backup/.datastore.lck" - Interrupted system call (os error 4)
Dec 18 16:14:32 pbs005 proxmox-backup-proxy[39143]: GET /api2/json/admin/datastore/XXXXXX/status: 400 Bad Request: [client [::ffff]:42144] Unable to acquire lock "/etc/proxmox-backup/.datastore.lck" - Interrupted system call (os error 4)
Dec 18 16:14:32 pbs005 proxmox-backup-proxy[39143]: GET /api2/json/admin/datastore/XXXXXX/status: 400 Bad Request: [client [::ffff]:47286] Unable to acquire lock "/etc/proxmox-backup/.datastore.lck" - Interrupted system call (os error 4)
Dec 18 16:14:32 pbs005 proxmox-backup-proxy[39143]: GET /api2/json/admin/datastore/XXXXXX/status: 400 Bad Request: [client [::ffff:]:45994] Unable to acquire lock "/etc/proxmox-backup/.datastore.lck" - Interrupted system call (os error 4)

Which surprised me, since this is a ’status’ call, which should not need locking of the datastore-config.

https://git.proxmox.com/?p=proxmox-backup.git;a=blob;f=src/api2/admin/datastore.rs;h=c611f593624977defc49d6e4de2ab8185cfe09e9;hb=HEAD#l687 does not lock the config, but 

https://git.proxmox.com/?p=proxmox-backup.git;a=blob;f=pbs-datastore/src/datastore.rs;h=0801b4bf6b25eaa6f306c7b39ae2cfe81b4782e1;hb=HEAD#l204 does.

So if I understand this correctly, every ’status’ call (30 per second in our case) locks the datastore-config exclusively. And also, every time ’status’ get called, the whole datastore-config gets loaded?

Is that something that could use some performance tuning?

— 
Mark Schouten
CTO, Tuxis B.V.
+31 318 200208 / mark@tuxis.nl


------ Original Message ------
From "Shannon Sterz" <s.sterz@proxmox.com>
To "Mark Schouten" <mark@tuxis.nl>
Cc "Proxmox Backup Server development discussion" <pbs-devel@lists.proxmox.com>
Date 16/12/2024 12:51:47
Subject Re: Re[2]: [pbs-devel] Authentication performance

On Mon Dec 16, 2024 at 12:23 PM CET, Mark Schouten wrote:
Hi,
 
>
>would you mind sharing either `authkey.pub` or the output of the
>following commands:
>
>head --lines=1 /etc/proxmox-backup/authkey.key
>cat /etc/proxmox-backup/authkey.key | wc -l
 
-----BEGIN RSA PRIVATE KEY-----
51
 
So that is indeed the legacy method. We are going to upgrade our PBS’es
on wednesday.
 
>
>The first should give the PEM header of the authkey whereas the second
>provides the amount of lines that the key takes up in the file. Both
>give an indication whether you are using the legacy RSA keys or newer
>Ed25519 keys. The later should provide more performance, security should
>not be affected much by this change. If the output of the commands look
>like this:
>
>-----BEGIN PRIVATE KEY-----
>3
>
>Then you are using the newer keys. There currently isn't a recommended
>way to upgrade the keys. However, in theory you should be able to remove
>the old keys, re-start PBS and it should just generate keys in the new
>format. Note that this will logout anyone that is currently
>authenticated and they'll have to re-authenticate.
 
Seems like a good moment to update those keys as well.
 
Sure, just be aware that you have to manually delete the key before
restarting the PBS. Upgrading alone won't affect the key. Ideally you'd
test this before rolling it out, if you can
 
>In general, tokens should still be fater to authenticate so we'd
>recommend that you try to get your users to switch to token-based
>authentication where possible. Improving performance there is a bit
>trickier though, as it often comes with a security trade-off (in the
>background we use yescrypt fo the authentication there, that
>delibaretely adds a work factor). However, we may be able to improve
>performance a bit via caching methods or similar.
 
Yes, that might help. I’m also not sure if it actually is
authentication, or if it is the datastore-call that the PVE-environments
call. As you can see in your support issue 3153557, it looks like some
requests loop through all datastores, before responding with a limited
set of datastores.
 
I looked at that ticket and yes, that is probably unrelated to
authentication.
 
For instance (and I’m a complete noob wrt Rust) but if I understand
correcly, PBS loops through all the datastores, checks mount-status and
config, and only starts filtering at line 1347. If I understand that
correctly, in our case with over 1100 datastores, that might cause quite
some load?
 
Possible, yes, that would depend on your configuration. Are all of these
datastores defined with a backing device? Because if not, than this
should be fairly fast (as in, this should not actually touch the disks).
If they are, then yes this could be slow as each store would trigger at
least 2 stat calls afaict.
 
In any case, it should be fine to move the `mount_status` check after
the `if allowed || allow_id` check from what i can tell. Not sure why
we'd need to check the mount_status for a datastore we won't include in
the resulsts anyway. Same goes for parsing the store config imo. Send a
patch for that [1].
 
 
 
 
Thanks,
 
Mark Schouten
CTO, Tuxis B.V.
+31 318 200208 / mark@tuxis.nl