From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <pdm-devel-bounces@lists.proxmox.com>
Received: from firstgate.proxmox.com (firstgate.proxmox.com [IPv6:2a01:7e0:0:424::9])
	by lore.proxmox.com (Postfix) with ESMTPS id B36E71FF146
	for <inbox@lore.proxmox.com>; Tue, 12 May 2026 10:40:13 +0200 (CEST)
Received: from firstgate.proxmox.com (localhost [127.0.0.1])
	by firstgate.proxmox.com (Proxmox) with ESMTP id 6A04B8CBD;
	Tue, 12 May 2026 10:40:12 +0200 (CEST)
Mime-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset=UTF-8
Date: Tue, 12 May 2026 10:39:35 +0200
Message-Id: <DIGKA3QJ7I26.7IOC5IYD48OR@proxmox.com>
Subject: Re: [RFC datacenter-manager 0/4] add generic, per-remote (and
 global) cache for remote API responses
From: "Lukas Wagner" <l.wagner@proxmox.com>
To: "Dominik Csapak" <d.csapak@proxmox.com>, "Lukas Wagner"
 <l.wagner@proxmox.com>, <pdm-devel@lists.proxmox.com>
X-Mailer: aerc 0.21.0-0-g5549850facc2-dirty
References: <20260508150330.363622-1-l.wagner@proxmox.com>
 <11d83bb3-d8e7-430f-8af8-64181d78154c@proxmox.com>
In-Reply-To: <11d83bb3-d8e7-430f-8af8-64181d78154c@proxmox.com>
X-Bm-Milter-Handled: 55990f41-d878-4baa-be0a-ee34c49e34d2
X-Bm-Transport-Timestamp: 1778575063231
X-SPAM-LEVEL: Spam detection results:  0
	AWL                     0.054 Adjusted score from AWL reputation of From:
 address
	BAYES_00                 -1.9 Bayes spam probability is 0 to 1%
	DMARC_MISSING             0.1 Missing DMARC policy
	KAM_DMARC_STATUS         0.01 Test Rule for DKIM or SPF Failure with Strict
 Alignment
	SPF_HELO_NONE           0.001 SPF: HELO does not publish an SPF Record
	SPF_PASS               -0.001 SPF: sender matches SPF record
	URIBL_BLOCKED           0.001 ADMINISTRATOR NOTICE: The query to URIBL was
 blocked.  See http://wiki.apache.org/spamassassin/DnsBlocklists#dnsbl-block
 for more information. [proxmox.com]
Message-ID-Hash: RP5JQDD3BGBIJKEW3K4Q35QADXQX4SWJ
X-Message-ID-Hash: RP5JQDD3BGBIJKEW3K4Q35QADXQX4SWJ
X-MailFrom: l.wagner@proxmox.com
X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; loop;
 banned-address; emergency; member-moderation; nonmember-moderation;
 administrivia; implicit-dest; max-recipients; max-size; news-moderation;
 no-subject; digests; suspicious-header
X-Mailman-Version: 3.3.10
Precedence: list
List-Id: Proxmox Datacenter Manager development discussion
 <pdm-devel.lists.proxmox.com>
List-Help: <mailto:pdm-devel-request@lists.proxmox.com?subject=help>
List-Owner: <mailto:pdm-devel-owner@lists.proxmox.com>
List-Post: <mailto:pdm-devel@lists.proxmox.com>
List-Subscribe: <mailto:pdm-devel-join@lists.proxmox.com>
List-Unsubscribe: <mailto:pdm-devel-leave@lists.proxmox.com>

Hey Dominik,

thanks for your feedback.

On Mon May 11, 2026 at 2:37 PM CEST, Dominik Csapak wrote:
> hi,
>
> IMO the code looks mostly good to me, as well as the approach, a
> few comments regardless (all here, as the patches themselves are
> not that big ;) )
>
> * using the filesystem can be good, it can reduce the memory footprint
>    since we're not automatically store everything in memory
>
>    this can also be a downside though, since now the IOPS requirements
>    rise, and for frequent update cycles it could have a negative impact
>    on the disks (ssd wearout)
>
>    this could be mitigated by using e.g. /var/run instead of /var/cache
>    but then it's back to using memory all the time, with less
>    clear accounting.
>
>    I'm actually not sure which is better, maybe having that configurable
>    could make sense? (as in, whats the cache path? this way users can
>    decide if they want it on disk or in memory)

FWIW, I did some back-on-the-envelope calculations. I assume that the
cache for the cluster resources would likely be the one entry with the
most writes by far, since that one scales linearly with the number of
resources *and* has to be refreshed periodically (at the moment, we have
a max-age of 30 seconds for the dashboard).

A guest resource (e.g. a VM) is roughly 260 bytes of minified JSON.
If we assume a huge setup with, say, 50000 guests, that would be:

  260 bytes * 50000 =3D 13 MBytes

  13 MBytes of writes every 30 seconds boils down to

  13 * 2 * 60 * 24 =3D ~37.5 GB per day of writes (without accounting for
  write amplification effects)

A PDM host for huge setups like these would (hopefully) use
fully-enterprise grade SSDs, so in reality these amounts of writes
*should* be fine, I think. The average PDM setup is likely going be at
least one order of magnitude smaller than this, I would assume.

Also, it's worth mentioning that this is the
worst-case scenario where the dashboard is opened 24/7, if that is not
the case, we only request resource data every couple of minutes.

That being said, since we only have 13 MB of data, we could, as you
said, just put that into /var/run for now, and live with the data loss
on reboot.

>    I'm actually not sure which is better, maybe having that configurable
>    could make sense? (as in, whats the cache path? this way users can
>    decide if they want it on disk or in memory)

I think both options are fine. Storing it on disk would have the slight
advantage that more expensive, rarely fetched API requests (e.g. update
status, requesting a list of updates can, as of now, trigger an 'apt
update' on PVE nodes) can survive reboots of the host.

In practice, as reboots should hopefully be not that common, this really
should not matter too much, I think.

With regards to making it configurable: Yes, could make sense, but I'd
only add this once we actually have users requesting it.

I think I'd put it in /var/run for the next version, going from
non-persistent to persistent seems to be the easier step, in case we
want to change the location in the future.

>
> * currently you expire values if the timestamp is in the future,
>    could it be better to return an error here? IMO there had to be one
>    at some point, either the time was wrong in the past, or the time
>    is wrong now (neither of which is good?)

The question is, do we want to put the burden on the caller to handle
this error variant correctly?

With the current implementation, a timestamp from the future would
simply entail the caller to get a `None`, which in practice means that
the caller will retrieve fresh data from the remote, which will then be
put into the cache, overriding the entry with the bogus timestamp,
resulting in a self-healing effect.

If an error is returned, what would you do in the caller with it? Just
bubble it up to the API level and return a failure? Or handle it in a
way to heal the cache manually?

>
> * not sure if the pdm_cache::instance() abstraction is gaining us
>    anything here, we could simply have freestanding functions?

I guess at the moment we do not gain anything here, true.

I'm more leaning towards this style, since I'm always having my proposal
from https://lore.proxmox.com/pdm-devel/20260129134418.307552-1-l.wagner@pr=
oxmox.com/T/#t=20
in mind, on which I will hopefully start working on again once I have
the time for it.

This cache would also something that would need to be injected via the
API handler via some application context object, to ensure that we can
drive API handler code in integration tests.

But since this is 100% pdm-internal code, I can just put it behind
helpers for now and refactor it again once we actually need it.

>
> * for the subscription info you chose to implement an async wrapper
>    maybe we could/should have that directly in the interface
>    (of pdm_cache for example) ? otherwise people might use it
>    without a 'spawn_blocking'. having both an async and sync interface
>    could be beneficial
>

Yes, I wondered the same actually. Would you add the async interface to
`pdm_cache`, or the generic implementation in `namespaced_cache`? If we
do the former, we need to add wrappers for the read/write guard types as
well, so that would be quite a bit more code.
If we put it in `namespaced_cache`, we probably should feature-gate them
as soon as we put the cache into a shared crate in proxmox.git