Re: [pdm-devel] [PATCH proxmox-datacenter-manager 00/15] change task cache mechanism from time-based to max-size FIFO

From: Thomas Lamprecht <t.lamprecht@proxmox.com>
To: Lukas Wagner <l.wagner@proxmox.com>,
	Proxmox Datacenter Manager development discussion
	<pdm-devel@lists.proxmox.com>
Subject: Re: [pdm-devel] [PATCH proxmox-datacenter-manager 00/15] change task cache mechanism from time-based to max-size FIFO
Date: Wed, 12 Feb 2025 10:19:42 +0100	[thread overview]
Message-ID: <b5adc613-64a3-41e6-bca5-efc0dfdf63c2@proxmox.com> (raw)
In-Reply-To: <3da58b54-5590-47d3-aa0d-f127209919f0@proxmox.com>

Am 06.02.25 um 11:13 schrieb Lukas Wagner:
> On  2025-02-05 16:34, Thomas Lamprecht wrote:
>> Am 31.01.25 um 10:35 schrieb Lukas Wagner:
>>> Some additional context as to explain the 'why', since @Thomas and @Wolfgang requested it:
>>>
>>> The status quo for the task cache is to fetch a certain time-based range of tasks
>>> (iirc the last seven days) from every remote and cache this for a certain period
>>> of time (max-age). If the cached data is too old, we discard the task data
>>> and fetch the same time range again.
>>> My initial reasoning behind designing it like this was to keep the
>>> 'ground truth' completely on the remote side, so *if* somebody were to mess with
>>
>> Yeah, I agree with Wolfgang here, nothing in our API allows messing with that,
>> neither does a first-class PVE CLI command or the like. If we want to hedge
>> against that we basically cannot cache anything at all, rendering them rather
>> useless, as even RRD metrics could be altered.
>>
>> Using the start-time of the newest available task from the cache as (inclusive)
>> boundary to request newer tasks that happened since the last query would be
>> enough or?
>>
>> Pruning older tasks from the cache after some max-time or max-size limit
>> is overstepped would then make sense though.
> 
> I mean yes, this is pretty much what I implemented in this patch series.

Ack, then I probably was shoved a bit too much towards a misinterpretation to
early on, that could have been certainly avoided on my side too, but with
lots of change sets "in the air" reading subjects to get the basic big picture
becomes inevitable, at least for me. E.g., to not just "complain" and actually
provide one possible example that would have helped me here:

task cache: leverage existing cache for reducing load on updates using a FIFO max-size strategy

But no biggie, I could have checked more closely myself and was a bit put off
in general by something that isn't really you to blame for (see at the end). 

>> IMO it for purging older task log entries to avoid huge cache we should mainly
>> focus on the age of entries limit, e.g. most our dashboards will focus on showing
>> the task results since X hours, where a sensible minimum for X is probably at
>> least 24 hours, probably better more like 3 to 5 days to be able to (quickly)
>> see what happened over a weekend, maybe with some holiday attached to it.
> 
> Sure, if you want, I can change it to use time-based approach with a generous maximum size.
> Should be pretty simple change to incorporate.

One compromise might be using time for the minimum and size for the maximum.
E.g., say keep at least something like 48h independent of size, after that
keep until the size is bigger than a few tens of MiB; but just to put out an
idea.

Most reduces task entries should be below 256 bytes (more like 128, but lets
keep some headroom), and only higher if there's a high failure rate, as then
the state entries will be bigger. With a 10 MiB limit per remote and considering
an average of 256 bytes per task we would still be able to hold all tasks for
a whole week if the remote (cluster) would produce ~4 task logs per minute
(10 * 2^20 / 256 / 24 / 60 / 7 = 4.06) – so that might be enough for starters.

W.r.t. splitting this up or not I can see advantages for both, an alternative
to have a specific file per remote might be to keep all in one but rotate them
per day; that might be then useful for pruning older entries from the cache,
e.g. delete files as long as the total size is above the maximum size but there
are still at least two days (for a 48h minimum) are left over.
This way one could still relatively easily process either global or per-remote
task state, as passing a time limit to the backend is probably the most useful
in practice, as it can be clearly communicated to the user and should be useful
for when looking into problems in practice, where it's normally known when
roughly it happened, but not at which absolute count of tasks it did.

I hope I made some sense here.

>> Yes, but the original idea would act the same in this regard if it did not
>> hedge against things that cannot really happen without manually messing around
>> (at which point all bets are off); or what am I missing here? To be fair: I
>> only skimmed the patches and asked Wolfgang a bit about it as he looked through
>> them more closely, so I might indeed miss something or just be slightly confused
>> about the presentation here.
> 
> The initial approach which was merged too soon had some flaws which were kindly pointed out to me off-list
> with some ideas for improvement.
> Essentially, the key point of *this* series is to transform the original approach in a way which
> does not 'hedge against things that cannot really happen'.

Maybe I should have used s/cannot really happen/we do not care for that to
happen, as that if it happens someone messed around manually/ for clarity,
but thanks for explicitly stating that with this series we effectively go
that route either way its called.

> 
> I'm sorry about any confusion here. I'll try to provide more context from the start if something similar
> happens again. Also, I'll try to communicate my questions/doubts (like I said, I wasn't sure if the initial
> approach from the RFC was good or not, hence the RFC in the first place) more openly if I post an RFC in a
> similar way again, e.g. in a cover letter, to make sure that a proper, on-list discussion takes place before
> anything is merged.

Thank you for your explanations here, just to summarize/clarify:
I was mostly put a bit off with how it was merged without any hint about
follow-up potential on the list (and that pattern was repeated already a
few times), you are certainly not to blame for that, and while another
presentation of this series might have helped, that part really wasn't the
problem for me. And while it might be a bit much to ask for, so mentioning
it mainly for the sake of completeness: I'd appreciate a summary independent
of which side (off-list reviewer or author) it comes from for such off-list
discussion – I do not want to discourage off-list talks, they are indeed
often very useful and more efficient to have, especially as long as the
travel time between offices is in the single-digit minute range; and if all
involved take part of that the summary (mail) of the discussion might not
be needed, but probably still wouldn't hurt. I hope I got enough nuance in
there now.

_______________________________________________
pdm-devel mailing list
pdm-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pdm-devel