From: "Daniel Kral" <d.kral@proxmox.com>
To: "Fiona Ebner" <f.ebner@proxmox.com>,
"Proxmox VE development discussion" <pve-devel@lists.proxmox.com>
Subject: Re: [pve-devel] [PATCH ha-manager 1/9] implement static service stats cache
Date: Thu, 16 Oct 2025 17:15:48 +0200 [thread overview]
Message-ID: <DDJUG5YKH96F.2H9FAGCGRML1A@proxmox.com> (raw)
In-Reply-To: <e753b537-762e-4c5a-8f19-cd1cfb56a02b@proxmox.com>
On Thu Oct 16, 2025 at 1:12 PM CEST, Fiona Ebner wrote:
> Am 30.09.25 um 4:21 PM schrieb Daniel Kral:
>> As the HA Manager builds the static load scheduler, it queries the
>> services' static usage by reading and parsing the static guest configs
>> individually, which can take significant time with respect to how many
>> times recompute_online_node_usage(...) is called.
>>
>> PVE::Cluster exposes an efficient interface to gather a set of
>> properties from one or all guest configs [0]. This is used here to build
>> a rather short-lived cache on every (re)initialization of the static
>> load scheduler.
>>
>> The downside to this approach is if there are way more non-HA managed
>> guests in the cluster than HA-managed guests, which causes this to load
>> much more information than necessary. It also doesn't cache the default
>> values for the environment-specific static service usage, which causes
>> quite a bottleneck as well.
>>
>> [0] pve-cluster cf1b19d (add get_guest_config_property IPCC method)
>>
>> Suggested-by: Fiona Ebner <f.ebner@proxmox.com>
>> Signed-off-by: Daniel Kral <d.kral@proxmox.com>
>> ---
>> If the above mentioned downside is too large for some setups, we could
>> also just add a switch to enable/disable the static stats cache or
>> automatically figure out if it brings any benefits.. But I think this
>> will be good enough, especially with the latter patches making way fewer
>> calls to get_service_usage(...).
>
> Parsing the configs is much more costly than
> PVE::Cluster::get_guest_config_properties(), so improving the situation
> for the majority of use cases of the static scheduler still makes sense,
> even if it hurts performance for some more exotic setups. And we would
> like to consider usage of non-HA guests in the future too to make the
> static scheduler more accurate even in such exotic setups. But with
> PSI-based information that is already included in the node usage, no
> extra preparation necessary there.
I fully agree, that's a rather odd case and in hindsight I doubt that
this has any significant performance improvements for the most common
setups.
Not that important, but I'd move this patch after the granular changes
are implemented since only there this patch makes a performance
improvement while at this point in time and with this implementation it
actually degrades performance.
>
>>
>> src/PVE/HA/Env.pm | 12 ++++++++++++
>> src/PVE/HA/Env/PVE2.pm | 21 +++++++++++++++++++++
>> src/PVE/HA/Manager.pm | 1 +
>
> Needs a rebase because of changes in Manager.pm
>
>> @@ -497,6 +499,25 @@ sub get_datacenter_settings {
>> };
>> }
>>
>> +sub get_static_service_stats {
>> + my ($self, $id) = @_;
>> +
>> + # undef if update_static_service_stats(...) failed before
>> + return undef if !defined($self->{static_service_stats});
>> +
>> + return $self->{static_service_stats}->{$id} // {};
>
> Can't returning '{}' when nothing is there lead to issues down the line?
> If we return undef instead, it's consistent with not having anything
> cached and the caller will fall back to loading the config.
This return value type definitely needs improvement and/or better
documentation, but an undef $self->{static_service_stats}->{$id} value
indicates that it should fallback to the default value as none of the
properties requested by get_guest_config_properties(...) was included in
that particular guest config, e.g. no 'cores', 'sockets', and 'memory'
properties defined in a VM config.
When $self->{static_service_stats} itself is undef, then the static
cache couldn't be queried for some reason.
>
>> @@ -477,6 +467,8 @@ sub new {
>>
>> $self->{service_config} = $self->read_service_config();
>>
>> + $self->{static_service_stats} = undef;
>> +
>> return $self;
>> }
>>
>> @@ -943,6 +935,25 @@ sub watchdog_update {
>> return &$modify_watchog($self, $code);
>> }
>>
>> +sub get_static_service_stats {
>> + my ($self, $id) = @_;
>> +
>> + # undef if update_static_service_stats(...) failed before
>> + return undef if !defined($self->{static_service_stats});
>> +
>> + return $self->{static_service_stats}->{$id} // {};
>
> Same here.
>
>> +}
>> +
>> +sub update_static_service_stats {
>> + my ($self) = @_;
>> +
>> + my $filename = "$self->{statusdir}/static_service_stats";
>> + my $stats = eval { PVE::HA::Tools::read_json_from_file($filename) };
>> + $self->log('warning', "unable to update static service stats cache - $@") if $@;
>> +
>> + $self->{static_service_stats} = $stats;
>> +}
>> +
>> sub get_static_node_stats {
>> my ($self) = @_;
>>
>> diff --git a/src/PVE/HA/Sim/Resources.pm b/src/PVE/HA/Sim/Resources.pm
>> index 72623ee1..7641b1a9 100644
>> --- a/src/PVE/HA/Sim/Resources.pm
>> +++ b/src/PVE/HA/Sim/Resources.pm
>> @@ -143,8 +143,8 @@ sub get_static_stats {
>> my $sid = $class->type() . ":$id";
>> my $hardware = $haenv->hardware();
>>
>> - my $stats = $hardware->read_static_service_stats();
>> - if (my $service_stats = $stats->{$sid}) {
>> + my $service_stats = $hardware->get_static_service_stats($sid);
>> + if (%$service_stats) {
>
> Style nit: with the above change, this could be just $service_stats or a
> definedness check for it. If really checking if there are any keys, I
> prefer being explicit with scalar(keys ...)
ACK will do that, don't know why I didn't see that already :)
>
>> return $service_stats;
>> } elsif ($id =~ /^(\d)(\d\d)/) {
>> # auto assign usage calculated from ID for convenience
_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
next prev parent reply other threads:[~2025-10-16 15:15 UTC|newest]
Thread overview: 36+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-09-30 14:19 [pve-devel] [RFC ha-manager/perl-rs/proxmox/qemu-server 00/12] Granular online_node_usage accounting Daniel Kral
2025-09-30 14:19 ` [pve-devel] [PATCH qemu-server 1/1] config: only fetch necessary default values in get_derived_property helper Daniel Kral
2025-10-15 14:31 ` Fiona Ebner
2025-10-16 9:07 ` Daniel Kral
2025-09-30 14:19 ` [pve-devel] [PATCH proxmox 1/1] resource-scheduling: change score_nodes_to_start_service signature Daniel Kral
2025-09-30 14:19 ` [pve-devel] [PATCH perl-rs 1/1] pve-rs: resource_scheduling: allow granular usage changes Daniel Kral
2025-10-16 10:32 ` Fiona Ebner
2025-10-16 15:34 ` Daniel Kral
2025-10-17 10:55 ` Fiona Ebner
2025-09-30 14:19 ` [pve-devel] [PATCH ha-manager 1/9] implement static service stats cache Daniel Kral
2025-10-16 11:12 ` Fiona Ebner
2025-10-16 15:15 ` Daniel Kral [this message]
2025-10-17 10:02 ` Fiona Ebner
2025-10-17 10:08 ` Fiona Ebner
2025-10-17 16:18 ` Daniel Kral
2025-09-30 14:19 ` [pve-devel] [PATCH ha-manager 2/9] manager: remove redundant recompute_online_node_usage from next_state_recovery Daniel Kral
2025-10-16 11:25 ` Fiona Ebner
2025-09-30 14:19 ` [pve-devel] [PATCH ha-manager 3/9] manager: remove redundant add_service_usage_to_node " Daniel Kral
2025-10-16 11:33 ` Fiona Ebner
2025-09-30 14:19 ` [pve-devel] [PATCH ha-manager 4/9] manager: remove redundant add_service_usage_to_node from next_state_started Daniel Kral
2025-10-16 11:39 ` Fiona Ebner
2025-09-30 14:19 ` [pve-devel] [PATCH ha-manager 5/9] rules: resource affinity: decouple get_resource_affinity helper from Usage class Daniel Kral
2025-10-17 11:14 ` Fiona Ebner
2025-10-17 15:46 ` Daniel Kral
2025-10-20 15:18 ` Fiona Ebner
2025-09-30 14:19 ` [pve-devel] [PATCH ha-manager 6/9] manager: make recompute_online_node_usage use get_service_nodes helper Daniel Kral
2025-10-17 11:25 ` Fiona Ebner
2025-09-30 14:19 ` [pve-devel] [PATCH ha-manager 7/9] usage: allow granular changes to Usage implementations Daniel Kral
2025-10-17 11:57 ` Fiona Ebner
2025-09-30 14:19 ` [pve-devel] [PATCH ha-manager 8/9] manager: make online node usage computation granular Daniel Kral
2025-10-17 12:32 ` Fiona Ebner
2025-10-17 16:07 ` Daniel Kral
2025-09-30 14:19 ` [pve-devel] [PATCH ha-manager 9/9] manager: make service node usage computation more granular Daniel Kral
2025-10-17 12:42 ` Fiona Ebner
2025-10-17 15:59 ` Daniel Kral
2025-10-20 16:50 ` [pve-devel] superseded: [RFC ha-manager/perl-rs/proxmox/qemu-server 00/12] Granular online_node_usage accounting Daniel Kral
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=DDJUG5YKH96F.2H9FAGCGRML1A@proxmox.com \
--to=d.kral@proxmox.com \
--cc=f.ebner@proxmox.com \
--cc=pve-devel@lists.proxmox.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox