Re: [RFC PATCH-SERIES many 00/36] dynamic scheduler + load rebalancer

all lists on lists.proxmox.com
 help / color / mirror / Atom feed

From: "Daniel Kral" <d.kral@proxmox.com>
To: "DERUMIER, Alexandre" <alexandre.derumier@groupe-cyllene.com>,
	"pve-devel@lists.proxmox.com" <pve-devel@lists.proxmox.com>
Subject: Re: [RFC PATCH-SERIES many 00/36] dynamic scheduler + load rebalancer
Date: Fri, 13 Mar 2026 10:35:34 +0100	[thread overview]
Message-ID: <DH1JWA8I3P7O.32KPTGKF5NUHQ@proxmox.com> (raw)
In-Reply-To: <5a2c063e18f8b052e22af94ecd5db74748082019.camel@groupe-cyllene.com>

On Thu Mar 12, 2026 at 5:24 PM CET, Alexandre DERUMIER wrote:
> Hi,
>
> thanks for working on this !
>
> I see another possible re-balancing case, is when pressure of a vm is
> too high (mostly on cpu), like 5~10% of pressure (should be
> configurable)
>
> When host have a lot of cores, balancing based on host cpu average is
> not always working fine, because you can have some cores at 100% and
> other have low usage, 
> (you can have a global host cpu average at 60~70% for example, but
> somes vms with cpu pressure).
>
> It could be great to have some kind of trigger to migrate vms with too
> much pressure (reusing topsis for example).

Hi Alexandre, thanks for the input!

Using the pressure stall information to reduce resource contention and
improve efficient resource usage is indeed a goal! There are some
nuances to properly integrate it into the load balancing decisions
though.

I want to give some notes about the current design decisions, which I'll
include in the documentation to make them more accessible.

The current implementation focuses on the stabilizing part of the load
balancing system. That is, the system reaches an equilibrium as the
general resource usage between the cluster nodes is evenly balanced.
This reduces the likelihood of individual nodes reaching 100% CPU and/or
memory usage, which will certainly lock up the guests on the nodes.

We cannot prevent the nodes from reaching 100% of course, as that would
mean the cluster itself is using up more resources than it can handle
and should be expanded in hardware.

The important part is that the CPU and memory usage of guests is usually
more reproducible on different nodes. That is, we can more or less
assume that the absolute usage will be the same on another node (not
taking account of things like KSM, etc currently).

This is not the case with pressure stall as this depends on the host and
the running processes on there. We cannot predict that the pressure on
one node or for one guest can be reduced by moving it to another node.

The rough idea might be that either the HA Manager (through the psi
information broadcasted over the pmxcfs) or the LRMs (with locally
polling the psi files themselves and signaling the information to the HA
Manager) might give more clues where guests should go. Though this needs
more care and thought as it's important that the system stabilizes and
won't move around guests all the time.

Hope this gives some more insight why this design was chosen. It's
important to have the core system ready in such a shape that it can be
improved on later ;).

I'd be happy for tests, design critique and review of course!

> Another improvements could be to filter candidate target-node based on
> ressources availability.( if the target-node have less cores than the
> vm , don't have storage or the network for example).

These should certainly be easier-to-implement cases too. I want to
include the option to include/exclude guests in the load balancing
scheme in one of the next revisions or a follow-up in general.

For now the series does implement the load balancing for the HA stack
only, so the assumptions are that the HA resource does have all
necessary resources available on the nodes. If there are nodes where
this is not the case, these should be excluded with HA node affinity
rules.

The goal is to expand this to the whole cluster in the end of course,
but this needs some more adaptions and should certainly be handled in
another patch series.

    Daniel

     prev parent reply	other threads:[~2026-03-13  9:36 UTC|newest]

Thread overview: 47+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-02-17 14:13 Daniel Kral
2026-02-17 14:13 ` [RFC proxmox 1/5] resource-scheduling: move score_nodes_to_start_service to scheduler crate Daniel Kral
2026-02-17 14:13 ` [RFC proxmox 2/5] resource-scheduling: introduce generic cluster usage implementation Daniel Kral
2026-03-09 13:38   ` Dominik Rusovac
2026-03-10 10:41     ` Daniel Kral
2026-02-17 14:13 ` [RFC proxmox 3/5] resource-scheduling: add dynamic node and service stats Daniel Kral
2026-02-17 14:13 ` [RFC proxmox 4/5] resource-scheduling: implement rebalancing migration selection Daniel Kral
2026-03-09 13:32   ` Dominik Rusovac
2026-03-10 10:40     ` Daniel Kral
2026-03-11  8:21       ` Dominik Rusovac
2026-02-17 14:13 ` [RFC proxmox 5/5] resource-scheduling: implement Add and Default for {Dynamic,Static}ServiceStats Daniel Kral
2026-02-17 14:14 ` [RFC perl-rs 1/6] pve-rs: resource scheduling: use generic cluster usage implementation Daniel Kral
2026-02-17 14:14 ` [RFC perl-rs 2/6] pve-rs: resource scheduling: create service_nodes hashset from array Daniel Kral
2026-02-17 14:14 ` [RFC perl-rs 3/6] pve-rs: resource scheduling: store service stats independently of node Daniel Kral
2026-02-17 14:14 ` [RFC perl-rs 4/6] pve-rs: resource scheduling: expose auto rebalancing methods Daniel Kral
2026-02-17 14:14 ` [RFC perl-rs 5/6] pve-rs: resource scheduling: move pve_static into resource_scheduling module Daniel Kral
2026-02-17 14:14 ` [RFC perl-rs 6/6] pve-rs: resource scheduling: implement pve_dynamic bindings Daniel Kral
2026-02-17 14:14 ` [RFC cluster 1/2] datacenter config: add dynamic load scheduler option Daniel Kral
2026-02-18 11:06   ` Maximiliano Sandoval
2026-02-17 14:14 ` [RFC cluster 2/2] datacenter config: add auto rebalancing options Daniel Kral
2026-02-18 11:15   ` Maximiliano Sandoval
2026-02-17 14:14 ` [RFC ha-manager 01/21] rename static node stats to be consistent with similar interfaces Daniel Kral
2026-02-17 14:14 ` [RFC ha-manager 02/21] resources: remove redundant load_config fallback for static config Daniel Kral
2026-02-17 14:14 ` [RFC ha-manager 03/21] remove redundant service_node and migration_target parameter Daniel Kral
2026-02-17 14:14 ` [RFC ha-manager 04/21] factor out common pve to ha resource type mapping Daniel Kral
2026-02-17 14:14 ` [RFC ha-manager 05/21] derive static service stats while filling the service stats repository Daniel Kral
2026-02-17 14:14 ` [RFC ha-manager 06/21] test: make static service usage explicit for all resources Daniel Kral
2026-02-17 14:14 ` [RFC ha-manager 07/21] make static service stats indexable by sid Daniel Kral
2026-02-17 14:14 ` [RFC ha-manager 08/21] move static service stats repository to PVE::HA::Usage::Static Daniel Kral
2026-02-17 14:14 ` [RFC ha-manager 09/21] usage: augment service stats with node and state information Daniel Kral
2026-02-17 14:14 ` [RFC ha-manager 10/21] include running non-HA resources in the scheduler's accounting Daniel Kral
2026-02-17 14:14 ` [RFC ha-manager 11/21] env, resources: add dynamic node and service stats abstraction Daniel Kral
2026-02-17 14:14 ` [RFC ha-manager 12/21] env: pve2: implement dynamic node and service stats Daniel Kral
2026-02-17 14:14 ` [RFC ha-manager 13/21] sim: hardware: pass correct types for static stats Daniel Kral
2026-02-17 14:14 ` [RFC ha-manager 14/21] sim: hardware: factor out static stats' default values Daniel Kral
2026-02-17 14:14 ` [RFC ha-manager 15/21] sim: hardware: rewrite set-static-stats Daniel Kral
2026-02-17 14:14 ` [RFC ha-manager 16/21] sim: hardware: add set-dynamic-stats for services Daniel Kral
2026-02-17 14:14 ` [RFC ha-manager 17/21] usage: add dynamic usage scheduler Daniel Kral
2026-02-17 14:14 ` [RFC ha-manager 18/21] manager: rename execute_migration to queue_resource_motion Daniel Kral
2026-02-17 14:14 ` [RFC ha-manager 19/21] manager: update_crs_scheduler_mode: factor out crs config Daniel Kral
2026-02-17 14:14 ` [RFC ha-manager 20/21] implement automatic rebalancing Daniel Kral
2026-02-17 14:14 ` [RFC ha-manager 21/21] test: add basic automatic rebalancing system test cases Daniel Kral
2026-02-17 14:14 ` [RFC manager 1/2] ui: dc/options: add dynamic load scheduler option Daniel Kral
2026-02-18 11:10   ` Maximiliano Sandoval
2026-02-17 14:14 ` [RFC manager 2/2] ui: dc/options: add auto rebalancing options Daniel Kral
2026-03-12 16:24 ` [RFC PATCH-SERIES many 00/36] dynamic scheduler + load rebalancer DERUMIER, Alexandre
2026-03-13  9:35   ` Daniel Kral [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=DH1JWA8I3P7O.32KPTGKF5NUHQ@proxmox.com \
    --to=d.kral@proxmox.com \
    --cc=alexandre.derumier@groupe-cyllene.com \
    --cc=pve-devel@lists.proxmox.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.

Service provided by Proxmox Server Solutions GmbH | Privacy | Legal