public inbox for pve-devel@lists.proxmox.com
 help / color / mirror / Atom feed
From: "Dominik Rusovac" <d.rusovac@proxmox.com>
To: "Daniel Kral" <d.kral@proxmox.com>, <pve-devel@lists.proxmox.com>
Subject: Re: [PATCH ha-manager v3 35/40] implement automatic rebalancing
Date: Tue, 31 Mar 2026 11:07:01 +0200	[thread overview]
Message-ID: <DHGUK85D97LK.3I0CO8FIJOCJG@proxmox.com> (raw)
In-Reply-To: <20260330144101.668747-36-d.kral@proxmox.com>

one nit inline

lgtm, consider this

On Mon Mar 30, 2026 at 4:30 PM CEST, Daniel Kral wrote:
> If the automatic load balancing system is enabled, it checks whether the
> cluster node imbalance exceeds some user-defined threshold for some HA
> Manager rounds ("hold duration"). If it does exceed on consecutive HA
> Manager rounds, it will choose the best resource motion to improve the
> cluster node imbalance and queue it if it significantly improves it by
> some user-defined imbalance improvement ("margin").
>
> This patch introduces resource bundles, which ensure that HA resources
> in strict positive resource affinity rules are considered as a whole
> "bundle" instead of individual HA resources.
>
> Specifically, active and stationary resource bundles are resource
> bundles, that have at least one resource running and all resources
> located on the same node. This distinction is needed as newly created
> strict positive resource affinity rules may still require some resource
> motions to enforce the rule.
>
> Additionally, the migration candidate generation prunes any target
> nodes, which do not adhere to the HA rules of these resource bundles
> before scoring these migration candidates.

nice work! also very nice idea to introduce resource bundles

>
> Signed-off-by: Daniel Kral <d.kral@proxmox.com>
> ---
> changes v2 -> v3:
> - none
>

[snip]

> diff --git a/src/PVE/HA/Manager.pm b/src/PVE/HA/Manager.pm
> index 2576c762..0f8a03a6 100644
> --- a/src/PVE/HA/Manager.pm
> +++ b/src/PVE/HA/Manager.pm
> @@ -59,10 +59,17 @@ sub new {
>  
>      my $self = bless {
>          haenv => $haenv,
> -        crs => {},
> +        crs => {
> +            auto_rebalance => {},
> +        },
>          last_rules_digest => '',
>          last_groups_digest => '',
>          last_services_digest => '',
> +        # used to track how many HA rounds the imbalance threshold has been exceeded
> +        #
> +        # this is not persisted for a CRM failover as in the mean time
> +        # the usage statistics might have change quite a bit already

nit:

           # the usage statistics might have change[d] quite a bit already

> +        sustained_imbalance_round => 0,
>          group_migration_round => 3, # wait a little bit
>      }, $class;
>  
> @@ -97,6 +104,13 @@ sub update_crs_scheduler_mode {
>      my $crs_cfg = $dc_cfg->{crs};
>  
>      $self->{crs}->{rebalance_on_request_start} = !!$crs_cfg->{'ha-rebalance-on-start'};
> +    $self->{crs}->{auto_rebalance}->{enable} = !!$crs_cfg->{'ha-auto-rebalance'};
> +    $self->{crs}->{auto_rebalance}->{threshold} = $crs_cfg->{'ha-auto-rebalance-threshold'} // 0.7;
> +    $self->{crs}->{auto_rebalance}->{method} = $crs_cfg->{'ha-auto-rebalance-method'}
> +        // 'bruteforce';
> +    $self->{crs}->{auto_rebalance}->{hold_duration} = $crs_cfg->{'ha-auto-rebalance-hold-duration'}
> +        // 3;
> +    $self->{crs}->{auto_rebalance}->{margin} = $crs_cfg->{'ha-auto-rebalance-margin'} // 0.1;
>  
>      my $old_mode = $self->{crs}->{scheduler};
>      my $new_mode = $crs_cfg->{ha} || 'basic';
> @@ -114,6 +128,150 @@ sub update_crs_scheduler_mode {
>      return;
>  }
>  
> +# Returns a hash of lists, which contain the running, non-moving HA resource
> +# bundles, which are on the same node, implied by the strict positive resource
> +# affinity rules.
> +#
> +# Each resource bundle has a leader, which is the alphabetically first running
> +# HA resource in the resource bundle and also the key of each resource bundle
> +# in the returned hash.
> +sub get_active_stationary_resource_bundles {
> +    my ($ss, $resource_affinity) = @_;
> +
> +    my $resource_bundles = {};
> +OUTER: for my $sid (sort keys %$ss) {
> +        # do not consider non-started resource as 'active' leading resource
> +        next if $ss->{$sid}->{state} ne 'started';
> +
> +        my @resources = ($sid);
> +        my $nodes = { $ss->{$sid}->{node} => 1 };
> +
> +        my ($dependent_resources) = get_affinitive_resources($resource_affinity, $sid);
> +        if (%$dependent_resources) {
> +            for my $csid (keys %$dependent_resources) {
> +                next if !defined($ss->{$csid});
> +                my ($state, $node) = $ss->{$csid}->@{qw(state node)};
> +
> +                # do not consider stationary bundle if a dependent resource moves
> +                next OUTER if $state eq 'migrate' || $state eq 'relocate';
> +                # do not add non-started resource to active bundle
> +                next if $state ne 'started';
> +
> +                $nodes->{$node} = 1;
> +
> +                push @resources, $csid;
> +            }
> +
> +            @resources = sort @resources;
> +        }
> +
> +        # skip resource bundles, which are not on the same node yet
> +        next if keys %$nodes > 1;
> +
> +        my $leader_sid = $resources[0];
> +
> +        $resource_bundles->{$leader_sid} = \@resources;
> +    }
> +
> +    return $resource_bundles;
> +}
> +
> +# Returns a hash of hashes, where each item contains the resource bundle's
> +# leader, the list of HA resources in the resource bundle, and the list of
> +# possible nodes to migrate to.
> +sub get_resource_migration_candidates {
> +    my ($self) = @_;
> +
> +    my ($ss, $compiled_rules, $online_node_usage) =
> +        $self->@{qw(ss compiled_rules online_node_usage)};
> +    my ($node_affinity, $resource_affinity) =
> +        $compiled_rules->@{qw(node-affinity resource-affinity)};
> +
> +    my $resource_bundles = get_active_stationary_resource_bundles($ss, $resource_affinity);
> +
> +    my @compact_migration_candidates = ();
> +    for my $leader_sid (sort keys %$resource_bundles) {
> +        my $current_leader_node = $ss->{$leader_sid}->{node};
> +        my $online_nodes = { map { $_ => 1 } $online_node_usage->list_nodes() };
> +
> +        my (undef, $target_nodes) = get_node_affinity($node_affinity, $leader_sid, $online_nodes);
> +        my ($together, $separate) =
> +            get_resource_affinity($resource_affinity, $leader_sid, $ss, $online_nodes);
> +        apply_negative_resource_affinity($separate, $target_nodes);
> +
> +        delete $target_nodes->{$current_leader_node};
> +
> +        next if !%$target_nodes;
> +
> +        push @compact_migration_candidates,
> +            {
> +                leader => $leader_sid,
> +                nodes => [sort keys %$target_nodes],
> +                resources => $resource_bundles->{$leader_sid},
> +            };
> +    }
> +
> +    return \@compact_migration_candidates;
> +}
> +
> +sub load_balance {
> +    my ($self) = @_;
> +
> +    my ($crs, $haenv, $online_node_usage) = $self->@{qw(crs haenv online_node_usage)};
> +    my ($auto_rebalance_opts) = $crs->{auto_rebalance};
> +
> +    return if !$auto_rebalance_opts->{enable};
> +    return if $crs->{scheduler} ne 'static' && $crs->{scheduler} ne 'dynamic';
> +    return if $self->any_resource_motion_queued_or_running();
> +
> +    my ($threshold, $method, $hold_duration, $margin) =
> +        $auto_rebalance_opts->@{qw(threshold method hold_duration margin)};
> +
> +    my $imbalance = $online_node_usage->calculate_node_imbalance();
> +
> +    # do not load balance unless imbalance threshold has been exceeded
> +    # consecutively for $hold_duration calls to load_balance()
> +    if ($imbalance < $threshold) {
> +        $self->{sustained_imbalance_round} = 0;
> +        return;
> +    } else {
> +        $self->{sustained_imbalance_round}++;
> +        return if $self->{sustained_imbalance_round} < $hold_duration;
> +        $self->{sustained_imbalance_round} = 0;
> +    }
> +
> +    my $candidates = $self->get_resource_migration_candidates();
> +
> +    my $result;
> +    if ($method eq 'bruteforce') {
> +        $result = $online_node_usage->select_best_balancing_migration($candidates);
> +    } elsif ($method eq 'topsis') {
> +        $result = $online_node_usage->select_best_balancing_migration_topsis($candidates);
> +    }
> +
> +    # happens if $candidates is empty or $method isn't handled above
> +    return if !$result;
> +
> +    my ($migration, $target_imbalance) = $result->@{qw(migration imbalance)};
> +
> +    my $relative_change = ($imbalance - $target_imbalance) / $imbalance;
> +    return if $relative_change < $margin;
> +
> +    my ($sid, $source, $target) = $migration->@{qw(sid source-node target-node)};
> +
> +    my (undef, $type, $id) = $haenv->parse_sid($sid);
> +    my $task = $type eq 'vm' ? "migrate" : "relocate";
> +    my $cmd = "$task $sid $target";
> +

nit: tbh I racked my brain a bit to get how this rounding technique
worked; so pls at least add a comment to say that you're rounding or
use printf

> +    my $target_imbalance_str = int(100 * $target_imbalance + 0.5) / 100;
> +    $haenv->log(
> +        'info',
> +        "auto rebalance - $task $sid to $target (expected target imbalance: $target_imbalance_str)",
> +    );
> +
> +    $self->queue_resource_motion($cmd, $task, $sid, $target);
> +}
> +
>  sub cleanup {
>      my ($self) = @_;
>  
> @@ -466,6 +624,21 @@ sub queue_resource_motion {
>      }
>  }
>  
> +sub any_resource_motion_queued_or_running {
> +    my ($self) = @_;
> +
> +    my ($ss) = $self->@{qw(ss)};
> +
> +    for my $sid (keys %$ss) {
> +        my ($cmd, $state) = $ss->{$sid}->@{qw(cmd state)};
> +
> +        return 1 if $state eq 'migrate' || $state eq 'relocate';
> +        return 1 if defined($cmd) && ($cmd->[0] eq 'migrate' || $cmd->[0] eq 'relocate');
> +    }
> +
> +    return 0;
> +}
> +
>  # read new crm commands and save them into crm master status
>  sub update_crm_commands {
>      my ($self) = @_;
> @@ -902,6 +1075,10 @@ sub manage {
>              return; # disarm active and progressing, skip normal service state machine
>          }
>          # disarm deferred - fall through but only process services in transient states
> +    } else {
> +        # load balance only if disarm is disabled as during a deferred disarm
> +        # the HA Manager should not introduce any new migrations
> +        $self->load_balance();
>      }

very thoughtful, nice

>  

[snip]

Reviewed-by: Dominik Rusovac <d.rusovac@proxmox.com>




  reply	other threads:[~2026-03-31  9:06 UTC|newest]

Thread overview: 72+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-30 14:30 [PATCH-SERIES cluster/ha-manager/perl-rs/proxmox v3 00/40] dynamic scheduler + load rebalancer Daniel Kral
2026-03-30 14:30 ` [PATCH proxmox v3 01/40] resource-scheduling: inline add_cpu_usage in score_nodes_to_start_service Daniel Kral
2026-03-31  6:01   ` Dominik Rusovac
2026-03-30 14:30 ` [PATCH proxmox v3 02/40] resource-scheduling: move score_nodes_to_start_service to scheduler crate Daniel Kral
2026-03-31  6:01   ` Dominik Rusovac
2026-03-30 14:30 ` [PATCH proxmox v3 03/40] resource-scheduling: rename service to resource where appropriate Daniel Kral
2026-03-31  6:02   ` Dominik Rusovac
2026-03-30 14:30 ` [PATCH proxmox v3 04/40] resource-scheduling: introduce generic scheduler implementation Daniel Kral
2026-03-31  6:11   ` Dominik Rusovac
2026-03-30 14:30 ` [PATCH proxmox v3 05/40] resource-scheduling: implement generic cluster usage implementation Daniel Kral
2026-03-31  7:26   ` Dominik Rusovac
2026-03-30 14:30 ` [PATCH proxmox v3 06/40] resource-scheduling: topsis: handle empty criteria without panics Daniel Kral
2026-03-30 14:30 ` [PATCH proxmox v3 07/40] resource-scheduling: compare by nodename in score_nodes_to_start_resource Daniel Kral
2026-03-30 14:30 ` [PATCH proxmox v3 08/40] resource-scheduling: factor out topsis alternative mapping Daniel Kral
2026-03-30 14:30 ` [PATCH proxmox v3 09/40] resource-scheduling: implement rebalancing migration selection Daniel Kral
2026-03-31  7:33   ` Dominik Rusovac
2026-03-31 12:42   ` Michael Köppl
2026-03-31 13:32     ` Daniel Kral
2026-03-30 14:30 ` [PATCH perl-rs v3 10/40] pve-rs: resource-scheduling: remove pedantic error handling from remove_node Daniel Kral
2026-03-30 14:30 ` [PATCH perl-rs v3 11/40] pve-rs: resource-scheduling: remove pedantic error handling from remove_service_usage Daniel Kral
2026-03-30 14:30 ` [PATCH perl-rs v3 12/40] pve-rs: resource-scheduling: move pve_static into resource_scheduling module Daniel Kral
2026-03-30 14:30 ` [PATCH perl-rs v3 13/40] pve-rs: resource-scheduling: use generic usage implementation Daniel Kral
2026-03-31  7:40   ` Dominik Rusovac
2026-03-30 14:30 ` [PATCH perl-rs v3 14/40] pve-rs: resource-scheduling: static: replace deprecated usage structs Daniel Kral
2026-03-30 14:30 ` [PATCH perl-rs v3 15/40] pve-rs: resource-scheduling: implement pve_dynamic bindings Daniel Kral
2026-03-30 14:30 ` [PATCH perl-rs v3 16/40] pve-rs: resource-scheduling: expose auto rebalancing methods Daniel Kral
2026-03-30 14:30 ` [PATCH cluster v3 17/40] datacenter config: restructure verbose description for the ha crs option Daniel Kral
2026-03-30 14:30 ` [PATCH cluster v3 18/40] datacenter config: add dynamic load scheduler option Daniel Kral
2026-03-30 14:30 ` [PATCH cluster v3 19/40] datacenter config: add auto rebalancing options Daniel Kral
2026-03-31  7:52   ` Dominik Rusovac
2026-03-30 14:30 ` [PATCH ha-manager v3 20/40] env: pve2: implement dynamic node and service stats Daniel Kral
2026-03-31 13:25   ` Daniel Kral
2026-03-30 14:30 ` [PATCH ha-manager v3 21/40] sim: hardware: pass correct types for static stats Daniel Kral
2026-03-30 14:30 ` [PATCH ha-manager v3 22/40] sim: hardware: factor out static stats' default values Daniel Kral
2026-03-30 14:30 ` [PATCH ha-manager v3 23/40] sim: hardware: fix static stats guard Daniel Kral
2026-03-30 14:30 ` [PATCH ha-manager v3 24/40] sim: hardware: handle dynamic service stats Daniel Kral
2026-03-30 14:30 ` [PATCH ha-manager v3 25/40] sim: hardware: add set-dynamic-stats command Daniel Kral
2026-03-30 14:30 ` [PATCH ha-manager v3 26/40] sim: hardware: add getters for dynamic {node,service} stats Daniel Kral
2026-03-30 14:30 ` [PATCH ha-manager v3 27/40] usage: pass service data to add_service_usage Daniel Kral
2026-03-30 14:30 ` [PATCH ha-manager v3 28/40] usage: pass service data to get_used_service_nodes Daniel Kral
2026-03-30 14:30 ` [PATCH ha-manager v3 29/40] add running flag to non-HA cluster service stats Daniel Kral
2026-03-31  7:58   ` Dominik Rusovac
2026-03-30 14:30 ` [PATCH ha-manager v3 30/40] usage: use add_service to add service usage to nodes Daniel Kral
2026-03-31  8:12   ` Dominik Rusovac
2026-03-30 14:30 ` [PATCH ha-manager v3 31/40] usage: add dynamic usage scheduler Daniel Kral
2026-03-31  8:15   ` Dominik Rusovac
2026-03-30 14:30 ` [PATCH ha-manager v3 32/40] test: add dynamic usage scheduler test cases Daniel Kral
2026-03-31  8:20   ` Dominik Rusovac
2026-03-30 14:30 ` [PATCH ha-manager v3 33/40] manager: rename execute_migration to queue_resource_motion Daniel Kral
2026-03-30 14:30 ` [PATCH ha-manager v3 34/40] manager: update_crs_scheduler_mode: factor out crs config Daniel Kral
2026-03-30 14:30 ` [PATCH ha-manager v3 35/40] implement automatic rebalancing Daniel Kral
2026-03-31  9:07   ` Dominik Rusovac [this message]
2026-03-31  9:07   ` Michael Köppl
2026-03-31  9:16     ` Dominik Rusovac
2026-03-31  9:32       ` Daniel Kral
2026-03-31  9:39         ` Dominik Rusovac
2026-03-31 13:55           ` Daniel Kral
2026-03-31  9:42     ` Daniel Kral
2026-03-31 11:01       ` Michael Köppl
2026-03-31 13:50   ` Daniel Kral
2026-03-30 14:30 ` [PATCH ha-manager v3 36/40] test: add resource bundle generation test cases Daniel Kral
2026-03-31  9:09   ` Dominik Rusovac
2026-03-30 14:30 ` [PATCH ha-manager v3 37/40] test: add dynamic automatic rebalancing system " Daniel Kral
2026-03-31  9:33   ` Dominik Rusovac
2026-03-30 14:30 ` [PATCH ha-manager v3 38/40] test: add static " Daniel Kral
2026-03-31  9:44   ` Dominik Rusovac
2026-03-30 14:30 ` [PATCH ha-manager v3 39/40] test: add automatic rebalancing system test cases with TOPSIS method Daniel Kral
2026-03-31  9:48   ` Dominik Rusovac
2026-03-30 14:30 ` [PATCH ha-manager v3 40/40] test: add automatic rebalancing system test cases with affinity rules Daniel Kral
2026-03-31 10:06   ` Dominik Rusovac
2026-03-31 20:44 ` partially-applied: [PATCH-SERIES cluster/ha-manager/perl-rs/proxmox v3 00/40] dynamic scheduler + load rebalancer Thomas Lamprecht
2026-04-02 12:55 ` superseded: " Daniel Kral

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=DHGUK85D97LK.3I0CO8FIJOCJG@proxmox.com \
    --to=d.rusovac@proxmox.com \
    --cc=d.kral@proxmox.com \
    --cc=pve-devel@lists.proxmox.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox
Service provided by Proxmox Server Solutions GmbH | Privacy | Legal