all lists on lists.proxmox.com
 help / color / mirror / Atom feed
From: "Dominik Rusovac" <d.rusovac@proxmox.com>
To: "Daniel Kral" <d.kral@proxmox.com>, <pve-devel@lists.proxmox.com>
Subject: Re: [PATCH ha-manager v3 35/40] implement automatic rebalancing
Date: Tue, 31 Mar 2026 11:07:01 +0200	[thread overview]
Message-ID: <DHGUK85D97LK.3I0CO8FIJOCJG@proxmox.com> (raw)
In-Reply-To: <20260330144101.668747-36-d.kral@proxmox.com>

one nit inline

lgtm, consider this

On Mon Mar 30, 2026 at 4:30 PM CEST, Daniel Kral wrote:
> If the automatic load balancing system is enabled, it checks whether the
> cluster node imbalance exceeds some user-defined threshold for some HA
> Manager rounds ("hold duration"). If it does exceed on consecutive HA
> Manager rounds, it will choose the best resource motion to improve the
> cluster node imbalance and queue it if it significantly improves it by
> some user-defined imbalance improvement ("margin").
>
> This patch introduces resource bundles, which ensure that HA resources
> in strict positive resource affinity rules are considered as a whole
> "bundle" instead of individual HA resources.
>
> Specifically, active and stationary resource bundles are resource
> bundles, that have at least one resource running and all resources
> located on the same node. This distinction is needed as newly created
> strict positive resource affinity rules may still require some resource
> motions to enforce the rule.
>
> Additionally, the migration candidate generation prunes any target
> nodes, which do not adhere to the HA rules of these resource bundles
> before scoring these migration candidates.

nice work! also very nice idea to introduce resource bundles

>
> Signed-off-by: Daniel Kral <d.kral@proxmox.com>
> ---
> changes v2 -> v3:
> - none
>

[snip]

> diff --git a/src/PVE/HA/Manager.pm b/src/PVE/HA/Manager.pm
> index 2576c762..0f8a03a6 100644
> --- a/src/PVE/HA/Manager.pm
> +++ b/src/PVE/HA/Manager.pm
> @@ -59,10 +59,17 @@ sub new {
>  
>      my $self = bless {
>          haenv => $haenv,
> -        crs => {},
> +        crs => {
> +            auto_rebalance => {},
> +        },
>          last_rules_digest => '',
>          last_groups_digest => '',
>          last_services_digest => '',
> +        # used to track how many HA rounds the imbalance threshold has been exceeded
> +        #
> +        # this is not persisted for a CRM failover as in the mean time
> +        # the usage statistics might have change quite a bit already

nit:

           # the usage statistics might have change[d] quite a bit already

> +        sustained_imbalance_round => 0,
>          group_migration_round => 3, # wait a little bit
>      }, $class;
>  
> @@ -97,6 +104,13 @@ sub update_crs_scheduler_mode {
>      my $crs_cfg = $dc_cfg->{crs};
>  
>      $self->{crs}->{rebalance_on_request_start} = !!$crs_cfg->{'ha-rebalance-on-start'};
> +    $self->{crs}->{auto_rebalance}->{enable} = !!$crs_cfg->{'ha-auto-rebalance'};
> +    $self->{crs}->{auto_rebalance}->{threshold} = $crs_cfg->{'ha-auto-rebalance-threshold'} // 0.7;
> +    $self->{crs}->{auto_rebalance}->{method} = $crs_cfg->{'ha-auto-rebalance-method'}
> +        // 'bruteforce';
> +    $self->{crs}->{auto_rebalance}->{hold_duration} = $crs_cfg->{'ha-auto-rebalance-hold-duration'}
> +        // 3;
> +    $self->{crs}->{auto_rebalance}->{margin} = $crs_cfg->{'ha-auto-rebalance-margin'} // 0.1;
>  
>      my $old_mode = $self->{crs}->{scheduler};
>      my $new_mode = $crs_cfg->{ha} || 'basic';
> @@ -114,6 +128,150 @@ sub update_crs_scheduler_mode {
>      return;
>  }
>  
> +# Returns a hash of lists, which contain the running, non-moving HA resource
> +# bundles, which are on the same node, implied by the strict positive resource
> +# affinity rules.
> +#
> +# Each resource bundle has a leader, which is the alphabetically first running
> +# HA resource in the resource bundle and also the key of each resource bundle
> +# in the returned hash.
> +sub get_active_stationary_resource_bundles {
> +    my ($ss, $resource_affinity) = @_;
> +
> +    my $resource_bundles = {};
> +OUTER: for my $sid (sort keys %$ss) {
> +        # do not consider non-started resource as 'active' leading resource
> +        next if $ss->{$sid}->{state} ne 'started';
> +
> +        my @resources = ($sid);
> +        my $nodes = { $ss->{$sid}->{node} => 1 };
> +
> +        my ($dependent_resources) = get_affinitive_resources($resource_affinity, $sid);
> +        if (%$dependent_resources) {
> +            for my $csid (keys %$dependent_resources) {
> +                next if !defined($ss->{$csid});
> +                my ($state, $node) = $ss->{$csid}->@{qw(state node)};
> +
> +                # do not consider stationary bundle if a dependent resource moves
> +                next OUTER if $state eq 'migrate' || $state eq 'relocate';
> +                # do not add non-started resource to active bundle
> +                next if $state ne 'started';
> +
> +                $nodes->{$node} = 1;
> +
> +                push @resources, $csid;
> +            }
> +
> +            @resources = sort @resources;
> +        }
> +
> +        # skip resource bundles, which are not on the same node yet
> +        next if keys %$nodes > 1;
> +
> +        my $leader_sid = $resources[0];
> +
> +        $resource_bundles->{$leader_sid} = \@resources;
> +    }
> +
> +    return $resource_bundles;
> +}
> +
> +# Returns a hash of hashes, where each item contains the resource bundle's
> +# leader, the list of HA resources in the resource bundle, and the list of
> +# possible nodes to migrate to.
> +sub get_resource_migration_candidates {
> +    my ($self) = @_;
> +
> +    my ($ss, $compiled_rules, $online_node_usage) =
> +        $self->@{qw(ss compiled_rules online_node_usage)};
> +    my ($node_affinity, $resource_affinity) =
> +        $compiled_rules->@{qw(node-affinity resource-affinity)};
> +
> +    my $resource_bundles = get_active_stationary_resource_bundles($ss, $resource_affinity);
> +
> +    my @compact_migration_candidates = ();
> +    for my $leader_sid (sort keys %$resource_bundles) {
> +        my $current_leader_node = $ss->{$leader_sid}->{node};
> +        my $online_nodes = { map { $_ => 1 } $online_node_usage->list_nodes() };
> +
> +        my (undef, $target_nodes) = get_node_affinity($node_affinity, $leader_sid, $online_nodes);
> +        my ($together, $separate) =
> +            get_resource_affinity($resource_affinity, $leader_sid, $ss, $online_nodes);
> +        apply_negative_resource_affinity($separate, $target_nodes);
> +
> +        delete $target_nodes->{$current_leader_node};
> +
> +        next if !%$target_nodes;
> +
> +        push @compact_migration_candidates,
> +            {
> +                leader => $leader_sid,
> +                nodes => [sort keys %$target_nodes],
> +                resources => $resource_bundles->{$leader_sid},
> +            };
> +    }
> +
> +    return \@compact_migration_candidates;
> +}
> +
> +sub load_balance {
> +    my ($self) = @_;
> +
> +    my ($crs, $haenv, $online_node_usage) = $self->@{qw(crs haenv online_node_usage)};
> +    my ($auto_rebalance_opts) = $crs->{auto_rebalance};
> +
> +    return if !$auto_rebalance_opts->{enable};
> +    return if $crs->{scheduler} ne 'static' && $crs->{scheduler} ne 'dynamic';
> +    return if $self->any_resource_motion_queued_or_running();
> +
> +    my ($threshold, $method, $hold_duration, $margin) =
> +        $auto_rebalance_opts->@{qw(threshold method hold_duration margin)};
> +
> +    my $imbalance = $online_node_usage->calculate_node_imbalance();
> +
> +    # do not load balance unless imbalance threshold has been exceeded
> +    # consecutively for $hold_duration calls to load_balance()
> +    if ($imbalance < $threshold) {
> +        $self->{sustained_imbalance_round} = 0;
> +        return;
> +    } else {
> +        $self->{sustained_imbalance_round}++;
> +        return if $self->{sustained_imbalance_round} < $hold_duration;
> +        $self->{sustained_imbalance_round} = 0;
> +    }
> +
> +    my $candidates = $self->get_resource_migration_candidates();
> +
> +    my $result;
> +    if ($method eq 'bruteforce') {
> +        $result = $online_node_usage->select_best_balancing_migration($candidates);
> +    } elsif ($method eq 'topsis') {
> +        $result = $online_node_usage->select_best_balancing_migration_topsis($candidates);
> +    }
> +
> +    # happens if $candidates is empty or $method isn't handled above
> +    return if !$result;
> +
> +    my ($migration, $target_imbalance) = $result->@{qw(migration imbalance)};
> +
> +    my $relative_change = ($imbalance - $target_imbalance) / $imbalance;
> +    return if $relative_change < $margin;
> +
> +    my ($sid, $source, $target) = $migration->@{qw(sid source-node target-node)};
> +
> +    my (undef, $type, $id) = $haenv->parse_sid($sid);
> +    my $task = $type eq 'vm' ? "migrate" : "relocate";
> +    my $cmd = "$task $sid $target";
> +

nit: tbh I racked my brain a bit to get how this rounding technique
worked; so pls at least add a comment to say that you're rounding or
use printf

> +    my $target_imbalance_str = int(100 * $target_imbalance + 0.5) / 100;
> +    $haenv->log(
> +        'info',
> +        "auto rebalance - $task $sid to $target (expected target imbalance: $target_imbalance_str)",
> +    );
> +
> +    $self->queue_resource_motion($cmd, $task, $sid, $target);
> +}
> +
>  sub cleanup {
>      my ($self) = @_;
>  
> @@ -466,6 +624,21 @@ sub queue_resource_motion {
>      }
>  }
>  
> +sub any_resource_motion_queued_or_running {
> +    my ($self) = @_;
> +
> +    my ($ss) = $self->@{qw(ss)};
> +
> +    for my $sid (keys %$ss) {
> +        my ($cmd, $state) = $ss->{$sid}->@{qw(cmd state)};
> +
> +        return 1 if $state eq 'migrate' || $state eq 'relocate';
> +        return 1 if defined($cmd) && ($cmd->[0] eq 'migrate' || $cmd->[0] eq 'relocate');
> +    }
> +
> +    return 0;
> +}
> +
>  # read new crm commands and save them into crm master status
>  sub update_crm_commands {
>      my ($self) = @_;
> @@ -902,6 +1075,10 @@ sub manage {
>              return; # disarm active and progressing, skip normal service state machine
>          }
>          # disarm deferred - fall through but only process services in transient states
> +    } else {
> +        # load balance only if disarm is disabled as during a deferred disarm
> +        # the HA Manager should not introduce any new migrations
> +        $self->load_balance();
>      }

very thoughtful, nice

>  

[snip]

Reviewed-by: Dominik Rusovac <d.rusovac@proxmox.com>




  reply	other threads:[~2026-03-31  9:06 UTC|newest]

Thread overview: 72+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-30 14:30 [PATCH-SERIES cluster/ha-manager/perl-rs/proxmox v3 00/40] dynamic scheduler + load rebalancer Daniel Kral
2026-03-30 14:30 ` [PATCH proxmox v3 01/40] resource-scheduling: inline add_cpu_usage in score_nodes_to_start_service Daniel Kral
2026-03-31  6:01   ` Dominik Rusovac
2026-03-30 14:30 ` [PATCH proxmox v3 02/40] resource-scheduling: move score_nodes_to_start_service to scheduler crate Daniel Kral
2026-03-31  6:01   ` Dominik Rusovac
2026-03-30 14:30 ` [PATCH proxmox v3 03/40] resource-scheduling: rename service to resource where appropriate Daniel Kral
2026-03-31  6:02   ` Dominik Rusovac
2026-03-30 14:30 ` [PATCH proxmox v3 04/40] resource-scheduling: introduce generic scheduler implementation Daniel Kral
2026-03-31  6:11   ` Dominik Rusovac
2026-03-30 14:30 ` [PATCH proxmox v3 05/40] resource-scheduling: implement generic cluster usage implementation Daniel Kral
2026-03-31  7:26   ` Dominik Rusovac
2026-03-30 14:30 ` [PATCH proxmox v3 06/40] resource-scheduling: topsis: handle empty criteria without panics Daniel Kral
2026-03-30 14:30 ` [PATCH proxmox v3 07/40] resource-scheduling: compare by nodename in score_nodes_to_start_resource Daniel Kral
2026-03-30 14:30 ` [PATCH proxmox v3 08/40] resource-scheduling: factor out topsis alternative mapping Daniel Kral
2026-03-30 14:30 ` [PATCH proxmox v3 09/40] resource-scheduling: implement rebalancing migration selection Daniel Kral
2026-03-31  7:33   ` Dominik Rusovac
2026-03-31 12:42   ` Michael Köppl
2026-03-31 13:32     ` Daniel Kral
2026-03-30 14:30 ` [PATCH perl-rs v3 10/40] pve-rs: resource-scheduling: remove pedantic error handling from remove_node Daniel Kral
2026-03-30 14:30 ` [PATCH perl-rs v3 11/40] pve-rs: resource-scheduling: remove pedantic error handling from remove_service_usage Daniel Kral
2026-03-30 14:30 ` [PATCH perl-rs v3 12/40] pve-rs: resource-scheduling: move pve_static into resource_scheduling module Daniel Kral
2026-03-30 14:30 ` [PATCH perl-rs v3 13/40] pve-rs: resource-scheduling: use generic usage implementation Daniel Kral
2026-03-31  7:40   ` Dominik Rusovac
2026-03-30 14:30 ` [PATCH perl-rs v3 14/40] pve-rs: resource-scheduling: static: replace deprecated usage structs Daniel Kral
2026-03-30 14:30 ` [PATCH perl-rs v3 15/40] pve-rs: resource-scheduling: implement pve_dynamic bindings Daniel Kral
2026-03-30 14:30 ` [PATCH perl-rs v3 16/40] pve-rs: resource-scheduling: expose auto rebalancing methods Daniel Kral
2026-03-30 14:30 ` [PATCH cluster v3 17/40] datacenter config: restructure verbose description for the ha crs option Daniel Kral
2026-03-30 14:30 ` [PATCH cluster v3 18/40] datacenter config: add dynamic load scheduler option Daniel Kral
2026-03-30 14:30 ` [PATCH cluster v3 19/40] datacenter config: add auto rebalancing options Daniel Kral
2026-03-31  7:52   ` Dominik Rusovac
2026-03-30 14:30 ` [PATCH ha-manager v3 20/40] env: pve2: implement dynamic node and service stats Daniel Kral
2026-03-31 13:25   ` Daniel Kral
2026-03-30 14:30 ` [PATCH ha-manager v3 21/40] sim: hardware: pass correct types for static stats Daniel Kral
2026-03-30 14:30 ` [PATCH ha-manager v3 22/40] sim: hardware: factor out static stats' default values Daniel Kral
2026-03-30 14:30 ` [PATCH ha-manager v3 23/40] sim: hardware: fix static stats guard Daniel Kral
2026-03-30 14:30 ` [PATCH ha-manager v3 24/40] sim: hardware: handle dynamic service stats Daniel Kral
2026-03-30 14:30 ` [PATCH ha-manager v3 25/40] sim: hardware: add set-dynamic-stats command Daniel Kral
2026-03-30 14:30 ` [PATCH ha-manager v3 26/40] sim: hardware: add getters for dynamic {node,service} stats Daniel Kral
2026-03-30 14:30 ` [PATCH ha-manager v3 27/40] usage: pass service data to add_service_usage Daniel Kral
2026-03-30 14:30 ` [PATCH ha-manager v3 28/40] usage: pass service data to get_used_service_nodes Daniel Kral
2026-03-30 14:30 ` [PATCH ha-manager v3 29/40] add running flag to non-HA cluster service stats Daniel Kral
2026-03-31  7:58   ` Dominik Rusovac
2026-03-30 14:30 ` [PATCH ha-manager v3 30/40] usage: use add_service to add service usage to nodes Daniel Kral
2026-03-31  8:12   ` Dominik Rusovac
2026-03-30 14:30 ` [PATCH ha-manager v3 31/40] usage: add dynamic usage scheduler Daniel Kral
2026-03-31  8:15   ` Dominik Rusovac
2026-03-30 14:30 ` [PATCH ha-manager v3 32/40] test: add dynamic usage scheduler test cases Daniel Kral
2026-03-31  8:20   ` Dominik Rusovac
2026-03-30 14:30 ` [PATCH ha-manager v3 33/40] manager: rename execute_migration to queue_resource_motion Daniel Kral
2026-03-30 14:30 ` [PATCH ha-manager v3 34/40] manager: update_crs_scheduler_mode: factor out crs config Daniel Kral
2026-03-30 14:30 ` [PATCH ha-manager v3 35/40] implement automatic rebalancing Daniel Kral
2026-03-31  9:07   ` Dominik Rusovac [this message]
2026-03-31  9:07   ` Michael Köppl
2026-03-31  9:16     ` Dominik Rusovac
2026-03-31  9:32       ` Daniel Kral
2026-03-31  9:39         ` Dominik Rusovac
2026-03-31 13:55           ` Daniel Kral
2026-03-31  9:42     ` Daniel Kral
2026-03-31 11:01       ` Michael Köppl
2026-03-31 13:50   ` Daniel Kral
2026-03-30 14:30 ` [PATCH ha-manager v3 36/40] test: add resource bundle generation test cases Daniel Kral
2026-03-31  9:09   ` Dominik Rusovac
2026-03-30 14:30 ` [PATCH ha-manager v3 37/40] test: add dynamic automatic rebalancing system " Daniel Kral
2026-03-31  9:33   ` Dominik Rusovac
2026-03-30 14:30 ` [PATCH ha-manager v3 38/40] test: add static " Daniel Kral
2026-03-31  9:44   ` Dominik Rusovac
2026-03-30 14:30 ` [PATCH ha-manager v3 39/40] test: add automatic rebalancing system test cases with TOPSIS method Daniel Kral
2026-03-31  9:48   ` Dominik Rusovac
2026-03-30 14:30 ` [PATCH ha-manager v3 40/40] test: add automatic rebalancing system test cases with affinity rules Daniel Kral
2026-03-31 10:06   ` Dominik Rusovac
2026-03-31 20:44 ` partially-applied: [PATCH-SERIES cluster/ha-manager/perl-rs/proxmox v3 00/40] dynamic scheduler + load rebalancer Thomas Lamprecht
2026-04-02 12:55 ` superseded: " Daniel Kral

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=DHGUK85D97LK.3I0CO8FIJOCJG@proxmox.com \
    --to=d.rusovac@proxmox.com \
    --cc=d.kral@proxmox.com \
    --cc=pve-devel@lists.proxmox.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.
Service provided by Proxmox Server Solutions GmbH | Privacy | Legal