public inbox for pve-devel@lists.proxmox.com
 help / color / mirror / Atom feed
From: Daniel Kral <d.kral@proxmox.com>
To: pve-devel@lists.proxmox.com
Subject: [PATCH ha-manager v4 19/28] implement automatic rebalancing
Date: Thu,  2 Apr 2026 14:44:13 +0200	[thread overview]
Message-ID: <20260402124817.416232-20-d.kral@proxmox.com> (raw)
In-Reply-To: <20260402124817.416232-1-d.kral@proxmox.com>

If the automatic load balancing system is enabled, it checks whether the
cluster node imbalance exceeds some user-defined threshold for some HA
Manager rounds ("hold duration"). If it does exceed on consecutive HA
Manager rounds, it will choose the best resource motion to improve the
cluster node imbalance and queue it if it significantly improves it by
some user-defined imbalance improvement ("margin").

This patch introduces resource bundles, which ensure that HA resources
in strict positive resource affinity rules are considered as a whole
"bundle" instead of individual HA resources.

Specifically, active and stationary resource bundles are resource
bundles, that have at least one resource running and all resources
located on the same node. This distinction is needed as newly created
strict positive resource affinity rules may still require some resource
motions to enforce the rule.

Additionally, the migration candidate generation prunes any target
nodes, which do not adhere to the HA rules of these resource bundles
before scoring these migration candidates.

Signed-off-by: Daniel Kral <d.kral@proxmox.com>
---
changes v3 -> v4:
- change imbalance threshold default value from 0.7 to 0.3
- use sprintf() for float number printing instead of perly rounding
  logic
- print before and expected after values
- implement PVE::HA::Usage::Basic rebalancing methods as well with
  sensible return values, but which are only used to not throw errors if
  a failback from 'dynamic'/'static' to 'basic' happens in
  recompute_online_node_usage()

 src/PVE/HA/Manager.pm       | 178 +++++++++++++++++++++++++++++++++++-
 src/PVE/HA/Usage.pm         |  34 +++++++
 src/PVE/HA/Usage/Basic.pm   |  18 ++++
 src/PVE/HA/Usage/Dynamic.pm |  33 +++++++
 src/PVE/HA/Usage/Static.pm  |  33 +++++++
 5 files changed, 295 insertions(+), 1 deletion(-)

diff --git a/src/PVE/HA/Manager.pm b/src/PVE/HA/Manager.pm
index 2576c762..b69a6bba 100644
--- a/src/PVE/HA/Manager.pm
+++ b/src/PVE/HA/Manager.pm
@@ -59,10 +59,17 @@ sub new {
 
     my $self = bless {
         haenv => $haenv,
-        crs => {},
+        crs => {
+            auto_rebalance => {},
+        },
         last_rules_digest => '',
         last_groups_digest => '',
         last_services_digest => '',
+        # used to track how many HA rounds the imbalance threshold has been exceeded
+        #
+        # this is not persisted for a CRM failover as in the mean time
+        # the usage statistics might have change quite a bit already
+        sustained_imbalance_round => 0,
         group_migration_round => 3, # wait a little bit
     }, $class;
 
@@ -97,6 +104,13 @@ sub update_crs_scheduler_mode {
     my $crs_cfg = $dc_cfg->{crs};
 
     $self->{crs}->{rebalance_on_request_start} = !!$crs_cfg->{'ha-rebalance-on-start'};
+    $self->{crs}->{auto_rebalance}->{enable} = !!$crs_cfg->{'ha-auto-rebalance'};
+    $self->{crs}->{auto_rebalance}->{threshold} = $crs_cfg->{'ha-auto-rebalance-threshold'} // 0.3;
+    $self->{crs}->{auto_rebalance}->{method} = $crs_cfg->{'ha-auto-rebalance-method'}
+        // 'bruteforce';
+    $self->{crs}->{auto_rebalance}->{hold_duration} = $crs_cfg->{'ha-auto-rebalance-hold-duration'}
+        // 3;
+    $self->{crs}->{auto_rebalance}->{margin} = $crs_cfg->{'ha-auto-rebalance-margin'} // 0.1;
 
     my $old_mode = $self->{crs}->{scheduler};
     my $new_mode = $crs_cfg->{ha} || 'basic';
@@ -114,6 +128,149 @@ sub update_crs_scheduler_mode {
     return;
 }
 
+# Returns a hash of lists, which contain the running, non-moving HA resource
+# bundles, which are on the same node, implied by the strict positive resource
+# affinity rules.
+#
+# Each resource bundle has a leader, which is the alphabetically first running
+# HA resource in the resource bundle and also the key of each resource bundle
+# in the returned hash.
+sub get_active_stationary_resource_bundles {
+    my ($ss, $resource_affinity) = @_;
+
+    my $resource_bundles = {};
+OUTER: for my $sid (sort keys %$ss) {
+        # do not consider non-started resource as 'active' leading resource
+        next if $ss->{$sid}->{state} ne 'started';
+
+        my @resources = ($sid);
+        my $nodes = { $ss->{$sid}->{node} => 1 };
+
+        my ($dependent_resources) = get_affinitive_resources($resource_affinity, $sid);
+        if (%$dependent_resources) {
+            for my $csid (keys %$dependent_resources) {
+                next if !defined($ss->{$csid});
+                my ($state, $node) = $ss->{$csid}->@{qw(state node)};
+
+                # do not consider stationary bundle if a dependent resource moves
+                next OUTER if $state eq 'migrate' || $state eq 'relocate';
+                # do not add non-started resource to active bundle
+                next if $state ne 'started';
+
+                $nodes->{$node} = 1;
+
+                push @resources, $csid;
+            }
+
+            @resources = sort @resources;
+        }
+
+        # skip resource bundles, which are not on the same node yet
+        next if keys %$nodes > 1;
+
+        my $leader_sid = $resources[0];
+
+        $resource_bundles->{$leader_sid} = \@resources;
+    }
+
+    return $resource_bundles;
+}
+
+# Returns a hash of hashes, where each item contains the resource bundle's
+# leader, the list of HA resources in the resource bundle, and the list of
+# possible nodes to migrate to.
+sub get_resource_migration_candidates {
+    my ($self) = @_;
+
+    my ($ss, $compiled_rules, $online_node_usage) =
+        $self->@{qw(ss compiled_rules online_node_usage)};
+    my ($node_affinity, $resource_affinity) =
+        $compiled_rules->@{qw(node-affinity resource-affinity)};
+
+    my $resource_bundles = get_active_stationary_resource_bundles($ss, $resource_affinity);
+
+    my @compact_migration_candidates = ();
+    for my $leader_sid (sort keys %$resource_bundles) {
+        my $current_leader_node = $ss->{$leader_sid}->{node};
+        my $online_nodes = { map { $_ => 1 } $online_node_usage->list_nodes() };
+
+        my (undef, $target_nodes) = get_node_affinity($node_affinity, $leader_sid, $online_nodes);
+        my ($together, $separate) =
+            get_resource_affinity($resource_affinity, $leader_sid, $ss, $online_nodes);
+        apply_negative_resource_affinity($separate, $target_nodes);
+
+        delete $target_nodes->{$current_leader_node};
+
+        next if !%$target_nodes;
+
+        push @compact_migration_candidates,
+            {
+                leader => $leader_sid,
+                nodes => [sort keys %$target_nodes],
+                resources => $resource_bundles->{$leader_sid},
+            };
+    }
+
+    return \@compact_migration_candidates;
+}
+
+sub load_balance {
+    my ($self) = @_;
+
+    my ($crs, $haenv, $online_node_usage) = $self->@{qw(crs haenv online_node_usage)};
+    my ($auto_rebalance_opts) = $crs->{auto_rebalance};
+
+    return if !$auto_rebalance_opts->{enable};
+    return if $crs->{scheduler} ne 'static' && $crs->{scheduler} ne 'dynamic';
+    return if $self->any_resource_motion_queued_or_running();
+
+    my ($threshold, $method, $hold_duration, $margin) =
+        $auto_rebalance_opts->@{qw(threshold method hold_duration margin)};
+
+    my $imbalance = $online_node_usage->calculate_node_imbalance();
+
+    # do not load balance unless imbalance threshold has been exceeded
+    # consecutively for $hold_duration calls to load_balance();
+    # the <= relation prevents load balancing from triggering for $imbalance = 0.0
+    if ($imbalance <= $threshold) {
+        $self->{sustained_imbalance_round} = 0;
+        return;
+    } else {
+        $self->{sustained_imbalance_round}++;
+        return if $self->{sustained_imbalance_round} < $hold_duration;
+        $self->{sustained_imbalance_round} = 0;
+    }
+
+    my $candidates = $self->get_resource_migration_candidates();
+
+    my $result;
+    if ($method eq 'bruteforce') {
+        $result = $online_node_usage->select_best_balancing_migration($candidates);
+    } elsif ($method eq 'topsis') {
+        $result = $online_node_usage->select_best_balancing_migration_topsis($candidates);
+    }
+
+    # happens if $candidates is empty or $method isn't handled above
+    return if !$result;
+
+    my ($migration, $target_imbalance) = $result->@{qw(migration imbalance)};
+
+    my $relative_change = ($imbalance - $target_imbalance) / $imbalance;
+    return if $relative_change < $margin;
+
+    my ($sid, $source, $target) = $migration->@{qw(sid source-node target-node)};
+
+    my (undef, $type, $id) = $haenv->parse_sid($sid);
+    my $task = $type eq 'vm' ? "migrate" : "relocate";
+    my $cmd = "$task $sid $target";
+
+    my $imbalance_change_str =
+        sprintf("expected change for imbalance from %.2f to %.2f", $imbalance, $target_imbalance);
+    $haenv->log('info', "auto rebalance - $task $sid to $target ($imbalance_change_str)");
+
+    $self->queue_resource_motion($cmd, $task, $sid, $target);
+}
+
 sub cleanup {
     my ($self) = @_;
 
@@ -466,6 +623,21 @@ sub queue_resource_motion {
     }
 }
 
+sub any_resource_motion_queued_or_running {
+    my ($self) = @_;
+
+    my ($ss) = $self->@{qw(ss)};
+
+    for my $sid (keys %$ss) {
+        my ($cmd, $state) = $ss->{$sid}->@{qw(cmd state)};
+
+        return 1 if $state eq 'migrate' || $state eq 'relocate';
+        return 1 if defined($cmd) && ($cmd->[0] eq 'migrate' || $cmd->[0] eq 'relocate');
+    }
+
+    return 0;
+}
+
 # read new crm commands and save them into crm master status
 sub update_crm_commands {
     my ($self) = @_;
@@ -902,6 +1074,10 @@ sub manage {
             return; # disarm active and progressing, skip normal service state machine
         }
         # disarm deferred - fall through but only process services in transient states
+    } else {
+        # load balance only if disarm is disabled as during a deferred disarm
+        # the HA Manager should not introduce any new migrations
+        $self->load_balance();
     }
 
     $self->{all_lrms_disarmed} = 0;
diff --git a/src/PVE/HA/Usage.pm b/src/PVE/HA/Usage.pm
index 43feb041..659ab30a 100644
--- a/src/PVE/HA/Usage.pm
+++ b/src/PVE/HA/Usage.pm
@@ -60,6 +60,40 @@ sub remove_service_usage {
     die "implement in subclass";
 }
 
+sub calculate_node_imbalance {
+    my ($self) = @_;
+
+    die "implement in subclass";
+}
+
+sub score_best_balancing_migrations {
+    my ($self, $migration_candidates, $limit) = @_;
+
+    die "implement in subclass";
+}
+
+sub select_best_balancing_migration {
+    my ($self, $migration_candidates) = @_;
+
+    my $migrations = $self->score_best_balancing_migrations($migration_candidates, 1);
+
+    return $migrations->[0];
+}
+
+sub score_best_balancing_migrations_topsis {
+    my ($self, $migration_candidates, $limit) = @_;
+
+    die "implement in subclass";
+}
+
+sub select_best_balancing_migration_topsis {
+    my ($self, $migration_candidates) = @_;
+
+    my $migrations = $self->score_best_balancing_migrations_topsis($migration_candidates, 1);
+
+    return $migrations->[0];
+}
+
 # Returns a hash with $nodename => $score pairs. A lower $score is better.
 sub score_nodes_to_start_service {
     my ($self, $sid) = @_;
diff --git a/src/PVE/HA/Usage/Basic.pm b/src/PVE/HA/Usage/Basic.pm
index 5aa3ac05..4dce9e17 100644
--- a/src/PVE/HA/Usage/Basic.pm
+++ b/src/PVE/HA/Usage/Basic.pm
@@ -66,6 +66,24 @@ sub remove_service_usage {
     }
 }
 
+sub calculate_node_imbalance {
+    my ($self) = @_;
+
+    return 0.0;
+}
+
+sub score_best_balancing_migrations {
+    my ($self, $migration_candidates, $limit) = @_;
+
+    return [];
+}
+
+sub score_best_balancing_migrations_topsis {
+    my ($self, $migration_candidates, $limit) = @_;
+
+    return [];
+}
+
 sub score_nodes_to_start_service {
     my ($self, $sid) = @_;
 
diff --git a/src/PVE/HA/Usage/Dynamic.pm b/src/PVE/HA/Usage/Dynamic.pm
index 24c85a41..76d0feaa 100644
--- a/src/PVE/HA/Usage/Dynamic.pm
+++ b/src/PVE/HA/Usage/Dynamic.pm
@@ -104,6 +104,39 @@ sub remove_service_usage {
     $self->{haenv}->log('warning', "unable to remove service '$sid' usage - $@") if $@;
 }
 
+sub calculate_node_imbalance {
+    my ($self) = @_;
+
+    my $node_imbalance = eval { $self->{scheduler}->calculate_node_imbalance() };
+    $self->{haenv}->log('warning', "unable to calculate dynamic node imbalance - $@") if $@;
+
+    return $node_imbalance // 0.0;
+}
+
+sub score_best_balancing_migrations {
+    my ($self, $migration_candidates, $limit) = @_;
+
+    my $migrations = eval {
+        $self->{scheduler}
+            ->score_best_balancing_migration_candidates($migration_candidates, $limit);
+    };
+    $self->{haenv}->log('warning', "unable to score best balancing migration - $@") if $@;
+
+    return $migrations;
+}
+
+sub score_best_balancing_migrations_topsis {
+    my ($self, $migration_candidates, $limit) = @_;
+
+    my $migrations = eval {
+        $self->{scheduler}
+            ->score_best_balancing_migration_candidates_topsis($migration_candidates, $limit);
+    };
+    $self->{haenv}->log('warning', "unable to score best balancing migration - $@") if $@;
+
+    return $migrations;
+}
+
 sub score_nodes_to_start_service {
     my ($self, $sid) = @_;
 
diff --git a/src/PVE/HA/Usage/Static.pm b/src/PVE/HA/Usage/Static.pm
index 8c7a614b..e67d5f5b 100644
--- a/src/PVE/HA/Usage/Static.pm
+++ b/src/PVE/HA/Usage/Static.pm
@@ -111,6 +111,39 @@ sub remove_service_usage {
     $self->{haenv}->log('warning', "unable to remove service '$sid' usage - $@") if $@;
 }
 
+sub calculate_node_imbalance {
+    my ($self) = @_;
+
+    my $node_imbalance = eval { $self->{scheduler}->calculate_node_imbalance() };
+    $self->{haenv}->log('warning', "unable to calculate static node imbalance - $@") if $@;
+
+    return $node_imbalance // 0.0;
+}
+
+sub score_best_balancing_migrations {
+    my ($self, $migration_candidates, $limit) = @_;
+
+    my $migrations = eval {
+        $self->{scheduler}
+            ->score_best_balancing_migration_candidates($migration_candidates, $limit);
+    };
+    $self->{haenv}->log('warning', "unable to score best balancing migration - $@") if $@;
+
+    return $migrations;
+}
+
+sub score_best_balancing_migrations_topsis {
+    my ($self, $migration_candidates, $limit) = @_;
+
+    my $migrations = eval {
+        $self->{scheduler}
+            ->score_best_balancing_migration_candidates_topsis($migration_candidates, $limit);
+    };
+    $self->{haenv}->log('warning', "unable to score best balancing migration - $@") if $@;
+
+    return $migrations;
+}
+
 sub score_nodes_to_start_service {
     my ($self, $sid) = @_;
 
-- 
2.47.3





  parent reply	other threads:[~2026-04-02 12:51 UTC|newest]

Thread overview: 41+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-04-02 12:43 [PATCH cluster/ha-manager/manager v4 00/28] dynamic scheduler + load rebalancer Daniel Kral
2026-04-02 12:43 ` [PATCH cluster v4 01/28] datacenter config: restructure verbose description for the ha crs option Daniel Kral
2026-04-02 12:43 ` [PATCH cluster v4 02/28] datacenter config: add dynamic load scheduler option Daniel Kral
2026-04-02 12:43 ` [PATCH cluster v4 03/28] datacenter config: add auto rebalancing options Daniel Kral
2026-04-02 13:07   ` Dominik Rusovac
2026-04-02 12:43 ` [PATCH ha-manager v4 04/28] env: pve2: implement dynamic node and service stats Daniel Kral
2026-04-02 13:40   ` Dominik Rusovac
2026-04-02 12:43 ` [PATCH ha-manager v4 05/28] sim: hardware: pass correct types for static stats Daniel Kral
2026-04-02 12:44 ` [PATCH ha-manager v4 06/28] sim: hardware: factor out static stats' default values Daniel Kral
2026-04-02 12:44 ` [PATCH ha-manager v4 07/28] sim: hardware: fix static stats guard Daniel Kral
2026-04-02 12:44 ` [PATCH ha-manager v4 08/28] sim: hardware: handle dynamic service stats Daniel Kral
2026-04-02 12:44 ` [PATCH ha-manager v4 09/28] sim: hardware: add set-dynamic-stats command Daniel Kral
2026-04-02 12:44 ` [PATCH ha-manager v4 10/28] sim: hardware: add getters for dynamic {node,service} stats Daniel Kral
2026-04-02 12:44 ` [PATCH ha-manager v4 11/28] usage: pass service data to add_service_usage Daniel Kral
2026-04-02 12:44 ` [PATCH ha-manager v4 12/28] usage: pass service data to get_used_service_nodes Daniel Kral
2026-04-02 12:44 ` [PATCH ha-manager v4 13/28] add running flag to non-HA cluster service stats Daniel Kral
2026-04-02 12:44 ` [PATCH ha-manager v4 14/28] usage: use add_service to add service usage to nodes Daniel Kral
2026-04-02 12:44 ` [PATCH ha-manager v4 15/28] usage: add dynamic usage scheduler Daniel Kral
2026-04-02 12:44 ` [PATCH ha-manager v4 16/28] test: add dynamic usage scheduler test cases Daniel Kral
2026-04-02 12:44 ` [PATCH ha-manager v4 17/28] manager: rename execute_migration to queue_resource_motion Daniel Kral
2026-04-02 12:44 ` [PATCH ha-manager v4 18/28] manager: update_crs_scheduler_mode: factor out crs config Daniel Kral
2026-04-02 12:44 ` Daniel Kral [this message]
2026-04-02 13:14   ` [PATCH ha-manager v4 19/28] implement automatic rebalancing Dominik Rusovac
2026-04-02 12:44 ` [PATCH ha-manager v4 20/28] test: add resource bundle generation test cases Daniel Kral
2026-04-02 12:44 ` [PATCH ha-manager v4 21/28] test: add dynamic automatic rebalancing system " Daniel Kral
2026-04-02 13:21   ` Dominik Rusovac
2026-04-02 12:44 ` [PATCH ha-manager v4 22/28] test: add static " Daniel Kral
2026-04-02 13:23   ` Dominik Rusovac
2026-04-02 12:44 ` [PATCH ha-manager v4 23/28] test: add automatic rebalancing system test cases with TOPSIS method Daniel Kral
2026-04-02 13:29   ` Dominik Rusovac
2026-04-02 12:44 ` [PATCH ha-manager v4 24/28] test: add automatic rebalancing system test cases with affinity rules Daniel Kral
2026-04-02 12:44 ` [PATCH manager v4 25/28] ui: dc/options: make the ha crs strings translatable Daniel Kral
2026-04-02 13:33   ` Dominik Rusovac
2026-04-02 12:44 ` [PATCH manager v4 26/28] ui: dc/options: add dynamic load scheduler option for ha crs Daniel Kral
2026-04-02 13:33   ` Dominik Rusovac
2026-04-02 12:44 ` [PATCH manager v4 27/28] ui: move cluster resource scheduling from dc/options into separate component Daniel Kral
2026-04-02 13:35   ` Dominik Rusovac
2026-04-02 12:44 ` [PATCH manager v4 28/28] ui: form: add crs auto rebalancing options Daniel Kral
2026-04-02 13:38   ` Dominik Rusovac
2026-04-02 14:24 ` [PATCH cluster/ha-manager/manager v4 00/28] dynamic scheduler + load rebalancer Dominik Rusovac
2026-04-02 16:07 ` applied: " Thomas Lamprecht

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260402124817.416232-20-d.kral@proxmox.com \
    --to=d.kral@proxmox.com \
    --cc=pve-devel@lists.proxmox.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox
Service provided by Proxmox Server Solutions GmbH | Privacy | Legal