From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68]) by lore.proxmox.com (Postfix) with ESMTPS id 27C9A1FF137 for ; Tue, 31 Mar 2026 11:06:40 +0200 (CEST) Received: from firstgate.proxmox.com (localhost [127.0.0.1]) by firstgate.proxmox.com (Proxmox) with ESMTP id 1081114C07; Tue, 31 Mar 2026 11:07:07 +0200 (CEST) Mime-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=UTF-8 Date: Tue, 31 Mar 2026 11:07:01 +0200 Message-Id: Subject: Re: [PATCH ha-manager v3 35/40] implement automatic rebalancing From: "Dominik Rusovac" To: "Daniel Kral" , X-Mailer: aerc 0.20.0 References: <20260330144101.668747-1-d.kral@proxmox.com> <20260330144101.668747-36-d.kral@proxmox.com> In-Reply-To: <20260330144101.668747-36-d.kral@proxmox.com> X-Bm-Milter-Handled: 55990f41-d878-4baa-be0a-ee34c49e34d2 X-Bm-Transport-Timestamp: 1774947965663 X-SPAM-LEVEL: Spam detection results: 0 AWL -1.024 Adjusted score from AWL reputation of From: address BAYES_00 -1.9 Bayes spam probability is 0 to 1% DMARC_MISSING 0.1 Missing DMARC policy KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment RCVD_IN_VALIDITY_CERTIFIED_BLOCKED 1 ADMINISTRATOR NOTICE: The query to Validity was blocked. See https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more information. RCVD_IN_VALIDITY_RPBL_BLOCKED 1 ADMINISTRATOR NOTICE: The query to Validity was blocked. See https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more information. RCVD_IN_VALIDITY_SAFE_BLOCKED 1 ADMINISTRATOR NOTICE: The query to Validity was blocked. See https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more information. SPF_HELO_NONE 0.001 SPF: HELO does not publish an SPF Record SPF_PASS -0.001 SPF: sender matches SPF record Message-ID-Hash: ICN2JHXJOKOS4J76EDE6JQEEZFTJE6MP X-Message-ID-Hash: ICN2JHXJOKOS4J76EDE6JQEEZFTJE6MP X-MailFrom: d.rusovac@proxmox.com X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; loop; banned-address; emergency; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header X-Mailman-Version: 3.3.10 Precedence: list List-Id: Proxmox VE development discussion List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: one nit inline lgtm, consider this On Mon Mar 30, 2026 at 4:30 PM CEST, Daniel Kral wrote: > If the automatic load balancing system is enabled, it checks whether the > cluster node imbalance exceeds some user-defined threshold for some HA > Manager rounds ("hold duration"). If it does exceed on consecutive HA > Manager rounds, it will choose the best resource motion to improve the > cluster node imbalance and queue it if it significantly improves it by > some user-defined imbalance improvement ("margin"). > > This patch introduces resource bundles, which ensure that HA resources > in strict positive resource affinity rules are considered as a whole > "bundle" instead of individual HA resources. > > Specifically, active and stationary resource bundles are resource > bundles, that have at least one resource running and all resources > located on the same node. This distinction is needed as newly created > strict positive resource affinity rules may still require some resource > motions to enforce the rule. > > Additionally, the migration candidate generation prunes any target > nodes, which do not adhere to the HA rules of these resource bundles > before scoring these migration candidates. nice work! also very nice idea to introduce resource bundles > > Signed-off-by: Daniel Kral > --- > changes v2 -> v3: > - none > [snip] > diff --git a/src/PVE/HA/Manager.pm b/src/PVE/HA/Manager.pm > index 2576c762..0f8a03a6 100644 > --- a/src/PVE/HA/Manager.pm > +++ b/src/PVE/HA/Manager.pm > @@ -59,10 +59,17 @@ sub new { > =20 > my $self =3D bless { > haenv =3D> $haenv, > - crs =3D> {}, > + crs =3D> { > + auto_rebalance =3D> {}, > + }, > last_rules_digest =3D> '', > last_groups_digest =3D> '', > last_services_digest =3D> '', > + # used to track how many HA rounds the imbalance threshold has b= een exceeded > + # > + # this is not persisted for a CRM failover as in the mean time > + # the usage statistics might have change quite a bit already nit: # the usage statistics might have change[d] quite a bit already > + sustained_imbalance_round =3D> 0, > group_migration_round =3D> 3, # wait a little bit > }, $class; > =20 > @@ -97,6 +104,13 @@ sub update_crs_scheduler_mode { > my $crs_cfg =3D $dc_cfg->{crs}; > =20 > $self->{crs}->{rebalance_on_request_start} =3D !!$crs_cfg->{'ha-reba= lance-on-start'}; > + $self->{crs}->{auto_rebalance}->{enable} =3D !!$crs_cfg->{'ha-auto-r= ebalance'}; > + $self->{crs}->{auto_rebalance}->{threshold} =3D $crs_cfg->{'ha-auto-= rebalance-threshold'} // 0.7; > + $self->{crs}->{auto_rebalance}->{method} =3D $crs_cfg->{'ha-auto-reb= alance-method'} > + // 'bruteforce'; > + $self->{crs}->{auto_rebalance}->{hold_duration} =3D $crs_cfg->{'ha-a= uto-rebalance-hold-duration'} > + // 3; > + $self->{crs}->{auto_rebalance}->{margin} =3D $crs_cfg->{'ha-auto-reb= alance-margin'} // 0.1; > =20 > my $old_mode =3D $self->{crs}->{scheduler}; > my $new_mode =3D $crs_cfg->{ha} || 'basic'; > @@ -114,6 +128,150 @@ sub update_crs_scheduler_mode { > return; > } > =20 > +# Returns a hash of lists, which contain the running, non-moving HA reso= urce > +# bundles, which are on the same node, implied by the strict positive re= source > +# affinity rules. > +# > +# Each resource bundle has a leader, which is the alphabetically first r= unning > +# HA resource in the resource bundle and also the key of each resource b= undle > +# in the returned hash. > +sub get_active_stationary_resource_bundles { > + my ($ss, $resource_affinity) =3D @_; > + > + my $resource_bundles =3D {}; > +OUTER: for my $sid (sort keys %$ss) { > + # do not consider non-started resource as 'active' leading resou= rce > + next if $ss->{$sid}->{state} ne 'started'; > + > + my @resources =3D ($sid); > + my $nodes =3D { $ss->{$sid}->{node} =3D> 1 }; > + > + my ($dependent_resources) =3D get_affinitive_resources($resource= _affinity, $sid); > + if (%$dependent_resources) { > + for my $csid (keys %$dependent_resources) { > + next if !defined($ss->{$csid}); > + my ($state, $node) =3D $ss->{$csid}->@{qw(state node)}; > + > + # do not consider stationary bundle if a dependent resou= rce moves > + next OUTER if $state eq 'migrate' || $state eq 'relocate= '; > + # do not add non-started resource to active bundle > + next if $state ne 'started'; > + > + $nodes->{$node} =3D 1; > + > + push @resources, $csid; > + } > + > + @resources =3D sort @resources; > + } > + > + # skip resource bundles, which are not on the same node yet > + next if keys %$nodes > 1; > + > + my $leader_sid =3D $resources[0]; > + > + $resource_bundles->{$leader_sid} =3D \@resources; > + } > + > + return $resource_bundles; > +} > + > +# Returns a hash of hashes, where each item contains the resource bundle= 's > +# leader, the list of HA resources in the resource bundle, and the list = of > +# possible nodes to migrate to. > +sub get_resource_migration_candidates { > + my ($self) =3D @_; > + > + my ($ss, $compiled_rules, $online_node_usage) =3D > + $self->@{qw(ss compiled_rules online_node_usage)}; > + my ($node_affinity, $resource_affinity) =3D > + $compiled_rules->@{qw(node-affinity resource-affinity)}; > + > + my $resource_bundles =3D get_active_stationary_resource_bundles($ss,= $resource_affinity); > + > + my @compact_migration_candidates =3D (); > + for my $leader_sid (sort keys %$resource_bundles) { > + my $current_leader_node =3D $ss->{$leader_sid}->{node}; > + my $online_nodes =3D { map { $_ =3D> 1 } $online_node_usage->lis= t_nodes() }; > + > + my (undef, $target_nodes) =3D get_node_affinity($node_affinity, = $leader_sid, $online_nodes); > + my ($together, $separate) =3D > + get_resource_affinity($resource_affinity, $leader_sid, $ss, = $online_nodes); > + apply_negative_resource_affinity($separate, $target_nodes); > + > + delete $target_nodes->{$current_leader_node}; > + > + next if !%$target_nodes; > + > + push @compact_migration_candidates, > + { > + leader =3D> $leader_sid, > + nodes =3D> [sort keys %$target_nodes], > + resources =3D> $resource_bundles->{$leader_sid}, > + }; > + } > + > + return \@compact_migration_candidates; > +} > + > +sub load_balance { > + my ($self) =3D @_; > + > + my ($crs, $haenv, $online_node_usage) =3D $self->@{qw(crs haenv onli= ne_node_usage)}; > + my ($auto_rebalance_opts) =3D $crs->{auto_rebalance}; > + > + return if !$auto_rebalance_opts->{enable}; > + return if $crs->{scheduler} ne 'static' && $crs->{scheduler} ne 'dyn= amic'; > + return if $self->any_resource_motion_queued_or_running(); > + > + my ($threshold, $method, $hold_duration, $margin) =3D > + $auto_rebalance_opts->@{qw(threshold method hold_duration margin= )}; > + > + my $imbalance =3D $online_node_usage->calculate_node_imbalance(); > + > + # do not load balance unless imbalance threshold has been exceeded > + # consecutively for $hold_duration calls to load_balance() > + if ($imbalance < $threshold) { > + $self->{sustained_imbalance_round} =3D 0; > + return; > + } else { > + $self->{sustained_imbalance_round}++; > + return if $self->{sustained_imbalance_round} < $hold_duration; > + $self->{sustained_imbalance_round} =3D 0; > + } > + > + my $candidates =3D $self->get_resource_migration_candidates(); > + > + my $result; > + if ($method eq 'bruteforce') { > + $result =3D $online_node_usage->select_best_balancing_migration(= $candidates); > + } elsif ($method eq 'topsis') { > + $result =3D $online_node_usage->select_best_balancing_migration_= topsis($candidates); > + } > + > + # happens if $candidates is empty or $method isn't handled above > + return if !$result; > + > + my ($migration, $target_imbalance) =3D $result->@{qw(migration imbal= ance)}; > + > + my $relative_change =3D ($imbalance - $target_imbalance) / $imbalanc= e; > + return if $relative_change < $margin; > + > + my ($sid, $source, $target) =3D $migration->@{qw(sid source-node tar= get-node)}; > + > + my (undef, $type, $id) =3D $haenv->parse_sid($sid); > + my $task =3D $type eq 'vm' ? "migrate" : "relocate"; > + my $cmd =3D "$task $sid $target"; > + nit: tbh I racked my brain a bit to get how this rounding technique worked; so pls at least add a comment to say that you're rounding or use printf > + my $target_imbalance_str =3D int(100 * $target_imbalance + 0.5) / 10= 0; > + $haenv->log( > + 'info', > + "auto rebalance - $task $sid to $target (expected target imbalan= ce: $target_imbalance_str)", > + ); > + > + $self->queue_resource_motion($cmd, $task, $sid, $target); > +} > + > sub cleanup { > my ($self) =3D @_; > =20 > @@ -466,6 +624,21 @@ sub queue_resource_motion { > } > } > =20 > +sub any_resource_motion_queued_or_running { > + my ($self) =3D @_; > + > + my ($ss) =3D $self->@{qw(ss)}; > + > + for my $sid (keys %$ss) { > + my ($cmd, $state) =3D $ss->{$sid}->@{qw(cmd state)}; > + > + return 1 if $state eq 'migrate' || $state eq 'relocate'; > + return 1 if defined($cmd) && ($cmd->[0] eq 'migrate' || $cmd->[0= ] eq 'relocate'); > + } > + > + return 0; > +} > + > # read new crm commands and save them into crm master status > sub update_crm_commands { > my ($self) =3D @_; > @@ -902,6 +1075,10 @@ sub manage { > return; # disarm active and progressing, skip normal service= state machine > } > # disarm deferred - fall through but only process services in tr= ansient states > + } else { > + # load balance only if disarm is disabled as during a deferred d= isarm > + # the HA Manager should not introduce any new migrations > + $self->load_balance(); > } very thoughtful, nice > =20 [snip] Reviewed-by: Dominik Rusovac