[pve-devel] [PATCH pve-ha-manager 0/3] POC/RFC: ressource aware HA manager

public inbox for pve-devel@lists.proxmox.com
 help / color / mirror / Atom feed

* [pve-devel] [PATCH pve-ha-manager 0/3] POC/RFC: ressource aware HA manager
@ 2021-12-13  7:43 Alexandre Derumier
  2021-12-13  7:43 ` [pve-devel] [PATCH pve-ha-manager 1/3] add ressource awareness manager Alexandre Derumier
                   ` (3 more replies)
  0 siblings, 4 replies; 8+ messages in thread
From: Alexandre Derumier @ 2021-12-13  7:43 UTC (permalink / raw)
  To: pve-devel

Hi,

this is a proof of concept to implement ressource aware HA.

The current implementation is really basic,
simply balancing the number of services on each node.

I had some real production cases, where a node is failing, and restarted vm
impact others nodes because of too much cpu/ram usage.


This new implementation use best-fit heuristic vector packing with constraints support.


- We compute nodes memory/cpu, and vm memory/cpu average stats  on last 20min

For each ressource :
- First, we ordering pending recovery state services by memory, then cpu usage.
  Memory is more important here, because vm can't start if target node don't have enough memory

- Then, we check possible target nodes contraints. (storage available, node have enough cpu/ram, node have enough cores,...)
  (could be extended with other constraint like vm affinity/anti-affinity, cpu compatibilty, ...)

- Then we compute a node weight with euclidean distance of both cpu/ram vectors between vm usage and node available ressources.
  Then we choose the first node with the lower eucliean distance weight.
  (Ex: if vm use 1go ram/1% cpu, node1 have 2go ram/2% cpu , and node2 have 4go ram/4% cpu,  node1 will be choose because it's the nearest of vm usage)

- We add recovered vm cpu/ram to target node stats. (This is only an best effort estimation, as the vm start is async on target lrm, and could failed,...)


I have keeped HA group node prio, and other other ordering,
so this don't break current tests, and we can add easily a option at datacenter to enable/disable

It could be easy to implement later some kind of vm auto migration when a node use too much cpu/ram,
reusing same node selection algorithm

I have added a basic test, I'll add more tests later if this patch serie is ok for you.



Some good litterature about heuristics:

microsoft hyper-v implementation: 
 - http://kunaltalwar.org/papers/VBPacking.pdf
 - https://www.microsoft.com/en-us/research/wp-content/uploads/2011/01/virtualization.pdf
Variable size vector bin packing heuristics:
 - https://hal.archives-ouvertes.fr/hal-00868016v2/document


Alexandre Derumier (3):
  add ressource awareness manager
  tests: add support for ressources
  add test-basic0

 src/PVE/HA/Env.pm                    |  24 +++
 src/PVE/HA/Env/PVE2.pm               |  90 ++++++++++
 src/PVE/HA/Manager.pm                | 246 ++++++++++++++++++++++++++-
 src/PVE/HA/Sim/Hardware.pm           |  61 +++++++
 src/PVE/HA/Sim/TestEnv.pm            |  36 ++++
 src/test/test-basic0/README          |   1 +
 src/test/test-basic0/cmdlist         |   4 +
 src/test/test-basic0/hardware_status |   5 +
 src/test/test-basic0/log.expect      |  52 ++++++
 src/test/test-basic0/manager_status  |   1 +
 src/test/test-basic0/node_stats      |   5 +
 src/test/test-basic0/service_config  |   5 +
 src/test/test-basic0/service_stats   |   5 +
 13 files changed, 528 insertions(+), 7 deletions(-)
 create mode 100644 src/test/test-basic0/README
 create mode 100644 src/test/test-basic0/cmdlist
 create mode 100644 src/test/test-basic0/hardware_status
 create mode 100644 src/test/test-basic0/log.expect
 create mode 100644 src/test/test-basic0/manager_status
 create mode 100644 src/test/test-basic0/node_stats
 create mode 100644 src/test/test-basic0/service_config
 create mode 100644 src/test/test-basic0/service_stats

-- 
2.30.2




^ permalink raw reply	[flat|nested] 8+ messages in thread

* [pve-devel] [PATCH pve-ha-manager 1/3] add ressource awareness manager
  2021-12-13  7:43 [pve-devel] [PATCH pve-ha-manager 0/3] POC/RFC: ressource aware HA manager Alexandre Derumier
@ 2021-12-13  7:43 ` Alexandre Derumier
  2021-12-13 10:04   ` Thomas Lamprecht
  2021-12-13  7:43 ` [pve-devel] [PATCH pve-ha-manager 2/3] tests: add support for ressources Alexandre Derumier
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 8+ messages in thread
From: Alexandre Derumier @ 2021-12-13  7:43 UTC (permalink / raw)
  To: pve-devel

---
 src/PVE/HA/Env.pm         |  24 ++++
 src/PVE/HA/Env/PVE2.pm    |  90 ++++++++++++++
 src/PVE/HA/Manager.pm     | 246 ++++++++++++++++++++++++++++++++++++--
 src/PVE/HA/Sim/TestEnv.pm |  27 +++++
 4 files changed, 380 insertions(+), 7 deletions(-)

diff --git a/src/PVE/HA/Env.pm b/src/PVE/HA/Env.pm
index ac569a9..73b6407 100644
--- a/src/PVE/HA/Env.pm
+++ b/src/PVE/HA/Env.pm
@@ -269,4 +269,28 @@ sub get_ha_settings {
     return $self->{plug}->get_ha_settings();
 }
 
+sub get_node_rrd_stats {
+    my ($self, $node) = @_;
+
+    return $self->{plug}->get_node_rrd_stats($node);
+}
+
+sub get_vm_rrd_stats {
+    my ($self, $vmid, $percentile) = @_;
+
+    return $self->{plug}->get_vm_rrd_stats($vmid, $percentile);
+}
+
+sub read_vm_config {
+    my ($self, $vmid) = @_;
+
+    return $self->{plug}->read_vm_config($vmid);
+}
+
+sub read_ct_config {
+    my ($self, $vmid) = @_;
+
+    return $self->{plug}->read_ct_config($vmid);
+}
+
 1;
diff --git a/src/PVE/HA/Env/PVE2.pm b/src/PVE/HA/Env/PVE2.pm
index 5e0a683..2e1585c 100644
--- a/src/PVE/HA/Env/PVE2.pm
+++ b/src/PVE/HA/Env/PVE2.pm
@@ -9,9 +9,14 @@ use IO::Socket::UNIX;
 use PVE::SafeSyslog;
 use PVE::Tools;
 use PVE::Cluster qw(cfs_register_file cfs_read_file cfs_write_file cfs_lock_file);
+use PVE::Cluster;
 use PVE::DataCenterConfig;
 use PVE::INotify;
 use PVE::RPCEnvironment;
+use PVE::API2Tools;
+use PVE::QemuConfig;
+use PVE::LXC::Config;
+use RRDs;
 
 use PVE::HA::Tools ':exit_codes';
 use PVE::HA::Env;
@@ -459,4 +464,89 @@ sub get_ha_settings {
     return $datacenterconfig->{ha};
 }
 
+sub get_node_rrd_stats {
+    my ($self, $node) = @_;
+
+    my $rrd = PVE::Cluster::rrd_dump();
+    my $members = PVE::Cluster::get_members();
+
+    my $stats = PVE::API2Tools::extract_node_stats($node, $members, $rrd);
+
+    return $stats;
+}
+
+sub get_vm_rrd_stats {
+    my ($self, $vmid, $percentile) = @_;
+
+    my $rrdname = "pve2-vm/$vmid";
+    my $rrddir = "/var/lib/rrdcached/db";
+
+    my $rrd = "$rrddir/$rrdname";
+
+    my $cf = "AVERAGE";
+
+    my $reso = 60;
+    my $ctime  = $reso*int(time()/$reso);
+
+    #last 20minutes
+    my $req_start = $ctime - $reso*20;
+    my $req_end = $ctime - $reso*1;
+
+    my @args = (
+        "-s" => $req_start,
+        "-e" => $req_end,
+        "-r" => $reso,
+        );
+
+    my $socket = "/var/run/rrdcached.sock";
+    push @args, "--daemon" => "unix:$socket" if -S $socket;
+
+    my ($start, $step, $names, $data) = RRDs::fetch($rrd, $cf, @args);
+
+    my @cpu = ();
+    my @mem = ();
+    my @maxmem = ();
+    my @maxcpu = ();
+
+    foreach my $rec (@$data) {
+        my $maxcpu = @$rec[0] || 0;
+        my $cpu = @$rec[1] || 0;
+        my $maxmem = @$rec[2] || 0;
+        my $mem = @$rec[3] || 0;
+        #skip zeros values if vm is down
+        push @cpu, $cpu*$maxcpu if $cpu > 0;
+        push @mem, $mem if $mem > 0;
+        push @maxcpu, $maxcpu if $maxcpu > 0;
+        push @maxmem, $maxmem if $maxmem > 0;
+    }
+
+    my $stats = {};
+
+    $stats->{cpu} = percentile($percentile, \@cpu) || 0;
+    $stats->{mem} = percentile($percentile, \@mem) || 0;
+    $stats->{maxmem} = percentile($percentile, \@maxmem) || 0;
+    $stats->{maxcpu} = percentile($percentile, \@maxcpu) || 0;
+    $stats->{totalcpu} = $stats->{cpu} * $stats->{maxcpu} * 100;
+
+    return $stats;
+}
+
+sub percentile {
+    my ($p, $aref) = @_;
+    my $percentile = int($p * $#{$aref}/100);
+    return (sort @$aref)[$percentile];
+}
+
+sub read_vm_config {
+    my ($self, $vmid) = @_;
+
+    return PVE::QemuConfig->load_config($vmid);
+}
+
+sub read_ct_config {
+    my ($self, $vmid) = @_;
+
+    return PVE::LXC::Config->load_config($vmid);
+}
+
 1;
diff --git a/src/PVE/HA/Manager.pm b/src/PVE/HA/Manager.pm
index 1c66b43..ae5fbcb 100644
--- a/src/PVE/HA/Manager.pm
+++ b/src/PVE/HA/Manager.pm
@@ -1,8 +1,13 @@
+
 package PVE::HA::Manager;
 
 use strict;
 use warnings;
 use Digest::MD5 qw(md5_base64);
+use RRDs;
+use POSIX qw/ceil/;
+use PVE::API2Tools;
+use PVE::Storage;
 
 use PVE::Tools;
 use PVE::HA::Tools ':exit_codes';
@@ -394,8 +399,16 @@ sub manage {
 	my $repeat = 0;
 
 	$self->recompute_online_node_usage();
+	$self->recompute_online_node_stats();
 
-	foreach my $sid (sort keys %$ss) {
+	$self->get_service_stats($ss);
+
+	foreach my $sid (
+		#ordering vm by size, bigger mem first then bigger cpu
+		#could be improved with bubblesearch heuristic
+		#https://www.cs.tufts.edu/~nr/cs257/archive/michael-mitzenmacher/bubblesearch.pdf
+		sort { $ss->{$a}->{stats}->{memg} <=> $ss->{$b}->{stats}->{memg} || $ss->{$a}->{stats}->{totalcpuround} <=> $ss->{$b}->{stats}->{totalcpuround} || $ss->{$a}->{type} cmp $ss->{$b}->{type}}
+		keys %$ss) {
 	    my $sd = $ss->{$sid};
 	    my $cd = $sc->{$sid} || { state => 'disabled' };
 
@@ -802,12 +815,8 @@ sub next_state_recovery {
 
     $self->recompute_online_node_usage(); # we want the most current node state
 
-    my $recovery_node = select_service_node(
-	$self->{groups},
-	$self->{online_node_usage},
-	$cd,
-	$sd->{node},
-    );
+    my $storecfg = PVE::Storage::config();
+    my $recovery_node = find_bestfit_node_target($haenv, $sid, $cd , $sd->{node}, $sd->{stats}, $self->{online_node_usage}, $self->{online_node_stats}, $self->{groups}, $storecfg);
 
     if ($recovery_node) {
 	my $msg = "recover service '$sid' from fenced node '$fenced_node' to node '$recovery_node'";
@@ -822,6 +831,14 @@ sub next_state_recovery {
 	$haenv->steal_service($sid, $sd->{node}, $recovery_node);
 	$self->{online_node_usage}->{$recovery_node}++;
 
+	#add vm cpu/mem to current node stats (this is an estimation based on last 20min vm stats)
+	my $node_stats = $self->{online_node_stats}->{$recovery_node}->{stats};
+	$node_stats->{totalcpu} += $sd->{stats}->{totalcpu};
+	$node_stats->{mem} += $sd->{stats}->{mem};
+	$node_stats->{totalfreecpu} = (100 * $node_stats->{maxcpu}) - $node_stats->{totalcpu};
+	$node_stats->{freemem} =  $node_stats->{maxmem} - $node_stats->{mem};
+
+
 	# NOTE: $sd *is normally read-only*, fencing is the exception
 	$cd->{node} = $sd->{node} = $recovery_node;
 	my $new_state = ($cd->{state} eq 'started') ? 'started' : 'request_stop';
@@ -839,4 +856,219 @@ sub next_state_recovery {
     }
 }
 
+
+sub dotprod {
+    my($vec_a, $vec_b, $mode) = @_;
+    die "they must have the same size\n" unless @$vec_a == @$vec_b;
+    $mode = "" if !$mode;
+    my $sum = 0;
+    my $norm_a = 0;
+    my $norm_b = 0;
+
+    for(my $i=0; $i < scalar @{$vec_a}; $i++) {
+	my $a = @{$vec_a}[$i];
+	my $b = @{$vec_b}[$i];
+
+	$sum += $a * $b;
+	$norm_a += $a * $a;
+	$norm_b += $b * $b;
+    }
+
+    if($mode eq 'normR') {
+	return $sum / (sqrt($norm_a) * sqrt($norm_b))
+    } elsif ($mode eq 'normC') {
+	return $sum / $norm_b;
+    }
+    return $sum;
+}
+
+sub euclidean_distance {
+    my($vec_a, $vec_b) = @_;
+
+    my $sum = 0;
+
+    for(my $i=0; $i < scalar @{$vec_a}; $i++) {
+        my $a = @{$vec_a}[$i];
+        my $b = @{$vec_b}[$i];
+        $sum += ($b - $a)**2;
+    }
+
+    return sqrt($sum);
+}
+
+sub find_bestfit_node_target {
+    my($haenv, $sid, $cd, $nodename, $vm_stats, $online_node_usage, $online_nodes, $groups, $storecfg) = @_;
+
+    my (undef, $vmid) = split(/:/, $sid);
+
+    my $hagroup = get_service_group($groups, $online_nodes, $cd);
+    my ($pri_groups, $group_members_prio) = get_node_priority_groups($hagroup, $online_nodes);
+
+    my $target_nodes = {};
+    foreach my $nodename (keys %$online_nodes) {
+	my $node_stats = $online_nodes->{$nodename}->{stats};
+
+        #### FILTERING NODES WITH HARD CONSTRAINTS (vm can't be started)
+	next if !check_hard_constraints($haenv, $vmid, $cd, $nodename, $node_stats, $vm_stats, $storecfg, $group_members_prio);
+
+	#### ADD prio and euclidean_distance weight
+	$target_nodes->{$nodename} = add_node_prio($nodename, 'distance', $node_stats, $vm_stats, $group_members_prio, $online_node_usage);
+    }
+
+    #order by soft_constraint_prio, hagroup prio, weight (Best fit algorithm, lower distance first), number of services, and nodename
+    my @target_array = sort { 
+				$target_nodes->{$b}->{prio} <=> $target_nodes->{$a}->{prio} || 
+				$target_nodes->{$a}->{soft_constraint_prio} <=> $target_nodes->{$b}->{soft_constraint_prio} || 
+				$target_nodes->{$a}->{weight} <=> $target_nodes->{$b}->{weight} || 
+				$target_nodes->{$a}->{online_node_usage} <=> $target_nodes->{$b}->{online_node_usage} || 
+				$target_nodes->{$a}->{name} cmp $target_nodes->{$b}->{name}
+			} keys %$target_nodes;
+
+    my $target = $target_array[0];
+
+    return $target;
+}
+
+
+sub check_hard_constraints {
+    my ($haenv, $vmid, $cd, $node, $node_stats, $vm_stats, $storecfg, $group_members_prio) = @_;
+
+    #node need to have a prio(restricted group)
+    return if !defined($group_members_prio->{$node});
+
+    #vm can't start if host have less core
+    return if $node_stats->{maxcpu} < $vm_stats->{maxcpu};
+    #vm can't start if node don't have enough mem to handle vm max mem
+    return if $node_stats->{freemem} < $vm_stats->{maxmem};
+
+    #max 95% cpu/ram
+    my $mem_threshold = 0.95;
+    my $cpu_threshold = 0.95;
+
+    #check if target node have enough mem ressources under threshold
+    return if $node_stats->{freemem} * $mem_threshold < $vm_stats->{mem};
+
+    #check if target node have enough cpu ressources under threshold
+    return if $node_stats->{totalfreecpu} * $cpu_threshold < $vm_stats->{totalcpu};
+
+    #check storage availability
+    if ($cd->{type} eq 'vm') {
+	my $conf = undef;
+	eval { $conf = $haenv->read_vm_config($vmid); };
+	if (!$@) {
+	    eval { PVE::QemuServer::check_storage_availability($storecfg, $conf, $node) };
+	    return if $@;
+	}
+
+    } elsif ($cd->{type} eq 'ct') {
+	my $conf = undef;
+	eval { $conf = $haenv->read_ct_config($vmid); };
+	#fixme : check storage for lxc too
+    }
+
+    # fixme: check bridge availability
+    # fixme: vm: add a check for cpumodel compatibility ?
+    return 1;
+}
+
+sub compute_soft_constraints {
+    my ($node_stats, $vm_stats) = @_;
+
+    #try to reach 80% max cpu/ram
+    my $mem_threshold = 0.8;
+    my $cpu_threshold = 0.8;
+
+    my $count = 0;
+    #check if target node have enough mem ressources under threshold
+    $count++ if $node_stats->{freemem} * $mem_threshold < $vm_stats->{mem};
+
+    #check if target node have enough cpu ressources under threshold
+    $count++ if $node_stats->{totalfreecpu} * $cpu_threshold < $vm_stats->{totalcpu};
+
+    #fixme : add antiaffinity
+
+    return $count;
+}
+
+sub add_node_prio {
+    my ($nodename, $method, $node_stats, $vm_stats, $group_members_prio, $online_node_usage) = @_;
+
+    #rounded values to compute vectors (cpu 0-100 , mem 0G-->XG)
+    my $vm_totalcpu = ceil($vm_stats->{totalcpu});
+    my $vm_mem = ceil($vm_stats->{mem}/1024/1024/1024);
+    my $node_freecpu = ceil($node_stats->{totalfreecpu});
+    my $node_freemem = ceil($node_stats->{freemem}/1024/1024/1024);
+
+    my @vec_vm = ($vm_totalcpu, $vm_mem);  #? add network usage dimension ?
+    my @vec_node = ($node_freecpu, $node_freemem); #? add network usage dimension ?
+    my $weight = 0;
+    if ($method eq 'distance') {
+	$weight = euclidean_distance(\@vec_vm,\@vec_node);
+    } elsif ($method eq 'dotprod') {
+	$weight = dotprod(\@vec_vm,\@vec_node);
+    }
+
+    my $node = {};
+    $node->{weight} = $weight;
+    $node->{soft_constraint_prio} = compute_soft_constraints($node_stats, $vm_stats);
+    $node->{prio} = $group_members_prio->{$nodename};
+    $node->{online_node_usage} = $online_node_usage->{$nodename};
+    $node->{name} = $nodename;
+
+    return $node;
+}
+
+sub get_service_stats {
+   my ($self, $ss) = @_;
+
+	foreach my $sid (sort keys %$ss) {
+
+	    if ($sid =~ m/^(vm|ct|fa):(\d+)$/) {
+		$ss->{$sid}->{type} = $1;
+		$ss->{$sid}->{name} = $2;
+	    }
+
+	    my $stats = {};
+	    $stats->{cpu} = 0;
+	    $stats->{maxcpu} = 0;
+	    $stats->{mem} = 0;
+	    $stats->{maxmem} = 0;
+
+	    #avoid to compute all stats, as currently we only support recovery
+	    if ($ss->{$sid}->{state} eq 'recovery') {
+
+		#get vm/ct stats 5min before on last 20min
+		$stats = $self->{haenv}->get_vm_rrd_stats($ss->{$sid}->{name}, 95);
+	    }
+	    #fixme: windows vm fill memory with zero at boot, so mem = maxmem
+
+	    #rounded values for ordering
+	    $stats->{totalcpuround} = ceil($stats->{cpu} * 100 * $stats->{maxcpu});
+	    $stats->{memg} = ceil( $stats->{mem} /1024 /1024 /1024);
+
+	    $ss->{$sid}->{stats} = $stats;
+	}
+}
+
+sub recompute_online_node_stats {
+    my ($self) = @_;
+
+    my $online_node_stats = {};
+    my $online_nodes = $self->{ns}->list_online_nodes();
+
+    foreach my $node (@$online_nodes) {
+	my $stats = $self->{haenv}->get_node_rrd_stats($node);
+        $stats->{cpu} = 0 if !defined($stats->{cpu});
+        $stats->{maxcpu} = 0 if !defined($stats->{maxcpu});
+        $stats->{mem} = 0 if !defined($stats->{mem});
+        $stats->{maxmem} = 0 if !defined($stats->{maxmem});
+        $stats->{totalcpu} = $stats->{cpu} * 100 * $stats->{maxcpu}; #how to handle different cpu model power ? bogomips ?
+        $stats->{totalfreecpu} = (100 * $stats->{maxcpu}) - $stats->{totalcpu};
+        $stats->{freemem} = $stats->{maxmem} - $stats->{mem};
+        $online_node_stats->{$node}->{stats} = $stats;
+    }
+
+    $self->{online_node_stats} = $online_node_stats;
+}
+
 1;
diff --git a/src/PVE/HA/Sim/TestEnv.pm b/src/PVE/HA/Sim/TestEnv.pm
index 6718d8c..08f27c7 100644
--- a/src/PVE/HA/Sim/TestEnv.pm
+++ b/src/PVE/HA/Sim/TestEnv.pm
@@ -118,4 +118,31 @@ sub get_max_workers {
     return 0;
 }
 
+sub get_node_rrd_stats {
+    my ($self, $node) = @_;
+
+    my $stats = {};
+    $stats->{cpu} = 0;
+    $stats->{maxcpu} = 0;
+    $stats->{mem} = 0;
+    $stats->{maxmem} = 0;
+
+    return $stats;
+}
+
+sub get_vm_rrd_stats {
+    my ($self, $vmid, $percentile) = @_;
+
+    my $stats = {};
+
+    $stats->{cpu} = 0;
+    $stats->{mem} = 0;
+    $stats->{maxmem} = 0;
+    $stats->{maxcpu} = 0;
+    $stats->{cpu} = $stats->{cpu} * 100;
+    $stats->{totalcpu} = $stats->{cpu} * $stats->{maxcpu};
+
+    return $stats;
+}
+
 1;
-- 
2.30.2




^ permalink raw reply	[flat|nested] 8+ messages in thread

* [pve-devel] [PATCH pve-ha-manager 2/3] tests: add support for ressources
  2021-12-13  7:43 [pve-devel] [PATCH pve-ha-manager 0/3] POC/RFC: ressource aware HA manager Alexandre Derumier
  2021-12-13  7:43 ` [pve-devel] [PATCH pve-ha-manager 1/3] add ressource awareness manager Alexandre Derumier
@ 2021-12-13  7:43 ` Alexandre Derumier
  2021-12-13  7:43 ` [pve-devel] [PATCH pve-ha-manager 3/3] add test-basic0 Alexandre Derumier
  2021-12-13  9:02 ` [pve-devel] [PATCH pve-ha-manager 0/3] POC/RFC: ressource aware HA manager Thomas Lamprecht
  3 siblings, 0 replies; 8+ messages in thread
From: Alexandre Derumier @ 2021-12-13  7:43 UTC (permalink / raw)
  To: pve-devel

---
 src/PVE/HA/Sim/Hardware.pm | 61 ++++++++++++++++++++++++++++++++++++++
 src/PVE/HA/Sim/TestEnv.pm  | 33 +++++++++++++--------
 2 files changed, 82 insertions(+), 12 deletions(-)

diff --git a/src/PVE/HA/Sim/Hardware.pm b/src/PVE/HA/Sim/Hardware.pm
index ba731e5..396e8bd 100644
--- a/src/PVE/HA/Sim/Hardware.pm
+++ b/src/PVE/HA/Sim/Hardware.pm
@@ -109,6 +109,22 @@ sub read_service_config {
     return $conf;
 }
 
+sub read_service_stats {
+    my ($self) = @_;
+
+    my $filename = "$self->{statusdir}/service_stats";
+    my $conf = PVE::HA::Tools::read_json_from_file($filename);
+    return $conf;
+}
+
+sub read_node_stats {
+    my ($self) = @_;
+
+    my $filename = "$self->{statusdir}/node_stats";
+    my $conf = PVE::HA::Tools::read_json_from_file($filename);
+    return $conf;
+}
+
 sub update_service_config {
     my ($self, $sid, $param) = @_;
 
@@ -132,6 +148,24 @@ sub write_service_config {
     return PVE::HA::Tools::write_json_to_file($filename, $conf);
 }
 
+sub write_service_stats {
+    my ($self, $conf) = @_;
+
+    $self->{service_stats} = $conf;
+
+    my $filename = "$self->{statusdir}/service_stats";
+    return PVE::HA::Tools::write_json_to_file($filename, $conf);
+}
+
+sub write_node_stats {
+    my ($self, $conf) = @_;
+
+    $self->{node_stats} = $conf;
+
+    my $filename = "$self->{statusdir}/node_stats";
+    return PVE::HA::Tools::write_json_to_file($filename, $conf);
+}
+
 sub read_fence_config {
     my ($self) = @_;
 
@@ -382,6 +416,31 @@ sub new {
 	$self->write_service_config($conf);
     }
 
+    if (-f "$testdir/service_stats") {
+	copy("$testdir/service_stats", "$statusdir/service_stats");
+    } else {
+	my $conf = {
+	    '101' => { cpu => 0, maxcpu => 0, mem => 0, maxmem => 0 },
+	    '102' => { cpu => 0, maxcpu => 0, mem => 0, maxmem => 0 },
+	    '103' => { cpu => 0, maxcpu => 0, mem => 0, maxmem => 0 },
+	    '104' => { cpu => 0, maxcpu => 0, mem => 0, maxmem => 0 },
+	    '105' => { cpu => 0, maxcpu => 0, mem => 0, maxmem => 0 },
+	    '106' => { cpu => 0, maxcpu => 0, mem => 0, maxmem => 0 },
+	};
+	$self->write_service_stats($conf);
+    }
+
+    if (-f "$testdir/node_stats") {
+	copy("$testdir/node_stats", "$statusdir/node_stats");
+    } else {
+	my $conf = {
+	    'node1' => { cpu => 0, maxcpu => 0, mem => 0, maxmem => 0 },
+	    'node2' => { cpu => 0, maxcpu => 0, mem => 0, maxmem => 0 },
+	    'node3' => { cpu => 0, maxcpu => 0, mem => 0, maxmem => 0 },
+	};
+	$self->write_node_stats($conf);
+    }
+
     if (-f "$testdir/hardware_status") {
 	copy("$testdir/hardware_status", "$statusdir/hardware_status") ||
 	    die "Copy failed: $!\n";
@@ -415,6 +474,8 @@ sub new {
     }
 
     $self->{service_config} = $self->read_service_config();
+    $self->{service_stats} = $self->read_service_stats();
+    $self->{node_stats} = $self->read_node_stats();
 
     return $self;
 }
diff --git a/src/PVE/HA/Sim/TestEnv.pm b/src/PVE/HA/Sim/TestEnv.pm
index 08f27c7..d5e85ad 100644
--- a/src/PVE/HA/Sim/TestEnv.pm
+++ b/src/PVE/HA/Sim/TestEnv.pm
@@ -121,11 +121,8 @@ sub get_max_workers {
 sub get_node_rrd_stats {
     my ($self, $node) = @_;
 
-    my $stats = {};
-    $stats->{cpu} = 0;
-    $stats->{maxcpu} = 0;
-    $stats->{mem} = 0;
-    $stats->{maxmem} = 0;
+    my $nodestats = $self->{hardware}->{node_stats};
+    my $stats = $nodestats->{$node};
 
     return $stats;
 }
@@ -133,16 +130,28 @@ sub get_node_rrd_stats {
 sub get_vm_rrd_stats {
     my ($self, $vmid, $percentile) = @_;
 
-    my $stats = {};
+    my $vmstats = $self->{hardware}->{service_stats};
+    my $stats = $vmstats->{$vmid};
 
-    $stats->{cpu} = 0;
-    $stats->{mem} = 0;
-    $stats->{maxmem} = 0;
-    $stats->{maxcpu} = 0;
-    $stats->{cpu} = $stats->{cpu} * 100;
-    $stats->{totalcpu} = $stats->{cpu} * $stats->{maxcpu};
+    $stats->{cpu} = $stats->{cpu} || 0;
+    $stats->{mem} = $stats->{mem} || 0;
+    $stats->{maxmem} = $stats->{maxmem} || 0;
+    $stats->{maxcpu} = $stats->{maxcpu} || 0;
+    $stats->{totalcpu} = $stats->{cpu} * $stats->{maxcpu} * 100;
 
     return $stats;
 }
 
+sub read_vm_config {
+    my ($self, $vmid) = @_;
+    
+    die "not yet implemented";
+}
+
+sub read_ct_config {
+    my ($self, $vmid) = @_;
+
+    die "not yet implemented";
+}
+
 1;
-- 
2.30.2




^ permalink raw reply	[flat|nested] 8+ messages in thread

* [pve-devel] [PATCH pve-ha-manager 3/3] add test-basic0
  2021-12-13  7:43 [pve-devel] [PATCH pve-ha-manager 0/3] POC/RFC: ressource aware HA manager Alexandre Derumier
  2021-12-13  7:43 ` [pve-devel] [PATCH pve-ha-manager 1/3] add ressource awareness manager Alexandre Derumier
  2021-12-13  7:43 ` [pve-devel] [PATCH pve-ha-manager 2/3] tests: add support for ressources Alexandre Derumier
@ 2021-12-13  7:43 ` Alexandre Derumier
  2021-12-13  9:02 ` [pve-devel] [PATCH pve-ha-manager 0/3] POC/RFC: ressource aware HA manager Thomas Lamprecht
  3 siblings, 0 replies; 8+ messages in thread
From: Alexandre Derumier @ 2021-12-13  7:43 UTC (permalink / raw)
  To: pve-devel

---
 src/test/test-basic0/README          |  1 +
 src/test/test-basic0/cmdlist         |  4 +++
 src/test/test-basic0/hardware_status |  5 +++
 src/test/test-basic0/log.expect      | 52 ++++++++++++++++++++++++++++
 src/test/test-basic0/manager_status  |  1 +
 src/test/test-basic0/node_stats      |  5 +++
 src/test/test-basic0/service_config  |  5 +++
 src/test/test-basic0/service_stats   |  5 +++
 8 files changed, 78 insertions(+)
 create mode 100644 src/test/test-basic0/README
 create mode 100644 src/test/test-basic0/cmdlist
 create mode 100644 src/test/test-basic0/hardware_status
 create mode 100644 src/test/test-basic0/log.expect
 create mode 100644 src/test/test-basic0/manager_status
 create mode 100644 src/test/test-basic0/node_stats
 create mode 100644 src/test/test-basic0/service_config
 create mode 100644 src/test/test-basic0/service_stats

diff --git a/src/test/test-basic0/README b/src/test/test-basic0/README
new file mode 100644
index 0000000..223c9dc
--- /dev/null
+++ b/src/test/test-basic0/README
@@ -0,0 +1 @@
+Test failover after single node network failure.
\ No newline at end of file
diff --git a/src/test/test-basic0/cmdlist b/src/test/test-basic0/cmdlist
new file mode 100644
index 0000000..eee0e40
--- /dev/null
+++ b/src/test/test-basic0/cmdlist
@@ -0,0 +1,4 @@
+[
+    [ "power node1 on", "power node2 on", "power node3 on"],
+    [ "network node3 off" ]
+]
diff --git a/src/test/test-basic0/hardware_status b/src/test/test-basic0/hardware_status
new file mode 100644
index 0000000..119b81c
--- /dev/null
+++ b/src/test/test-basic0/hardware_status
@@ -0,0 +1,5 @@
+{ 
+  "node1": { "power": "off", "network": "off" },
+  "node2": { "power": "off", "network": "off" },
+  "node3": { "power": "off", "network": "off" }
+}
\ No newline at end of file
diff --git a/src/test/test-basic0/log.expect b/src/test/test-basic0/log.expect
new file mode 100644
index 0000000..c61850d
--- /dev/null
+++ b/src/test/test-basic0/log.expect
@@ -0,0 +1,52 @@
+info      0     hardware: starting simulation
+info     20      cmdlist: execute power node1 on
+info     20    node1/crm: status change startup => wait_for_quorum
+info     20    node1/lrm: status change startup => wait_for_agent_lock
+info     20      cmdlist: execute power node2 on
+info     20    node2/crm: status change startup => wait_for_quorum
+info     20    node2/lrm: status change startup => wait_for_agent_lock
+info     20      cmdlist: execute power node3 on
+info     20    node3/crm: status change startup => wait_for_quorum
+info     20    node3/lrm: status change startup => wait_for_agent_lock
+info     20    node1/crm: got lock 'ha_manager_lock'
+info     20    node1/crm: status change wait_for_quorum => master
+info     20    node1/crm: node 'node1': state changed from 'unknown' => 'online'
+info     20    node1/crm: node 'node2': state changed from 'unknown' => 'online'
+info     20    node1/crm: node 'node3': state changed from 'unknown' => 'online'
+info     20    node1/crm: adding new service 'vm:101' on node 'node1'
+info     20    node1/crm: adding new service 'vm:102' on node 'node2'
+info     20    node1/crm: adding new service 'vm:103' on node 'node3'
+info     21    node1/lrm: got lock 'ha_agent_node1_lock'
+info     21    node1/lrm: status change wait_for_agent_lock => active
+info     21    node1/lrm: starting service vm:101
+info     21    node1/lrm: service status vm:101 started
+info     22    node2/crm: status change wait_for_quorum => slave
+info     23    node2/lrm: got lock 'ha_agent_node2_lock'
+info     23    node2/lrm: status change wait_for_agent_lock => active
+info     24    node3/crm: status change wait_for_quorum => slave
+info     25    node3/lrm: got lock 'ha_agent_node3_lock'
+info     25    node3/lrm: status change wait_for_agent_lock => active
+info     25    node3/lrm: starting service vm:103
+info     25    node3/lrm: service status vm:103 started
+info     40    node1/crm: service 'vm:102': state changed from 'request_stop' to 'stopped'
+info    120      cmdlist: execute network node3 off
+info    120    node1/crm: node 'node3': state changed from 'online' => 'unknown'
+info    124    node3/crm: status change slave => wait_for_quorum
+info    125    node3/lrm: status change active => lost_agent_lock
+info    160    node1/crm: service 'vm:103': state changed from 'started' to 'fence'
+info    160    node1/crm: node 'node3': state changed from 'unknown' => 'fence'
+emai    160    node1/crm: FENCE: Try to fence node 'node3'
+info    166     watchdog: execute power node3 off
+info    165    node3/crm: killed by poweroff
+info    166    node3/lrm: killed by poweroff
+info    166     hardware: server 'node3' stopped by poweroff (watchdog)
+info    240    node1/crm: got lock 'ha_agent_node3_lock'
+info    240    node1/crm: fencing: acknowledged - got agent lock for node 'node3'
+info    240    node1/crm: node 'node3': state changed from 'fence' => 'unknown'
+emai    240    node1/crm: SUCCEED: fencing: acknowledged - got agent lock for node 'node3'
+info    240    node1/crm: service 'vm:103': state changed from 'fence' to 'recovery'
+info    240    node1/crm: recover service 'vm:103' from fenced node 'node3' to node 'node2'
+info    240    node1/crm: service 'vm:103': state changed from 'recovery' to 'started'  (node = node2)
+info    243    node2/lrm: starting service vm:103
+info    243    node2/lrm: service status vm:103 started
+info    720     hardware: exit simulation - done
diff --git a/src/test/test-basic0/manager_status b/src/test/test-basic0/manager_status
new file mode 100644
index 0000000..9e26dfe
--- /dev/null
+++ b/src/test/test-basic0/manager_status
@@ -0,0 +1 @@
+{}
\ No newline at end of file
diff --git a/src/test/test-basic0/node_stats b/src/test/test-basic0/node_stats
new file mode 100644
index 0000000..2e20c53
--- /dev/null
+++ b/src/test/test-basic0/node_stats
@@ -0,0 +1,5 @@
+{ 
+  "node1": { "cpu": 0.1, "maxcpu": 1,"mem": 30737418240,"maxmem": 107374182400 },
+  "node2": { "cpu": 0.1, "maxcpu": 1,"mem": 60737418240,"maxmem": 107374182400 },
+  "node3": { "cpu": 1,   "maxcpu": 1,"mem": 10737418240,"maxmem": 107374182400 }
+}
\ No newline at end of file
diff --git a/src/test/test-basic0/service_config b/src/test/test-basic0/service_config
new file mode 100644
index 0000000..0e05ab4
--- /dev/null
+++ b/src/test/test-basic0/service_config
@@ -0,0 +1,5 @@
+{
+    "vm:101": { "node": "node1", "state": "enabled" },
+    "vm:102": { "node": "node2" },
+    "vm:103": { "node": "node3", "state": "enabled" }
+}
\ No newline at end of file
diff --git a/src/test/test-basic0/service_stats b/src/test/test-basic0/service_stats
new file mode 100644
index 0000000..52f506d
--- /dev/null
+++ b/src/test/test-basic0/service_stats
@@ -0,0 +1,5 @@
+{ 
+  "101": { "cpu": 0.1, "maxcpu": 1,"mem": 1073741824,"maxmem": 1073741824 },
+  "102": { "cpu": 0.1, "maxcpu": 1,"mem": 1073741824,"maxmem": 1073741824 },
+  "103": { "cpu": 0.1, "maxcpu": 1,"mem": 1073741824,"maxmem": 1073741824 }
+}
\ No newline at end of file
-- 
2.30.2




^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [pve-devel] [PATCH pve-ha-manager 0/3] POC/RFC: ressource aware HA manager
  2021-12-13  7:43 [pve-devel] [PATCH pve-ha-manager 0/3] POC/RFC: ressource aware HA manager Alexandre Derumier
                   ` (2 preceding siblings ...)
  2021-12-13  7:43 ` [pve-devel] [PATCH pve-ha-manager 3/3] add test-basic0 Alexandre Derumier
@ 2021-12-13  9:02 ` Thomas Lamprecht
  3 siblings, 0 replies; 8+ messages in thread
From: Thomas Lamprecht @ 2021-12-13  9:02 UTC (permalink / raw)
  To: Proxmox VE development discussion, Alexandre Derumier

Hi,

On 13.12.21 08:43, Alexandre Derumier wrote:
> Hi,
> 
> this is a proof of concept to implement ressource aware HA.

nice! I'll try to give it a quick view now so that I do not stall you to much
on this long-wished feature.

> The current implementation is really basic,
> simply balancing the number of services on each node.
> 
> I had some real production cases, where a node is failing, and restarted vm
> impact others nodes because of too much cpu/ram usage.
> 
> This new implementation use best-fit heuristic vector packing with constraints support.
> 
> 
> - We compute nodes memory/cpu, and vm memory/cpu average stats  on last 20min
> 
> For each ressource :
> - First, we ordering pending recovery state services by memory, then cpu usage.
>   Memory is more important here, because vm can't start if target node don't have enough memory

agreed

> 
> - Then, we check possible target nodes contraints. (storage available, node have enough cpu/ram, node have enough cores,...)
>   (could be extended with other constraint like vm affinity/anti-affinity, cpu compatibilty, ...)
> 
> - Then we compute a node weight with euclidean distance of both cpu/ram vectors between vm usage and node available ressources.
>   Then we choose the first node with the lower eucliean distance weight.
>   (Ex: if vm use 1go ram/1% cpu, node1 have 2go ram/2% cpu , and node2 have 4go ram/4% cpu,  node1 will be choose because it's the nearest of vm usage)

sounds like an OK approach to me, I had relatively similar in mind

> 
> - We add recovered vm cpu/ram to target node stats. (This is only an best effort estimation, as the vm start is async on target lrm, and could failed,...)
> 
> 
> I have keeped HA group node prio, and other other ordering,
> so this don't break current tests

that is great, the regression test from HA is one of the best we have to
test and simulate behavior, so keeping those unchanged can give quite a bit
of confidence in any implementation. Albeit with your change its mostly because
it's side stepping the balancer as no usage is there?

IMO it would be good to have most tests such that they can get affected by
the balancer, at least if we make it opt-out

> and we can add easily a option at datacenter to enable/disable

As a starter we could also only do the compute-node-by-resource usage on recovery
and first start transition, as especially for the latter it's quite important to
get the service recovered to a node with a low(er) load to avoid domino effect.

Doing re-computation then for started VMs would be easy to add once we're sure
the algorithm works out.

But yeah, for some admins it would surely be welcomed to make it configurable, like:

[ ] move to lowest used node on start and recovery of service
[ ] auto-balance started services periodically

> 
> It could be easy to implement later some kind of vm auto migration when a node use too much cpu/ram,
> reusing same node selection algorithm
> 
> I have added a basic test, I'll add more tests later if this patch serie is ok for you.

I'd add commands to sim_hardware_cmd for simulating cpu/memory increase,
it's nicer to have that controllable by the cmd list.

For the test system it could be also interesting if we can annotate the
services with some basic resource usage, e.g. memory and core count and possibly
also some low (0.33), mid (0.66) and high (1.0) load-factor (that is controllable
by command), that could help to simulate reality while keeping it somewhat simple.

> Some good litterature about heuristics:
> 
> microsoft hyper-v implementation: 
>  - http://kunaltalwar.org/papers/VBPacking.pdf
>  - https://www.microsoft.com/en-us/research/wp-content/uploads/2011/01/virtualization.pdf
> Variable size vector bin packing heuristics:
>  - https://hal.archives-ouvertes.fr/hal-00868016v2/document
> 
> 
> Alexandre Derumier (3):
>   add ressource awareness manager
>   tests: add support for ressources
>   add test-basic0
> 
>  src/PVE/HA/Env.pm                    |  24 +++
>  src/PVE/HA/Env/PVE2.pm               |  90 ++++++++++
>  src/PVE/HA/Manager.pm                | 246 ++++++++++++++++++++++++++-
>  src/PVE/HA/Sim/Hardware.pm           |  61 +++++++
>  src/PVE/HA/Sim/TestEnv.pm            |  36 ++++
>  src/test/test-basic0/README          |   1 +
>  src/test/test-basic0/cmdlist         |   4 +
>  src/test/test-basic0/hardware_status |   5 +
>  src/test/test-basic0/log.expect      |  52 ++++++
>  src/test/test-basic0/manager_status  |   1 +
>  src/test/test-basic0/node_stats      |   5 +
>  src/test/test-basic0/service_config  |   5 +
>  src/test/test-basic0/service_stats   |   5 +
>  13 files changed, 528 insertions(+), 7 deletions(-)
>  create mode 100644 src/test/test-basic0/README
>  create mode 100644 src/test/test-basic0/cmdlist
>  create mode 100644 src/test/test-basic0/hardware_status
>  create mode 100644 src/test/test-basic0/log.expect
>  create mode 100644 src/test/test-basic0/manager_status
>  create mode 100644 src/test/test-basic0/node_stats
>  create mode 100644 src/test/test-basic0/service_config
>  create mode 100644 src/test/test-basic0/service_stats
> 





^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [pve-devel] [PATCH pve-ha-manager 1/3] add ressource awareness manager
  2021-12-13  7:43 ` [pve-devel] [PATCH pve-ha-manager 1/3] add ressource awareness manager Alexandre Derumier
@ 2021-12-13 10:04   ` Thomas Lamprecht
  2021-12-13 10:58     ` DERUMIER, Alexandre
  0 siblings, 1 reply; 8+ messages in thread
From: Thomas Lamprecht @ 2021-12-13 10:04 UTC (permalink / raw)
  To: Proxmox VE development discussion, Alexandre Derumier

some parts of your cover-letter (basic approach/decisions) should go here so that
it can be found more easily when working with git.

Some comments inline, disclaimer: I pretty much read it down from here, did not
tried out to much and I possibly may just have overlooked something.

On 13.12.21 08:43, Alexandre Derumier wrote:
> ---
>  src/PVE/HA/Env.pm         |  24 ++++
>  src/PVE/HA/Env/PVE2.pm    |  90 ++++++++++++++
>  src/PVE/HA/Manager.pm     | 246 ++++++++++++++++++++++++++++++++++++--
>  src/PVE/HA/Sim/TestEnv.pm |  27 +++++
>  4 files changed, 380 insertions(+), 7 deletions(-)
> 
> diff --git a/src/PVE/HA/Env.pm b/src/PVE/HA/Env.pm
> index ac569a9..73b6407 100644
> --- a/src/PVE/HA/Env.pm
> +++ b/src/PVE/HA/Env.pm
> @@ -269,4 +269,28 @@ sub get_ha_settings {
>      return $self->{plug}->get_ha_settings();
>  }
>  
> +sub get_node_rrd_stats {
> +    my ($self, $node) = @_;
> +
> +    return $self->{plug}->get_node_rrd_stats($node);
> +}
> +
> +sub get_vm_rrd_stats {
> +    my ($self, $vmid, $percentile) = @_;
> +
> +    return $self->{plug}->get_vm_rrd_stats($vmid, $percentile);
> +}
> +
> +sub read_vm_config {
> +    my ($self, $vmid) = @_;
> +
> +    return $self->{plug}->read_vm_config($vmid);

I'd like to avoid returning the whole guest config but rather an abstract format
that only sets some keys like "memory" and "cpu" that we actually require, that'd
make it more defined and easier to see what possible impact factors are - the test
or sim system is also slightly easier to add that way.

As that's the basic job of the Env plugin system, transform the mixed/complex
PVE metrics and controls in some simpler abstraction to use internally, making
it easy to simulate/test them.


> +}
> +
> +sub read_ct_config {
> +    my ($self, $vmid) = @_;
> +
> +    return $self->{plug}->read_ct_config($vmid);
> +}
> +
>  1;
> diff --git a/src/PVE/HA/Env/PVE2.pm b/src/PVE/HA/Env/PVE2.pm
> index 5e0a683..2e1585c 100644
> --- a/src/PVE/HA/Env/PVE2.pm
> +++ b/src/PVE/HA/Env/PVE2.pm
> @@ -9,9 +9,14 @@ use IO::Socket::UNIX;
>  use PVE::SafeSyslog;
>  use PVE::Tools;
>  use PVE::Cluster qw(cfs_register_file cfs_read_file cfs_write_file cfs_lock_file);
> +use PVE::Cluster;

Cluster is already imported a line above

>  use PVE::DataCenterConfig;
>  use PVE::INotify;
>  use PVE::RPCEnvironment;
> +use PVE::API2Tools;
> +use PVE::QemuConfig;
> +use PVE::LXC::Config;
> +use RRDs;
>  
>  use PVE::HA::Tools ':exit_codes';
>  use PVE::HA::Env;
> @@ -459,4 +464,89 @@ sub get_ha_settings {
>      return $datacenterconfig->{ha};
>  }
>  
> +sub get_node_rrd_stats {
> +    my ($self, $node) = @_;
> +
> +    my $rrd = PVE::Cluster::rrd_dump();
> +    my $members = PVE::Cluster::get_members();
> +
> +    my $stats = PVE::API2Tools::extract_node_stats($node, $members, $rrd);
> +
> +    return $stats;
> +}
> +
> +sub get_vm_rrd_stats {
> +    my ($self, $vmid, $percentile) = @_;
> +
> +    my $rrdname = "pve2-vm/$vmid";
> +    my $rrddir = "/var/lib/rrdcached/db";
> +
> +    my $rrd = "$rrddir/$rrdname";
> +
> +    my $cf = "AVERAGE";
> +
> +    my $reso = 60;
> +    my $ctime  = $reso*int(time()/$reso);
> +
> +    #last 20minutes
> +    my $req_start = $ctime - $reso*20;
> +    my $req_end = $ctime - $reso*1;
> +
> +    my @args = (
> +        "-s" => $req_start,
> +        "-e" => $req_end,
> +        "-r" => $reso,
> +        );
> +
> +    my $socket = "/var/run/rrdcached.sock";
> +    push @args, "--daemon" => "unix:$socket" if -S $socket;
> +
> +    my ($start, $step, $names, $data) = RRDs::fetch($rrd, $cf, @args);

semi-related, Dietmar and I talked about adapting the RRD format and higher resolution
from PBS also for PVE, dropping the dependency on rrdcached, but PVE with the cluster
and broadcasting is quite a bit more complex compared to PBS, so the effort stalled.

If we make more use of those metrics here it would add further coupling to the existing
interface so maybe we'd rethink that through - but nothing that needs to stall this
completely, changing RRD would be better done at a major release anyway where we
can handle this somewhat cleanly.

> +
> +    my @cpu = ();
> +    my @mem = ();
> +    my @maxmem = ();
> +    my @maxcpu = ();
> +
> +    foreach my $rec (@$data) {
> +        my $maxcpu = @$rec[0] || 0;
> +        my $cpu = @$rec[1] || 0;
> +        my $maxmem = @$rec[2] || 0;
> +        my $mem = @$rec[3] || 0;
> +        #skip zeros values if vm is down
> +        push @cpu, $cpu*$maxcpu if $cpu > 0;
> +        push @mem, $mem if $mem > 0;
> +        push @maxcpu, $maxcpu if $maxcpu > 0;
> +        push @maxmem, $maxmem if $maxmem > 0;
> +    }
> +
> +    my $stats = {};


maybe create this nested

my $stats = {
  cpu => {
    avg => percentile($percentile, \@cpu) || 0;
    max => percentile($percentile, \@maxcpu) || 0;
    total  ...
  },
};

> +    $stats->{cpu} = percentile($percentile, \@cpu) || 0;
> +    $stats->{mem} = percentile($percentile, \@mem) || 0;
> +    $stats->{maxmem} = percentile($percentile, \@maxmem) || 0;
> +    $stats->{maxcpu} = percentile($percentile, \@maxcpu) || 0;
> +    $stats->{totalcpu} = $stats->{cpu} * $stats->{maxcpu} * 100;

what is cpu/maxcpu exactly, the multiplication of both strikes me as a bit weird?


> +
> +    return $stats;
> +}
> +
> +sub percentile {
> +    my ($p, $aref) = @_;
> +    my $percentile = int($p * $#{$aref}/100);
> +    return (sort @$aref)[$percentile];
> +}
> +
> +sub read_vm_config {
> +    my ($self, $vmid) = @_;
> +
> +    return PVE::QemuConfig->load_config($vmid);
> +}
> +
> +sub read_ct_config {
> +    my ($self, $vmid) = @_;
> +
> +    return PVE::LXC::Config->load_config($vmid);
> +}
> +
>  1;
> diff --git a/src/PVE/HA/Manager.pm b/src/PVE/HA/Manager.pm
> index 1c66b43..ae5fbcb 100644
> --- a/src/PVE/HA/Manager.pm
> +++ b/src/PVE/HA/Manager.pm
> @@ -1,8 +1,13 @@
> +
>  package PVE::HA::Manager;
>  
>  use strict;
>  use warnings;
>  use Digest::MD5 qw(md5_base64);
> +use RRDs;
> +use POSIX qw/ceil/;
> +use PVE::API2Tools;
> +use PVE::Storage;

Above two are not necessarily available in the test/sim environment, whatever needs
them needs to be capsuled in the PVE2 Env module.

>  
>  use PVE::Tools;
>  use PVE::HA::Tools ':exit_codes';
> @@ -394,8 +399,16 @@ sub manage {
>  	my $repeat = 0;
>  
>  	$self->recompute_online_node_usage();
> +	$self->recompute_online_node_stats();
>  
> -	foreach my $sid (sort keys %$ss) {
> +	$self->get_service_stats($ss);
> +
> +	foreach my $sid (
> +		#ordering vm by size, bigger mem first then bigger cpu
> +		#could be improved with bubblesearch heuristic
> +		#https://www.cs.tufts.edu/~nr/cs257/archive/michael-mitzenmacher/bubblesearch.pdf
> +		sort { $ss->{$a}->{stats}->{memg} <=> $ss->{$b}->{stats}->{memg} || $ss->{$a}->{stats}->{totalcpuround} <=> $ss->{$b}->{stats}->{totalcpuround} || $ss->{$a}->{type} cmp $ss->{$b}->{type}}
> +		keys %$ss) {

maybe we can move that sort into a sub with some line breaks?

>  	    my $sd = $ss->{$sid};
>  	    my $cd = $sc->{$sid} || { state => 'disabled' };
>  
> @@ -802,12 +815,8 @@ sub next_state_recovery {
>  
>      $self->recompute_online_node_usage(); # we want the most current node state
>  
> -    my $recovery_node = select_service_node(
> -	$self->{groups},
> -	$self->{online_node_usage},
> -	$cd,
> -	$sd->{node},
> -    );
> +    my $storecfg = PVE::Storage::config();

that dependency cannot live here directly in manager, from top of my head I'd suggest adding
a check_service_constraints($sid, $node) plugin method to env


> +    my $recovery_node = find_bestfit_node_target($haenv, $sid, $cd , $sd->{node}, $sd->{stats}, $self->{online_node_usage}, $self->{online_node_stats}, $self->{groups}, $storecfg);

maybe just pass $self to avoid most of those parameters, iow. make it a blessed method of this
module, something like:

$self->find_bestfit_node_target($sid, $cd , $sd->{node}, $sd->{stats});

seems to me like a balance of avoid passing every helper module/hash explicit again while keeping
it a bit decoupled from the blessed module info.

>  
>      if ($recovery_node) {
>  	my $msg = "recover service '$sid' from fenced node '$fenced_node' to node '$recovery_node'";
> @@ -822,6 +831,14 @@ sub next_state_recovery {
>  	$haenv->steal_service($sid, $sd->{node}, $recovery_node);
>  	$self->{online_node_usage}->{$recovery_node}++;
>  
> +	#add vm cpu/mem to current node stats (this is an estimation based on last 20min vm stats)
> +	my $node_stats = $self->{online_node_stats}->{$recovery_node}->{stats};
> +	$node_stats->{totalcpu} += $sd->{stats}->{totalcpu};
> +	$node_stats->{mem} += $sd->{stats}->{mem};
> +	$node_stats->{totalfreecpu} = (100 * $node_stats->{maxcpu}) - $node_stats->{totalcpu};
> +	$node_stats->{freemem} =  $node_stats->{maxmem} - $node_stats->{mem};



> +
> +
>  	# NOTE: $sd *is normally read-only*, fencing is the exception
>  	$cd->{node} = $sd->{node} = $recovery_node;
>  	my $new_state = ($cd->{state} eq 'started') ? 'started' : 'request_stop';
> @@ -839,4 +856,219 @@ sub next_state_recovery {
>      }
>  }
>  
> +
> +sub dotprod {
> +    my($vec_a, $vec_b, $mode) = @_;
> +    die "they must have the same size\n" unless @$vec_a == @$vec_b;
> +    $mode = "" if !$mode;
> +    my $sum = 0;
> +    my $norm_a = 0;
> +    my $norm_b = 0;
> +
> +    for(my $i=0; $i < scalar @{$vec_a}; $i++) {
> +	my $a = @{$vec_a}[$i];
> +	my $b = @{$vec_b}[$i];
> +
> +	$sum += $a * $b;
> +	$norm_a += $a * $a;
> +	$norm_b += $b * $b;
> +    }
> +
> +    if($mode eq 'normR') {
> +	return $sum / (sqrt($norm_a) * sqrt($norm_b))
> +    } elsif ($mode eq 'normC') {
> +	return $sum / $norm_b;
> +    }
> +    return $sum;
> +}
> +
> +sub euclidean_distance {
> +    my($vec_a, $vec_b) = @_;
> +
> +    my $sum = 0;
> +
> +    for(my $i=0; $i < scalar @{$vec_a}; $i++) {
> +        my $a = @{$vec_a}[$i];
> +        my $b = @{$vec_b}[$i];
> +        $sum += ($b - $a)**2;
> +    }
> +
> +    return sqrt($sum);
> +}

We can add a few helper methods to a new module in proxmox-perl-rs for above, could make
this quite a bit faster.

Also, above, and actually also below should live in a separate module in the future, but
I have nothing against keeping it here for development for now.

> +
> +sub find_bestfit_node_target {
> +    my($haenv, $sid, $cd, $nodename, $vm_stats, $online_node_usage, $online_nodes, $groups, $storecfg) = @_;
> +
> +    my (undef, $vmid) = split(/:/, $sid);

we have a helper for that IIRC

> +
> +    my $hagroup = get_service_group($groups, $online_nodes, $cd);
> +    my ($pri_groups, $group_members_prio) = get_node_priority_groups($hagroup, $online_nodes);
> +
> +    my $target_nodes = {};
> +    foreach my $nodename (keys %$online_nodes) {
> +	my $node_stats = $online_nodes->{$nodename}->{stats};
> +
> +        #### FILTERING NODES WITH HARD CONSTRAINTS (vm can't be started)
> +	next if !check_hard_constraints($haenv, $vmid, $cd, $nodename, $node_stats, $vm_stats, $storecfg, $group_members_prio);

I mean, most of those should be covered by the grouping already?

> +
> +	#### ADD prio and euclidean_distance weight
> +	$target_nodes->{$nodename} = add_node_prio($nodename, 'distance', $node_stats, $vm_stats, $group_members_prio, $online_node_usage);
> +    }
> +
> +    #order by soft_constraint_prio, hagroup prio, weight (Best fit algorithm, lower distance first), number of services, and nodename
> +    my @target_array = sort { 
> +				$target_nodes->{$b}->{prio} <=> $target_nodes->{$a}->{prio} || 
> +				$target_nodes->{$a}->{soft_constraint_prio} <=> $target_nodes->{$b}->{soft_constraint_prio} || 
> +				$target_nodes->{$a}->{weight} <=> $target_nodes->{$b}->{weight} || 
> +				$target_nodes->{$a}->{online_node_usage} <=> $target_nodes->{$b}->{online_node_usage} || 
> +				$target_nodes->{$a}->{name} cmp $target_nodes->{$b}->{name}
> +			} keys %$target_nodes;
> +
> +    my $target = $target_array[0];
> +
> +    return $target;
> +}
> +
> +
> +sub check_hard_constraints {
> +    my ($haenv, $vmid, $cd, $node, $node_stats, $vm_stats, $storecfg, $group_members_prio) = @_;
> +
> +    #node need to have a prio(restricted group)
> +    return if !defined($group_members_prio->{$node});
> +
> +    #vm can't start if host have less core
> +    return if $node_stats->{maxcpu} < $vm_stats->{maxcpu};
> +    #vm can't start if node don't have enough mem to handle vm max mem
> +    return if $node_stats->{freemem} < $vm_stats->{maxmem};
> +
> +    #max 95% cpu/ram
> +    my $mem_threshold = 0.95;
> +    my $cpu_threshold = 0.95;
> +
> +    #check if target node have enough mem ressources under threshold
> +    return if $node_stats->{freemem} * $mem_threshold < $vm_stats->{mem};
> +
> +    #check if target node have enough cpu ressources under threshold
> +    return if $node_stats->{totalfreecpu} * $cpu_threshold < $vm_stats->{totalcpu};
> +
> +    #check storage availability
> +    if ($cd->{type} eq 'vm') {
> +	my $conf = undef;
> +	eval { $conf = $haenv->read_vm_config($vmid); };
> +	if (!$@) {
> +	    eval { PVE::QemuServer::check_storage_availability($storecfg, $conf, $node) };

non-guared pve-env call cannot stay here in manager

> +	    return if $@;
> +	}
> +
> +    } elsif ($cd->{type} eq 'ct') {
> +	my $conf = undef;
> +	eval { $conf = $haenv->read_ct_config($vmid); };
> +	#fixme : check storage for lxc too
> +    }
> +
> +    # fixme: check bridge availability
> +    # fixme: vm: add a check for cpumodel compatibility ?
> +    return 1;
> +}
> +
> +sub compute_soft_constraints {
> +    my ($node_stats, $vm_stats) = @_;
> +
> +    #try to reach 80% max cpu/ram
> +    my $mem_threshold = 0.8;
> +    my $cpu_threshold = 0.8;
> +
> +    my $count = 0;
> +    #check if target node have enough mem ressources under threshold
> +    $count++ if $node_stats->{freemem} * $mem_threshold < $vm_stats->{mem};
> +
> +    #check if target node have enough cpu ressources under threshold
> +    $count++ if $node_stats->{totalfreecpu} * $cpu_threshold < $vm_stats->{totalcpu};
> +
> +    #fixme : add antiaffinity
> +
> +    return $count;
> +}
> +
> +sub add_node_prio {
> +    my ($nodename, $method, $node_stats, $vm_stats, $group_members_prio, $online_node_usage) = @_;
> +
> +    #rounded values to compute vectors (cpu 0-100 , mem 0G-->XG)
> +    my $vm_totalcpu = ceil($vm_stats->{totalcpu});
> +    my $vm_mem = ceil($vm_stats->{mem}/1024/1024/1024);
> +    my $node_freecpu = ceil($node_stats->{totalfreecpu});
> +    my $node_freemem = ceil($node_stats->{freemem}/1024/1024/1024);
> +
> +    my @vec_vm = ($vm_totalcpu, $vm_mem);  #? add network usage dimension ?
> +    my @vec_node = ($node_freecpu, $node_freemem); #? add network usage dimension ?
> +    my $weight = 0;
> +    if ($method eq 'distance') {
> +	$weight = euclidean_distance(\@vec_vm,\@vec_node);
> +    } elsif ($method eq 'dotprod') {
> +	$weight = dotprod(\@vec_vm,\@vec_node);
> +    }
> +
> +    my $node = {};
> +    $node->{weight} = $weight;
> +    $node->{soft_constraint_prio} = compute_soft_constraints($node_stats, $vm_stats);
> +    $node->{prio} = $group_members_prio->{$nodename};
> +    $node->{online_node_usage} = $online_node_usage->{$nodename};
> +    $node->{name} = $nodename;
> +
> +    return $node;
> +}
> +
> +sub get_service_stats {
> +   my ($self, $ss) = @_;
> +
> +	foreach my $sid (sort keys %$ss) {
> +
> +	    if ($sid =~ m/^(vm|ct|fa):(\d+)$/) {

Could use PVE::HA::Config::parse_sid

> +		$ss->{$sid}->{type} = $1;
> +		$ss->{$sid}->{name} = $2;
> +	    }
> +
> +	    my $stats = {};
> +	    $stats->{cpu} = 0;
> +	    $stats->{maxcpu} = 0;
> +	    $stats->{mem} = 0;
> +	    $stats->{maxmem} = 0;

same here, maybe nested and (nit) I find it easier to read to directly declare the hash
in such cases:

my $stats = {
    cpu => ...
};

> +
> +	    #avoid to compute all stats, as currently we only support recovery

ah okay, you already do it only on recovery - overlooked that when reading the cover
letter

> +	    if ($ss->{$sid}->{state} eq 'recovery') {
> +
> +		#get vm/ct stats 5min before on last 20min
> +		$stats = $self->{haenv}->get_vm_rrd_stats($ss->{$sid}->{name}, 95);
> +	    }
> +	    #fixme: windows vm fill memory with zero at boot, so mem = maxmem
> +
> +	    #rounded values for ordering

nit: ceil != round

> +	    $stats->{totalcpuround} = ceil($stats->{cpu} * 100 * $stats->{maxcpu});
> +	    $stats->{memg} = ceil( $stats->{mem} /1024 /1024 /1024);
> +
> +	    $ss->{$sid}->{stats} = $stats;
> +	}
> +}
> +
> +sub recompute_online_node_stats {
> +    my ($self) = @_;
> +
> +    my $online_node_stats = {};
> +    my $online_nodes = $self->{ns}->list_online_nodes();
> +
> +    foreach my $node (@$online_nodes) {
> +	my $stats = $self->{haenv}->get_node_rrd_stats($node);
> +        $stats->{cpu} = 0 if !defined($stats->{cpu});
> +        $stats->{maxcpu} = 0 if !defined($stats->{maxcpu});
> +        $stats->{mem} = 0 if !defined($stats->{mem});
> +        $stats->{maxmem} = 0 if !defined($stats->{maxmem});
> +        $stats->{totalcpu} = $stats->{cpu} * 100 * $stats->{maxcpu}; #how to handle different cpu model power ? bogomips ?

BogoMips is for a single core loop IIRC, so it'd be `cores x bogo-mips` I guess.

> +        $stats->{totalfreecpu} = (100 * $stats->{maxcpu}) - $stats->{totalcpu};
> +        $stats->{freemem} = $stats->{maxmem} - $stats->{mem};
> +        $online_node_stats->{$node}->{stats} = $stats;
> +    }
> +
> +    $self->{online_node_stats} = $online_node_stats;
> +}
> +
>  1;
> diff --git a/src/PVE/HA/Sim/TestEnv.pm b/src/PVE/HA/Sim/TestEnv.pm
> index 6718d8c..08f27c7 100644
> --- a/src/PVE/HA/Sim/TestEnv.pm
> +++ b/src/PVE/HA/Sim/TestEnv.pm
> @@ -118,4 +118,31 @@ sub get_max_workers {
>      return 0;
>  }
>  
> +sub get_node_rrd_stats {
> +    my ($self, $node) = @_;
> +

It gets added in the next patch, but (else) it'd def. need a #FIXME comment ^^

That makes me think: For a patch/commit sepatation perspective it could be nice
to split this one into two,

1. add usage accounting for the specific recovery case
2. use that to actually find a low(er)-loaded node on recovery.

> +    my $stats = {};
> +    $stats->{cpu} = 0;
> +    $stats->{maxcpu} = 0;
> +    $stats->{mem} = 0;
> +    $stats->{maxmem} = 0;
> +
> +    return $stats;
> +}
> +
> +sub get_vm_rrd_stats {
> +    my ($self, $vmid, $percentile) = @_;
> +
> +    my $stats = {};
> +> +    $stats->{cpu} = 0;
> +    $stats->{mem} = 0;
> +    $stats->{maxmem} = 0;
> +    $stats->{maxcpu} = 0;
> +    $stats->{cpu} = $stats->{cpu} * 100;
> +    $stats->{totalcpu} = $stats->{cpu} * $stats->{maxcpu};
> +
> +    return $stats;
> +}
> +
>  1;





^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [pve-devel] [PATCH pve-ha-manager 1/3] add ressource awareness manager
  2021-12-13 10:04   ` Thomas Lamprecht
@ 2021-12-13 10:58     ` DERUMIER, Alexandre
  2021-12-13 11:29       ` Thomas Lamprecht
  0 siblings, 1 reply; 8+ messages in thread
From: DERUMIER, Alexandre @ 2021-12-13 10:58 UTC (permalink / raw)
  To: pve-devel, aderumier

Hi Thomas,
Thanks for the fast review !

I'll rework the patches with your comments.
Do you prefer some another commits patches for each of your remark for easier review ?  or can I send a v2 directly the changes already merged ?




semi-related, Dietmar and I talked about adapting the RRD format and higher resolution
from PBS also for PVE, dropping the dependency on rrdcached, but PVE with the cluster
and broadcasting is quite a bit more complex compared to PBS, so the effort stalled.

If we make more use of those metrics here it would add further coupling to the existing
interface so maybe we'd rethink that through - but nothing that needs to stall this
completely, changing RRD would be better done at a major release anyway where we
can handle this somewhat cleanly.

I think we had talked last year about single rrd file by metric, but I'm not sure it's the best way
(with the number of files it could be generate).
I would like to have another metrics in rrd, like pressure of cpu/mem, or true vm cgroup mem/cpu.
I think we already stream them, but rrd part is not yet done.


+    $stats->{cpu} = percentile($percentile, \@cpu) || 0;
+    $stats->{mem} = percentile($percentile, \@mem) || 0;
+    $stats->{maxmem} = percentile($percentile, \@maxmem) || 0;
+    $stats->{maxcpu} = percentile($percentile, \@maxcpu) || 0;
+    $stats->{totalcpu} = $stats->{cpu} * $stats->{maxcpu} * 100;

what is cpu/maxcpu exactly, the multiplication of both strikes me as a bit weird?

maxcpu is the number of core,so totalcpu is the number of cores * current percent usage * 100.
(This is to be able to compare vm/host with differents number of cores)
But indeed, maybe I'm wrong with my maths ;) I need to check a little bit the ordering and compare code.



+        #### FILTERING NODES WITH HARD CONSTRAINTS (vm can't be started)
+       next if !check_hard_constraints($haenv, $vmid, $cd, $nodename, $node_stats, $vm_stats, $storecfg, $group_members_prio);

I mean, most of those should be covered by the grouping already?

do you mean: manually grouping by user ? if yes, the point is here to get it done auto (for storage avalability, number de host core,maybe vmbr availability,....), because this is a pain to setup.
For example, currently, if you forgot it, a vm with a local storage could be migrated to another node by HA (with error). Then user don't known how to migrate it back to original node.

But it's also checking dynamic values like available cpu/ram on host.




Le lundi 13 décembre 2021 à 11:04 +0100, Thomas Lamprecht a écrit :
some parts of your cover-letter (basic approach/decisions) should go here so that
it can be found more easily when working with git.

Some comments inline, disclaimer: I pretty much read it down from here, did not
tried out to much and I possibly may just have overlooked something.

On 13.12.21 08:43, Alexandre Derumier wrote:
---
 src/PVE/HA/Env.pm         |  24 ++++
 src/PVE/HA/Env/PVE2.pm    |  90 ++++++++++++++
 src/PVE/HA/Manager.pm     | 246 ++++++++++++++++++++++++++++++++++++--
 src/PVE/HA/Sim/TestEnv.pm |  27 +++++
 4 files changed, 380 insertions(+), 7 deletions(-)

diff --git a/src/PVE/HA/Env.pm b/src/PVE/HA/Env.pm
index ac569a9..73b6407 100644
--- a/src/PVE/HA/Env.pm
+++ b/src/PVE/HA/Env.pm
@@ -269,4 +269,28 @@ sub get_ha_settings {
     return $self->{plug}->get_ha_settings();
 }

+sub get_node_rrd_stats {
+    my ($self, $node) = @_;
+
+    return $self->{plug}->get_node_rrd_stats($node);
+}
+
+sub get_vm_rrd_stats {
+    my ($self, $vmid, $percentile) = @_;
+
+    return $self->{plug}->get_vm_rrd_stats($vmid, $percentile);
+}
+
+sub read_vm_config {
+    my ($self, $vmid) = @_;
+
+    return $self->{plug}->read_vm_config($vmid);

I'd like to avoid returning the whole guest config but rather an abstract format
that only sets some keys like "memory" and "cpu" that we actually require, that'd
make it more defined and easier to see what possible impact factors are - the test
or sim system is also slightly easier to add that way.

As that's the basic job of the Env plugin system, transform the mixed/complex
PVE metrics and controls in some simpler abstraction to use internally, making
it easy to simulate/test them.


+}
+
+sub read_ct_config {
+    my ($self, $vmid) = @_;
+
+    return $self->{plug}->read_ct_config($vmid);
+}
+
 1;
diff --git a/src/PVE/HA/Env/PVE2.pm b/src/PVE/HA/Env/PVE2.pm
index 5e0a683..2e1585c 100644
--- a/src/PVE/HA/Env/PVE2.pm
+++ b/src/PVE/HA/Env/PVE2.pm
@@ -9,9 +9,14 @@ use IO::Socket::UNIX;
 use PVE::SafeSyslog;
 use PVE::Tools;
 use PVE::Cluster qw(cfs_register_file cfs_read_file cfs_write_file cfs_lock_file);
+use PVE::Cluster;

Cluster is already imported a line above

 use PVE::DataCenterConfig;
 use PVE::INotify;
 use PVE::RPCEnvironment;
+use PVE::API2Tools;
+use PVE::QemuConfig;
+use PVE::LXC::Config;
+use RRDs;

 use PVE::HA::Tools ':exit_codes';
 use PVE::HA::Env;
@@ -459,4 +464,89 @@ sub get_ha_settings {
     return $datacenterconfig->{ha};
 }

+sub get_node_rrd_stats {
+    my ($self, $node) = @_;
+
+    my $rrd = PVE::Cluster::rrd_dump();
+    my $members = PVE::Cluster::get_members();
+
+    my $stats = PVE::API2Tools::extract_node_stats($node, $members, $rrd);
+
+    return $stats;
+}
+
+sub get_vm_rrd_stats {
+    my ($self, $vmid, $percentile) = @_;
+
+    my $rrdname = "pve2-vm/$vmid";
+    my $rrddir = "/var/lib/rrdcached/db";
+
+    my $rrd = "$rrddir/$rrdname";
+
+    my $cf = "AVERAGE";
+
+    my $reso = 60;
+    my $ctime  = $reso*int(time()/$reso);
+
+    #last 20minutes
+    my $req_start = $ctime - $reso*20;
+    my $req_end = $ctime - $reso*1;
+
+    my @args = (
+        "-s" => $req_start,
+        "-e" => $req_end,
+        "-r" => $reso,
+        );
+
+    my $socket = "/var/run/rrdcached.sock";
+    push @args, "--daemon" => "unix:$socket" if -S $socket;
+
+    my ($start, $step, $names, $data) = RRDs::fetch($rrd, $cf, @args);

semi-related, Dietmar and I talked about adapting the RRD format and higher resolution
from PBS also for PVE, dropping the dependency on rrdcached, but PVE with the cluster
and broadcasting is quite a bit more complex compared to PBS, so the effort stalled.

If we make more use of those metrics here it would add further coupling to the existing
interface so maybe we'd rethink that through - but nothing that needs to stall this
completely, changing RRD would be better done at a major release anyway where we
can handle this somewhat cleanly.

+
+    my @cpu = ();
+    my @mem = ();
+    my @maxmem = ();
+    my @maxcpu = ();
+
+    foreach my $rec (@$data) {
+        my $maxcpu = @$rec[0] || 0;
+        my $cpu = @$rec[1] || 0;
+        my $maxmem = @$rec[2] || 0;
+        my $mem = @$rec[3] || 0;
+        #skip zeros values if vm is down
+        push @cpu, $cpu*$maxcpu if $cpu > 0;
+        push @mem, $mem if $mem > 0;
+        push @maxcpu, $maxcpu if $maxcpu > 0;
+        push @maxmem, $maxmem if $maxmem > 0;
+    }
+
+    my $stats = {};


maybe create this nested

my $stats = {
  cpu => {
    avg => percentile($percentile, \@cpu) || 0;
    max => percentile($percentile, \@maxcpu) || 0;
    total  ...
  },
};

+    $stats->{cpu} = percentile($percentile, \@cpu) || 0;
+    $stats->{mem} = percentile($percentile, \@mem) || 0;
+    $stats->{maxmem} = percentile($percentile, \@maxmem) || 0;
+    $stats->{maxcpu} = percentile($percentile, \@maxcpu) || 0;
+    $stats->{totalcpu} = $stats->{cpu} * $stats->{maxcpu} * 100;

what is cpu/maxcpu exactly, the multiplication of both strikes me as a bit weird?


+
+    return $stats;
+}
+
+sub percentile {
+    my ($p, $aref) = @_;
+    my $percentile = int($p * $#{$aref}/100);
+    return (sort @$aref)[$percentile];
+}
+
+sub read_vm_config {
+    my ($self, $vmid) = @_;
+
+    return PVE::QemuConfig->load_config($vmid);
+}
+
+sub read_ct_config {
+    my ($self, $vmid) = @_;
+
+    return PVE::LXC::Config->load_config($vmid);
+}
+
 1;
diff --git a/src/PVE/HA/Manager.pm b/src/PVE/HA/Manager.pm
index 1c66b43..ae5fbcb 100644
--- a/src/PVE/HA/Manager.pm
+++ b/src/PVE/HA/Manager.pm
@@ -1,8 +1,13 @@
+
 package PVE::HA::Manager;

 use strict;
 use warnings;
 use Digest::MD5 qw(md5_base64);
+use RRDs;
+use POSIX qw/ceil/;
+use PVE::API2Tools;
+use PVE::Storage;

Above two are not necessarily available in the test/sim environment, whatever needs
them needs to be capsuled in the PVE2 Env module.


 use PVE::Tools;
 use PVE::HA::Tools ':exit_codes';
@@ -394,8 +399,16 @@ sub manage {
        my $repeat = 0;

        $self->recompute_online_node_usage();
+       $self->recompute_online_node_stats();

-       foreach my $sid (sort keys %$ss) {
+       $self->get_service_stats($ss);
+
+       foreach my $sid (
+               #ordering vm by size, bigger mem first then bigger cpu
+               #could be improved with bubblesearch heuristic
+               #https://antiphishing.cetsi.fr/proxy/v3?i=Zk92VEFKaGQ4Ums4cnZEUWMTpfHaXFQGRw1_CnOoOH0&r=bHA1dGV3NWJQVUloaWNFUZPm0fiiBviaiy_RDav2GQ1U4uy6lsDDv3uBszpvvWYQN5FqKqFD6WPYupfAUP1c9g&f=SlhDbE9uS2laS2JaZFpNWvmsxai1zlJP9llgnl5HIv-4jAji8Dh2BQawzxID5bzr6Uv-3EQd-eluQbsPfcUOTg&u=https%3A//www.cs.tufts.edu/~nr/cs257/archive/michael-mitzenmacher/bubblesearch.pdf&k=XRKU
+               sort { $ss->{$a}->{stats}->{memg} <=> $ss->{$b}->{stats}->{memg} || $ss->{$a}->{stats}->{totalcpuround} <=> $ss->{$b}->{stats}->{totalcpuround} || $ss->{$a}->{type} cmp $ss->{$b}->{type}}
+               keys %$ss) {

maybe we can move that sort into a sub with some line breaks?

            my $sd = $ss->{$sid};
            my $cd = $sc->{$sid} || { state => 'disabled' };

@@ -802,12 +815,8 @@ sub next_state_recovery {

     $self->recompute_online_node_usage(); # we want the most current node state

-    my $recovery_node = select_service_node(
-       $self->{groups},
-       $self->{online_node_usage},
-       $cd,
-       $sd->{node},
-    );
+    my $storecfg = PVE::Storage::config();

that dependency cannot live here directly in manager, from top of my head I'd suggest adding
a check_service_constraints($sid, $node) plugin method to env


+    my $recovery_node = find_bestfit_node_target($haenv, $sid, $cd , $sd->{node}, $sd->{stats}, $self->{online_node_usage}, $self->{online_node_stats}, $self->{groups}, $storecfg);

maybe just pass $self to avoid most of those parameters, iow. make it a blessed method of this
module, something like:

$self->find_bestfit_node_target($sid, $cd , $sd->{node}, $sd->{stats});

seems to me like a balance of avoid passing every helper module/hash explicit again while keeping
it a bit decoupled from the blessed module info.


     if ($recovery_node) {
        my $msg = "recover service '$sid' from fenced node '$fenced_node' to node '$recovery_node'";
@@ -822,6 +831,14 @@ sub next_state_recovery {
        $haenv->steal_service($sid, $sd->{node}, $recovery_node);
        $self->{online_node_usage}->{$recovery_node}++;

+       #add vm cpu/mem to current node stats (this is an estimation based on last 20min vm stats)
+       my $node_stats = $self->{online_node_stats}->{$recovery_node}->{stats};
+       $node_stats->{totalcpu} += $sd->{stats}->{totalcpu};
+       $node_stats->{mem} += $sd->{stats}->{mem};
+       $node_stats->{totalfreecpu} = (100 * $node_stats->{maxcpu}) - $node_stats->{totalcpu};
+       $node_stats->{freemem} =  $node_stats->{maxmem} - $node_stats->{mem};



+
+
        # NOTE: $sd *is normally read-only*, fencing is the exception
        $cd->{node} = $sd->{node} = $recovery_node;
        my $new_state = ($cd->{state} eq 'started') ? 'started' : 'request_stop';
@@ -839,4 +856,219 @@ sub next_state_recovery {
     }
 }

+
+sub dotprod {
+    my($vec_a, $vec_b, $mode) = @_;
+    die "they must have the same size\n" unless @$vec_a == @$vec_b;
+    $mode = "" if !$mode;
+    my $sum = 0;
+    my $norm_a = 0;
+    my $norm_b = 0;
+
+    for(my $i=0; $i < scalar @{$vec_a}; $i++) {
+       my $a = @{$vec_a}[$i];
+       my $b = @{$vec_b}[$i];
+
+       $sum += $a * $b;
+       $norm_a += $a * $a;
+       $norm_b += $b * $b;
+    }
+
+    if($mode eq 'normR') {
+       return $sum / (sqrt($norm_a) * sqrt($norm_b))
+    } elsif ($mode eq 'normC') {
+       return $sum / $norm_b;
+    }
+    return $sum;
+}
+
+sub euclidean_distance {
+    my($vec_a, $vec_b) = @_;
+
+    my $sum = 0;
+
+    for(my $i=0; $i < scalar @{$vec_a}; $i++) {
+        my $a = @{$vec_a}[$i];
+        my $b = @{$vec_b}[$i];
+        $sum += ($b - $a)**2;
+    }
+
+    return sqrt($sum);
+}

We can add a few helper methods to a new module in proxmox-perl-rs for above, could make
this quite a bit faster.

Also, above, and actually also below should live in a separate module in the future, but
I have nothing against keeping it here for development for now.

+
+sub find_bestfit_node_target {
+    my($haenv, $sid, $cd, $nodename, $vm_stats, $online_node_usage, $online_nodes, $groups, $storecfg) = @_;
+
+    my (undef, $vmid) = split(/:/, $sid);

we have a helper for that IIRC

+
+    my $hagroup = get_service_group($groups, $online_nodes, $cd);
+    my ($pri_groups, $group_members_prio) = get_node_priority_groups($hagroup, $online_nodes);
+
+    my $target_nodes = {};
+    foreach my $nodename (keys %$online_nodes) {
+       my $node_stats = $online_nodes->{$nodename}->{stats};
+
+        #### FILTERING NODES WITH HARD CONSTRAINTS (vm can't be started)
+       next if !check_hard_constraints($haenv, $vmid, $cd, $nodename, $node_stats, $vm_stats, $storecfg, $group_members_prio);

I mean, most of those should be covered by the grouping already?

+
+       #### ADD prio and euclidean_distance weight
+       $target_nodes->{$nodename} = add_node_prio($nodename, 'distance', $node_stats, $vm_stats, $group_members_prio, $online_node_usage);
+    }
+
+    #order by soft_constraint_prio, hagroup prio, weight (Best fit algorithm, lower distance first), number of services, and nodename
+    my @target_array = sort {
+                               $target_nodes->{$b}->{prio} <=> $target_nodes->{$a}->{prio} ||
+                               $target_nodes->{$a}->{soft_constraint_prio} <=> $target_nodes->{$b}->{soft_constraint_prio} ||
+                               $target_nodes->{$a}->{weight} <=> $target_nodes->{$b}->{weight} ||
+                               $target_nodes->{$a}->{online_node_usage} <=> $target_nodes->{$b}->{online_node_usage} ||
+                               $target_nodes->{$a}->{name} cmp $target_nodes->{$b}->{name}
+                       } keys %$target_nodes;
+
+    my $target = $target_array[0];
+
+    return $target;
+}
+
+
+sub check_hard_constraints {
+    my ($haenv, $vmid, $cd, $node, $node_stats, $vm_stats, $storecfg, $group_members_prio) = @_;
+
+    #node need to have a prio(restricted group)
+    return if !defined($group_members_prio->{$node});
+
+    #vm can't start if host have less core
+    return if $node_stats->{maxcpu} < $vm_stats->{maxcpu};
+    #vm can't start if node don't have enough mem to handle vm max mem
+    return if $node_stats->{freemem} < $vm_stats->{maxmem};
+
+    #max 95% cpu/ram
+    my $mem_threshold = 0.95;
+    my $cpu_threshold = 0.95;
+
+    #check if target node have enough mem ressources under threshold
+    return if $node_stats->{freemem} * $mem_threshold < $vm_stats->{mem};
+
+    #check if target node have enough cpu ressources under threshold
+    return if $node_stats->{totalfreecpu} * $cpu_threshold < $vm_stats->{totalcpu};
+
+    #check storage availability
+    if ($cd->{type} eq 'vm') {
+       my $conf = undef;
+       eval { $conf = $haenv->read_vm_config($vmid); };
+       if (!$@) {
+           eval { PVE::QemuServer::check_storage_availability($storecfg, $conf, $node) };

non-guared pve-env call cannot stay here in manager

+           return if $@;
+       }
+
+    } elsif ($cd->{type} eq 'ct') {
+       my $conf = undef;
+       eval { $conf = $haenv->read_ct_config($vmid); };
+       #fixme : check storage for lxc too
+    }
+
+    # fixme: check bridge availability
+    # fixme: vm: add a check for cpumodel compatibility ?
+    return 1;
+}
+
+sub compute_soft_constraints {
+    my ($node_stats, $vm_stats) = @_;
+
+    #try to reach 80% max cpu/ram
+    my $mem_threshold = 0.8;
+    my $cpu_threshold = 0.8;
+
+    my $count = 0;
+    #check if target node have enough mem ressources under threshold
+    $count++ if $node_stats->{freemem} * $mem_threshold < $vm_stats->{mem};
+
+    #check if target node have enough cpu ressources under threshold
+    $count++ if $node_stats->{totalfreecpu} * $cpu_threshold < $vm_stats->{totalcpu};
+
+    #fixme : add antiaffinity
+
+    return $count;
+}
+
+sub add_node_prio {
+    my ($nodename, $method, $node_stats, $vm_stats, $group_members_prio, $online_node_usage) = @_;
+
+    #rounded values to compute vectors (cpu 0-100 , mem 0G-->XG)
+    my $vm_totalcpu = ceil($vm_stats->{totalcpu});
+    my $vm_mem = ceil($vm_stats->{mem}/1024/1024/1024);
+    my $node_freecpu = ceil($node_stats->{totalfreecpu});
+    my $node_freemem = ceil($node_stats->{freemem}/1024/1024/1024);
+
+    my @vec_vm = ($vm_totalcpu, $vm_mem);  #? add network usage dimension ?
+    my @vec_node = ($node_freecpu, $node_freemem); #? add network usage dimension ?
+    my $weight = 0;
+    if ($method eq 'distance') {
+       $weight = euclidean_distance(\@vec_vm,\@vec_node);
+    } elsif ($method eq 'dotprod') {
+       $weight = dotprod(\@vec_vm,\@vec_node);
+    }
+
+    my $node = {};
+    $node->{weight} = $weight;
+    $node->{soft_constraint_prio} = compute_soft_constraints($node_stats, $vm_stats);
+    $node->{prio} = $group_members_prio->{$nodename};
+    $node->{online_node_usage} = $online_node_usage->{$nodename};
+    $node->{name} = $nodename;
+
+    return $node;
+}
+
+sub get_service_stats {
+   my ($self, $ss) = @_;
+
+       foreach my $sid (sort keys %$ss) {
+
+           if ($sid =~ m/^(vm|ct|fa):(\d+)$/) {

Could use PVE::HA::Config::parse_sid

+               $ss->{$sid}->{type} = $1;
+               $ss->{$sid}->{name} = $2;
+           }
+
+           my $stats = {};
+           $stats->{cpu} = 0;
+           $stats->{maxcpu} = 0;
+           $stats->{mem} = 0;
+           $stats->{maxmem} = 0;

same here, maybe nested and (nit) I find it easier to read to directly declare the hash
in such cases:

my $stats = {
    cpu => ...
};

+
+           #avoid to compute all stats, as currently we only support recovery

ah okay, you already do it only on recovery - overlooked that when reading the cover
letter

+           if ($ss->{$sid}->{state} eq 'recovery') {
+
+               #get vm/ct stats 5min before on last 20min
+               $stats = $self->{haenv}->get_vm_rrd_stats($ss->{$sid}->{name}, 95);
+           }
+           #fixme: windows vm fill memory with zero at boot, so mem = maxmem
+
+           #rounded values for ordering

nit: ceil != round

+           $stats->{totalcpuround} = ceil($stats->{cpu} * 100 * $stats->{maxcpu});
+           $stats->{memg} = ceil( $stats->{mem} /1024 /1024 /1024);
+
+           $ss->{$sid}->{stats} = $stats;
+       }
+}
+
+sub recompute_online_node_stats {
+    my ($self) = @_;
+
+    my $online_node_stats = {};
+    my $online_nodes = $self->{ns}->list_online_nodes();
+
+    foreach my $node (@$online_nodes) {
+       my $stats = $self->{haenv}->get_node_rrd_stats($node);
+        $stats->{cpu} = 0 if !defined($stats->{cpu});
+        $stats->{maxcpu} = 0 if !defined($stats->{maxcpu});
+        $stats->{mem} = 0 if !defined($stats->{mem});
+        $stats->{maxmem} = 0 if !defined($stats->{maxmem});
+        $stats->{totalcpu} = $stats->{cpu} * 100 * $stats->{maxcpu}; #how to handle different cpu model power ? bogomips ?

BogoMips is for a single core loop IIRC, so it'd be `cores x bogo-mips` I guess.

+        $stats->{totalfreecpu} = (100 * $stats->{maxcpu}) - $stats->{totalcpu};
+        $stats->{freemem} = $stats->{maxmem} - $stats->{mem};
+        $online_node_stats->{$node}->{stats} = $stats;
+    }
+
+    $self->{online_node_stats} = $online_node_stats;
+}
+
 1;
diff --git a/src/PVE/HA/Sim/TestEnv.pm b/src/PVE/HA/Sim/TestEnv.pm
index 6718d8c..08f27c7 100644
--- a/src/PVE/HA/Sim/TestEnv.pm
+++ b/src/PVE/HA/Sim/TestEnv.pm
@@ -118,4 +118,31 @@ sub get_max_workers {
     return 0;
 }

+sub get_node_rrd_stats {
+    my ($self, $node) = @_;
+

It gets added in the next patch, but (else) it'd def. need a #FIXME comment ^^

That makes me think: For a patch/commit sepatation perspective it could be nice
to split this one into two,

1. add usage accounting for the specific recovery case
2. use that to actually find a low(er)-loaded node on recovery.

+    my $stats = {};
+    $stats->{cpu} = 0;
+    $stats->{maxcpu} = 0;
+    $stats->{mem} = 0;
+    $stats->{maxmem} = 0;
+
+    return $stats;
+}
+
+sub get_vm_rrd_stats {
+    my ($self, $vmid, $percentile) = @_;
+
+    my $stats = {};
+> +    $stats->{cpu} = 0;
+    $stats->{mem} = 0;
+    $stats->{maxmem} = 0;
+    $stats->{maxcpu} = 0;
+    $stats->{cpu} = $stats->{cpu} * 100;
+    $stats->{totalcpu} = $stats->{cpu} * $stats->{maxcpu};
+
+    return $stats;
+}
+
 1;



_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com<mailto:pve-devel@lists.proxmox.com>
https://antiphishing.cetsi.fr/proxy/v3?i=Zk92VEFKaGQ4Ums4cnZEUWMTpfHaXFQGRw1_CnOoOH0&r=bHA1dGV3NWJQVUloaWNFUZPm0fiiBviaiy_RDav2GQ1U4uy6lsDDv3uBszpvvWYQN5FqKqFD6WPYupfAUP1c9g&f=SlhDbE9uS2laS2JaZFpNWvmsxai1zlJP9llgnl5HIv-4jAji8Dh2BQawzxID5bzr6Uv-3EQd-eluQbsPfcUOTg&u=https%3A//lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel&k=XRKU



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [pve-devel] [PATCH pve-ha-manager 1/3] add ressource awareness manager
  2021-12-13 10:58     ` DERUMIER, Alexandre
@ 2021-12-13 11:29       ` Thomas Lamprecht
  0 siblings, 0 replies; 8+ messages in thread
From: Thomas Lamprecht @ 2021-12-13 11:29 UTC (permalink / raw)
  To: Proxmox VE development discussion, DERUMIER, Alexandre, aderumier

Hi Alexandre,

On 13.12.21 11:58, DERUMIER, Alexandre wrote:
> Thanks for the fast review !
> 
> I'll rework the patches with your comments.
> Do you prefer some another commits patches for each of your remark for easier review ?  or can I send a v2 directly the changes already merged ?

A v2 with the fixes squashed in works good for me, if you also think the commit-split
I suggested is Ok and do that one great, else it's OK too.

> 
>> semi-related, Dietmar and I talked about adapting the RRD format and higher resolution
>> from PBS also for PVE, dropping the dependency on rrdcached, but PVE with the cluster
>> and broadcasting is quite a bit more complex compared to PBS, so the effort stalled.
>>
>> If we make more use of those metrics here it would add further coupling to the existing
>> interface so maybe we'd rethink that through - but nothing that needs to stall this
>> completely, changing RRD would be better done at a major release anyway where we
>> can handle this somewhat cleanly.
> 
> I think we had talked last year about single rrd file by metric, but I'm not sure it's the best way
> (with the number of files it could be generate).
> I would like to have another metrics in rrd, like pressure of cpu/mem, or true vm cgroup mem/cpu.
> I think we already stream them, but rrd part is not yet done.
> 

I'd like to benchmark and evaluate the in-memory key-value store of pmxcfs
to see if it's ok to further expand usage of that one, as for PSI info it could be
OK to only have the last one, as they are already averaged by the kernel,
so we could start out by just broadcasting the last one outside of RRD - but
as said, I do not know if that's really (better) scalable.

> 
> +    $stats->{cpu} = percentile($percentile, \@cpu) || 0;
> +    $stats->{mem} = percentile($percentile, \@mem) || 0;
> +    $stats->{maxmem} = percentile($percentile, \@maxmem) || 0;
> +    $stats->{maxcpu} = percentile($percentile, \@maxcpu) || 0;
> +    $stats->{totalcpu} = $stats->{cpu} * $stats->{maxcpu} * 100;
> 
>> what is cpu/maxcpu exactly, the multiplication of both strikes me as a bit weird?
> 
> maxcpu is the number of core,so totalcpu is the number of cores * current percent usage * 100.
> (This is to be able to compare vm/host with differents number of cores)
> But indeed, maybe I'm wrong with my maths ;) I need to check a little bit the ordering and compare code.
> 
> 

Ok, I was more confused because maxcpu and cpu had different units (count vs. load-percentage).

> 
> +        #### FILTERING NODES WITH HARD CONSTRAINTS (vm can't be started)
> +       next if !check_hard_constraints($haenv, $vmid, $cd, $nodename, $node_stats, $vm_stats, $storecfg, $group_members_prio);
> 
>> I mean, most of those should be covered by the grouping already?
> 
> do you mean: manually grouping by user ? if yes, the point is here to get it done auto (for storage avalability, number de host core,maybe vmbr availability,....), because this is a pain to setup.

Hmm, OK, groups can still make sense if the admin wants to parition the cluster,
e.g., if its not homogenous (e.g., hyper-converged with storage and compute nodes)
or the like.

> For example, currently, if you forgot it, a vm with a local storage could be migrated to another node by HA (with error). Then user don't known how to migrate it back to original node.

I mean, it's definitively a pain point that users new to our HA stack run into,
so IMO good to improve.

Maybe the suggested `check_service_constraints` should then rather be a
`get_runnable_nodes($sid)` method in Env, not to sure yet though, they're both
roughly the same.

> 
> But it's also checking dynamic values like available cpu/ram on host.
> 
> 

hmm, sounds like a third patch we could split out

1. commit: add basic constrained checking with storage and such automatically
2. commit: add resource/node usage tracking code-infrastructure
3. commit integrate that tracking into the constraints checker for dynamic
   target decisions
4. tests and such

As said, the split is not a must for me, but I think it could be a bit easier
to digest that way on review.

thx!

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2021-12-13 11:29 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-12-13  7:43 [pve-devel] [PATCH pve-ha-manager 0/3] POC/RFC: ressource aware HA manager Alexandre Derumier
2021-12-13  7:43 ` [pve-devel] [PATCH pve-ha-manager 1/3] add ressource awareness manager Alexandre Derumier
2021-12-13 10:04   ` Thomas Lamprecht
2021-12-13 10:58     ` DERUMIER, Alexandre
2021-12-13 11:29       ` Thomas Lamprecht
2021-12-13  7:43 ` [pve-devel] [PATCH pve-ha-manager 2/3] tests: add support for ressources Alexandre Derumier
2021-12-13  7:43 ` [pve-devel] [PATCH pve-ha-manager 3/3] add test-basic0 Alexandre Derumier
2021-12-13  9:02 ` [pve-devel] [PATCH pve-ha-manager 0/3] POC/RFC: ressource aware HA manager Thomas Lamprecht

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox

Service provided by Proxmox Server Solutions GmbH | Privacy | Legal