From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits)) (No client certificate requested) by lists.proxmox.com (Postfix) with ESMTPS id DD7B0844F1 for ; Mon, 13 Dec 2021 11:04:33 +0100 (CET) Received: from firstgate.proxmox.com (localhost [127.0.0.1]) by firstgate.proxmox.com (Proxmox) with ESMTP id BC0B215FDB for ; Mon, 13 Dec 2021 11:04:33 +0100 (CET) Received: from proxmox-new.maurer-it.com (proxmox-new.maurer-it.com [94.136.29.106]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by firstgate.proxmox.com (Proxmox) with ESMTPS id D4E1715FCB for ; Mon, 13 Dec 2021 11:04:31 +0100 (CET) Received: from proxmox-new.maurer-it.com (localhost.localdomain [127.0.0.1]) by proxmox-new.maurer-it.com (Proxmox) with ESMTP id 9EF894474D; Mon, 13 Dec 2021 11:04:31 +0100 (CET) Message-ID: <6402c0c7-20ac-82d5-ecf8-a0b20899e154@proxmox.com> Date: Mon, 13 Dec 2021 11:04:28 +0100 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:96.0) Gecko/20100101 Thunderbird/96.0 Content-Language: en-US To: Proxmox VE development discussion , Alexandre Derumier References: <20211213074316.2565139-1-aderumier@odiso.com> <20211213074316.2565139-2-aderumier@odiso.com> From: Thomas Lamprecht In-Reply-To: <20211213074316.2565139-2-aderumier@odiso.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-SPAM-LEVEL: Spam detection results: 0 AWL 1.127 Adjusted score from AWL reputation of From: address BAYES_00 -1.9 Bayes spam probability is 0 to 1% KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment NICE_REPLY_A -4.093 Looks like a legit reply (A) SPF_HELO_NONE 0.001 SPF: HELO does not publish an SPF Record SPF_PASS -0.001 SPF: sender matches SPF record URI_DOTEDU 1.997 Has .edu URI Subject: Re: [pve-devel] [PATCH pve-ha-manager 1/3] add ressource awareness manager X-BeenThere: pve-devel@lists.proxmox.com X-Mailman-Version: 2.1.29 Precedence: list List-Id: Proxmox VE development discussion List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 13 Dec 2021 10:04:33 -0000 some parts of your cover-letter (basic approach/decisions) should go here so that it can be found more easily when working with git. Some comments inline, disclaimer: I pretty much read it down from here, did not tried out to much and I possibly may just have overlooked something. On 13.12.21 08:43, Alexandre Derumier wrote: > --- > src/PVE/HA/Env.pm | 24 ++++ > src/PVE/HA/Env/PVE2.pm | 90 ++++++++++++++ > src/PVE/HA/Manager.pm | 246 ++++++++++++++++++++++++++++++++++++-- > src/PVE/HA/Sim/TestEnv.pm | 27 +++++ > 4 files changed, 380 insertions(+), 7 deletions(-) > > diff --git a/src/PVE/HA/Env.pm b/src/PVE/HA/Env.pm > index ac569a9..73b6407 100644 > --- a/src/PVE/HA/Env.pm > +++ b/src/PVE/HA/Env.pm > @@ -269,4 +269,28 @@ sub get_ha_settings { > return $self->{plug}->get_ha_settings(); > } > > +sub get_node_rrd_stats { > + my ($self, $node) = @_; > + > + return $self->{plug}->get_node_rrd_stats($node); > +} > + > +sub get_vm_rrd_stats { > + my ($self, $vmid, $percentile) = @_; > + > + return $self->{plug}->get_vm_rrd_stats($vmid, $percentile); > +} > + > +sub read_vm_config { > + my ($self, $vmid) = @_; > + > + return $self->{plug}->read_vm_config($vmid); I'd like to avoid returning the whole guest config but rather an abstract format that only sets some keys like "memory" and "cpu" that we actually require, that'd make it more defined and easier to see what possible impact factors are - the test or sim system is also slightly easier to add that way. As that's the basic job of the Env plugin system, transform the mixed/complex PVE metrics and controls in some simpler abstraction to use internally, making it easy to simulate/test them. > +} > + > +sub read_ct_config { > + my ($self, $vmid) = @_; > + > + return $self->{plug}->read_ct_config($vmid); > +} > + > 1; > diff --git a/src/PVE/HA/Env/PVE2.pm b/src/PVE/HA/Env/PVE2.pm > index 5e0a683..2e1585c 100644 > --- a/src/PVE/HA/Env/PVE2.pm > +++ b/src/PVE/HA/Env/PVE2.pm > @@ -9,9 +9,14 @@ use IO::Socket::UNIX; > use PVE::SafeSyslog; > use PVE::Tools; > use PVE::Cluster qw(cfs_register_file cfs_read_file cfs_write_file cfs_lock_file); > +use PVE::Cluster; Cluster is already imported a line above > use PVE::DataCenterConfig; > use PVE::INotify; > use PVE::RPCEnvironment; > +use PVE::API2Tools; > +use PVE::QemuConfig; > +use PVE::LXC::Config; > +use RRDs; > > use PVE::HA::Tools ':exit_codes'; > use PVE::HA::Env; > @@ -459,4 +464,89 @@ sub get_ha_settings { > return $datacenterconfig->{ha}; > } > > +sub get_node_rrd_stats { > + my ($self, $node) = @_; > + > + my $rrd = PVE::Cluster::rrd_dump(); > + my $members = PVE::Cluster::get_members(); > + > + my $stats = PVE::API2Tools::extract_node_stats($node, $members, $rrd); > + > + return $stats; > +} > + > +sub get_vm_rrd_stats { > + my ($self, $vmid, $percentile) = @_; > + > + my $rrdname = "pve2-vm/$vmid"; > + my $rrddir = "/var/lib/rrdcached/db"; > + > + my $rrd = "$rrddir/$rrdname"; > + > + my $cf = "AVERAGE"; > + > + my $reso = 60; > + my $ctime = $reso*int(time()/$reso); > + > + #last 20minutes > + my $req_start = $ctime - $reso*20; > + my $req_end = $ctime - $reso*1; > + > + my @args = ( > + "-s" => $req_start, > + "-e" => $req_end, > + "-r" => $reso, > + ); > + > + my $socket = "/var/run/rrdcached.sock"; > + push @args, "--daemon" => "unix:$socket" if -S $socket; > + > + my ($start, $step, $names, $data) = RRDs::fetch($rrd, $cf, @args); semi-related, Dietmar and I talked about adapting the RRD format and higher resolution from PBS also for PVE, dropping the dependency on rrdcached, but PVE with the cluster and broadcasting is quite a bit more complex compared to PBS, so the effort stalled. If we make more use of those metrics here it would add further coupling to the existing interface so maybe we'd rethink that through - but nothing that needs to stall this completely, changing RRD would be better done at a major release anyway where we can handle this somewhat cleanly. > + > + my @cpu = (); > + my @mem = (); > + my @maxmem = (); > + my @maxcpu = (); > + > + foreach my $rec (@$data) { > + my $maxcpu = @$rec[0] || 0; > + my $cpu = @$rec[1] || 0; > + my $maxmem = @$rec[2] || 0; > + my $mem = @$rec[3] || 0; > + #skip zeros values if vm is down > + push @cpu, $cpu*$maxcpu if $cpu > 0; > + push @mem, $mem if $mem > 0; > + push @maxcpu, $maxcpu if $maxcpu > 0; > + push @maxmem, $maxmem if $maxmem > 0; > + } > + > + my $stats = {}; maybe create this nested my $stats = { cpu => { avg => percentile($percentile, \@cpu) || 0; max => percentile($percentile, \@maxcpu) || 0; total ... }, }; > + $stats->{cpu} = percentile($percentile, \@cpu) || 0; > + $stats->{mem} = percentile($percentile, \@mem) || 0; > + $stats->{maxmem} = percentile($percentile, \@maxmem) || 0; > + $stats->{maxcpu} = percentile($percentile, \@maxcpu) || 0; > + $stats->{totalcpu} = $stats->{cpu} * $stats->{maxcpu} * 100; what is cpu/maxcpu exactly, the multiplication of both strikes me as a bit weird? > + > + return $stats; > +} > + > +sub percentile { > + my ($p, $aref) = @_; > + my $percentile = int($p * $#{$aref}/100); > + return (sort @$aref)[$percentile]; > +} > + > +sub read_vm_config { > + my ($self, $vmid) = @_; > + > + return PVE::QemuConfig->load_config($vmid); > +} > + > +sub read_ct_config { > + my ($self, $vmid) = @_; > + > + return PVE::LXC::Config->load_config($vmid); > +} > + > 1; > diff --git a/src/PVE/HA/Manager.pm b/src/PVE/HA/Manager.pm > index 1c66b43..ae5fbcb 100644 > --- a/src/PVE/HA/Manager.pm > +++ b/src/PVE/HA/Manager.pm > @@ -1,8 +1,13 @@ > + > package PVE::HA::Manager; > > use strict; > use warnings; > use Digest::MD5 qw(md5_base64); > +use RRDs; > +use POSIX qw/ceil/; > +use PVE::API2Tools; > +use PVE::Storage; Above two are not necessarily available in the test/sim environment, whatever needs them needs to be capsuled in the PVE2 Env module. > > use PVE::Tools; > use PVE::HA::Tools ':exit_codes'; > @@ -394,8 +399,16 @@ sub manage { > my $repeat = 0; > > $self->recompute_online_node_usage(); > + $self->recompute_online_node_stats(); > > - foreach my $sid (sort keys %$ss) { > + $self->get_service_stats($ss); > + > + foreach my $sid ( > + #ordering vm by size, bigger mem first then bigger cpu > + #could be improved with bubblesearch heuristic > + #https://www.cs.tufts.edu/~nr/cs257/archive/michael-mitzenmacher/bubblesearch.pdf > + sort { $ss->{$a}->{stats}->{memg} <=> $ss->{$b}->{stats}->{memg} || $ss->{$a}->{stats}->{totalcpuround} <=> $ss->{$b}->{stats}->{totalcpuround} || $ss->{$a}->{type} cmp $ss->{$b}->{type}} > + keys %$ss) { maybe we can move that sort into a sub with some line breaks? > my $sd = $ss->{$sid}; > my $cd = $sc->{$sid} || { state => 'disabled' }; > > @@ -802,12 +815,8 @@ sub next_state_recovery { > > $self->recompute_online_node_usage(); # we want the most current node state > > - my $recovery_node = select_service_node( > - $self->{groups}, > - $self->{online_node_usage}, > - $cd, > - $sd->{node}, > - ); > + my $storecfg = PVE::Storage::config(); that dependency cannot live here directly in manager, from top of my head I'd suggest adding a check_service_constraints($sid, $node) plugin method to env > + my $recovery_node = find_bestfit_node_target($haenv, $sid, $cd , $sd->{node}, $sd->{stats}, $self->{online_node_usage}, $self->{online_node_stats}, $self->{groups}, $storecfg); maybe just pass $self to avoid most of those parameters, iow. make it a blessed method of this module, something like: $self->find_bestfit_node_target($sid, $cd , $sd->{node}, $sd->{stats}); seems to me like a balance of avoid passing every helper module/hash explicit again while keeping it a bit decoupled from the blessed module info. > > if ($recovery_node) { > my $msg = "recover service '$sid' from fenced node '$fenced_node' to node '$recovery_node'"; > @@ -822,6 +831,14 @@ sub next_state_recovery { > $haenv->steal_service($sid, $sd->{node}, $recovery_node); > $self->{online_node_usage}->{$recovery_node}++; > > + #add vm cpu/mem to current node stats (this is an estimation based on last 20min vm stats) > + my $node_stats = $self->{online_node_stats}->{$recovery_node}->{stats}; > + $node_stats->{totalcpu} += $sd->{stats}->{totalcpu}; > + $node_stats->{mem} += $sd->{stats}->{mem}; > + $node_stats->{totalfreecpu} = (100 * $node_stats->{maxcpu}) - $node_stats->{totalcpu}; > + $node_stats->{freemem} = $node_stats->{maxmem} - $node_stats->{mem}; > + > + > # NOTE: $sd *is normally read-only*, fencing is the exception > $cd->{node} = $sd->{node} = $recovery_node; > my $new_state = ($cd->{state} eq 'started') ? 'started' : 'request_stop'; > @@ -839,4 +856,219 @@ sub next_state_recovery { > } > } > > + > +sub dotprod { > + my($vec_a, $vec_b, $mode) = @_; > + die "they must have the same size\n" unless @$vec_a == @$vec_b; > + $mode = "" if !$mode; > + my $sum = 0; > + my $norm_a = 0; > + my $norm_b = 0; > + > + for(my $i=0; $i < scalar @{$vec_a}; $i++) { > + my $a = @{$vec_a}[$i]; > + my $b = @{$vec_b}[$i]; > + > + $sum += $a * $b; > + $norm_a += $a * $a; > + $norm_b += $b * $b; > + } > + > + if($mode eq 'normR') { > + return $sum / (sqrt($norm_a) * sqrt($norm_b)) > + } elsif ($mode eq 'normC') { > + return $sum / $norm_b; > + } > + return $sum; > +} > + > +sub euclidean_distance { > + my($vec_a, $vec_b) = @_; > + > + my $sum = 0; > + > + for(my $i=0; $i < scalar @{$vec_a}; $i++) { > + my $a = @{$vec_a}[$i]; > + my $b = @{$vec_b}[$i]; > + $sum += ($b - $a)**2; > + } > + > + return sqrt($sum); > +} We can add a few helper methods to a new module in proxmox-perl-rs for above, could make this quite a bit faster. Also, above, and actually also below should live in a separate module in the future, but I have nothing against keeping it here for development for now. > + > +sub find_bestfit_node_target { > + my($haenv, $sid, $cd, $nodename, $vm_stats, $online_node_usage, $online_nodes, $groups, $storecfg) = @_; > + > + my (undef, $vmid) = split(/:/, $sid); we have a helper for that IIRC > + > + my $hagroup = get_service_group($groups, $online_nodes, $cd); > + my ($pri_groups, $group_members_prio) = get_node_priority_groups($hagroup, $online_nodes); > + > + my $target_nodes = {}; > + foreach my $nodename (keys %$online_nodes) { > + my $node_stats = $online_nodes->{$nodename}->{stats}; > + > + #### FILTERING NODES WITH HARD CONSTRAINTS (vm can't be started) > + next if !check_hard_constraints($haenv, $vmid, $cd, $nodename, $node_stats, $vm_stats, $storecfg, $group_members_prio); I mean, most of those should be covered by the grouping already? > + > + #### ADD prio and euclidean_distance weight > + $target_nodes->{$nodename} = add_node_prio($nodename, 'distance', $node_stats, $vm_stats, $group_members_prio, $online_node_usage); > + } > + > + #order by soft_constraint_prio, hagroup prio, weight (Best fit algorithm, lower distance first), number of services, and nodename > + my @target_array = sort { > + $target_nodes->{$b}->{prio} <=> $target_nodes->{$a}->{prio} || > + $target_nodes->{$a}->{soft_constraint_prio} <=> $target_nodes->{$b}->{soft_constraint_prio} || > + $target_nodes->{$a}->{weight} <=> $target_nodes->{$b}->{weight} || > + $target_nodes->{$a}->{online_node_usage} <=> $target_nodes->{$b}->{online_node_usage} || > + $target_nodes->{$a}->{name} cmp $target_nodes->{$b}->{name} > + } keys %$target_nodes; > + > + my $target = $target_array[0]; > + > + return $target; > +} > + > + > +sub check_hard_constraints { > + my ($haenv, $vmid, $cd, $node, $node_stats, $vm_stats, $storecfg, $group_members_prio) = @_; > + > + #node need to have a prio(restricted group) > + return if !defined($group_members_prio->{$node}); > + > + #vm can't start if host have less core > + return if $node_stats->{maxcpu} < $vm_stats->{maxcpu}; > + #vm can't start if node don't have enough mem to handle vm max mem > + return if $node_stats->{freemem} < $vm_stats->{maxmem}; > + > + #max 95% cpu/ram > + my $mem_threshold = 0.95; > + my $cpu_threshold = 0.95; > + > + #check if target node have enough mem ressources under threshold > + return if $node_stats->{freemem} * $mem_threshold < $vm_stats->{mem}; > + > + #check if target node have enough cpu ressources under threshold > + return if $node_stats->{totalfreecpu} * $cpu_threshold < $vm_stats->{totalcpu}; > + > + #check storage availability > + if ($cd->{type} eq 'vm') { > + my $conf = undef; > + eval { $conf = $haenv->read_vm_config($vmid); }; > + if (!$@) { > + eval { PVE::QemuServer::check_storage_availability($storecfg, $conf, $node) }; non-guared pve-env call cannot stay here in manager > + return if $@; > + } > + > + } elsif ($cd->{type} eq 'ct') { > + my $conf = undef; > + eval { $conf = $haenv->read_ct_config($vmid); }; > + #fixme : check storage for lxc too > + } > + > + # fixme: check bridge availability > + # fixme: vm: add a check for cpumodel compatibility ? > + return 1; > +} > + > +sub compute_soft_constraints { > + my ($node_stats, $vm_stats) = @_; > + > + #try to reach 80% max cpu/ram > + my $mem_threshold = 0.8; > + my $cpu_threshold = 0.8; > + > + my $count = 0; > + #check if target node have enough mem ressources under threshold > + $count++ if $node_stats->{freemem} * $mem_threshold < $vm_stats->{mem}; > + > + #check if target node have enough cpu ressources under threshold > + $count++ if $node_stats->{totalfreecpu} * $cpu_threshold < $vm_stats->{totalcpu}; > + > + #fixme : add antiaffinity > + > + return $count; > +} > + > +sub add_node_prio { > + my ($nodename, $method, $node_stats, $vm_stats, $group_members_prio, $online_node_usage) = @_; > + > + #rounded values to compute vectors (cpu 0-100 , mem 0G-->XG) > + my $vm_totalcpu = ceil($vm_stats->{totalcpu}); > + my $vm_mem = ceil($vm_stats->{mem}/1024/1024/1024); > + my $node_freecpu = ceil($node_stats->{totalfreecpu}); > + my $node_freemem = ceil($node_stats->{freemem}/1024/1024/1024); > + > + my @vec_vm = ($vm_totalcpu, $vm_mem); #? add network usage dimension ? > + my @vec_node = ($node_freecpu, $node_freemem); #? add network usage dimension ? > + my $weight = 0; > + if ($method eq 'distance') { > + $weight = euclidean_distance(\@vec_vm,\@vec_node); > + } elsif ($method eq 'dotprod') { > + $weight = dotprod(\@vec_vm,\@vec_node); > + } > + > + my $node = {}; > + $node->{weight} = $weight; > + $node->{soft_constraint_prio} = compute_soft_constraints($node_stats, $vm_stats); > + $node->{prio} = $group_members_prio->{$nodename}; > + $node->{online_node_usage} = $online_node_usage->{$nodename}; > + $node->{name} = $nodename; > + > + return $node; > +} > + > +sub get_service_stats { > + my ($self, $ss) = @_; > + > + foreach my $sid (sort keys %$ss) { > + > + if ($sid =~ m/^(vm|ct|fa):(\d+)$/) { Could use PVE::HA::Config::parse_sid > + $ss->{$sid}->{type} = $1; > + $ss->{$sid}->{name} = $2; > + } > + > + my $stats = {}; > + $stats->{cpu} = 0; > + $stats->{maxcpu} = 0; > + $stats->{mem} = 0; > + $stats->{maxmem} = 0; same here, maybe nested and (nit) I find it easier to read to directly declare the hash in such cases: my $stats = { cpu => ... }; > + > + #avoid to compute all stats, as currently we only support recovery ah okay, you already do it only on recovery - overlooked that when reading the cover letter > + if ($ss->{$sid}->{state} eq 'recovery') { > + > + #get vm/ct stats 5min before on last 20min > + $stats = $self->{haenv}->get_vm_rrd_stats($ss->{$sid}->{name}, 95); > + } > + #fixme: windows vm fill memory with zero at boot, so mem = maxmem > + > + #rounded values for ordering nit: ceil != round > + $stats->{totalcpuround} = ceil($stats->{cpu} * 100 * $stats->{maxcpu}); > + $stats->{memg} = ceil( $stats->{mem} /1024 /1024 /1024); > + > + $ss->{$sid}->{stats} = $stats; > + } > +} > + > +sub recompute_online_node_stats { > + my ($self) = @_; > + > + my $online_node_stats = {}; > + my $online_nodes = $self->{ns}->list_online_nodes(); > + > + foreach my $node (@$online_nodes) { > + my $stats = $self->{haenv}->get_node_rrd_stats($node); > + $stats->{cpu} = 0 if !defined($stats->{cpu}); > + $stats->{maxcpu} = 0 if !defined($stats->{maxcpu}); > + $stats->{mem} = 0 if !defined($stats->{mem}); > + $stats->{maxmem} = 0 if !defined($stats->{maxmem}); > + $stats->{totalcpu} = $stats->{cpu} * 100 * $stats->{maxcpu}; #how to handle different cpu model power ? bogomips ? BogoMips is for a single core loop IIRC, so it'd be `cores x bogo-mips` I guess. > + $stats->{totalfreecpu} = (100 * $stats->{maxcpu}) - $stats->{totalcpu}; > + $stats->{freemem} = $stats->{maxmem} - $stats->{mem}; > + $online_node_stats->{$node}->{stats} = $stats; > + } > + > + $self->{online_node_stats} = $online_node_stats; > +} > + > 1; > diff --git a/src/PVE/HA/Sim/TestEnv.pm b/src/PVE/HA/Sim/TestEnv.pm > index 6718d8c..08f27c7 100644 > --- a/src/PVE/HA/Sim/TestEnv.pm > +++ b/src/PVE/HA/Sim/TestEnv.pm > @@ -118,4 +118,31 @@ sub get_max_workers { > return 0; > } > > +sub get_node_rrd_stats { > + my ($self, $node) = @_; > + It gets added in the next patch, but (else) it'd def. need a #FIXME comment ^^ That makes me think: For a patch/commit sepatation perspective it could be nice to split this one into two, 1. add usage accounting for the specific recovery case 2. use that to actually find a low(er)-loaded node on recovery. > + my $stats = {}; > + $stats->{cpu} = 0; > + $stats->{maxcpu} = 0; > + $stats->{mem} = 0; > + $stats->{maxmem} = 0; > + > + return $stats; > +} > + > +sub get_vm_rrd_stats { > + my ($self, $vmid, $percentile) = @_; > + > + my $stats = {}; > +> + $stats->{cpu} = 0; > + $stats->{mem} = 0; > + $stats->{maxmem} = 0; > + $stats->{maxcpu} = 0; > + $stats->{cpu} = $stats->{cpu} * 100; > + $stats->{totalcpu} = $stats->{cpu} * $stats->{maxcpu}; > + > + return $stats; > +} > + > 1;