public inbox for pve-devel@lists.proxmox.com
 help / color / mirror / Atom feed
* [pve-devel] [PATCH pve-ha-manager 0/3] POC/RFC: ressource aware HA manager
@ 2021-12-13  7:43 Alexandre Derumier
  2021-12-13  7:43 ` [pve-devel] [PATCH pve-ha-manager 1/3] add ressource awareness manager Alexandre Derumier
                   ` (3 more replies)
  0 siblings, 4 replies; 8+ messages in thread
From: Alexandre Derumier @ 2021-12-13  7:43 UTC (permalink / raw)
  To: pve-devel

Hi,

this is a proof of concept to implement ressource aware HA.

The current implementation is really basic,
simply balancing the number of services on each node.

I had some real production cases, where a node is failing, and restarted vm
impact others nodes because of too much cpu/ram usage.


This new implementation use best-fit heuristic vector packing with constraints support.


- We compute nodes memory/cpu, and vm memory/cpu average stats  on last 20min

For each ressource :
- First, we ordering pending recovery state services by memory, then cpu usage.
  Memory is more important here, because vm can't start if target node don't have enough memory

- Then, we check possible target nodes contraints. (storage available, node have enough cpu/ram, node have enough cores,...)
  (could be extended with other constraint like vm affinity/anti-affinity, cpu compatibilty, ...)

- Then we compute a node weight with euclidean distance of both cpu/ram vectors between vm usage and node available ressources.
  Then we choose the first node with the lower eucliean distance weight.
  (Ex: if vm use 1go ram/1% cpu, node1 have 2go ram/2% cpu , and node2 have 4go ram/4% cpu,  node1 will be choose because it's the nearest of vm usage)

- We add recovered vm cpu/ram to target node stats. (This is only an best effort estimation, as the vm start is async on target lrm, and could failed,...)


I have keeped HA group node prio, and other other ordering,
so this don't break current tests, and we can add easily a option at datacenter to enable/disable

It could be easy to implement later some kind of vm auto migration when a node use too much cpu/ram,
reusing same node selection algorithm

I have added a basic test, I'll add more tests later if this patch serie is ok for you.



Some good litterature about heuristics:

microsoft hyper-v implementation: 
 - http://kunaltalwar.org/papers/VBPacking.pdf
 - https://www.microsoft.com/en-us/research/wp-content/uploads/2011/01/virtualization.pdf
Variable size vector bin packing heuristics:
 - https://hal.archives-ouvertes.fr/hal-00868016v2/document


Alexandre Derumier (3):
  add ressource awareness manager
  tests: add support for ressources
  add test-basic0

 src/PVE/HA/Env.pm                    |  24 +++
 src/PVE/HA/Env/PVE2.pm               |  90 ++++++++++
 src/PVE/HA/Manager.pm                | 246 ++++++++++++++++++++++++++-
 src/PVE/HA/Sim/Hardware.pm           |  61 +++++++
 src/PVE/HA/Sim/TestEnv.pm            |  36 ++++
 src/test/test-basic0/README          |   1 +
 src/test/test-basic0/cmdlist         |   4 +
 src/test/test-basic0/hardware_status |   5 +
 src/test/test-basic0/log.expect      |  52 ++++++
 src/test/test-basic0/manager_status  |   1 +
 src/test/test-basic0/node_stats      |   5 +
 src/test/test-basic0/service_config  |   5 +
 src/test/test-basic0/service_stats   |   5 +
 13 files changed, 528 insertions(+), 7 deletions(-)
 create mode 100644 src/test/test-basic0/README
 create mode 100644 src/test/test-basic0/cmdlist
 create mode 100644 src/test/test-basic0/hardware_status
 create mode 100644 src/test/test-basic0/log.expect
 create mode 100644 src/test/test-basic0/manager_status
 create mode 100644 src/test/test-basic0/node_stats
 create mode 100644 src/test/test-basic0/service_config
 create mode 100644 src/test/test-basic0/service_stats

-- 
2.30.2




^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2021-12-13 11:29 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-12-13  7:43 [pve-devel] [PATCH pve-ha-manager 0/3] POC/RFC: ressource aware HA manager Alexandre Derumier
2021-12-13  7:43 ` [pve-devel] [PATCH pve-ha-manager 1/3] add ressource awareness manager Alexandre Derumier
2021-12-13 10:04   ` Thomas Lamprecht
2021-12-13 10:58     ` DERUMIER, Alexandre
2021-12-13 11:29       ` Thomas Lamprecht
2021-12-13  7:43 ` [pve-devel] [PATCH pve-ha-manager 2/3] tests: add support for ressources Alexandre Derumier
2021-12-13  7:43 ` [pve-devel] [PATCH pve-ha-manager 3/3] add test-basic0 Alexandre Derumier
2021-12-13  9:02 ` [pve-devel] [PATCH pve-ha-manager 0/3] POC/RFC: ressource aware HA manager Thomas Lamprecht

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox
Service provided by Proxmox Server Solutions GmbH | Privacy | Legal