all lists on lists.proxmox.com
 help / color / mirror / Atom feed
* [pve-devel] [PATCH pve-ha-manager 0/3] POC/RFC: ressource aware HA manager
@ 2021-12-13  7:43 Alexandre Derumier
  2021-12-13  7:43 ` [pve-devel] [PATCH pve-ha-manager 1/3] add ressource awareness manager Alexandre Derumier
                   ` (3 more replies)
  0 siblings, 4 replies; 8+ messages in thread
From: Alexandre Derumier @ 2021-12-13  7:43 UTC (permalink / raw)
  To: pve-devel

Hi,

this is a proof of concept to implement ressource aware HA.

The current implementation is really basic,
simply balancing the number of services on each node.

I had some real production cases, where a node is failing, and restarted vm
impact others nodes because of too much cpu/ram usage.


This new implementation use best-fit heuristic vector packing with constraints support.


- We compute nodes memory/cpu, and vm memory/cpu average stats  on last 20min

For each ressource :
- First, we ordering pending recovery state services by memory, then cpu usage.
  Memory is more important here, because vm can't start if target node don't have enough memory

- Then, we check possible target nodes contraints. (storage available, node have enough cpu/ram, node have enough cores,...)
  (could be extended with other constraint like vm affinity/anti-affinity, cpu compatibilty, ...)

- Then we compute a node weight with euclidean distance of both cpu/ram vectors between vm usage and node available ressources.
  Then we choose the first node with the lower eucliean distance weight.
  (Ex: if vm use 1go ram/1% cpu, node1 have 2go ram/2% cpu , and node2 have 4go ram/4% cpu,  node1 will be choose because it's the nearest of vm usage)

- We add recovered vm cpu/ram to target node stats. (This is only an best effort estimation, as the vm start is async on target lrm, and could failed,...)


I have keeped HA group node prio, and other other ordering,
so this don't break current tests, and we can add easily a option at datacenter to enable/disable

It could be easy to implement later some kind of vm auto migration when a node use too much cpu/ram,
reusing same node selection algorithm

I have added a basic test, I'll add more tests later if this patch serie is ok for you.



Some good litterature about heuristics:

microsoft hyper-v implementation: 
 - http://kunaltalwar.org/papers/VBPacking.pdf
 - https://www.microsoft.com/en-us/research/wp-content/uploads/2011/01/virtualization.pdf
Variable size vector bin packing heuristics:
 - https://hal.archives-ouvertes.fr/hal-00868016v2/document


Alexandre Derumier (3):
  add ressource awareness manager
  tests: add support for ressources
  add test-basic0

 src/PVE/HA/Env.pm                    |  24 +++
 src/PVE/HA/Env/PVE2.pm               |  90 ++++++++++
 src/PVE/HA/Manager.pm                | 246 ++++++++++++++++++++++++++-
 src/PVE/HA/Sim/Hardware.pm           |  61 +++++++
 src/PVE/HA/Sim/TestEnv.pm            |  36 ++++
 src/test/test-basic0/README          |   1 +
 src/test/test-basic0/cmdlist         |   4 +
 src/test/test-basic0/hardware_status |   5 +
 src/test/test-basic0/log.expect      |  52 ++++++
 src/test/test-basic0/manager_status  |   1 +
 src/test/test-basic0/node_stats      |   5 +
 src/test/test-basic0/service_config  |   5 +
 src/test/test-basic0/service_stats   |   5 +
 13 files changed, 528 insertions(+), 7 deletions(-)
 create mode 100644 src/test/test-basic0/README
 create mode 100644 src/test/test-basic0/cmdlist
 create mode 100644 src/test/test-basic0/hardware_status
 create mode 100644 src/test/test-basic0/log.expect
 create mode 100644 src/test/test-basic0/manager_status
 create mode 100644 src/test/test-basic0/node_stats
 create mode 100644 src/test/test-basic0/service_config
 create mode 100644 src/test/test-basic0/service_stats

-- 
2.30.2




^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2021-12-13 11:29 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-12-13  7:43 [pve-devel] [PATCH pve-ha-manager 0/3] POC/RFC: ressource aware HA manager Alexandre Derumier
2021-12-13  7:43 ` [pve-devel] [PATCH pve-ha-manager 1/3] add ressource awareness manager Alexandre Derumier
2021-12-13 10:04   ` Thomas Lamprecht
2021-12-13 10:58     ` DERUMIER, Alexandre
2021-12-13 11:29       ` Thomas Lamprecht
2021-12-13  7:43 ` [pve-devel] [PATCH pve-ha-manager 2/3] tests: add support for ressources Alexandre Derumier
2021-12-13  7:43 ` [pve-devel] [PATCH pve-ha-manager 3/3] add test-basic0 Alexandre Derumier
2021-12-13  9:02 ` [pve-devel] [PATCH pve-ha-manager 0/3] POC/RFC: ressource aware HA manager Thomas Lamprecht

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.
Service provided by Proxmox Server Solutions GmbH | Privacy | Legal