Re: [PVE-User] Locking HA during UPS shutdown

public inbox for pve-user@lists.proxmox.com
 help / color / mirror / Atom feed

* Re: [PVE-User] Locking HA during UPS shutdown
       [not found] <mailman.34.1646911206.440.pve-user@lists.proxmox.com>
@ 2022-03-10 11:50 ` admins
       [not found]   ` <AC07A433-7E37-420B-97E1-2314F97C022A@me.com>
  0 siblings, 1 reply; 5+ messages in thread
From: admins @ 2022-03-10 11:50 UTC (permalink / raw)
  To: Proxmox VE user list; +Cc: PVE User List, Stefan Radman

Hi, 

here are two ideas: shutdown sequence -and- command sequence
1: shutdown sequence you may achieve when you set NUT’s on each node to only monitor the UPS power, then configure each node to shutdown itself on a different ups power levels, ex: node1 on 15% battery, node2 on 10% battery and so on
2: you can set a cmd sequence to firstly execute  pve node maintenance mode , and then execute shutdown -> this way HA will not try to migrate vm to node in maintenance, and the chance all nodes to goes into maintenance in exactly same second seems to be not a risk at all.

hope thats helpful.

Regards,
Sto.

> On Mar 10, 2022, at 1:10 PM, Stefan Radman via pve-user <pve-user@lists.proxmox.com <mailto:pve-user@lists.proxmox.com>> wrote:
> 
> 
> From: Stefan Radman <stefan.radman@me.com <mailto:stefan.radman@me.com>>
> Subject: Locking HA during UPS shutdown
> Date: March 10, 2022 at 1:10:09 PM GMT+2
> To: PVE User List <pve-user@pve.proxmox.com <mailto:pve-user@pve.proxmox.com>>
> 
> 
> Hi 
> 
> I am configuring a 3 node PVE cluster with integrated Ceph storage.
> 
> It is powered by 2 UPS that are monitored by NUT (Network UPS Tools).
> 
> HA is configured with 3 groups:
> group pve1 nodes pve1:1,pve2,pve3
> group pve2 nodes pve1,pve2:1,pve3
> group pve3 nodes pve1,pve2,pve3:1
> 
> That will normally place the VMs in each group on the corresponding node, unless that node fails.
> 
> The cluster is configured to migrate VMs away from a node before shutting it down (Cluster=>Options=>HA Settings: shutdown_policy=migrate).
> 
> NUT is configured to shut down the serves once the last of the two UPS is running low on battery.
> 
> My problem:
> When NUT starts shutting down the 3 nodes, HA will first try to live-migrate them to another node.
> That live migration process gets stuck because all the nodes are shutting down simultaneously.
> It seems that the whole process runs into a timeout, finally “powers off” all the VMs and shuts down the nodes.
> 
> My question:
> Is there a way to “lock” or temporarily de-activate HA before shutting down a node to avoid that deadlock?
> 
> Thank you
> 
> Stefan
> 
> 
> 
> 
> 
> 
> 
> 
> _______________________________________________
> pve-user mailing list
> pve-user@lists.proxmox.com <mailto:pve-user@lists.proxmox.com>
> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user


Best Regards,

Stoyan Stoyanov Sto | Solutions Manager
| Telehouse.Solutions | ICT Department
| phone/viber:  +359 894774934 <tel:+359 894774934>
| telegram:  @prostoSto <https://mysignature.io/redirect/skype:prosto.sto?chat>
| skype:  prosto.sto <https://mysignature.io/redirect/skype:prosto.sto?chat>
| email:  sto@telehouse.solutions <mailto:sto@telehouse.solutions>
| website: www.telehouse.solutions <https://mysig.io/MTRmMTg>
| address: Telepoint #2, Sofia, Bulgaria
 <https://mysignature.io/editor/?utm_source=freepixel>

 <https://mysig.io/ZDNkNWY>
Save paper. Don’t print







Best Regards,

Stoyan Stoyanov Sto | Solutions Manager
| Telehouse.Solutions | ICT Department
| phone/viber:  +359 894774934 <tel:+359 894774934>
| telegram:  @prostoSto <https://mysignature.io/redirect/skype:prosto.sto?chat>
| skype:  prosto.sto <https://mysignature.io/redirect/skype:prosto.sto?chat>
| email:  sto@telehouse.solutions <mailto:sto@telehouse.solutions>
| website: www.telehouse.solutions <https://mysig.io/MTRmMTg>
| address: Telepoint #2, Sofia, Bulgaria
 <https://mysignature.io/editor/?utm_source=freepixel>

 <https://mysig.io/ZDNkNWY>
Save paper. Don’t print






^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PVE-User] Locking HA during UPS shutdown
       [not found]   ` <AC07A433-7E37-420B-97E1-2314F97C022A@me.com>
@ 2022-03-10 12:48     ` admins
  2022-03-10 13:48       ` admins
  0 siblings, 1 reply; 5+ messages in thread
From: admins @ 2022-03-10 12:48 UTC (permalink / raw)
  To: Stefan Radman; +Cc: Proxmox VE user list, PVE User List

I don’t remember, search into pvecm and pve[tab][tab] related commands man pages 

> On Mar 10, 2022, at 2:19 PM, Stefan Radman <stefan.radman@me.com> wrote:
> 
> Hi Sto
> 
> Thanks for the suggestions.
> 
> The second option is what I was looking for.
> 
> How do I initiate “pve node maintenance mode”?
> 
> The “Node Maintenance” paragraph in the HA documentation is quite brief and does not refer to any command or GUI component.
> 
> Thank you
> 
> Stefan
> 
> 
>> On Mar 10, 2022, at 14:50, admins@telehouse.solutions <mailto:admins@telehouse.solutions> wrote:
>> 
>> Hi, 
>> 
>> here are two ideas: shutdown sequence -and- command sequence
>> 1: shutdown sequence you may achieve when you set NUT’s on each node to only monitor the UPS power, then configure each node to shutdown itself on a different ups power levels, ex: node1 on 15% battery, node2 on 10% battery and so on
>> 2: you can set a cmd sequence to firstly execute  pve node maintenance mode , and then execute shutdown -> this way HA will not try to migrate vm to node in maintenance, and the chance all nodes to goes into maintenance in exactly same second seems to be not a risk at all.
>> 
>> hope thats helpful.
>> 
>> Regards,
>> Sto.
>> 
>>> On Mar 10, 2022, at 1:10 PM, Stefan Radman via pve-user <pve-user@lists.proxmox.com <mailto:pve-user@lists.proxmox.com>> wrote:
>>> 
>>> 
>>> From: Stefan Radman <stefan.radman@me.com <mailto:stefan.radman@me.com>>
>>> Subject: Locking HA during UPS shutdown
>>> Date: March 10, 2022 at 1:10:09 PM GMT+2
>>> To: PVE User List <pve-user@pve.proxmox.com <mailto:pve-user@pve.proxmox.com>>
>>> 
>>> 
>>> Hi 
>>> 
>>> I am configuring a 3 node PVE cluster with integrated Ceph storage.
>>> 
>>> It is powered by 2 UPS that are monitored by NUT (Network UPS Tools).
>>> 
>>> HA is configured with 3 groups:
>>> group pve1 nodes pve1:1,pve2,pve3
>>> group pve2 nodes pve1,pve2:1,pve3
>>> group pve3 nodes pve1,pve2,pve3:1
>>> 
>>> That will normally place the VMs in each group on the corresponding node, unless that node fails.
>>> 
>>> The cluster is configured to migrate VMs away from a node before shutting it down (Cluster=>Options=>HA Settings: shutdown_policy=migrate).
>>> 
>>> NUT is configured to shut down the serves once the last of the two UPS is running low on battery.
>>> 
>>> My problem:
>>> When NUT starts shutting down the 3 nodes, HA will first try to live-migrate them to another node.
>>> That live migration process gets stuck because all the nodes are shutting down simultaneously.
>>> It seems that the whole process runs into a timeout, finally “powers off” all the VMs and shuts down the nodes.
>>> 
>>> My question:
>>> Is there a way to “lock” or temporarily de-activate HA before shutting down a node to avoid that deadlock?
>>> 
>>> Thank you
>>> 
>>> Stefan
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> _______________________________________________
>>> pve-user mailing list
>>> pve-user@lists.proxmox.com <mailto:pve-user@lists.proxmox.com>
>>> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user <https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user>
>> 
>> 
>> Best Regards,
>> 
>> Stoyan Stoyanov Sto | Solutions Manager
>> | Telehouse.Solutions | ICT Department
>> | phone/viber:  +359 894774934 <tel:+359 894774934>
>> | telegram:  @prostoSto <https://mysignature.io/redirect/skype:prosto.sto?chat>
>> | skype:  prosto.sto <https://mysignature.io/redirect/skype:prosto.sto?chat>
>> | email:  sto@telehouse.solutions <mailto:sto@telehouse.solutions>
>> | website: www.telehouse.solutions <https://mysig.io/MTRmMTg>
>> | address: Telepoint #2, Sofia, Bulgaria
>>  <https://mysignature.io/editor/?utm_source=freepixel><356841.png>
>> 
>>  <https://mysig.io/ZDNkNWY>
>> Save paper. Don’t print
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> Best Regards,
>> 
>> Stoyan Stoyanov Sto | Solutions Manager
>> | Telehouse.Solutions | ICT Department
>> | phone/viber:  +359 894774934 <tel:+359 894774934>
>> | telegram:  @prostoSto <https://mysignature.io/redirect/skype:prosto.sto?chat>
>> | skype:  prosto.sto <https://mysignature.io/redirect/skype:prosto.sto?chat>
>> | email:  sto@telehouse.solutions <mailto:sto@telehouse.solutions>
>> | website: www.telehouse.solutions <https://mysig.io/MTRmMTg>
>> | address: Telepoint #2, Sofia, Bulgaria
>>  <https://mysignature.io/editor/?utm_source=freepixel><356841.png>
>> 
>>  <https://mysig.io/ZDNkNWY>
>> Save paper. Don’t print
> 


Best Regards,

Stoyan Stoyanov Sto | Solutions Manager
| Telehouse.Solutions | ICT Department
| phone/viber:  +359 894774934 <tel:+359 894774934>
| telegram:  @prostoSto <https://mysignature.io/redirect/skype:prosto.sto?chat>
| skype:  prosto.sto <https://mysignature.io/redirect/skype:prosto.sto?chat>
| email:  sto@telehouse.solutions <mailto:sto@telehouse.solutions>
| website: www.telehouse.solutions <https://mysig.io/MTRmMTg>
| address: Telepoint #2, Sofia, Bulgaria
 <https://mysignature.io/editor/?utm_source=freepixel>

 <https://mysig.io/ZDNkNWY>
Save paper. Don’t print






^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PVE-User] Locking HA during UPS shutdown
  2022-03-10 12:48     ` admins
@ 2022-03-10 13:48       ` admins
  2022-03-10 14:24         ` Fabian Grünbichler
  0 siblings, 1 reply; 5+ messages in thread
From: admins @ 2022-03-10 13:48 UTC (permalink / raw)
  To: Proxmox VE user list; +Cc: Stefan Radman, PVE User List

That was actually a really BAD ADVICE…. as when node initiate maintenance mode it will try to migrate hosted vms … and eventually ends up in the same Lock loop..
what you really need is to remove started vms from ha-manager, so when the node initiate shutdown it will do firstly do regular shutdown vm per vm.

So, do something like below as first command in your NUT command sequence:

for a in `ha-manager status | grep started|awk '{print $2}'|sed 's/vm://g'`; do ha-manager remove $a;done


> On Mar 10, 2022, at 2:48 PM, admins@telehouse.solutions wrote:
> 
> I don’t remember, search into pvecm and pve[tab][tab] related commands man pages 
> 
>> On Mar 10, 2022, at 2:19 PM, Stefan Radman <stefan.radman@me.com> wrote:
>> 
>> Hi Sto
>> 
>> Thanks for the suggestions.
>> 
>> The second option is what I was looking for.
>> 
>> How do I initiate “pve node maintenance mode”?
>> 
>> The “Node Maintenance” paragraph in the HA documentation is quite brief and does not refer to any command or GUI component.
>> 
>> Thank you
>> 
>> Stefan
>> 
>> 
>>> On Mar 10, 2022, at 14:50, admins@telehouse.solutions <mailto:admins@telehouse.solutions> wrote:
>>> 
>>> Hi, 
>>> 
>>> here are two ideas: shutdown sequence -and- command sequence
>>> 1: shutdown sequence you may achieve when you set NUT’s on each node to only monitor the UPS power, then configure each node to shutdown itself on a different ups power levels, ex: node1 on 15% battery, node2 on 10% battery and so on
>>> 2: you can set a cmd sequence to firstly execute  pve node maintenance mode , and then execute shutdown -> this way HA will not try to migrate vm to node in maintenance, and the chance all nodes to goes into maintenance in exactly same second seems to be not a risk at all.
>>> 
>>> hope thats helpful.
>>> 
>>> Regards,
>>> Sto.
>>> 
>>>> On Mar 10, 2022, at 1:10 PM, Stefan Radman via pve-user <pve-user@lists.proxmox.com <mailto:pve-user@lists.proxmox.com>> wrote:
>>>> 
>>>> 
>>>> From: Stefan Radman <stefan.radman@me.com <mailto:stefan.radman@me.com>>
>>>> Subject: Locking HA during UPS shutdown
>>>> Date: March 10, 2022 at 1:10:09 PM GMT+2
>>>> To: PVE User List <pve-user@pve.proxmox.com <mailto:pve-user@pve.proxmox.com>>
>>>> 
>>>> 
>>>> Hi 
>>>> 
>>>> I am configuring a 3 node PVE cluster with integrated Ceph storage.
>>>> 
>>>> It is powered by 2 UPS that are monitored by NUT (Network UPS Tools).
>>>> 
>>>> HA is configured with 3 groups:
>>>> group pve1 nodes pve1:1,pve2,pve3
>>>> group pve2 nodes pve1,pve2:1,pve3
>>>> group pve3 nodes pve1,pve2,pve3:1
>>>> 
>>>> That will normally place the VMs in each group on the corresponding node, unless that node fails.
>>>> 
>>>> The cluster is configured to migrate VMs away from a node before shutting it down (Cluster=>Options=>HA Settings: shutdown_policy=migrate).
>>>> 
>>>> NUT is configured to shut down the serves once the last of the two UPS is running low on battery.
>>>> 
>>>> My problem:
>>>> When NUT starts shutting down the 3 nodes, HA will first try to live-migrate them to another node.
>>>> That live migration process gets stuck because all the nodes are shutting down simultaneously.
>>>> It seems that the whole process runs into a timeout, finally “powers off” all the VMs and shuts down the nodes.
>>>> 
>>>> My question:
>>>> Is there a way to “lock” or temporarily de-activate HA before shutting down a node to avoid that deadlock?
>>>> 
>>>> Thank you
>>>> 
>>>> Stefan
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> _______________________________________________
>>>> pve-user mailing list
>>>> pve-user@lists.proxmox.com <mailto:pve-user@lists.proxmox.com>
>>>> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user <https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user>
>>> 
>>> 
>>> Best Regards,
>>> 
>>> Stoyan Stoyanov Sto | Solutions Manager
>>> | Telehouse.Solutions | ICT Department
>>> | phone/viber:  +359 894774934 <tel:+359 894774934>
>>> | telegram:  @prostoSto <https://mysignature.io/redirect/skype:prosto.sto?chat>
>>> | skype:  prosto.sto <https://mysignature.io/redirect/skype:prosto.sto?chat>
>>> | email:  sto@telehouse.solutions <mailto:sto@telehouse.solutions>
>>> | website: www.telehouse.solutions <https://mysig.io/MTRmMTg>
>>> | address: Telepoint #2, Sofia, Bulgaria
>>> <https://mysignature.io/editor/?utm_source=freepixel><356841.png>
>>> 
>>> <https://mysig.io/ZDNkNWY>
>>> Save paper. Don’t print
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> Best Regards,
>>> 
>>> Stoyan Stoyanov Sto | Solutions Manager
>>> | Telehouse.Solutions | ICT Department
>>> | phone/viber:  +359 894774934 <tel:+359 894774934>
>>> | telegram:  @prostoSto <https://mysignature.io/redirect/skype:prosto.sto?chat>
>>> | skype:  prosto.sto <https://mysignature.io/redirect/skype:prosto.sto?chat>
>>> | email:  sto@telehouse.solutions <mailto:sto@telehouse.solutions>
>>> | website: www.telehouse.solutions <https://mysig.io/MTRmMTg>
>>> | address: Telepoint #2, Sofia, Bulgaria
>>> <https://mysignature.io/editor/?utm_source=freepixel><356841.png>
>>> 
>>> <https://mysig.io/ZDNkNWY>
>>> Save paper. Don’t print
>> 
> 
> 
> Best Regards,
> 
> Stoyan Stoyanov Sto | Solutions Manager
> | Telehouse.Solutions | ICT Department
> | phone/viber:  +359 894774934 <tel:+359 894774934>
> | telegram:  @prostoSto <https://mysignature.io/redirect/skype:prosto.sto?chat>
> | skype:  prosto.sto <https://mysignature.io/redirect/skype:prosto.sto?chat>
> | email:  sto@telehouse.solutions <mailto:sto@telehouse.solutions>
> | website: www.telehouse.solutions <https://mysig.io/MTRmMTg>
> | address: Telepoint #2, Sofia, Bulgaria
> <https://mysignature.io/editor/?utm_source=freepixel>
> 
> <https://mysig.io/ZDNkNWY>
> Save paper. Don’t print
> 
> 
> 
> 
> _______________________________________________
> pve-user mailing list
> pve-user@lists.proxmox.com
> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user


Best Regards,

Stoyan Stoyanov Sto | Solutions Manager
| Telehouse.Solutions | ICT Department
| phone/viber:  +359 894774934 <tel:+359 894774934>
| telegram:  @prostoSto <https://mysignature.io/redirect/skype:prosto.sto?chat>
| skype:  prosto.sto <https://mysignature.io/redirect/skype:prosto.sto?chat>
| email:  sto@telehouse.solutions <mailto:sto@telehouse.solutions>
| website: www.telehouse.solutions <https://mysig.io/MTRmMTg>
| address: Telepoint #2, Sofia, Bulgaria
 <https://mysignature.io/editor/?utm_source=freepixel>

 <https://mysig.io/ZDNkNWY>
Save paper. Don’t print






^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PVE-User] Locking HA during UPS shutdown
  2022-03-10 13:48       ` admins
@ 2022-03-10 14:24         ` Fabian Grünbichler
  2022-03-10 16:07           ` M. Lyakhovsky
  0 siblings, 1 reply; 5+ messages in thread
From: Fabian Grünbichler @ 2022-03-10 14:24 UTC (permalink / raw)
  To: Proxmox VE user list

On March 10, 2022 2:48 pm, admins@telehouse.solutions wrote:
> That was actually a really BAD ADVICE…. as when node initiate maintenance mode it will try to migrate hosted vms … and eventually ends up in the same Lock loop..
> what you really need is to remove started vms from ha-manager, so when the node initiate shutdown it will do firstly do regular shutdown vm per vm.
> 
> So, do something like below as first command in your NUT command sequence:
> 
> for a in `ha-manager status | grep started|awk '{print $2}'|sed 's/vm://g'`; do ha-manager remove $a;done

what you should do is just change the policy to freeze or fail-over 
before triggering the shutdown. and once power comes back up and your 
cluster has booted, switch it back to migrate.

that way, the shutdown will just stop and freeze the resources, similar 
to what happens when rebooting using the default conditional policy.

note that editing datacenter.cfg (where the shutdown_policy is 
configured) is currently not exposed in any CLI tool, but you can update 
it using pvesh or the API.

there is still one issue though - if the whole cluster is shutdown at 
the same time, at some point during the shutdown a non-quorate partition 
will be all that's left, and at that point certain actions won't work 
anymore and the node probably will get fenced. fixing this effectively 
would require some sort of conditional delay at the right point in the 
shutdown sequence that waits for all guests on all nodes(!) to stop 
before proceeding with stopping the PVE services and corosync (nodes 
still might get fenced if they take too long shutting down after the 
last guest has exited, but that shouldn't cause much issues other than 
noise). one way to do this would be for your NUT script to set a flag 
file in /etc/pve, and some systemd service with the right Wants/After 
settings that blocks the shutdown if the flag file exists and any guests 
are still running. probably requires some tinkering, but can be safely 
tested in a virtual cluster before moving to production ;)

this last problem is not related to HA though (other than HA introducing 
another source of trouble courtesy of fencing being active) - you will 
also potentially hit it with your approach. the 'stop all guests on 
node' logic that PVE has on shutdown is for shutting down one node 
without affecting quorum, it doesn't work reliably for full-cluster 
shutdowns (you might not see problems if timing works out, but it's 
based on chance).

an alternative approach would be to request all HA resources to be stopped 
or disabled (`ha-manager set .. --state ..`), wait for that to be done 
cluster-wide (e.g. by polling /cluster/resources API path), and then 
trigger the shutdown. disadvantage of that is you have to remember the 
pre-shutdown state and restore that afterwards for each resource..

https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_node_maintenance

>> On Mar 10, 2022, at 2:48 PM, admins@telehouse.solutions wrote:
>> 
>> I don’t remember, search into pvecm and pve[tab][tab] related commands man pages 
>> 
>>> On Mar 10, 2022, at 2:19 PM, Stefan Radman <stefan.radman@me.com> wrote:
>>> 
>>> Hi Sto
>>> 
>>> Thanks for the suggestions.
>>> 
>>> The second option is what I was looking for.
>>> 
>>> How do I initiate “pve node maintenance mode”?
>>> 
>>> The “Node Maintenance” paragraph in the HA documentation is quite brief and does not refer to any command or GUI component.
>>> 
>>> Thank you
>>> 
>>> Stefan
>>> 
>>> 
>>>> On Mar 10, 2022, at 14:50, admins@telehouse.solutions <mailto:admins@telehouse.solutions> wrote:
>>>> 
>>>> Hi, 
>>>> 
>>>> here are two ideas: shutdown sequence -and- command sequence
>>>> 1: shutdown sequence you may achieve when you set NUT’s on each node to only monitor the UPS power, then configure each node to shutdown itself on a different ups power levels, ex: node1 on 15% battery, node2 on 10% battery and so on
>>>> 2: you can set a cmd sequence to firstly execute  pve node maintenance mode , and then execute shutdown -> this way HA will not try to migrate vm to node in maintenance, and the chance all nodes to goes into maintenance in exactly same second seems to be not a risk at all.
>>>> 
>>>> hope thats helpful.
>>>> 
>>>> Regards,
>>>> Sto.
>>>> 
>>>>> On Mar 10, 2022, at 1:10 PM, Stefan Radman via pve-user <pve-user@lists.proxmox.com <mailto:pve-user@lists.proxmox.com>> wrote:
>>>>> 
>>>>> 
>>>>> From: Stefan Radman <stefan.radman@me.com <mailto:stefan.radman@me.com>>
>>>>> Subject: Locking HA during UPS shutdown
>>>>> Date: March 10, 2022 at 1:10:09 PM GMT+2
>>>>> To: PVE User List <pve-user@pve.proxmox.com <mailto:pve-user@pve.proxmox.com>>
>>>>> 
>>>>> 
>>>>> Hi 
>>>>> 
>>>>> I am configuring a 3 node PVE cluster with integrated Ceph storage.
>>>>> 
>>>>> It is powered by 2 UPS that are monitored by NUT (Network UPS Tools).
>>>>> 
>>>>> HA is configured with 3 groups:
>>>>> group pve1 nodes pve1:1,pve2,pve3
>>>>> group pve2 nodes pve1,pve2:1,pve3
>>>>> group pve3 nodes pve1,pve2,pve3:1
>>>>> 
>>>>> That will normally place the VMs in each group on the corresponding node, unless that node fails.
>>>>> 
>>>>> The cluster is configured to migrate VMs away from a node before shutting it down (Cluster=>Options=>HA Settings: shutdown_policy=migrate).
>>>>> 
>>>>> NUT is configured to shut down the serves once the last of the two UPS is running low on battery.
>>>>> 
>>>>> My problem:
>>>>> When NUT starts shutting down the 3 nodes, HA will first try to live-migrate them to another node.
>>>>> That live migration process gets stuck because all the nodes are shutting down simultaneously.
>>>>> It seems that the whole process runs into a timeout, finally “powers off” all the VMs and shuts down the nodes.
>>>>> 
>>>>> My question:
>>>>> Is there a way to “lock” or temporarily de-activate HA before shutting down a node to avoid that deadlock?
>>>>> 
>>>>> Thank you
>>>>> 
>>>>> Stefan
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> _______________________________________________
>>>>> pve-user mailing list
>>>>> pve-user@lists.proxmox.com <mailto:pve-user@lists.proxmox.com>
>>>>> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user <https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user>
>>>> 
>>>> 
>>>> Best Regards,
>>>> 
>>>> Stoyan Stoyanov Sto | Solutions Manager
>>>> | Telehouse.Solutions | ICT Department
>>>> | phone/viber:  +359 894774934 <tel:+359 894774934>
>>>> | telegram:  @prostoSto <https://mysignature.io/redirect/skype:prosto.sto?chat>
>>>> | skype:  prosto.sto <https://mysignature.io/redirect/skype:prosto.sto?chat>
>>>> | email:  sto@telehouse.solutions <mailto:sto@telehouse.solutions>
>>>> | website: www.telehouse.solutions <https://mysig.io/MTRmMTg>
>>>> | address: Telepoint #2, Sofia, Bulgaria
>>>> <https://mysignature.io/editor/?utm_source=freepixel><356841.png>
>>>> 
>>>> <https://mysig.io/ZDNkNWY>
>>>> Save paper. Don’t print
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> Best Regards,
>>>> 
>>>> Stoyan Stoyanov Sto | Solutions Manager
>>>> | Telehouse.Solutions | ICT Department
>>>> | phone/viber:  +359 894774934 <tel:+359 894774934>
>>>> | telegram:  @prostoSto <https://mysignature.io/redirect/skype:prosto.sto?chat>
>>>> | skype:  prosto.sto <https://mysignature.io/redirect/skype:prosto.sto?chat>
>>>> | email:  sto@telehouse.solutions <mailto:sto@telehouse.solutions>
>>>> | website: www.telehouse.solutions <https://mysig.io/MTRmMTg>
>>>> | address: Telepoint #2, Sofia, Bulgaria
>>>> <https://mysignature.io/editor/?utm_source=freepixel><356841.png>
>>>> 
>>>> <https://mysig.io/ZDNkNWY>
>>>> Save paper. Don’t print
>>> 
>> 
>> 
>> Best Regards,
>> 
>> Stoyan Stoyanov Sto | Solutions Manager
>> | Telehouse.Solutions | ICT Department
>> | phone/viber:  +359 894774934 <tel:+359 894774934>
>> | telegram:  @prostoSto <https://mysignature.io/redirect/skype:prosto.sto?chat>
>> | skype:  prosto.sto <https://mysignature.io/redirect/skype:prosto.sto?chat>
>> | email:  sto@telehouse.solutions <mailto:sto@telehouse.solutions>
>> | website: www.telehouse.solutions <https://mysig.io/MTRmMTg>
>> | address: Telepoint #2, Sofia, Bulgaria
>> <https://mysignature.io/editor/?utm_source=freepixel>
>> 
>> <https://mysig.io/ZDNkNWY>
>> Save paper. Don’t print
>> 
>> 
>> 
>> 
>> _______________________________________________
>> pve-user mailing list
>> pve-user@lists.proxmox.com
>> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user
> 
> 
> Best Regards,
> 
> Stoyan Stoyanov Sto | Solutions Manager
> | Telehouse.Solutions | ICT Department
> | phone/viber:  +359 894774934 <tel:+359 894774934>
> | telegram:  @prostoSto <https://mysignature.io/redirect/skype:prosto.sto?chat>
> | skype:  prosto.sto <https://mysignature.io/redirect/skype:prosto.sto?chat>
> | email:  sto@telehouse.solutions <mailto:sto@telehouse.solutions>
> | website: www.telehouse.solutions <https://mysig.io/MTRmMTg>
> | address: Telepoint #2, Sofia, Bulgaria
>  <https://mysignature.io/editor/?utm_source=freepixel>
> 
>  <https://mysig.io/ZDNkNWY>
> Save paper. Don’t print
> 
> 
> 
> 
> _______________________________________________
> pve-user mailing list
> pve-user@lists.proxmox.com
> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user
> 




^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PVE-User] Locking HA during UPS shutdown
  2022-03-10 14:24         ` Fabian Grünbichler
@ 2022-03-10 16:07           ` M. Lyakhovsky
  0 siblings, 0 replies; 5+ messages in thread
From: M. Lyakhovsky @ 2022-03-10 16:07 UTC (permalink / raw)
  To: Proxmox VE user list

Hi I asked a. Question earlier and no one answered  about not being able
load lv2 library because I need lvremove
And please can someone tell me how anable WiFi by putting a sting in.
/etc/network/devices

On Thu, Mar 10, 2022 at 9:30 AM Fabian Grünbichler <
f.gruenbichler@proxmox.com> wrote:

> On March 10, 2022 2:48 pm, admins@telehouse.solutions wrote:
> > That was actually a really BAD ADVICE…. as when node initiate
> maintenance mode it will try to migrate hosted vms … and eventually ends up
> in the same Lock loop..
> > what you really need is to remove started vms from ha-manager, so when
> the node initiate shutdown it will do firstly do regular shutdown vm per vm.
> >
> > So, do something like below as first command in your NUT command
> sequence:
> >
> > for a in `ha-manager status | grep started|awk '{print $2}'|sed
> 's/vm://g'`; do ha-manager remove $a;done
>
> what you should do is just change the policy to freeze or fail-over
> before triggering the shutdown. and once power comes back up and your
> cluster has booted, switch it back to migrate.
>
> that way, the shutdown will just stop and freeze the resources, similar
> to what happens when rebooting using the default conditional policy.
>
> note that editing datacenter.cfg (where the shutdown_policy is
> configured) is currently not exposed in any CLI tool, but you can update
> it using pvesh or the API.
>
> there is still one issue though - if the whole cluster is shutdown at
> the same time, at some point during the shutdown a non-quorate partition
> will be all that's left, and at that point certain actions won't work
> anymore and the node probably will get fenced. fixing this effectively
> would require some sort of conditional delay at the right point in the
> shutdown sequence that waits for all guests on all nodes(!) to stop
> before proceeding with stopping the PVE services and corosync (nodes
> still might get fenced if they take too long shutting down after the
> last guest has exited, but that shouldn't cause much issues other than
> noise). one way to do this would be for your NUT script to set a flag
> file in /etc/pve, and some systemd service with the right Wants/After
> settings that blocks the shutdown if the flag file exists and any guests
> are still running. probably requires some tinkering, but can be safely
> tested in a virtual cluster before moving to production ;)
>
> this last problem is not related to HA though (other than HA introducing
> another source of trouble courtesy of fencing being active) - you will
> also potentially hit it with your approach. the 'stop all guests on
> node' logic that PVE has on shutdown is for shutting down one node
> without affecting quorum, it doesn't work reliably for full-cluster
> shutdowns (you might not see problems if timing works out, but it's
> based on chance).
>
> an alternative approach would be to request all HA resources to be stopped
> or disabled (`ha-manager set .. --state ..`), wait for that to be done
> cluster-wide (e.g. by polling /cluster/resources API path), and then
> trigger the shutdown. disadvantage of that is you have to remember the
> pre-shutdown state and restore that afterwards for each resource..
>
> https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_node_maintenance
>
> >> On Mar 10, 2022, at 2:48 PM, admins@telehouse.solutions wrote:
> >>
> >> I don’t remember, search into pvecm and pve[tab][tab] related commands
> man pages
> >>
> >>> On Mar 10, 2022, at 2:19 PM, Stefan Radman <stefan.radman@me.com>
> wrote:
> >>>
> >>> Hi Sto
> >>>
> >>> Thanks for the suggestions.
> >>>
> >>> The second option is what I was looking for.
> >>>
> >>> How do I initiate “pve node maintenance mode”?
> >>>
> >>> The “Node Maintenance” paragraph in the HA documentation is quite
> brief and does not refer to any command or GUI component.
> >>>
> >>> Thank you
> >>>
> >>> Stefan
> >>>
> >>>
> >>>> On Mar 10, 2022, at 14:50, admins@telehouse.solutions <mailto:
> admins@telehouse.solutions> wrote:
> >>>>
> >>>> Hi,
> >>>>
> >>>> here are two ideas: shutdown sequence -and- command sequence
> >>>> 1: shutdown sequence you may achieve when you set NUT’s on each node
> to only monitor the UPS power, then configure each node to shutdown itself
> on a different ups power levels, ex: node1 on 15% battery, node2 on 10%
> battery and so on
> >>>> 2: you can set a cmd sequence to firstly execute  pve node
> maintenance mode , and then execute shutdown -> this way HA will not try to
> migrate vm to node in maintenance, and the chance all nodes to goes into
> maintenance in exactly same second seems to be not a risk at all.
> >>>>
> >>>> hope thats helpful.
> >>>>
> >>>> Regards,
> >>>> Sto.
> >>>>
> >>>>> On Mar 10, 2022, at 1:10 PM, Stefan Radman via pve-user <
> pve-user@lists.proxmox.com <mailto:pve-user@lists.proxmox.com>> wrote:
> >>>>>
> >>>>>
> >>>>> From: Stefan Radman <stefan.radman@me.com <mailto:
> stefan.radman@me.com>>
> >>>>> Subject: Locking HA during UPS shutdown
> >>>>> Date: March 10, 2022 at 1:10:09 PM GMT+2
> >>>>> To: PVE User List <pve-user@pve.proxmox.com <mailto:
> pve-user@pve.proxmox.com>>
> >>>>>
> >>>>>
> >>>>> Hi
> >>>>>
> >>>>> I am configuring a 3 node PVE cluster with integrated Ceph storage.
> >>>>>
> >>>>> It is powered by 2 UPS that are monitored by NUT (Network UPS Tools).
> >>>>>
> >>>>> HA is configured with 3 groups:
> >>>>> group pve1 nodes pve1:1,pve2,pve3
> >>>>> group pve2 nodes pve1,pve2:1,pve3
> >>>>> group pve3 nodes pve1,pve2,pve3:1
> >>>>>
> >>>>> That will normally place the VMs in each group on the corresponding
> node, unless that node fails.
> >>>>>
> >>>>> The cluster is configured to migrate VMs away from a node before
> shutting it down (Cluster=>Options=>HA Settings: shutdown_policy=migrate).
> >>>>>
> >>>>> NUT is configured to shut down the serves once the last of the two
> UPS is running low on battery.
> >>>>>
> >>>>> My problem:
> >>>>> When NUT starts shutting down the 3 nodes, HA will first try to
> live-migrate them to another node.
> >>>>> That live migration process gets stuck because all the nodes are
> shutting down simultaneously.
> >>>>> It seems that the whole process runs into a timeout, finally “powers
> off” all the VMs and shuts down the nodes.
> >>>>>
> >>>>> My question:
> >>>>> Is there a way to “lock” or temporarily de-activate HA before
> shutting down a node to avoid that deadlock?
> >>>>>
> >>>>> Thank you
> >>>>>
> >>>>> Stefan
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> _______________________________________________
> >>>>> pve-user mailing list
> >>>>> pve-user@lists.proxmox.com <mailto:pve-user@lists.proxmox.com>
> >>>>> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user <
> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user>
> >>>>
> >>>>
> >>>> Best Regards,
> >>>>
> >>>> Stoyan Stoyanov Sto | Solutions Manager
> >>>> | Telehouse.Solutions | ICT Department
> >>>> | phone/viber:  +359 894774934 <tel:+359 894774934>
> >>>> | telegram:  @prostoSto <
> https://mysignature.io/redirect/skype:prosto.sto?chat>
> >>>> | skype:  prosto.sto <
> https://mysignature.io/redirect/skype:prosto.sto?chat>
> >>>> | email:  sto@telehouse.solutions <mailto:sto@telehouse.solutions>
> >>>> | website: www.telehouse.solutions <https://mysig.io/MTRmMTg>
> >>>> | address: Telepoint #2, Sofia, Bulgaria
> >>>> <https://mysignature.io/editor/?utm_source=freepixel><356841.png>
> >>>>
> >>>> <https://mysig.io/ZDNkNWY>
> >>>> Save paper. Don’t print
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> Best Regards,
> >>>>
> >>>> Stoyan Stoyanov Sto | Solutions Manager
> >>>> | Telehouse.Solutions | ICT Department
> >>>> | phone/viber:  +359 894774934 <tel:+359 894774934>
> >>>> | telegram:  @prostoSto <
> https://mysignature.io/redirect/skype:prosto.sto?chat>
> >>>> | skype:  prosto.sto <
> https://mysignature.io/redirect/skype:prosto.sto?chat>
> >>>> | email:  sto@telehouse.solutions <mailto:sto@telehouse.solutions>
> >>>> | website: www.telehouse.solutions <https://mysig.io/MTRmMTg>
> >>>> | address: Telepoint #2, Sofia, Bulgaria
> >>>> <https://mysignature.io/editor/?utm_source=freepixel><356841.png>
> >>>>
> >>>> <https://mysig.io/ZDNkNWY>
> >>>> Save paper. Don’t print
> >>>
> >>
> >>
> >> Best Regards,
> >>
> >> Stoyan Stoyanov Sto | Solutions Manager
> >> | Telehouse.Solutions | ICT Department
> >> | phone/viber:  +359 894774934 <tel:+359 894774934>
> >> | telegram:  @prostoSto <
> https://mysignature.io/redirect/skype:prosto.sto?chat>
> >> | skype:  prosto.sto <
> https://mysignature.io/redirect/skype:prosto.sto?chat>
> >> | email:  sto@telehouse.solutions <mailto:sto@telehouse.solutions>
> >> | website: www.telehouse.solutions <https://mysig.io/MTRmMTg>
> >> | address: Telepoint #2, Sofia, Bulgaria
> >> <https://mysignature.io/editor/?utm_source=freepixel>
> >>
> >> <https://mysig.io/ZDNkNWY>
> >> Save paper. Don’t print
> >>
> >>
> >>
> >>
> >> _______________________________________________
> >> pve-user mailing list
> >> pve-user@lists.proxmox.com
> >> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user
> >
> >
> > Best Regards,
> >
> > Stoyan Stoyanov Sto | Solutions Manager
> > | Telehouse.Solutions | ICT Department
> > | phone/viber:  +359 894774934 <tel:+359 894774934>
> > | telegram:  @prostoSto <
> https://mysignature.io/redirect/skype:prosto.sto?chat>
> > | skype:  prosto.sto <
> https://mysignature.io/redirect/skype:prosto.sto?chat>
> > | email:  sto@telehouse.solutions <mailto:sto@telehouse.solutions>
> > | website: www.telehouse.solutions <https://mysig.io/MTRmMTg>
> > | address: Telepoint #2, Sofia, Bulgaria
> >  <https://mysignature.io/editor/?utm_source=freepixel>
> >
> >  <https://mysig.io/ZDNkNWY>
> > Save paper. Don’t print
> >
> >
> >
> >
> > _______________________________________________
> > pve-user mailing list
> > pve-user@lists.proxmox.com
> > https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user
> >
>
>
> _______________________________________________
> pve-user mailing list
> pve-user@lists.proxmox.com
> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user
>
-- 
Do have a Blessed Day


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2022-03-10 16:08 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <mailman.34.1646911206.440.pve-user@lists.proxmox.com>
2022-03-10 11:50 ` [PVE-User] Locking HA during UPS shutdown admins
     [not found]   ` <AC07A433-7E37-420B-97E1-2314F97C022A@me.com>
2022-03-10 12:48     ` admins
2022-03-10 13:48       ` admins
2022-03-10 14:24         ` Fabian Grünbichler
2022-03-10 16:07           ` M. Lyakhovsky

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox

Service provided by Proxmox Server Solutions GmbH | Privacy | Legal