Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown

public inbox for pve-devel@lists.proxmox.com
 help / color / mirror / Atom feed

From: Alexandre DERUMIER <aderumier@odiso.com>
To: Thomas Lamprecht <t.lamprecht@proxmox.com>
Cc: Proxmox VE development discussion <pve-devel@lists.proxmox.com>
Subject: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown
Date: Thu, 10 Sep 2020 13:34:44 +0200 (CEST)	[thread overview]
Message-ID: <1245358354.508169.1599737684557.JavaMail.zimbra@odiso.com> (raw)
In-Reply-To: <3ee5d9cf-19be-1067-3931-1c54f1c6043a@proxmox.com>

>>as said, if the other nodes where not using HA, the watchdog-mux had no
>>client which could expire.

sorry, maybe I have wrong explained it,
but all my nodes had HA enabled.

I have double check lrm_status json files from my morning backup 2h before the problem,
they were all in "active" state. ("state":"active","mode":"active" )

I don't why node7 don't have rebooted, the only difference is that is was the crm master.
(I think crm also reset the watchdog counter ? maybe behaviour is different than lrm ?)




>>above lines also indicate very high load. 
>>Do you have some monitoring which shows the CPU/IO load before/during this event? 

load (1,5,15 ) was: 6  (for 48cores), cpu usage: 23%
no iowait on disk (vms are on a remote ceph, only proxmox services are running on local ssd disk)

so nothing strange here :/


----- Mail original -----
De: "Thomas Lamprecht" <t.lamprecht@proxmox.com>
À: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com>, "Alexandre Derumier" <aderumier@odiso.com>
Envoyé: Jeudi 10 Septembre 2020 10:21:48
Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown

On 10.09.20 06:58, Alexandre DERUMIER wrote: 
> Thanks Thomas for the investigations. 
> 
> I'm still trying to reproduce... 
> I think I have some special case here, because the user of the forum with 30 nodes had corosync cluster split. (Note that I had this bug 6 months ago,when shuting down a node too, and the only way was stop full stop corosync on all nodes, and start corosync again on all nodes). 
> 
> 
> But this time, corosync logs looks fine. (every node, correctly see node2 down, and see remaning nodes) 
> 
> surviving node7, was the only node with HA, and LRM didn't have enable watchog (I don't have found any log like "pve-ha-lrm: watchdog active" for the last 6months on this nodes 
> 
> 
> So, the timing was: 
> 
> 10:39:05 : "halt" command is send to node2 
> 10:39:16 : node2 is leaving corosync / halt -> every node is seeing it and correctly do a new membership with 13 remaining nodes 
> 
> ...don't see any special logs (corosync,pmxcfs,pve-ha-crm,pve-ha-lrm) after the node2 leaving. 
> But they are still activity on the server, pve-firewall is still logging, vms are running fine 
> 
> 
> between 10:40:25 - 10:40:34 : watchdog reset nodes, but not node7. 
> 
> -> so between 70s-80s after the node2 was done, so I think that watchdog-mux was still running fine until that. 
> (That's sound like lrm was stuck, and client_watchdog_timeout have expired in watchdog-mux) 

as said, if the other nodes where not using HA, the watchdog-mux had no 
client which could expire. 

> 
> 10:40:41 node7, loose quorum (as all others nodes have reset), 

> 10:40:50: node7 crm/lrm finally log. 
> 
> Sep 3 10:40:50 m6kvm7 pve-ha-crm[16196]: got unexpected error - error during cfs-locked 'domain-ha' operation: no quorum! 
> Sep 3 10:40:51 m6kvm7 pve-ha-lrm[16140]: loop take too long (87 seconds) 
> Sep 3 10:40:51 m6kvm7 pve-ha-crm[16196]: loop take too long (92 seconds) 

above lines also indicate very high load. 

Do you have some monitoring which shows the CPU/IO load before/during this event? 

> Sep 3 10:40:51 m6kvm7 pve-ha-crm[16196]: lost lock 'ha_manager_lock - cfs lock update failed - Permission denied 
> Sep 3 10:40:51 m6kvm7 pve-ha-lrm[16140]: lost lock 'ha_agent_m6kvm7_lock - cfs lock update failed - Permission denied 
> 
> 
> 
> So, I really think that something have stucked lrm/crm loop, and watchdog was not resetted because of that. 
>

next prev parent reply	other threads:[~2020-09-10 11:34 UTC|newest]

Thread overview: 84+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-09-03 14:11 Alexandre DERUMIER
2020-09-04 12:29 ` Alexandre DERUMIER
2020-09-04 15:42   ` Dietmar Maurer
2020-09-05 13:32     ` Alexandre DERUMIER
2020-09-05 15:23       ` dietmar
2020-09-05 17:30         ` Alexandre DERUMIER
2020-09-06  4:21           ` dietmar
2020-09-06  5:36             ` Alexandre DERUMIER
2020-09-06  6:33               ` Alexandre DERUMIER
2020-09-06  8:43               ` Alexandre DERUMIER
2020-09-06 12:14                 ` dietmar
2020-09-06 12:19                   ` dietmar
2020-09-07  7:00                     ` Thomas Lamprecht
2020-09-07  7:19                   ` Alexandre DERUMIER
2020-09-07  8:18                     ` dietmar
2020-09-07  9:32                       ` Alexandre DERUMIER
2020-09-07 13:23                         ` Alexandre DERUMIER
2020-09-08  4:41                           ` dietmar
2020-09-08  7:11                             ` Alexandre DERUMIER
2020-09-09 20:05                               ` Thomas Lamprecht
2020-09-10  4:58                                 ` Alexandre DERUMIER
2020-09-10  8:21                                   ` Thomas Lamprecht
2020-09-10 11:34                                     ` Alexandre DERUMIER [this message]
2020-09-10 18:21                                       ` Thomas Lamprecht
2020-09-14  4:54                                         ` Alexandre DERUMIER
2020-09-14  7:14                                           ` Dietmar Maurer
2020-09-14  8:27                                             ` Alexandre DERUMIER
2020-09-14  8:51                                               ` Thomas Lamprecht
2020-09-14 15:45                                                 ` Alexandre DERUMIER
2020-09-15  5:45                                                   ` dietmar
2020-09-15  6:27                                                     ` Alexandre DERUMIER
2020-09-15  7:13                                                       ` dietmar
2020-09-15  8:42                                                         ` Alexandre DERUMIER
2020-09-15  9:35                                                           ` Alexandre DERUMIER
2020-09-15  9:46                                                             ` Thomas Lamprecht
2020-09-15 10:15                                                               ` Alexandre DERUMIER
2020-09-15 11:04                                                                 ` Alexandre DERUMIER
2020-09-15 12:49                                                                   ` Alexandre DERUMIER
2020-09-15 13:00                                                                     ` Thomas Lamprecht
2020-09-15 14:09                                                                       ` Alexandre DERUMIER
2020-09-15 14:19                                                                         ` Alexandre DERUMIER
2020-09-15 14:32                                                                         ` Thomas Lamprecht
2020-09-15 14:57                                                                           ` Alexandre DERUMIER
2020-09-15 15:58                                                                             ` Alexandre DERUMIER
2020-09-16  7:34                                                                               ` Alexandre DERUMIER
2020-09-16  7:58                                                                                 ` Alexandre DERUMIER
2020-09-16  8:30                                                                                   ` Alexandre DERUMIER
2020-09-16  8:53                                                                                     ` Alexandre DERUMIER
     [not found]                                                                                     ` <1894376736.864562.1600253445817.JavaMail.zimbra@odiso.com>
2020-09-16 13:15                                                                                       ` Alexandre DERUMIER
2020-09-16 14:45                                                                                         ` Thomas Lamprecht
2020-09-16 15:17                                                                                           ` Alexandre DERUMIER
2020-09-17  9:21                                                                                             ` Fabian Grünbichler
2020-09-17  9:59                                                                                               ` Alexandre DERUMIER
2020-09-17 10:02                                                                                                 ` Alexandre DERUMIER
2020-09-17 11:35                                                                                                   ` Thomas Lamprecht
2020-09-20 23:54                                                                                                     ` Alexandre DERUMIER
2020-09-22  5:43                                                                                                       ` Alexandre DERUMIER
2020-09-24 14:02                                                                                                         ` Fabian Grünbichler
2020-09-24 14:29                                                                                                           ` Alexandre DERUMIER
2020-09-24 18:07                                                                                                             ` Alexandre DERUMIER
2020-09-25  6:44                                                                                                               ` Alexandre DERUMIER
2020-09-25  7:15                                                                                                                 ` Alexandre DERUMIER
2020-09-25  9:19                                                                                                                   ` Fabian Grünbichler
2020-09-25  9:46                                                                                                                     ` Alexandre DERUMIER
2020-09-25 12:51                                                                                                                       ` Fabian Grünbichler
2020-09-25 16:29                                                                                                                         ` Alexandre DERUMIER
2020-09-28  9:17                                                                                                                           ` Fabian Grünbichler
2020-09-28  9:35                                                                                                                             ` Alexandre DERUMIER
2020-09-28 15:59                                                                                                                               ` Alexandre DERUMIER
2020-09-29  5:30                                                                                                                                 ` Alexandre DERUMIER
2020-09-29  8:51                                                                                                                                 ` Fabian Grünbichler
2020-09-29  9:37                                                                                                                                   ` Alexandre DERUMIER
2020-09-29 10:52                                                                                                                                     ` Alexandre DERUMIER
2020-09-29 11:43                                                                                                                                       ` Alexandre DERUMIER
2020-09-29 11:50                                                                                                                                         ` Alexandre DERUMIER
2020-09-29 13:28                                                                                                                                           ` Fabian Grünbichler
2020-09-29 13:52                                                                                                                                             ` Alexandre DERUMIER
2020-09-30  6:09                                                                                                                                               ` Alexandre DERUMIER
2020-09-30  6:26                                                                                                                                                 ` Thomas Lamprecht
2020-09-15  7:58                                                       ` Thomas Lamprecht
2020-12-29 14:21   ` Josef Johansson
2020-09-04 15:46 ` Alexandre DERUMIER
2020-09-30 15:50 ` Thomas Lamprecht
2020-10-15  9:16   ` Eneko Lacunza

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1245358354.508169.1599737684557.JavaMail.zimbra@odiso.com \
    --to=aderumier@odiso.com \
    --cc=pve-devel@lists.proxmox.com \
    --cc=t.lamprecht@proxmox.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox

Service provided by Proxmox Server Solutions GmbH | Privacy | Legal