From: Alexandre DERUMIER <aderumier@odiso.com>
To: Thomas Lamprecht <t.lamprecht@proxmox.com>
Cc: Proxmox VE development discussion <pve-devel@lists.proxmox.com>
Subject: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown
Date: Tue, 15 Sep 2020 16:57:46 +0200 (CEST) [thread overview]
Message-ID: <132388307.839866.1600181866529.JavaMail.zimbra@odiso.com> (raw)
In-Reply-To: <6b680921-12d0-006b-6d04-bbe1c4bb04f8@proxmox.com>
>>I mean this is bad, but also great!
>>Cam you do a coredump of the whole thing and upload it somewhere with the version info
>>used (for dbgsym package)? That could help a lot.
I'll try to reproduce it again (with the full lock everywhere), and do the coredump.
I have tried the real time scheduling,
but I still have been able to reproduce the "lrm too long" for 60s (but as I'm restarting corosync each minute, I think it's unlocking
something at next corosync restart.)
this time it was blocked at the same time on a node in:
work {
...
} elsif ($state eq 'active') {
....
$self->update_lrm_status();
and another node in
if ($fence_request) {
$haenv->log('err', "node need to be fenced - releasing agent_lock\n");
$self->set_local_status({ state => 'lost_agent_lock'});
} elsif (!$self->get_protected_ha_agent_lock()) {
$self->set_local_status({ state => 'lost_agent_lock'});
} elsif ($self->{mode} eq 'maintenance') {
$self->set_local_status({ state => 'maintenance'});
}
----- Mail original -----
De: "Thomas Lamprecht" <t.lamprecht@proxmox.com>
À: "aderumier" <aderumier@odiso.com>
Cc: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com>
Envoyé: Mardi 15 Septembre 2020 16:32:52
Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown
On 9/15/20 4:09 PM, Alexandre DERUMIER wrote:
>>> Can you try to give pmxcfs real time scheduling, e.g., by doing:
>>>
>>> # systemctl edit pve-cluster
>>>
>>> And then add snippet:
>>>
>>>
>>> [Service]
>>> CPUSchedulingPolicy=rr
>>> CPUSchedulingPriority=99
> yes, sure, I'll do it now
>
>
>> I'm currently digging the logs
>>> Is your most simplest/stable reproducer still a periodic restart of corosync in one node?
> yes, a simple "systemctl restart corosync" on 1 node each minute
>
>
>
> After 1hour, it's still locked.
>
> on other nodes, I still have pmxfs logs like:
>
I mean this is bad, but also great!
Cam you do a coredump of the whole thing and upload it somewhere with the version info
used (for dbgsym package)? That could help a lot.
> manual "pmxcfs -d"
> https://gist.github.com/aderumier/4cd91d17e1f8847b93ea5f621f257c2e
>
Hmm, the fuse connection of the previous one got into a weird state (or something is still
running) but I'd rather say this is a side-effect not directly connected to the real bug.
>
> some interesting dmesg about "pvesr"
>
> [Tue Sep 15 14:45:34 2020] INFO: task pvesr:19038 blocked for more than 120 seconds.
> [Tue Sep 15 14:45:34 2020] Tainted: P O 5.4.60-1-pve #1
> [Tue Sep 15 14:45:34 2020] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [Tue Sep 15 14:45:34 2020] pvesr D 0 19038 1 0x00000080
> [Tue Sep 15 14:45:34 2020] Call Trace:
> [Tue Sep 15 14:45:34 2020] __schedule+0x2e6/0x6f0
> [Tue Sep 15 14:45:34 2020] ? filename_parentat.isra.57.part.58+0xf7/0x180
> [Tue Sep 15 14:45:34 2020] schedule+0x33/0xa0
> [Tue Sep 15 14:45:34 2020] rwsem_down_write_slowpath+0x2ed/0x4a0
> [Tue Sep 15 14:45:34 2020] down_write+0x3d/0x40
> [Tue Sep 15 14:45:34 2020] filename_create+0x8e/0x180
> [Tue Sep 15 14:45:34 2020] do_mkdirat+0x59/0x110
> [Tue Sep 15 14:45:34 2020] __x64_sys_mkdir+0x1b/0x20
> [Tue Sep 15 14:45:34 2020] do_syscall_64+0x57/0x190
> [Tue Sep 15 14:45:34 2020] entry_SYSCALL_64_after_hwframe+0x44/0xa9
>
hmm, hangs in mkdir (cluster wide locking)
next prev parent reply other threads:[~2020-09-15 14:57 UTC|newest]
Thread overview: 84+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-09-03 14:11 Alexandre DERUMIER
2020-09-04 12:29 ` Alexandre DERUMIER
2020-09-04 15:42 ` Dietmar Maurer
2020-09-05 13:32 ` Alexandre DERUMIER
2020-09-05 15:23 ` dietmar
2020-09-05 17:30 ` Alexandre DERUMIER
2020-09-06 4:21 ` dietmar
2020-09-06 5:36 ` Alexandre DERUMIER
2020-09-06 6:33 ` Alexandre DERUMIER
2020-09-06 8:43 ` Alexandre DERUMIER
2020-09-06 12:14 ` dietmar
2020-09-06 12:19 ` dietmar
2020-09-07 7:00 ` Thomas Lamprecht
2020-09-07 7:19 ` Alexandre DERUMIER
2020-09-07 8:18 ` dietmar
2020-09-07 9:32 ` Alexandre DERUMIER
2020-09-07 13:23 ` Alexandre DERUMIER
2020-09-08 4:41 ` dietmar
2020-09-08 7:11 ` Alexandre DERUMIER
2020-09-09 20:05 ` Thomas Lamprecht
2020-09-10 4:58 ` Alexandre DERUMIER
2020-09-10 8:21 ` Thomas Lamprecht
2020-09-10 11:34 ` Alexandre DERUMIER
2020-09-10 18:21 ` Thomas Lamprecht
2020-09-14 4:54 ` Alexandre DERUMIER
2020-09-14 7:14 ` Dietmar Maurer
2020-09-14 8:27 ` Alexandre DERUMIER
2020-09-14 8:51 ` Thomas Lamprecht
2020-09-14 15:45 ` Alexandre DERUMIER
2020-09-15 5:45 ` dietmar
2020-09-15 6:27 ` Alexandre DERUMIER
2020-09-15 7:13 ` dietmar
2020-09-15 8:42 ` Alexandre DERUMIER
2020-09-15 9:35 ` Alexandre DERUMIER
2020-09-15 9:46 ` Thomas Lamprecht
2020-09-15 10:15 ` Alexandre DERUMIER
2020-09-15 11:04 ` Alexandre DERUMIER
2020-09-15 12:49 ` Alexandre DERUMIER
2020-09-15 13:00 ` Thomas Lamprecht
2020-09-15 14:09 ` Alexandre DERUMIER
2020-09-15 14:19 ` Alexandre DERUMIER
2020-09-15 14:32 ` Thomas Lamprecht
2020-09-15 14:57 ` Alexandre DERUMIER [this message]
2020-09-15 15:58 ` Alexandre DERUMIER
2020-09-16 7:34 ` Alexandre DERUMIER
2020-09-16 7:58 ` Alexandre DERUMIER
2020-09-16 8:30 ` Alexandre DERUMIER
2020-09-16 8:53 ` Alexandre DERUMIER
[not found] ` <1894376736.864562.1600253445817.JavaMail.zimbra@odiso.com>
2020-09-16 13:15 ` Alexandre DERUMIER
2020-09-16 14:45 ` Thomas Lamprecht
2020-09-16 15:17 ` Alexandre DERUMIER
2020-09-17 9:21 ` Fabian Grünbichler
2020-09-17 9:59 ` Alexandre DERUMIER
2020-09-17 10:02 ` Alexandre DERUMIER
2020-09-17 11:35 ` Thomas Lamprecht
2020-09-20 23:54 ` Alexandre DERUMIER
2020-09-22 5:43 ` Alexandre DERUMIER
2020-09-24 14:02 ` Fabian Grünbichler
2020-09-24 14:29 ` Alexandre DERUMIER
2020-09-24 18:07 ` Alexandre DERUMIER
2020-09-25 6:44 ` Alexandre DERUMIER
2020-09-25 7:15 ` Alexandre DERUMIER
2020-09-25 9:19 ` Fabian Grünbichler
2020-09-25 9:46 ` Alexandre DERUMIER
2020-09-25 12:51 ` Fabian Grünbichler
2020-09-25 16:29 ` Alexandre DERUMIER
2020-09-28 9:17 ` Fabian Grünbichler
2020-09-28 9:35 ` Alexandre DERUMIER
2020-09-28 15:59 ` Alexandre DERUMIER
2020-09-29 5:30 ` Alexandre DERUMIER
2020-09-29 8:51 ` Fabian Grünbichler
2020-09-29 9:37 ` Alexandre DERUMIER
2020-09-29 10:52 ` Alexandre DERUMIER
2020-09-29 11:43 ` Alexandre DERUMIER
2020-09-29 11:50 ` Alexandre DERUMIER
2020-09-29 13:28 ` Fabian Grünbichler
2020-09-29 13:52 ` Alexandre DERUMIER
2020-09-30 6:09 ` Alexandre DERUMIER
2020-09-30 6:26 ` Thomas Lamprecht
2020-09-15 7:58 ` Thomas Lamprecht
2020-12-29 14:21 ` Josef Johansson
2020-09-04 15:46 ` Alexandre DERUMIER
2020-09-30 15:50 ` Thomas Lamprecht
2020-10-15 9:16 ` Eneko Lacunza
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=132388307.839866.1600181866529.JavaMail.zimbra@odiso.com \
--to=aderumier@odiso.com \
--cc=pve-devel@lists.proxmox.com \
--cc=t.lamprecht@proxmox.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox