public inbox for pve-user@lists.proxmox.com
 help / color / mirror / Atom feed
From: JR Richardson <jmr.richardson@gmail.com>
To: PVE User List <pve-user@pve.proxmox.com>
Subject: [PVE-User] 5.4 Seeing Corosync TOTEM Retransmits
Date: Tue, 18 Aug 2020 08:48:19 -0500	[thread overview]
Message-ID: <CA+U74VPSXwbU721E7Tpr46t7sR5uEr_zF2-ax1tZh-EUc61ZFQ@mail.gmail.com> (raw)

Hi Folks,

I have a v5.4.13, 13 node cluster, has been in production running
fine, separate networks for storage, data, heartbeat, management. All
linux bonded interfaces to Cisco 3750G LACP port channels, all
Gigabit. Last evening I started maintenance to update to the latest
5.4 release version in preparation for upgrading to 6.2.

First 3 nodes went OK, as expected, no issues. When I started to
migrate some VMs back to a node I just upgraded, the whole cluster
crashed and all 13 nodes rebooted. After recovering, two nodes network
bonds were blocking and several VMs were locked (migration). I was
able to recover the cluster and all VMs ok, overall the system was
pretty resilient didn't take too long to get everything restored.

This morning I started diagnosing what had happened and found at the
time of the all node reboot, common log message in all nodes at
relatively the same time:
Aug 17 20:36:07 pbxpve01 corosync[2184]: notice  [TOTEM ] Retransmit
List: 2d5 2d6 2d7 2d8 2e3 2e4 2e5 2e7 2e8 2ea 2eb 2ec 2ed 2ee 2f1 2f2
Aug 17 20:36:07 pbxpve01 corosync[2184]:  [TOTEM ] Retransmit List:
2d5 2d6 2d7 2d8 2e3 2e4 2e5 2e7 2e8 2ea 2eb 2ec 2ed 2ee 2f1 2f2
Aug 17 20:36:07 pbxpve01 corosync[2184]: notice  [TOTEM ] Retransmit
List: 307 308 309 30a 30d 30e 30f 313 314 315 316 317 31b 31c 31d 31e
Aug 17 20:36:07 pbxpve01 corosync[2184]:  [TOTEM ] Retransmit List:
307 308 309 30a 30d 30e 30f 313 314 315 316 317 31b 31c 31d 31e
Aug 17 20:36:07 pbxpve01 corosync[2184]: notice  [TOTEM ] Retransmit
List: 34a 353 35b 35c 35d 35e 35f 360 364 365 366 368 369 36a 36b 36c
.........

I did some reading and I understand there was some sort of heartbeat
network latency introduced during the live migration event. But since
my networks are separate, does the VM memory transfer between nodes
performed on the heartbeat network? Can I specify what network to use
for migration, like storage (jumbo frame enabled) or management to
relieve any congestion on the heartbeat network segment?

Another question is tuning, should I try to tune corosync '<totem
netmtu="1480"/>' or '<totem window_size="170"/>' settings or just push
through the upgrade to 6.2?

Any suggestions are welcome.

Thanks.

JR
-- 
JR Richardson
Engineering for the Masses
Chasing the Azeotrope



             reply	other threads:[~2020-08-18 13:49 UTC|newest]

Thread overview: 2+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-08-18 13:48 JR Richardson [this message]
2020-08-18 16:36 ` Gianni Milo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CA+U74VPSXwbU721E7Tpr46t7sR5uEr_zF2-ax1tZh-EUc61ZFQ@mail.gmail.com \
    --to=jmr.richardson@gmail.com \
    --cc=pve-user@pve.proxmox.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox
Service provided by Proxmox Server Solutions GmbH | Privacy | Legal