From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits)) (No client certificate requested) by lists.proxmox.com (Postfix) with ESMTPS id 7514E616CD for ; Tue, 18 Aug 2020 15:49:08 +0200 (CEST) Received: from firstgate.proxmox.com (localhost [127.0.0.1]) by firstgate.proxmox.com (Proxmox) with ESMTP id 618011462A for ; Tue, 18 Aug 2020 15:48:38 +0200 (CEST) Received: from mail-io1-xd31.google.com (mail-io1-xd31.google.com [IPv6:2607:f8b0:4864:20::d31]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by firstgate.proxmox.com (Proxmox) with ESMTPS id 84F601461D for ; Tue, 18 Aug 2020 15:48:36 +0200 (CEST) Received: by mail-io1-xd31.google.com with SMTP id g13so6927591ioo.9 for ; Tue, 18 Aug 2020 06:48:36 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:from:date:message-id:subject:to; bh=L1MQ0Hf/hSZlR5mcQ2yXCZhvy/MxomLcfQ/BjTS4Pts=; b=NSgD3RjA+JfrBodiPN2CjxdBiIBfBoM8x9qvnLY3ccQulP9UyedsuNmRWv2vNQ3Xx5 Wp6Qz+8QaWlqWtH2C0W/8VcN7jcWsK476NSuhf6J2AJsJwM2/r4dZGGXx1o2w/Vv3wxE RicKod5cN/IA4mFqxQA9W0+bXudTcJci2CeBNOGRgGgePA0HODV7gSSBspzfGzL2mb4u wwkwoexKY0hQejo+wPXZ+aFYko73PVqT7iF2T3p/lVFVSe3sLsxeRNUV0TUGsxCrGruS 7J9uR6qZxKNcsoYQ59E4ZcXXuNBvCKny29GrEAn0z4kvn+vcz0nLm41ekvr4g4UGjOfO UQNA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:from:date:message-id:subject:to; bh=L1MQ0Hf/hSZlR5mcQ2yXCZhvy/MxomLcfQ/BjTS4Pts=; b=VcF79F6AYWZKCQUXNOWxfJ2TqCwaDvWzPmgxvmK2mW5mOyQvk/ulm8IBfKBjZVKUab hFMLYP5+q2LukWfI73hCh3HzAl0nky6o+zE29qnX74uuEMVrvUxglGKNWj9kXb7pScPb 7MzBlRK7do3KZInO7RzditE9u3edpB5QxZzwEOPKIjTtMDGYnSowTsZlcsuA3QuX6pky Dpxe85OxfqQk+wZp1cqPDsJkrYwiJkatHPkOfH5P4EHfFTrMqatHriCyMion9JOK0Si6 +jtWs/D03/NStor9PhDrLaPMblHh/r64cwljhf5AUVMyzqj84G4Auq0FP3KwOM9r33tt TLgQ== X-Gm-Message-State: AOAM533ylxNIMRTXpRR5akcRGGQOk1UeMn8G6jPjvDOFDNlmtvJhpEeq 2lZT+ZwQ5wADrcrOwTWIk//5Z19m4LuyRHlm8wb1awP+pAg= X-Google-Smtp-Source: ABdhPJxTWlyfE44tVLZqlLXXsqTXRMCQom3PyZ0Gjeyo1MUxFncPAbaVu29v8J474X6MCIgvdpzMHG/5hu7neWZd1ao= X-Received: by 2002:a05:6602:2d43:: with SMTP id d3mr16305199iow.39.1597758508839; Tue, 18 Aug 2020 06:48:28 -0700 (PDT) MIME-Version: 1.0 From: JR Richardson Date: Tue, 18 Aug 2020 08:48:19 -0500 Message-ID: To: PVE User List Content-Type: text/plain; charset="UTF-8" X-SPAM-LEVEL: Spam detection results: 0 DKIM_SIGNED 0.1 Message has a DKIM or DK signature, not necessarily valid DKIM_VALID -0.1 Message has at least one valid DKIM or DK signature DKIM_VALID_AU -0.1 Message has a valid DKIM or DK signature from author's domain DKIM_VALID_EF -0.1 Message has a valid DKIM or DK signature from envelope-from domain FREEMAIL_FROM 0.001 Sender email is commonly abused enduser mail provider RCVD_IN_DNSWL_NONE -0.0001 Sender listed at https://www.dnswl.org/, no trust SPF_HELO_NONE 0.001 SPF: HELO does not publish an SPF Record SPF_PASS -0.001 SPF: sender matches SPF record Subject: [PVE-User] 5.4 Seeing Corosync TOTEM Retransmits X-BeenThere: pve-user@lists.proxmox.com X-Mailman-Version: 2.1.29 Precedence: list List-Id: Proxmox VE user list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 18 Aug 2020 13:49:08 -0000 Hi Folks, I have a v5.4.13, 13 node cluster, has been in production running fine, separate networks for storage, data, heartbeat, management. All linux bonded interfaces to Cisco 3750G LACP port channels, all Gigabit. Last evening I started maintenance to update to the latest 5.4 release version in preparation for upgrading to 6.2. First 3 nodes went OK, as expected, no issues. When I started to migrate some VMs back to a node I just upgraded, the whole cluster crashed and all 13 nodes rebooted. After recovering, two nodes network bonds were blocking and several VMs were locked (migration). I was able to recover the cluster and all VMs ok, overall the system was pretty resilient didn't take too long to get everything restored. This morning I started diagnosing what had happened and found at the time of the all node reboot, common log message in all nodes at relatively the same time: Aug 17 20:36:07 pbxpve01 corosync[2184]: notice [TOTEM ] Retransmit List: 2d5 2d6 2d7 2d8 2e3 2e4 2e5 2e7 2e8 2ea 2eb 2ec 2ed 2ee 2f1 2f2 Aug 17 20:36:07 pbxpve01 corosync[2184]: [TOTEM ] Retransmit List: 2d5 2d6 2d7 2d8 2e3 2e4 2e5 2e7 2e8 2ea 2eb 2ec 2ed 2ee 2f1 2f2 Aug 17 20:36:07 pbxpve01 corosync[2184]: notice [TOTEM ] Retransmit List: 307 308 309 30a 30d 30e 30f 313 314 315 316 317 31b 31c 31d 31e Aug 17 20:36:07 pbxpve01 corosync[2184]: [TOTEM ] Retransmit List: 307 308 309 30a 30d 30e 30f 313 314 315 316 317 31b 31c 31d 31e Aug 17 20:36:07 pbxpve01 corosync[2184]: notice [TOTEM ] Retransmit List: 34a 353 35b 35c 35d 35e 35f 360 364 365 366 368 369 36a 36b 36c ......... I did some reading and I understand there was some sort of heartbeat network latency introduced during the live migration event. But since my networks are separate, does the VM memory transfer between nodes performed on the heartbeat network? Can I specify what network to use for migration, like storage (jumbo frame enabled) or management to relieve any congestion on the heartbeat network segment? Another question is tuning, should I try to tune corosync '' or '' settings or just push through the upgrade to 6.2? Any suggestions are welcome. Thanks. JR -- JR Richardson Engineering for the Masses Chasing the Azeotrope