From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits)) (No client certificate requested) by lists.proxmox.com (Postfix) with ESMTPS id CB483713D4 for ; Mon, 28 Jun 2021 17:34:11 +0200 (CEST) Received: from firstgate.proxmox.com (localhost [127.0.0.1]) by firstgate.proxmox.com (Proxmox) with ESMTP id C16C42D351 for ; Mon, 28 Jun 2021 17:33:41 +0200 (CEST) Received: from proxmox-new.maurer-it.com (proxmox-new.maurer-it.com [94.136.29.106]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits)) (No client certificate requested) by firstgate.proxmox.com (Proxmox) with ESMTPS id C93142D335 for ; Mon, 28 Jun 2021 17:33:39 +0200 (CEST) Received: from proxmox-new.maurer-it.com (localhost.localdomain [127.0.0.1]) by proxmox-new.maurer-it.com (Proxmox) with ESMTP id EAD7D4442F; Mon, 28 Jun 2021 11:33:11 +0200 (CEST) Message-ID: Date: Mon, 28 Jun 2021 11:32:58 +0200 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:90.0) Gecko/20100101 Thunderbird/90.0 Content-Language: en-US To: Eneko Lacunza , Proxmox VE user list References: From: Thomas Lamprecht In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-SPAM-LEVEL: Spam detection results: 0 AWL 1.482 Adjusted score from AWL reputation of From: address BAYES_00 -1.9 Bayes spam probability is 0 to 1% KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment NICE_REPLY_A -1.765 Looks like a legit reply (A) SPF_HELO_NONE 0.001 SPF: HELO does not publish an SPF Record SPF_PASS -0.001 SPF: sender matches SPF record Subject: Re: [PVE-User] BIG cluster questions X-BeenThere: pve-user@lists.proxmox.com X-Mailman-Version: 2.1.29 Precedence: list List-Id: Proxmox VE user list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 28 Jun 2021 15:34:11 -0000 Hi, On 24.06.21 16:30, Eneko Lacunza wrote: > We're currently helping a customer to configure a virtualization cluster with 88 servers for VDI. > > Right know we're testing the feasibility of building just one Proxmox cluster of 88 nodes. A 4-node cluster has been configured too for comparing both (same server and networking/racks). > > Nodes have 2 NICs 2x25Gbps each. Currently there are two LACP bonds configured (one for each NIC); one for storage (NFS v4.2) and the other for the rest (VMs, cluster). > > Cluster has two rings, one on each bond. > > - With clusters at rest (no significant number of VMs running), we see quite a different corosync/knet latency average on our 88 node cluster (~300-400) and our 4-node cluster (<100). > > > For 88-node cluster: > > - Creating some VMs (let's say 16), one each 30s, works well. > - Destroying some VMs (let's say 16), one each 30s, outputs error messages (storage cfs lock related) and fails removing some of the VMs. some storages operationss are still cluster-locked on a not so fine-grained basis, that could be probably improved - mostly affects parallel creation/removals though - once the guests are setup it should not be such a big issue as then removal and creations frequency normally will go down. > > - Rebooting 32 nodes, one each 30 seconds (boot for a node is about 120s) so that no quorum is lost, creates a cluster traffic "flood". Some of the rebooted nodes don't rejoin the cluster, and WUI shows all nodes in cluster quorum with a grey ?, instead of green OK. In this situation corosying latency in some nodes can skyrocket to 10s or 100s times the values before the reboots. Access to pmxcfs is very slow and we have been able to fix the issue only rebooting all nodes. > > - We have tried changing the transport of knet in a ring from UDP to SCTP as reported here: > https://forum.proxmox.com/threads/proxmox-6-2-corosync-3-rare-and-spontaneous-disruptive-udp-5405-storm-flood.75871/page-2 > that gives better latencies for corosync, but the reboot issue continues. > > We don't know whether both issues are related or not. > > Could LACP bonds be the issue? > https://pve.proxmox.com/pve-docs/pve-admin-guide.html#sysadmin_network_configuration > " > If your switch support the LACP (IEEE 802.3ad) protocol then we recommend using the corresponding bonding mode (802.3ad). Otherwise you should generally use the active-backup mode. > If you intend to run your cluster network on the bonding interfaces, then you have to use active-passive mode on the bonding interfaces, other modes are unsupported. > " > As per second line, we understand that running cluster networking over a LACP bond is not supported (just to confirm our interpretation)? We're in the process of reconfiguring nodes/switches to test without a bond, to see if that gives us a stable cluster (will report on this). Do you think this could be the issue? > We general think that there's something that can still improve on leave/join of the cluster, those floods are not ideal and we tackled a few issues with it already (so ensure latest Proxmox VE 6.4 is in use), but there are still some issues left, which sadly are really hard to debug and reproduce nicely. > Now for more general questions; do you think a 88-node Proxmox VE cluster is feasible? > To be honest, currently I rather do not think so. The biggest cluster we know of is 51 nodes big, and they had quite some fine-tuning going on and really high-end "golden" HW to get there. > Those 88 nodes will host about 14.000 VMs. Will HA manager be able to manage them, or are they too many? (HA for those VMs doesn't seem to be a requirement right know). The HA stack has some limitations due to physical maximum file size, so currently it can track about 3500 - 4000 services, so 14k would be to much regarding that. I'd advise to split that cluster into 4 separate clusters, for example 2 x 21-node + 2 x 23-node cluster (odd sized node count are slightly better regarding quorum), then you'd have roughly 3500 HA services per cluster, which can be feasible. As there's semi-active work going on in cross-cluster migrations, the biggest drawback one gets when splitting up into separate clusters would be gone, sooner or later.