From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: <laurentfdumont@gmail.com> Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits)) (No client certificate requested) by lists.proxmox.com (Postfix) with ESMTPS id 9905070CFE for <pve-user@lists.proxmox.com>; Fri, 25 Jun 2021 19:34:28 +0200 (CEST) Received: from firstgate.proxmox.com (localhost [127.0.0.1]) by firstgate.proxmox.com (Proxmox) with ESMTP id 80F6D19E07 for <pve-user@lists.proxmox.com>; Fri, 25 Jun 2021 19:33:58 +0200 (CEST) Received: from mail-wm1-x331.google.com (mail-wm1-x331.google.com [IPv6:2a00:1450:4864:20::331]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by firstgate.proxmox.com (Proxmox) with ESMTPS id AB9ED19DE1 for <pve-user@lists.proxmox.com>; Fri, 25 Jun 2021 19:33:55 +0200 (CEST) Received: by mail-wm1-x331.google.com with SMTP id l21-20020a05600c1d15b02901e7513b02dbso3557450wms.2 for <pve-user@lists.proxmox.com>; Fri, 25 Jun 2021 10:33:55 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=XeVv+v55mgzgPCIMwygfUKqUTSlbyC5ONUcfVFK1AoI=; b=r9lLgEUfspDB93fUDLYmetLMaGLx4awz+AsXP3myf9cnnz5dwvboNyvRaijnLboNGZ B06sD/JMys2UsCuVJFAAh7F8JI5ihfeAClVPjLzZ2QWPr0/u12+tpS/M8miCvub4FmUr NIC8YQLOaiDpEDooPNtZ/OOt0EEkraZRWafmpuO8z/n7GZatdcOUU1nGL+ptTjptVBzL v4013ikYWl3ASv/8tvEkviCsgVn0X2tiq2pGDA90JFRXIcK5ySSD5B72EMdJojOZUkUM UCg21lPSapCci/zcJ+wv5j7RBiQGDbAzIs8Xt0d5yrGVqfbZ3X/ZCmegpc32HKEMejjX YS5w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=XeVv+v55mgzgPCIMwygfUKqUTSlbyC5ONUcfVFK1AoI=; b=TF2pMaDfAP0eFM4eGkfjas54+zY+/0jPINocQQB4BA6eejJNc6aFR+PH34AsbUxuJ6 I3AZz5m4fMhAB5sAIgdENaysbLXeEqIZZx9TG0ZKyXL+Wbctcns7kiX9Z2cdkFIdjAfZ bImVgVuqpmox9FjYkz5uFLCJJfZuFTKwiJsY71DKX0s0I7yRA5BWbpnFDbCA1FUPkDXL raezP2SC+LdQXXdxJJGh93wEmxEQLASFyZ00uGGdm149CiDUZYibjr4FhPbCgtSxMhRk 8T+pjx77w7UJO1SmxV7LCI+QX99+8CcfAE+DClisPlrPubXmiUkldj7qiOXULSMd7m2F wGvw== X-Gm-Message-State: AOAM533w2F0wftexcxTempWwRfijtnthoZsVLns6QUWn8gveSdN7Ac7/ QaR7TkBLwEkU+Zr62V1tJLOszXvp8JhxhiOm8mpwI8ZUrQ== X-Google-Smtp-Source: ABdhPJxhQdD0jeidAc+gFGpPQLEHKj5jAwbv0R0O+EyMaCBS3rVfstxb9Txftmf6lu0PxEBw5GuDyVH8TXe+thBxGzo= X-Received: by 2002:a05:600c:4a18:: with SMTP id c24mr11906333wmp.180.1624642428906; Fri, 25 Jun 2021 10:33:48 -0700 (PDT) MIME-Version: 1.0 References: <mailman.16.1624545042.464.pve-user@lists.proxmox.com> In-Reply-To: <mailman.16.1624545042.464.pve-user@lists.proxmox.com> From: Laurent Dumont <laurentfdumont@gmail.com> Date: Fri, 25 Jun 2021 13:33:37 -0400 Message-ID: <CAOAKi8wnK6E97E5uDX7CH2zr1hTnonszmiTMMTkqK9-S36xcyg@mail.gmail.com> To: Proxmox VE user list <pve-user@lists.proxmox.com> Cc: "pve-user@pve.proxmox.com" <pve-user@pve.proxmox.com>, Eneko Lacunza <elacunza@binovo.es> X-SPAM-LEVEL: Spam detection results: 0 AWL 0.750 Adjusted score from AWL reputation of From: address BAYES_00 -1.9 Bayes spam probability is 0 to 1% DKIM_SIGNED 0.1 Message has a DKIM or DK signature, not necessarily valid DKIM_VALID -0.1 Message has at least one valid DKIM or DK signature DKIM_VALID_AU -0.1 Message has a valid DKIM or DK signature from author's domain DKIM_VALID_EF -0.1 Message has a valid DKIM or DK signature from envelope-from domain FREEMAIL_FROM 0.001 Sender email is commonly abused enduser mail provider HTML_MESSAGE 0.001 HTML included in message POISEN_SPAM_PILL_4 0.1 random spam to be learned in bayes RCVD_IN_DNSWL_NONE -0.0001 Sender listed at https://www.dnswl.org/, no trust SPF_HELO_NONE 0.001 SPF: HELO does not publish an SPF Record SPF_PASS -0.001 SPF: sender matches SPF record URIBL_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to URIBL was blocked. See http://wiki.apache.org/spamassassin/DnsBlocklists#dnsbl-block for more information. [proxmox.com, binovo.es] Content-Type: text/plain; charset="UTF-8" X-Content-Filtered-By: Mailman/MimeDel 2.1.29 Subject: Re: [PVE-User] BIG cluster questions X-BeenThere: pve-user@lists.proxmox.com X-Mailman-Version: 2.1.29 Precedence: list List-Id: Proxmox VE user list <pve-user.lists.proxmox.com> List-Unsubscribe: <https://lists.proxmox.com/cgi-bin/mailman/options/pve-user>, <mailto:pve-user-request@lists.proxmox.com?subject=unsubscribe> List-Archive: <http://lists.proxmox.com/pipermail/pve-user/> List-Post: <mailto:pve-user@lists.proxmox.com> List-Help: <mailto:pve-user-request@lists.proxmox.com?subject=help> List-Subscribe: <https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user>, <mailto:pve-user-request@lists.proxmox.com?subject=subscribe> X-List-Received-Date: Fri, 25 Jun 2021 17:34:28 -0000 This is anecdotal but I have never seen one cluster that big. You might want to inquire about professional support which would give you a better perspective for that kind of scale. On Thu, Jun 24, 2021 at 10:30 AM Eneko Lacunza via pve-user < pve-user@lists.proxmox.com> wrote: > > > > ---------- Forwarded message ---------- > From: Eneko Lacunza <elacunza@binovo.es> > To: "pve-user@pve.proxmox.com" <pve-user@pve.proxmox.com> > Cc: > Bcc: > Date: Thu, 24 Jun 2021 16:30:31 +0200 > Subject: BIG cluster questions > Hi all, > > We're currently helping a customer to configure a virtualization cluster > with 88 servers for VDI. > > Right know we're testing the feasibility of building just one Proxmox > cluster of 88 nodes. A 4-node cluster has been configured too for > comparing both (same server and networking/racks). > > Nodes have 2 NICs 2x25Gbps each. Currently there are two LACP bonds > configured (one for each NIC); one for storage (NFS v4.2) and the other > for the rest (VMs, cluster). > > Cluster has two rings, one on each bond. > > - With clusters at rest (no significant number of VMs running), we see > quite a different corosync/knet latency average on our 88 node cluster > (~300-400) and our 4-node cluster (<100). > > > For 88-node cluster: > > - Creating some VMs (let's say 16), one each 30s, works well. > - Destroying some VMs (let's say 16), one each 30s, outputs error > messages (storage cfs lock related) and fails removing some of the VMs. > > - Rebooting 32 nodes, one each 30 seconds (boot for a node is about > 120s) so that no quorum is lost, creates a cluster traffic "flood". Some > of the rebooted nodes don't rejoin the cluster, and WUI shows all nodes > in cluster quorum with a grey ?, instead of green OK. In this situation > corosying latency in some nodes can skyrocket to 10s or 100s times the > values before the reboots. Access to pmxcfs is very slow and we have > been able to fix the issue only rebooting all nodes. > > - We have tried changing the transport of knet in a ring from UDP to > SCTP as reported here: > > https://forum.proxmox.com/threads/proxmox-6-2-corosync-3-rare-and-spontaneous-disruptive-udp-5405-storm-flood.75871/page-2 > that gives better latencies for corosync, but the reboot issue continues. > > We don't know whether both issues are related or not. > > Could LACP bonds be the issue? > > https://pve.proxmox.com/pve-docs/pve-admin-guide.html#sysadmin_network_configuration > " > If your switch support the LACP (IEEE 802.3ad) protocol then we > recommend using the corresponding bonding mode (802.3ad). Otherwise you > should generally use the active-backup mode. > If you intend to run your cluster network on the bonding interfaces, > then you have to use active-passive mode on the bonding interfaces, > other modes are unsupported. > " > As per second line, we understand that running cluster networking over a > LACP bond is not supported (just to confirm our interpretation)? We're > in the process of reconfiguring nodes/switches to test without a bond, > to see if that gives us a stable cluster (will report on this). Do you > think this could be the issue? > > > Now for more general questions; do you think a 88-node Proxmox VE > cluster is feasible? > > Those 88 nodes will host about 14.000 VMs. Will HA manager be able to > manage them, or are they too many? (HA for those VMs doesn't seem to be > a requirement right know). > > > Thanks a lot > Eneko > > > EnekoLacunza > > CTO | Zuzendari teknikoa > > Binovo IT Human Project > > 943 569 206 <tel:943 569 206> > > elacunza@binovo.es <mailto:elacunza@binovo.es> > > binovo.es <//binovo.es> > > Astigarragako Bidea, 2 - 2 izda. Oficina 10-11, 20180 Oiartzun > > > youtube <https://www.youtube.com/user/CANALBINOVO/> > linkedin <https://www.linkedin.com/company/37269706/> > > > > > ---------- Forwarded message ---------- > From: Eneko Lacunza via pve-user <pve-user@lists.proxmox.com> > To: "pve-user@pve.proxmox.com" <pve-user@pve.proxmox.com> > Cc: Eneko Lacunza <elacunza@binovo.es> > Bcc: > Date: Thu, 24 Jun 2021 16:30:31 +0200 > Subject: [PVE-User] BIG cluster questions > _______________________________________________ > pve-user mailing list > pve-user@lists.proxmox.com > https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user >