public inbox for pve-user@lists.proxmox.com
 help / color / mirror / Atom feed
* Re: [PVE-User] BIG cluster questions
       [not found] <mailman.16.1624545042.464.pve-user@lists.proxmox.com>
@ 2021-06-25 17:33 ` Laurent Dumont
  2021-06-26 11:16 ` aderumier
  1 sibling, 0 replies; 4+ messages in thread
From: Laurent Dumont @ 2021-06-25 17:33 UTC (permalink / raw)
  To: Proxmox VE user list; +Cc: pve-user, Eneko Lacunza

This is anecdotal but I have never seen one cluster that big. You might
want to inquire about professional support which would give you a better
perspective for that kind of scale.

On Thu, Jun 24, 2021 at 10:30 AM Eneko Lacunza via pve-user <
pve-user@lists.proxmox.com> wrote:

>
>
>
> ---------- Forwarded message ----------
> From: Eneko Lacunza <elacunza@binovo.es>
> To: "pve-user@pve.proxmox.com" <pve-user@pve.proxmox.com>
> Cc:
> Bcc:
> Date: Thu, 24 Jun 2021 16:30:31 +0200
> Subject: BIG cluster questions
> Hi all,
>
> We're currently helping a customer to configure a virtualization cluster
> with 88 servers for VDI.
>
> Right know we're testing the feasibility of building just one Proxmox
> cluster of 88 nodes. A 4-node cluster has been configured too for
> comparing both (same server and networking/racks).
>
> Nodes have 2 NICs 2x25Gbps each. Currently there are two LACP bonds
> configured (one for each NIC); one for storage (NFS v4.2) and the other
> for the rest (VMs, cluster).
>
> Cluster has two rings, one on each bond.
>
> - With clusters at rest (no significant number of VMs running), we see
> quite a different corosync/knet latency average on our 88 node cluster
> (~300-400) and our 4-node cluster (<100).
>
>
> For 88-node cluster:
>
> - Creating some VMs (let's say 16), one each 30s, works well.
> - Destroying some VMs (let's say 16), one each 30s, outputs error
> messages (storage cfs lock related) and fails removing some of the VMs.
>
> - Rebooting 32 nodes, one each 30 seconds (boot for a node is about
> 120s) so that no quorum is lost, creates a cluster traffic "flood". Some
> of the rebooted nodes don't rejoin the cluster, and WUI shows all nodes
> in cluster quorum with a grey ?, instead of green OK. In this situation
> corosying latency in some nodes can skyrocket to 10s or 100s times the
> values before the reboots. Access to pmxcfs is very slow and we have
> been able to fix the issue only rebooting all nodes.
>
> - We have tried changing the transport of knet in a ring from UDP to
> SCTP as reported here:
>
> https://forum.proxmox.com/threads/proxmox-6-2-corosync-3-rare-and-spontaneous-disruptive-udp-5405-storm-flood.75871/page-2
> that gives better latencies for corosync, but the reboot issue continues.
>
> We don't know whether both issues are related or not.
>
> Could LACP bonds be the issue?
>
> https://pve.proxmox.com/pve-docs/pve-admin-guide.html#sysadmin_network_configuration
> "
> If your switch support the LACP (IEEE 802.3ad) protocol then we
> recommend using the corresponding bonding mode (802.3ad). Otherwise you
> should generally use the active-backup mode.
> If you intend to run your cluster network on the bonding interfaces,
> then you have to use active-passive mode on the bonding interfaces,
> other modes are unsupported.
> "
> As per second line, we understand that running cluster networking over a
> LACP bond is not supported (just to confirm our interpretation)? We're
> in the process of reconfiguring nodes/switches to test without a bond,
> to see if that gives us a stable cluster (will report on this). Do you
> think this could be the issue?
>
>
> Now for more general questions; do you think a 88-node Proxmox VE
> cluster is feasible?
>
> Those 88 nodes will host about 14.000 VMs. Will HA manager be able to
> manage them, or are they too many? (HA for those VMs doesn't seem to be
> a requirement right know).
>
>
> Thanks a lot
> Eneko
>
>
>       EnekoLacunza
>
> CTO | Zuzendari teknikoa
>
> Binovo IT Human Project
>
>         943 569 206 <tel:943 569 206>
>
>         elacunza@binovo.es <mailto:elacunza@binovo.es>
>
>         binovo.es <//binovo.es>
>
>         Astigarragako Bidea, 2 - 2 izda. Oficina 10-11, 20180 Oiartzun
>
>
> youtube <https://www.youtube.com/user/CANALBINOVO/>
>         linkedin <https://www.linkedin.com/company/37269706/>
>
>
>
>
> ---------- Forwarded message ----------
> From: Eneko Lacunza via pve-user <pve-user@lists.proxmox.com>
> To: "pve-user@pve.proxmox.com" <pve-user@pve.proxmox.com>
> Cc: Eneko Lacunza <elacunza@binovo.es>
> Bcc:
> Date: Thu, 24 Jun 2021 16:30:31 +0200
> Subject: [PVE-User] BIG cluster questions
> _______________________________________________
> pve-user mailing list
> pve-user@lists.proxmox.com
> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user
>


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PVE-User] BIG cluster questions
       [not found] <mailman.16.1624545042.464.pve-user@lists.proxmox.com>
  2021-06-25 17:33 ` [PVE-User] BIG cluster questions Laurent Dumont
@ 2021-06-26 11:16 ` aderumier
  1 sibling, 0 replies; 4+ messages in thread
From: aderumier @ 2021-06-26 11:16 UTC (permalink / raw)
  To: Proxmox VE user list, pve-user

Le jeudi 24 juin 2021 à 16:30 +0200, Eneko Lacunza via pve-user a
écrit :
> Now for more general questions; do you think a 88-node Proxmox VE 
> cluster is feasible?

Well, corosync is not really done to this amount of node. (I think the
hardcoded limit is around 100),
but in practice, I have seen a lot of users having problem with
corosync communication starting around 30 nodes

(Maybe with low-latency switches + fast frequencies cpu, it's possible
to keep latency enough low to get it working)


> Those 88 nodes will host about 14.000 VMs. Will HA manager be able to
> manage them, or are they too many? (HA for those VMs doesn't seem to
> be 
> a requirement right know).

I have cluster with 2500vms, it's working fine. (on a 15 nodes cluster
with around 200vms on each node).
I don't known with 15000vms, maybe the main pve-crm loop will take more
time, I'm not sure about the timeouts.


At work, I'm doing nodes 20 nodes cluster (1 by rack) to avoid to have
big cluster.
Multi-cluster central management is not yet on the roadmap, but they
are some tool like
https://maas.io/
which allow to manage multi-cluster in a central way. 



^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PVE-User] BIG cluster questions
       [not found] <a4c39bce-b416-1286-3374-fc73afa41125@binovo.es>
@ 2021-06-28  9:32 ` Thomas Lamprecht
  0 siblings, 0 replies; 4+ messages in thread
From: Thomas Lamprecht @ 2021-06-28  9:32 UTC (permalink / raw)
  To: Eneko Lacunza, Proxmox VE user list

Hi,

On 24.06.21 16:30, Eneko Lacunza wrote:
> We're currently helping a customer to configure a virtualization cluster with 88 servers for VDI.
> 
> Right know we're testing the feasibility of building just one Proxmox cluster of 88 nodes. A 4-node cluster has been configured too for comparing both (same server and networking/racks).
> 
> Nodes have 2 NICs 2x25Gbps each. Currently there are two LACP bonds configured (one for each NIC); one for storage (NFS v4.2) and the other for the rest (VMs, cluster).
> 
> Cluster has two rings, one on each bond.
> 
> - With clusters at rest (no significant number of VMs running), we see quite a different corosync/knet latency average on our 88 node cluster (~300-400) and our 4-node cluster (<100).
> 
> 
> For 88-node cluster:
> 
> - Creating some VMs (let's say 16), one each 30s, works well.
> - Destroying some VMs (let's say 16), one each 30s, outputs error messages (storage cfs lock related) and fails removing some of the VMs.

some storages operationss are still cluster-locked on a not so fine-grained basis,
that could be probably improved - mostly affects parallel creation/removals
though - once the guests are setup it should not be such a big issue as then
removal and creations frequency normally will go down.

> 
> - Rebooting 32 nodes, one each 30 seconds (boot for a node is about 120s) so that no quorum is lost, creates a cluster traffic "flood". Some of the rebooted nodes don't rejoin the cluster, and WUI shows all nodes in cluster quorum with a grey ?, instead of green OK. In this situation corosying latency in some nodes can skyrocket to 10s or 100s times the values before the reboots. Access to pmxcfs is very slow and we have been able to fix the issue only rebooting all nodes.
> 
> - We have tried changing the transport of knet in a ring from UDP to SCTP as reported here:
> https://forum.proxmox.com/threads/proxmox-6-2-corosync-3-rare-and-spontaneous-disruptive-udp-5405-storm-flood.75871/page-2
> that gives better latencies for corosync, but the reboot issue continues.
> 
> We don't know whether both issues are related or not.
> 
> Could LACP bonds be the issue?
> https://pve.proxmox.com/pve-docs/pve-admin-guide.html#sysadmin_network_configuration
> "
> If your switch support the LACP (IEEE 802.3ad) protocol then we recommend using the corresponding bonding mode (802.3ad). Otherwise you should generally use the active-backup mode.
> If you intend to run your cluster network on the bonding interfaces, then you have to use active-passive mode on the bonding interfaces, other modes are unsupported.
> "
> As per second line, we understand that running cluster networking over a LACP bond is not supported (just to confirm our interpretation)? We're in the process of reconfiguring nodes/switches to test without a bond, to see if that gives us a stable cluster (will report on this). Do you think this could be the issue?
> 

We general think that there's something that can still improve on leave/join of the cluster,
those floods are not ideal and we tackled a few issues with it already (so ensure latest
Proxmox VE 6.4 is in use), but there are still some issues left, which sadly are really
hard to debug and reproduce nicely.

> Now for more general questions; do you think a 88-node Proxmox VE cluster is feasible?
> 

To be honest, currently I rather do not think so. The biggest cluster we know of is 51
nodes big, and they had quite some fine-tuning  going on and really high-end "golden" HW
to get there.


> Those 88 nodes will host about 14.000 VMs. Will HA manager be able to manage them, or are they too many? (HA for those VMs doesn't seem to be a requirement right know).


The HA stack has some limitations due to physical maximum file size, so currently it can
track about 3500 - 4000 services, so 14k would be to much regarding that.

I'd advise to split that cluster into 4 separate clusters, for example 2 x 21-node + 2 x 23-node
cluster (odd sized node count are slightly better regarding quorum), then you'd have roughly
3500 HA services per cluster, which can be feasible.

As there's semi-active work going on in cross-cluster migrations, the biggest drawback one
gets when splitting up into separate clusters would be gone, sooner or later.







^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PVE-User] BIG cluster questions
@ 2021-06-26 12:59 JR Richardson
  0 siblings, 0 replies; 4+ messages in thread
From: JR Richardson @ 2021-06-26 12:59 UTC (permalink / raw)
  To: pve-user

That is a big cluster, I like it, hope it works out. You should separate the
corosync/heartbeat network on its own physical Ethernet link. This is
probably where you are getting latency from. Even though you are using 25Gig
NICs, pushing all your data/migration traffic/heartbeat traffic, across one
physical link bonded or not, you can experience situations with a busy link
where your corosync traffic is queued, even for a few milli seconds, this
will add up across many nodes. Think about jumbo frames as well, slamming a
NIC with 9000 byte packets for storage, and poor little heartbeat packets
start queueing up in the waiting pool.

In the design notes for proxmox, it's highly recommended to separate all
needed networks on physical NICs and switches as well.

Good luck.

JR Richardson
Engineering for the Masses
Chasing the Azeotrope
JRx DistillCo
1'st Place Brisket
1'st Place Chili

This is anecdotal but I have never seen one cluster that big. You might want
to inquire about professional support which would give you a better
perspective for that kind of scale.

On Thu, Jun 24, 2021 at 10:30 AM Eneko Lacunza via pve-user <
pve-user@lists.proxmox.com> wrote:

>
>
>
> ---------- Forwarded message ----------
> From: Eneko Lacunza <elacunza@binovo.es>
> To: "pve-user@pve.proxmox.com" <pve-user@pve.proxmox.com>
> Cc:
> Bcc:
> Date: Thu, 24 Jun 2021 16:30:31 +0200
> Subject: BIG cluster questions
> Hi all,
>
> We're currently helping a customer to configure a virtualization 
> cluster with 88 servers for VDI.
>





^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2021-06-28 15:34 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <mailman.16.1624545042.464.pve-user@lists.proxmox.com>
2021-06-25 17:33 ` [PVE-User] BIG cluster questions Laurent Dumont
2021-06-26 11:16 ` aderumier
2021-06-26 12:59 JR Richardson
     [not found] <a4c39bce-b416-1286-3374-fc73afa41125@binovo.es>
2021-06-28  9:32 ` Thomas Lamprecht

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox
Service provided by Proxmox Server Solutions GmbH | Privacy | Legal