From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits)) (No client certificate requested) by lists.proxmox.com (Postfix) with ESMTPS id DBC338D765 for ; Tue, 8 Nov 2022 21:57:45 +0100 (CET) Received: from firstgate.proxmox.com (localhost [127.0.0.1]) by firstgate.proxmox.com (Proxmox) with ESMTP id B4DEABE50 for ; Tue, 8 Nov 2022 21:57:15 +0100 (CET) Received: from gmmr-1.centrum.cz (gmmr-1.centrum.cz [IPv6:2a00:da80:0:502::7]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by firstgate.proxmox.com (Proxmox) with ESMTPS for ; Tue, 8 Nov 2022 21:57:13 +0100 (CET) Received: from gmmr-1.centrum.cz (localhost [127.0.0.1]) by gmmr-1.centrum.cz (Postfix) with ESMTP id DF6F02340D for ; Tue, 8 Nov 2022 21:57:06 +0100 (CET) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=volny.cz; s=mail; t=1667941026; bh=2k+8zH0HN8Q71K7mR42hurGsjf/uhFYl+5RtgsvB4Hs=; h=From:Subject:Date:References:To:In-Reply-To:From; b=hfvhvzN14n9lg9Xvp7L2Hqa4m7HK1qlFPRt5DcWrk398cDDB3L4Lsh0SANa48iaqP 6kFMKDbr8BnzVDCWXufa9qDzoy0/ViCd2uBbF3BLgs6/wGzIb+O1wGKKNbbmu93yOR 7b+gTuUi0uyhtnNY9rawFkXCAZY5LMlcdoM7Utno= Received: from vm1.excello.cz (vm1.excello.cz [IPv6:2001:67c:1591::3]) by gmmr-1.centrum.cz (Postfix) with QMQP id DD1FC27FD7 for ; Tue, 8 Nov 2022 21:57:06 +0100 (CET) Received: from vm1.excello.cz by vm1.excello.cz (VF-Scanner: Clear:RC:0(2a00:da80:0:502::8):SC:0(-4.5/5.0):CC:0:; processed in 1.0 s); 08 Nov 2022 20:57:06 +0000 X-VF-Scanner-ID: 20221108205705.904396.8364.vm1.excello.cz.0 X-Spam-Status: No, hits=-4.5, required=5.0 Received: from gmmr-3.centrum.cz (2a00:da80:0:502::8) by out2.virusfree.cz with ESMTPS (TLSv1.3, TLS_AES_256_GCM_SHA384); 8 Nov 2022 21:57:05 +0100 Received: from gm-smtp10.centrum.cz (envoy-stl.cent [10.32.56.18]) by gmmr-3.centrum.cz (Postfix) with ESMTP id CA49C2019DC6 for ; Tue, 8 Nov 2022 21:57:05 +0100 (CET) Received: from smtpclient.apple (unknown [10.128.64.72]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by gm-smtp10.centrum.cz (Postfix) with ESMTPSA id C2D45B2ACA for ; Tue, 8 Nov 2022 21:57:05 +0100 (CET) From: Jan Vlach Mime-Version: 1.0 (Mac OS X Mail 16.0 \(3696.120.41.1.1\)) Date: Tue, 8 Nov 2022 21:57:05 +0100 References: <1378480941-319@kerio.tuxis.nl> <7827641E-40A1-4E5D-8EDF-4E37BA2BD5AB@volny.cz> To: Proxmox VE user list In-Reply-To: Message-Id: <67ED424D-3B01-400A-B71E-40A19B9EE160@volny.cz> X-Mailer: Apple Mail (2.3696.120.41.1.1) X-SPAM-LEVEL: Spam detection results: 0 BAYES_00 -1.9 Bayes spam probability is 0 to 1% DKIM_SIGNED 0.1 Message has a DKIM or DK signature, not necessarily valid DKIM_VALID -0.1 Message has at least one valid DKIM or DK signature DKIM_VALID_AU -0.1 Message has a valid DKIM or DK signature from author's domain DKIM_VALID_EF -0.1 Message has a valid DKIM or DK signature from envelope-from domain HTML_MESSAGE 0.001 HTML included in message RCVD_IN_DNSWL_NONE -0.0001 Sender listed at https://www.dnswl.org/, no trust SPF_HELO_NONE 0.001 SPF: HELO does not publish an SPF Record SPF_PASS -0.001 SPF: sender matches SPF record URIBL_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to URIBL was blocked. See http://wiki.apache.org/spamassassin/DnsBlocklists#dnsbl-block for more information. [proxmox.com, binovo.es, volny.cz] Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Content-Filtered-By: Mailman/MimeDel 2.1.29 Subject: Re: [PVE-User] VMs hung after live migration - Intel CPU X-BeenThere: pve-user@lists.proxmox.com X-Mailman-Version: 2.1.29 Precedence: list List-Id: Proxmox VE user list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 08 Nov 2022 20:57:45 -0000 Hi Eneko, thank you a million for taking your time to re-test this! It really = helps me to understand what to expect that works and what doesn=E2=80=99t.= I had a glimpse of an idea to create cluster with mixed CPUs of EPYC = gen1 and EPYC gen3, but this really seems like a road to hell(tm). So = I=E2=80=99ll keep the clusters homogenous with the same gen of CPU. I = have two sites, but fortunately, I can keep the clusters homogenous = (with one having =E2=80=9Cmore power=E2=80=9D). Honestly, up until now, I thought I could abstract from the version of = linux kernel I=E2=80=99m running. Because, hey, it=E2=80=99s all KVM. = I=E2=80=99m setting my VMs with cpu type host to have the benefit of = accelerated AES and other instructions, but I have yet to see if EPYCv1 = is compatible with EPYCv3. (v being gen) Thanks for teaching me a new = trick or a thing to be aware of at least! (I remember this to be an = issue with VMware heterogenous clusters (with cpus of different = generations), but I really though KVM64 would help you to abstract from = all this, KVM64 being Pentium4-era CPU) Do you use virtio drivers for storage and network card at all? Can you = see a pattern there where the 3 Debian/Windows machines were not = affected? Did they use virtio or not?=20 I really don=E2=80=99t see a reason why the migration back from 5.13 -> = 5.19 should bring that 50/100% CPU load and hanging. I=E2=80=99ve had = some phantom load before with having =E2=80=9CUse tablet for pointer: = Yes=E2=80=9D before, but that was in the 5% ballpark per VM. I=E2=80=99m just a fellow proxmox admin/user. Hope this would ring a = bell or spark interest in the core proxmox team. I=E2=80=99ve had = struggles with 5.15 before with GPU passthrough (wasn=E2=80=99t able to = do this) and OpenBSD vm=E2=80=99s taking minutes compared to tens of = seconds to boot on 5.15 before.=20 All and all, thanks for all the hints I could test before production, do = it won=E2=80=99t hurt =E2=80=9Cdown the road=E2=80=9D =E2=80=A6=20 JV P.S. i=E2=80=99m trying to push my boss towards a commercial = subscription for our clusters, but at this point I really am no sure it = would help ... > On 8. 11. 2022, at 18:18, Eneko Lacunza via pve-user = wrote: >=20 >=20 > From: Eneko Lacunza > Subject: Re: [PVE-User] VMs hung after live migration - Intel CPU > Date: 8 November 2022 18:18:44 CET > To: pve-user@lists.proxmox.com >=20 >=20 > Hi Jan, >=20 > I had some time to re-test this. >=20 > I tried live migration with KVM64 CPU between 2 nodes: >=20 > node-ryzen1700 - kernel 5.19.7-1-pve > node-ryzen5900x - kernel 5.19.7-1-pve >=20 > I bulk-migrated 9 VMs (8 Debian 9/10/11 and 1 Windows 2008r2). > This works OK in both directions. >=20 > Then I downgraded a node to 5.13: > node-ryzen1700 - kernel 5.19.7-1-pve > node-ryzen5900x - kernel 5.13.19-6-pve >=20 > Migration of those 9 VMs worked well from node-ryzen1700 -> = node->ryzen5900x >=20 > But migration of those 9 VMs back node->ryzen5900x -> node-ryzen1700 = was a disaster: all 8 debian VMs hung with 50/100% CPU use. Window = 2008r2 seems not affected by the issue at all. >=20 > 3 other Debian/Windows VMs on node-ryzen1700 were not affected. >=20 > After migrating both nodes to kernel 5.13: >=20 > node-ryzen1700 - kernel 5.13.19-6-pve > node-ryzen5900x - kernel 5.13.19-6-pve >=20 > Migration of those 9 VMs node->ryzen5900x -> node-ryzen1700 works as = intended :) >=20 > Cheers >=20 >=20 >=20 > El 8/11/22 a las 9:40, Eneko Lacunza via pve-user escribi=C3=B3: >> Hi Jan, >>=20 >> Yes, there's no issue if CPUs are the same. >>=20 >> VMs hang when CPUs are of different enough generation, even being of = the same brand and using KVM64 vCPU. >>=20 >> El 7/11/22 a las 22:59, Jan Vlach escribi=C3=B3: >>> Hi, >>>=20 >>> For what=E2=80=99s it worth, live VM migration with Linux VMs with = various debian versions work here just fine. I=E2=80=99m using virtio = for networking and virtio scsi for disks. (The only version where I had = problems was debian6 where the kernel does not support virtio scsi and = megaraid sas 8708EM2 needs to be used. I get kernel panic in mpt_sas on = thaw after migration.) >>>=20 >>> We're running 5.15.60-1-pve on three node cluster with AMD EPYC = 7551P 32-Core Processor. These are supermicros with latest bios (latest = microcode?) and BMC >>>=20 >>> Storage is local ZFS pool, backed by SSDS in striped mirrors (4 = devices on each node). Migration has dedicated 2x 10GigE LACP and = dedicated VLAN on switch stack. >>>=20 >>> I have more nodes with EPYC3/Milan on the way, so I=E2=80=99ll test = those later as well. >>>=20 >>> What does your cluster look hardware-wise? What are the problems you = experienced with VM migratio on 5.13->5.19? >>>=20 >>> Thanks, >>> JV=20 >=20 > Eneko Lacunza > Zuzendari teknikoa | Director t=C3=A9cnico > Binovo IT Human Project >=20 > Tel. +34 943 569 206 |https://www.binovo.es > Astigarragako Bidea, 2 - 2=C2=BA izda. Oficina 10-11, 20180 Oiartzun >=20 > https://www.youtube.com/user/CANALBINOVO > https://www.linkedin.com/company/37269706/ >=20 >=20 > _______________________________________________ > pve-user mailing list > pve-user@lists.proxmox.com > https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user