* Re: [PVE-User] VMs hung after live migration - Intel CPU
[not found] ` <mailman.219.1667828447.489.pve-user@lists.proxmox.com>
@ 2022-11-07 21:59 ` Jan Vlach
[not found] ` <mailman.242.1667896822.489.pve-user@lists.proxmox.com>
0 siblings, 1 reply; 3+ messages in thread
From: Jan Vlach @ 2022-11-07 21:59 UTC (permalink / raw)
To: Proxmox VE user list
Hi,
For what’s it worth, live VM migration with Linux VMs with various debian versions work here just fine. I’m using virtio for networking and virtio scsi for disks. (The only version where I had problems was debian6 where the kernel does not support virtio scsi and megaraid sas 8708EM2 needs to be used. I get kernel panic in mpt_sas on thaw after migration.)
We're running 5.15.60-1-pve on three node cluster with AMD EPYC 7551P 32-Core Processor. These are supermicros with latest bios (latest microcode?) and BMC
Storage is local ZFS pool, backed by SSDS in striped mirrors (4 devices on each node). Migration has dedicated 2x 10GigE LACP and dedicated VLAN on switch stack.
I have more nodes with EPYC3/Milan on the way, so I’ll test those later as well.
What does your cluster look hardware-wise? What are the problems you experienced with VM migratio on 5.13->5.19?
Thanks,
JV
> On 7. 11. 2022, at 14:40, Eneko Lacunza via pve-user <pve-user@lists.proxmox.com> wrote:
>
>
> From: Eneko Lacunza <elacunza@binovo.es>
> Subject: Re: [PVE-User] VMs hung after live migration - Intel CPU
> Date: 7 November 2022 14:40:07 CET
> To: Mark Schouten <mark@tuxis.nl>, Proxmox VE user list <pve-user@lists.proxmox.com>
>
>
> Hi,
>
> Sadly I'm not sure what is best. For most of the clusters we admin, I have decided to stay in 5.13 (pinning that version with proxmox-boot-tool) because 5.19 seems will receive much more changes and it will be more unstable...
>
> Cheers
>
> El 7/11/22 a las 13:56, Mark Schouten escribió:
>> Hi,
>>
>>
>> Thanks. What would you suggest? Downgrading to 5.13 ?
>>
>> --
>> Mark Schouten
>> CTO, Tuxis B.V. | https://www.tuxis.nl/
>> <mark@tuxis.nl> | +31 318 200208
>>
>>
>> *From: * Eneko Lacunza <elacunza@binovo.es>
>> *To: * Mark Schouten <mark@tuxis.nl>, Proxmox VE user list <pve-user@lists.proxmox.com>
>> *Sent: * 2022-11-07 9:23
>> *Subject: * Re: [PVE-User] VMs hung after live migration - Intel CPU
>>
>> Hi,
>>
>> 5.15 has been a disaster for us, issues seem to have no end.
>> Frankly, I don't understand how can it be the official supported
>> kernel in PVE 7.2 right now.
>>
>> Our tests with 5.19 in a pair of nodes (in another cluster) seem
>> good, but I don't think 5.13 -> 5.19 migration is working well
>> either. Both kernels not being the "official" one, I'm unable to
>> decide what to do with our clusters...
>>
>> This has been ongoing for some months... :-(
>>
>> I see 5.15.64 has been promoted to enterprise repo this weekend,
>> no idea if any attempt to fix live migration issues is included...
>>
>> Thanks
>>
>> El 6/11/22 a las 9:04, Mark Schouten escribió:
>>> Hi,
>>>
>>> I’ve seen the same behavior between two AMD cpu’s with the -60 kernel. One of the vm’s the ‘crashed’ even started working after migrating back again..
>>>
>>> I’m probably going to 5.19, I’ve heard other issues with 5.15 as well (CephFS client issues).
>>>
>>> Mark Schouten
>>>
>>>> Op 3 nov. 2022 om 17:55 heeft Eneko Lacunza via pve-user<pve-user@lists.proxmox.com> <mailto:pve-user@lists.proxmox.com> het volgende geschreven:
>>>>
>>>>
>>>>
>>>>> _______________________________________________
>>>>> pve-user mailing list
>>>>> pve-user@lists.proxmox.com <mailto:pve-user@lists.proxmox.com>
>>>>> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user <https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user>
>>
>> Eneko Lacunza
>> Zuzendari teknikoa | Director técnico
>> Binovo IT Human Project
>>
>> Tel. +34 943 569 206 |https://www.binovo.es <https://www.binovo.es>
>> Astigarragako Bidea, 2 - 2º izda. Oficina 10-11, 20180 Oiartzun
>>
>> https://www.youtube.com/user/CANALBINOVO <https://www.youtube.com/user/CANALBINOVO>
>> https://www.linkedin.com/company/37269706/ <https://www.linkedin.com/company/37269706/>
>>
>
> Eneko Lacunza
> Zuzendari teknikoa | Director técnico
> Binovo IT Human Project
>
> Tel. +34 943 569 206 |https://www.binovo.es
> Astigarragako Bidea, 2 - 2º izda. Oficina 10-11, 20180 Oiartzun
>
> https://www.youtube.com/user/CANALBINOVO
> https://www.linkedin.com/company/37269706/
>
>
> _______________________________________________
> pve-user mailing list
> pve-user@lists.proxmox.com
> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: [PVE-User] VMs hung after live migration - Intel CPU
[not found] ` <mailman.254.1667927935.489.pve-user@lists.proxmox.com>
@ 2022-11-08 20:57 ` Jan Vlach
2022-11-08 22:51 ` Kyle Schmitt
0 siblings, 1 reply; 3+ messages in thread
From: Jan Vlach @ 2022-11-08 20:57 UTC (permalink / raw)
To: Proxmox VE user list
Hi Eneko,
thank you a million for taking your time to re-test this! It really helps me to understand what to expect that works and what doesn’t. I had a glimpse of an idea to create cluster with mixed CPUs of EPYC gen1 and EPYC gen3, but this really seems like a road to hell(tm). So I’ll keep the clusters homogenous with the same gen of CPU. I have two sites, but fortunately, I can keep the clusters homogenous (with one having “more power”).
Honestly, up until now, I thought I could abstract from the version of linux kernel I’m running. Because, hey, it’s all KVM. I’m setting my VMs with cpu type host to have the benefit of accelerated AES and other instructions, but I have yet to see if EPYCv1 is compatible with EPYCv3. (v being gen) Thanks for teaching me a new trick or a thing to be aware of at least! (I remember this to be an issue with VMware heterogenous clusters (with cpus of different generations), but I really though KVM64 would help you to abstract from all this, KVM64 being Pentium4-era CPU)
Do you use virtio drivers for storage and network card at all? Can you see a pattern there where the 3 Debian/Windows machines were not affected? Did they use virtio or not?
I really don’t see a reason why the migration back from 5.13 -> 5.19 should bring that 50/100% CPU load and hanging. I’ve had some phantom load before with having “Use tablet for pointer: Yes” before, but that was in the 5% ballpark per VM.
I’m just a fellow proxmox admin/user. Hope this would ring a bell or spark interest in the core proxmox team. I’ve had struggles with 5.15 before with GPU passthrough (wasn’t able to do this) and OpenBSD vm’s taking minutes compared to tens of seconds to boot on 5.15 before.
All and all, thanks for all the hints I could test before production, do it won’t hurt “down the road” …
JV
P.S. i’m trying to push my boss towards a commercial subscription for our clusters, but at this point I really am no sure it would help ...
> On 8. 11. 2022, at 18:18, Eneko Lacunza via pve-user <pve-user@lists.proxmox.com> wrote:
>
>
> From: Eneko Lacunza <elacunza@binovo.es>
> Subject: Re: [PVE-User] VMs hung after live migration - Intel CPU
> Date: 8 November 2022 18:18:44 CET
> To: pve-user@lists.proxmox.com
>
>
> Hi Jan,
>
> I had some time to re-test this.
>
> I tried live migration with KVM64 CPU between 2 nodes:
>
> node-ryzen1700 - kernel 5.19.7-1-pve
> node-ryzen5900x - kernel 5.19.7-1-pve
>
> I bulk-migrated 9 VMs (8 Debian 9/10/11 and 1 Windows 2008r2).
> This works OK in both directions.
>
> Then I downgraded a node to 5.13:
> node-ryzen1700 - kernel 5.19.7-1-pve
> node-ryzen5900x - kernel 5.13.19-6-pve
>
> Migration of those 9 VMs worked well from node-ryzen1700 -> node->ryzen5900x
>
> But migration of those 9 VMs back node->ryzen5900x -> node-ryzen1700 was a disaster: all 8 debian VMs hung with 50/100% CPU use. Window 2008r2 seems not affected by the issue at all.
>
> 3 other Debian/Windows VMs on node-ryzen1700 were not affected.
>
> After migrating both nodes to kernel 5.13:
>
> node-ryzen1700 - kernel 5.13.19-6-pve
> node-ryzen5900x - kernel 5.13.19-6-pve
>
> Migration of those 9 VMs node->ryzen5900x -> node-ryzen1700 works as intended :)
>
> Cheers
>
>
>
> El 8/11/22 a las 9:40, Eneko Lacunza via pve-user escribió:
>> Hi Jan,
>>
>> Yes, there's no issue if CPUs are the same.
>>
>> VMs hang when CPUs are of different enough generation, even being of the same brand and using KVM64 vCPU.
>>
>> El 7/11/22 a las 22:59, Jan Vlach escribió:
>>> Hi,
>>>
>>> For what’s it worth, live VM migration with Linux VMs with various debian versions work here just fine. I’m using virtio for networking and virtio scsi for disks. (The only version where I had problems was debian6 where the kernel does not support virtio scsi and megaraid sas 8708EM2 needs to be used. I get kernel panic in mpt_sas on thaw after migration.)
>>>
>>> We're running 5.15.60-1-pve on three node cluster with AMD EPYC 7551P 32-Core Processor. These are supermicros with latest bios (latest microcode?) and BMC
>>>
>>> Storage is local ZFS pool, backed by SSDS in striped mirrors (4 devices on each node). Migration has dedicated 2x 10GigE LACP and dedicated VLAN on switch stack.
>>>
>>> I have more nodes with EPYC3/Milan on the way, so I’ll test those later as well.
>>>
>>> What does your cluster look hardware-wise? What are the problems you experienced with VM migratio on 5.13->5.19?
>>>
>>> Thanks,
>>> JV
>
> Eneko Lacunza
> Zuzendari teknikoa | Director técnico
> Binovo IT Human Project
>
> Tel. +34 943 569 206 |https://www.binovo.es
> Astigarragako Bidea, 2 - 2º izda. Oficina 10-11, 20180 Oiartzun
>
> https://www.youtube.com/user/CANALBINOVO
> https://www.linkedin.com/company/37269706/
>
>
> _______________________________________________
> pve-user mailing list
> pve-user@lists.proxmox.com
> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: [PVE-User] VMs hung after live migration - Intel CPU
2022-11-08 20:57 ` Jan Vlach
@ 2022-11-08 22:51 ` Kyle Schmitt
0 siblings, 0 replies; 3+ messages in thread
From: Kyle Schmitt @ 2022-11-08 22:51 UTC (permalink / raw)
To: Proxmox VE user list
It's been quite a long time since I've done it, but for what it's
worth, I never had problems live migrating KVM machines to hosts with
other processors, **as long as it wasn't launched using a processor
specific extension**.
Get the exact options kvm is running on both hosts, and compare.
In openstack there's a tendency to auto-detect processor features and
launch with all available, so when I had a cluster of mixed epyc
generations, I had to declare features instead of letting it
autodetect (previous job, over a year ago, so details sketchy). My
guess is some auto-detection gone wrong.
My home-cluster is homogeneous cast-off R610s, otherwise I'd test this
myself Sorry.
--Kyle
On Tue, Nov 8, 2022 at 2:57 PM Jan Vlach <janus@volny.cz> wrote:
>
> Hi Eneko,
>
> thank you a million for taking your time to re-test this! It really helps me to understand what to expect that works and what doesn’t. I had a glimpse of an idea to create cluster with mixed CPUs of EPYC gen1 and EPYC gen3, but this really seems like a road to hell(tm). So I’ll keep the clusters homogenous with the same gen of CPU. I have two sites, but fortunately, I can keep the clusters homogenous (with one having “more power”).
>
> Honestly, up until now, I thought I could abstract from the version of linux kernel I’m running. Because, hey, it’s all KVM. I’m setting my VMs with cpu type host to have the benefit of accelerated AES and other instructions, but I have yet to see if EPYCv1 is compatible with EPYCv3. (v being gen) Thanks for teaching me a new trick or a thing to be aware of at least! (I remember this to be an issue with VMware heterogenous clusters (with cpus of different generations), but I really though KVM64 would help you to abstract from all this, KVM64 being Pentium4-era CPU)
>
> Do you use virtio drivers for storage and network card at all? Can you see a pattern there where the 3 Debian/Windows machines were not affected? Did they use virtio or not?
>
> I really don’t see a reason why the migration back from 5.13 -> 5.19 should bring that 50/100% CPU load and hanging. I’ve had some phantom load before with having “Use tablet for pointer: Yes” before, but that was in the 5% ballpark per VM.
>
> I’m just a fellow proxmox admin/user. Hope this would ring a bell or spark interest in the core proxmox team. I’ve had struggles with 5.15 before with GPU passthrough (wasn’t able to do this) and OpenBSD vm’s taking minutes compared to tens of seconds to boot on 5.15 before.
>
> All and all, thanks for all the hints I could test before production, do it won’t hurt “down the road” …
>
> JV
> P.S. i’m trying to push my boss towards a commercial subscription for our clusters, but at this point I really am no sure it would help ...
>
>
> > On 8. 11. 2022, at 18:18, Eneko Lacunza via pve-user <pve-user@lists.proxmox.com> wrote:
> >
> >
> > From: Eneko Lacunza <elacunza@binovo.es>
> > Subject: Re: [PVE-User] VMs hung after live migration - Intel CPU
> > Date: 8 November 2022 18:18:44 CET
> > To: pve-user@lists.proxmox.com
> >
> >
> > Hi Jan,
> >
> > I had some time to re-test this.
> >
> > I tried live migration with KVM64 CPU between 2 nodes:
> >
> > node-ryzen1700 - kernel 5.19.7-1-pve
> > node-ryzen5900x - kernel 5.19.7-1-pve
> >
> > I bulk-migrated 9 VMs (8 Debian 9/10/11 and 1 Windows 2008r2).
> > This works OK in both directions.
> >
> > Then I downgraded a node to 5.13:
> > node-ryzen1700 - kernel 5.19.7-1-pve
> > node-ryzen5900x - kernel 5.13.19-6-pve
> >
> > Migration of those 9 VMs worked well from node-ryzen1700 -> node->ryzen5900x
> >
> > But migration of those 9 VMs back node->ryzen5900x -> node-ryzen1700 was a disaster: all 8 debian VMs hung with 50/100% CPU use. Window 2008r2 seems not affected by the issue at all.
> >
> > 3 other Debian/Windows VMs on node-ryzen1700 were not affected.
> >
> > After migrating both nodes to kernel 5.13:
> >
> > node-ryzen1700 - kernel 5.13.19-6-pve
> > node-ryzen5900x - kernel 5.13.19-6-pve
> >
> > Migration of those 9 VMs node->ryzen5900x -> node-ryzen1700 works as intended :)
> >
> > Cheers
> >
> >
> >
> > El 8/11/22 a las 9:40, Eneko Lacunza via pve-user escribió:
> >> Hi Jan,
> >>
> >> Yes, there's no issue if CPUs are the same.
> >>
> >> VMs hang when CPUs are of different enough generation, even being of the same brand and using KVM64 vCPU.
> >>
> >> El 7/11/22 a las 22:59, Jan Vlach escribió:
> >>> Hi,
> >>>
> >>> For what’s it worth, live VM migration with Linux VMs with various debian versions work here just fine. I’m using virtio for networking and virtio scsi for disks. (The only version where I had problems was debian6 where the kernel does not support virtio scsi and megaraid sas 8708EM2 needs to be used. I get kernel panic in mpt_sas on thaw after migration.)
> >>>
> >>> We're running 5.15.60-1-pve on three node cluster with AMD EPYC 7551P 32-Core Processor. These are supermicros with latest bios (latest microcode?) and BMC
> >>>
> >>> Storage is local ZFS pool, backed by SSDS in striped mirrors (4 devices on each node). Migration has dedicated 2x 10GigE LACP and dedicated VLAN on switch stack.
> >>>
> >>> I have more nodes with EPYC3/Milan on the way, so I’ll test those later as well.
> >>>
> >>> What does your cluster look hardware-wise? What are the problems you experienced with VM migratio on 5.13->5.19?
> >>>
> >>> Thanks,
> >>> JV
> >
> > Eneko Lacunza
> > Zuzendari teknikoa | Director técnico
> > Binovo IT Human Project
> >
> > Tel. +34 943 569 206 |https://www.binovo.es
> > Astigarragako Bidea, 2 - 2º izda. Oficina 10-11, 20180 Oiartzun
> >
> > https://www.youtube.com/user/CANALBINOVO
> > https://www.linkedin.com/company/37269706/
> >
> >
> > _______________________________________________
> > pve-user mailing list
> > pve-user@lists.proxmox.com
> > https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user
>
> _______________________________________________
> pve-user mailing list
> pve-user@lists.proxmox.com
> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2022-11-08 22:51 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
[not found] <1378480941-319@kerio.tuxis.nl>
[not found] ` <mailman.219.1667828447.489.pve-user@lists.proxmox.com>
2022-11-07 21:59 ` [PVE-User] VMs hung after live migration - Intel CPU Jan Vlach
[not found] ` <mailman.242.1667896822.489.pve-user@lists.proxmox.com>
[not found] ` <mailman.254.1667927935.489.pve-user@lists.proxmox.com>
2022-11-08 20:57 ` Jan Vlach
2022-11-08 22:51 ` Kyle Schmitt
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox