public inbox for pve-user@lists.proxmox.com
 help / color / mirror / Atom feed
* [PVE-User] Failed live migration on Supermicro with EPYCv1 - what's going on here?
@ 2023-10-20  9:52 Jan Vlach
  2023-10-20 13:56 ` DERUMIER, Alexandre
  0 siblings, 1 reply; 5+ messages in thread
From: Jan Vlach @ 2023-10-20  9:52 UTC (permalink / raw)
  To: Proxmox VE user list

Hello proxmox-users,

I have a proxmox cluster running PVE7 with local ZFS pool (striped mirrors), fully patched (no-subscription repo) and now I’m rebooting into new kernels (5.15 .60 -> .126)

Migration network is dedicated 2x 10 GigE LACP interface on every node.

These are dual socketed Supermicro boxes with 2x AMD EPYC 7281 16-Core Processor. Microcode is already 0x800126e everywhere.

The VMs are Cisco Ironport Appliances running FreeBSD (no qemu-guest-agent, disabled in settings). For some, the live migration fails on transferring contents of RAM. The job cleans up remote zvol, but kills source VM.

Couple weeks ago, I migrated at least 24 ironport VMs without a hiccup.

What’s going on here? Where else can I look? Log with snipped the 500G disk log transfer, there were no errors, just time and percent going up. 

On a tangent - on bulk migrate, first VM in the batch complains that port 60001 is already used and job can’t bind, so the first VM gets skipped. Probably unrelated, different error.

Thank you for cluestick,
JV

2023-10-20 11:10:47 use dedicated network address for sending migration traffic (10.30.24.20)
2023-10-20 11:10:47 starting migration of VM 148 to node 'prox-node7' (10.30.24.20)
2023-10-20 11:10:47 found local disk 'local-zfs:vm-148-disk-0' (in current VM config)
2023-10-20 11:10:47 starting VM 148 on remote node 'prox-node7'
2023-10-20 11:10:52 volume 'local-zfs:vm-148-disk-0' is 'local-zfs:vm-148-disk-0' on the target
2023-10-20 11:10:52 start remote tunnel
2023-10-20 11:10:53 ssh tunnel ver 1
2023-10-20 11:10:53 starting storage migration
2023-10-20 11:10:53 scsi0: start migration to nbd:10.30.24.20:60001:exportname=drive-scsi0
drive mirror is starting for drive-scsi0
drive-scsi0: transferred 477.0 MiB of 500.0 GiB (0.09%) in 14m 17s
drive-scsi0: transferred 1.1 GiB of 500.0 GiB (0.22%) in 14m 18s
drive-scsi0: transferred 1.7 GiB of 500.0 GiB (0.34%) in 14m 19s
drive-scsi0: transferred 2.3 GiB of 500.0 GiB (0.46%) in 14m 20s
drive-scsi0: transferred 2.9 GiB of 500.0 GiB (0.57%) in 14m 21s
drive-scsi0: transferred 3.5 GiB of 500.0 GiB (0.69%) in 14m 22s
drive-scsi0: transferred 4.0 GiB of 500.0 GiB (0.81%) in 14m 23s
drive-scsi0: transferred 4.6 GiB of 500.0 GiB (0.92%) in 14m 24s
drive-scsi0: transferred 5.0 GiB of 500.0 GiB (1.01%) in 14m 25s
drive-scsi0: transferred 5.5 GiB of 500.0 GiB (1.10%) in 14m 26s
drive-scsi0: transferred 6.0 GiB of 500.0 GiB (1.20%) in 14m 27s
drive-scsi0: transferred 6.5 GiB of 500.0 GiB (1.30%) in 14m 28s
drive-scsi0: transferred 6.9 GiB of 500.0 GiB (1.38%) in 14m 29s
drive-scsi0: transferred 7.4 GiB of 500.0 GiB (1.48%) in 14m 30s
drive-scsi0: transferred 7.8 GiB of 500.0 GiB (1.56%) in 14m 31s
drive-scsi0: transferred 8.2 GiB of 500.0 GiB (1.65%) in 14m 32s
… snipped to keep it sane, no errors here …
drive-scsi0: transferred 500.2 GiB of 500.8 GiB (99.87%) in 28m 32s
drive-scsi0: transferred 500.5 GiB of 500.8 GiB (99.94%) in 28m 33s
drive-scsi0: transferred 500.5 GiB of 500.8 GiB (99.95%) in 28m 34s
drive-scsi0: transferred 500.6 GiB of 500.8 GiB (99.96%) in 28m 35s
drive-scsi0: transferred 500.7 GiB of 500.8 GiB (99.97%) in 28m 36s
drive-scsi0: transferred 500.7 GiB of 500.8 GiB (99.97%) in 28m 37s
drive-scsi0: transferred 500.7 GiB of 500.8 GiB (99.98%) in 28m 38s
drive-scsi0: transferred 500.8 GiB of 500.8 GiB (99.99%) in 28m 39s
drive-scsi0: transferred 500.8 GiB of 500.8 GiB (100.00%) in 28m 40s
drive-scsi0: transferred 500.8 GiB of 500.8 GiB (100.00%) in 28m 41s, ready
all 'mirror' jobs are ready
2023-10-20 11:39:34 starting online/live migration on tcp:10.30.24.20:60000
2023-10-20 11:39:34 set migration capabilities
2023-10-20 11:39:34 migration downtime limit: 100 ms
2023-10-20 11:39:34 migration cachesize: 1.0 GiB
2023-10-20 11:39:34 set migration parameters
2023-10-20 11:39:34 start migrate command to tcp:10.30.24.20:60000
2023-10-20 11:39:35 migration active, transferred 615.9 MiB of 8.0 GiB VM-state, 537.5 MiB/s
2023-10-20 11:39:36 migration active, transferred 1.1 GiB of 8.0 GiB VM-state, 812.6 MiB/s
2023-10-20 11:39:37 migration active, transferred 1.6 GiB of 8.0 GiB VM-state, 440.5 MiB/s
2023-10-20 11:39:38 migration active, transferred 2.1 GiB of 8.0 GiB VM-state, 495.3 MiB/s
2023-10-20 11:39:39 migration active, transferred 2.5 GiB of 8.0 GiB VM-state, 250.1 MiB/s
2023-10-20 11:39:40 migration active, transferred 2.9 GiB of 8.0 GiB VM-state, 490.4 MiB/s
2023-10-20 11:39:41 migration active, transferred 3.4 GiB of 8.0 GiB VM-state, 514.4 MiB/s
2023-10-20 11:39:42 migration active, transferred 3.9 GiB of 8.0 GiB VM-state, 485.9 MiB/s
2023-10-20 11:39:43 migration active, transferred 4.3 GiB of 8.0 GiB VM-state, 488.2 MiB/s
2023-10-20 11:39:44 migration active, transferred 4.8 GiB of 8.0 GiB VM-state, 738.3 MiB/s
2023-10-20 11:39:45 migration active, transferred 5.6 GiB of 8.0 GiB VM-state, 730.8 MiB/s
2023-10-20 11:39:46 migration active, transferred 6.2 GiB of 8.0 GiB VM-state, 492.9 MiB/s
2023-10-20 11:39:47 migration active, transferred 6.7 GiB of 8.0 GiB VM-state, 471.5 MiB/s
2023-10-20 11:39:48 migration active, transferred 7.1 GiB of 8.0 GiB VM-state, 469.4 MiB/s
2023-10-20 11:39:49 migration active, transferred 7.9 GiB of 8.0 GiB VM-state, 666.7 MiB/s
2023-10-20 11:39:50 migration active, transferred 8.6 GiB of 8.0 GiB VM-state, 771.9 MiB/s
2023-10-20 11:39:51 migration active, transferred 9.4 GiB of 8.0 GiB VM-state, 1.2 GiB/s
2023-10-20 11:39:51 xbzrle: send updates to 33286 pages in 23.2 MiB encoded memory, cache-miss 96.68%, overflow 5045
2023-10-20 11:39:52 auto-increased downtime to continue migration: 200 ms
2023-10-20 11:39:53 migration active, transferred 9.9 GiB of 8.0 GiB VM-state, 1.1 GiB/s
2023-10-20 11:39:53 xbzrle: send updates to 177238 pages in 60.2 MiB encoded memory, cache-miss 73.74%, overflow 9766
query migrate failed: VM 148 qmp command 'query-migrate' failed - client closed connection

2023-10-20 11:39:54 query migrate failed: VM 148 qmp command 'query-migrate' failed - client closed connection
query migrate failed: VM 148 not running

2023-10-20 11:39:55 query migrate failed: VM 148 not running
query migrate failed: VM 148 not running

2023-10-20 11:39:56 query migrate failed: VM 148 not running
query migrate failed: VM 148 not running

2023-10-20 11:39:57 query migrate failed: VM 148 not running
query migrate failed: VM 148 not running

2023-10-20 11:39:58 query migrate failed: VM 148 not running
query migrate failed: VM 148 not running

2023-10-20 11:39:59 query migrate failed: VM 148 not running
2023-10-20 11:39:59 ERROR: online migrate failure - too many query migrate failures - aborting
2023-10-20 11:39:59 aborting phase 2 - cleanup resources
2023-10-20 11:39:59 migrate_cancel
2023-10-20 11:39:59 migrate_cancel error: VM 148 not running
2023-10-20 11:39:59 ERROR: query-status error: VM 148 not running
drive-scsi0: Cancelling block job
2023-10-20 11:39:59 ERROR: VM 148 not running
2023-10-20 11:40:04 ERROR: migration finished with problems (duration 00:29:17)
TASK ERROR: migration problems



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PVE-User] Failed live migration on Supermicro with EPYCv1 - what's going on here?
  2023-10-20  9:52 [PVE-User] Failed live migration on Supermicro with EPYCv1 - what's going on here? Jan Vlach
@ 2023-10-20 13:56 ` DERUMIER, Alexandre
  2023-10-22 13:00   ` Jan Vlach
  0 siblings, 1 reply; 5+ messages in thread
From: DERUMIER, Alexandre @ 2023-10-20 13:56 UTC (permalink / raw)
  To: pve-user

Hi,

what is the cpu model of the vms ?


-------- Message initial --------
De: Jan Vlach <janus@volny.cz>
Répondre à: Proxmox VE user list <pve-user@lists.proxmox.com>
À: Proxmox VE user list <pve-user@lists.proxmox.com>
Objet: [PVE-User] Failed live migration on Supermicro with EPYCv1 -
what's going on here?
Date: 20/10/2023 11:52:45

Hello proxmox-users,

I have a proxmox cluster running PVE7 with local ZFS pool (striped
mirrors), fully patched (no-subscription repo) and now I’m rebooting
into new kernels (5.15 .60 -> .126)

Migration network is dedicated 2x 10 GigE LACP interface on every node.

These are dual socketed Supermicro boxes with 2x AMD EPYC 7281 16-Core
Processor. Microcode is already 0x800126e everywhere.

The VMs are Cisco Ironport Appliances running FreeBSD (no qemu-guest-
agent, disabled in settings). For some, the live migration fails on
transferring contents of RAM. The job cleans up remote zvol, but kills
source VM.

Couple weeks ago, I migrated at least 24 ironport VMs without a hiccup.

What’s going on here? Where else can I look? Log with snipped the 500G
disk log transfer, there were no errors, just time and percent going
up. 

On a tangent - on bulk migrate, first VM in the batch complains that
port 60001 is already used and job can’t bind, so the first VM gets
skipped. Probably unrelated, different error.

Thank you for cluestick,
JV

2023-10-20 11:10:47 use dedicated network address for sending migration
traffic (10.30.24.20)
2023-10-20 11:10:47 starting migration of VM 148 to node 'prox-node7'
(10.30.24.20)
2023-10-20 11:10:47 found local disk 'local-zfs:vm-148-disk-0' (in
current VM config)
2023-10-20 11:10:47 starting VM 148 on remote node 'prox-node7'
2023-10-20 11:10:52 volume 'local-zfs:vm-148-disk-0' is 'local-zfs:vm-
148-disk-0' on the target
2023-10-20 11:10:52 start remote tunnel
2023-10-20 11:10:53 ssh tunnel ver 1
2023-10-20 11:10:53 starting storage migration
2023-10-20 11:10:53 scsi0: start migration to
nbd:10.30.24.20:60001:exportname=drive-scsi0
drive mirror is starting for drive-scsi0
drive-scsi0: transferred 477.0 MiB of 500.0 GiB (0.09%) in 14m 17s
drive-scsi0: transferred 1.1 GiB of 500.0 GiB (0.22%) in 14m 18s
drive-scsi0: transferred 1.7 GiB of 500.0 GiB (0.34%) in 14m 19s
drive-scsi0: transferred 2.3 GiB of 500.0 GiB (0.46%) in 14m 20s
drive-scsi0: transferred 2.9 GiB of 500.0 GiB (0.57%) in 14m 21s
drive-scsi0: transferred 3.5 GiB of 500.0 GiB (0.69%) in 14m 22s
drive-scsi0: transferred 4.0 GiB of 500.0 GiB (0.81%) in 14m 23s
drive-scsi0: transferred 4.6 GiB of 500.0 GiB (0.92%) in 14m 24s
drive-scsi0: transferred 5.0 GiB of 500.0 GiB (1.01%) in 14m 25s
drive-scsi0: transferred 5.5 GiB of 500.0 GiB (1.10%) in 14m 26s
drive-scsi0: transferred 6.0 GiB of 500.0 GiB (1.20%) in 14m 27s
drive-scsi0: transferred 6.5 GiB of 500.0 GiB (1.30%) in 14m 28s
drive-scsi0: transferred 6.9 GiB of 500.0 GiB (1.38%) in 14m 29s
drive-scsi0: transferred 7.4 GiB of 500.0 GiB (1.48%) in 14m 30s
drive-scsi0: transferred 7.8 GiB of 500.0 GiB (1.56%) in 14m 31s
drive-scsi0: transferred 8.2 GiB of 500.0 GiB (1.65%) in 14m 32s
… snipped to keep it sane, no errors here …
drive-scsi0: transferred 500.2 GiB of 500.8 GiB (99.87%) in 28m 32s
drive-scsi0: transferred 500.5 GiB of 500.8 GiB (99.94%) in 28m 33s
drive-scsi0: transferred 500.5 GiB of 500.8 GiB (99.95%) in 28m 34s
drive-scsi0: transferred 500.6 GiB of 500.8 GiB (99.96%) in 28m 35s
drive-scsi0: transferred 500.7 GiB of 500.8 GiB (99.97%) in 28m 36s
drive-scsi0: transferred 500.7 GiB of 500.8 GiB (99.97%) in 28m 37s
drive-scsi0: transferred 500.7 GiB of 500.8 GiB (99.98%) in 28m 38s
drive-scsi0: transferred 500.8 GiB of 500.8 GiB (99.99%) in 28m 39s
drive-scsi0: transferred 500.8 GiB of 500.8 GiB (100.00%) in 28m 40s
drive-scsi0: transferred 500.8 GiB of 500.8 GiB (100.00%) in 28m 41s,
ready
all 'mirror' jobs are ready
2023-10-20 11:39:34 starting online/live migration on
tcp:10.30.24.20:60000
2023-10-20 11:39:34 set migration capabilities
2023-10-20 11:39:34 migration downtime limit: 100 ms
2023-10-20 11:39:34 migration cachesize: 1.0 GiB
2023-10-20 11:39:34 set migration parameters
2023-10-20 11:39:34 start migrate command to tcp:10.30.24.20:60000
2023-10-20 11:39:35 migration active, transferred 615.9 MiB of 8.0 GiB
VM-state, 537.5 MiB/s
2023-10-20 11:39:36 migration active, transferred 1.1 GiB of 8.0 GiB
VM-state, 812.6 MiB/s
2023-10-20 11:39:37 migration active, transferred 1.6 GiB of 8.0 GiB
VM-state, 440.5 MiB/s
2023-10-20 11:39:38 migration active, transferred 2.1 GiB of 8.0 GiB
VM-state, 495.3 MiB/s
2023-10-20 11:39:39 migration active, transferred 2.5 GiB of 8.0 GiB
VM-state, 250.1 MiB/s
2023-10-20 11:39:40 migration active, transferred 2.9 GiB of 8.0 GiB
VM-state, 490.4 MiB/s
2023-10-20 11:39:41 migration active, transferred 3.4 GiB of 8.0 GiB
VM-state, 514.4 MiB/s
2023-10-20 11:39:42 migration active, transferred 3.9 GiB of 8.0 GiB
VM-state, 485.9 MiB/s
2023-10-20 11:39:43 migration active, transferred 4.3 GiB of 8.0 GiB
VM-state, 488.2 MiB/s
2023-10-20 11:39:44 migration active, transferred 4.8 GiB of 8.0 GiB
VM-state, 738.3 MiB/s
2023-10-20 11:39:45 migration active, transferred 5.6 GiB of 8.0 GiB
VM-state, 730.8 MiB/s
2023-10-20 11:39:46 migration active, transferred 6.2 GiB of 8.0 GiB
VM-state, 492.9 MiB/s
2023-10-20 11:39:47 migration active, transferred 6.7 GiB of 8.0 GiB
VM-state, 471.5 MiB/s
2023-10-20 11:39:48 migration active, transferred 7.1 GiB of 8.0 GiB
VM-state, 469.4 MiB/s
2023-10-20 11:39:49 migration active, transferred 7.9 GiB of 8.0 GiB
VM-state, 666.7 MiB/s
2023-10-20 11:39:50 migration active, transferred 8.6 GiB of 8.0 GiB
VM-state, 771.9 MiB/s
2023-10-20 11:39:51 migration active, transferred 9.4 GiB of 8.0 GiB
VM-state, 1.2 GiB/s
2023-10-20 11:39:51 xbzrle: send updates to 33286 pages in 23.2 MiB
encoded memory, cache-miss 96.68%, overflow 5045
2023-10-20 11:39:52 auto-increased downtime to continue migration: 200
ms
2023-10-20 11:39:53 migration active, transferred 9.9 GiB of 8.0 GiB
VM-state, 1.1 GiB/s
2023-10-20 11:39:53 xbzrle: send updates to 177238 pages in 60.2 MiB
encoded memory, cache-miss 73.74%, overflow 9766
query migrate failed: VM 148 qmp command 'query-migrate' failed -
client closed connection

2023-10-20 11:39:54 query migrate failed: VM 148 qmp command 'query-
migrate' failed - client closed connection
query migrate failed: VM 148 not running

2023-10-20 11:39:55 query migrate failed: VM 148 not running
query migrate failed: VM 148 not running

2023-10-20 11:39:56 query migrate failed: VM 148 not running
query migrate failed: VM 148 not running

2023-10-20 11:39:57 query migrate failed: VM 148 not running
query migrate failed: VM 148 not running

2023-10-20 11:39:58 query migrate failed: VM 148 not running
query migrate failed: VM 148 not running

2023-10-20 11:39:59 query migrate failed: VM 148 not running
2023-10-20 11:39:59 ERROR: online migrate failure - too many query
migrate failures - aborting
2023-10-20 11:39:59 aborting phase 2 - cleanup resources
2023-10-20 11:39:59 migrate_cancel
2023-10-20 11:39:59 migrate_cancel error: VM 148 not running
2023-10-20 11:39:59 ERROR: query-status error: VM 148 not running
drive-scsi0: Cancelling block job
2023-10-20 11:39:59 ERROR: VM 148 not running
2023-10-20 11:40:04 ERROR: migration finished with problems (duration
00:29:17)
TASK ERROR: migration problems

_______________________________________________
pve-user mailing list
pve-user@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PVE-User] Failed live migration on Supermicro with EPYCv1 - what's going on here?
  2023-10-20 13:56 ` DERUMIER, Alexandre
@ 2023-10-22 13:00   ` Jan Vlach
  2023-10-23  7:54     ` DERUMIER, Alexandre
  2023-11-20 15:21     ` Jan Vlach
  0 siblings, 2 replies; 5+ messages in thread
From: Jan Vlach @ 2023-10-22 13:00 UTC (permalink / raw)
  To: Proxmox VE user list

Hi,

I’m using “host” there as both the CPU and HW platform (Supermicro H11DSU-iN) are the same.

JV

> On 20. 10. 2023, at 15:56, DERUMIER, Alexandre <alexandre.derumier@groupe-cyllene.com> wrote:
> 
> Hi,
> 
> what is the cpu model of the vms ?
> 


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PVE-User] Failed live migration on Supermicro with EPYCv1 - what's going on here?
  2023-10-22 13:00   ` Jan Vlach
@ 2023-10-23  7:54     ` DERUMIER, Alexandre
  2023-11-20 15:21     ` Jan Vlach
  1 sibling, 0 replies; 5+ messages in thread
From: DERUMIER, Alexandre @ 2023-10-23  7:54 UTC (permalink / raw)
  To: pve-user

you should really avoid to use "host" for live migration, and use the
correct qemu model (EPYC-v1 for example).


Are you sure that both CPU have exactly the same microcode version ? 
(same supermicro bios version  or same amd-microcode package if you
have installed it )




-------- Message initial --------
De: Jan Vlach <janus@volny.cz>
Répondre à: Proxmox VE user list <pve-user@lists.proxmox.com>
À: Proxmox VE user list <pve-user@lists.proxmox.com>
Objet: Re: [PVE-User] Failed live migration on Supermicro with EPYCv1 -
what's going on here?
Date: 22/10/2023 15:00:13

Hi,

I’m using “host” there as both the CPU and HW platform (Supermicro
H11DSU-iN) are the same.

JV

> On 20. 10. 2023, at 15:56, DERUMIER, Alexandre
> <alexandre.derumier@groupe-cyllene.com> wrote:
> 
> Hi,
> 
> what is the cpu model of the vms ?
> 
_______________________________________________
pve-user mailing list
pve-user@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PVE-User] Failed live migration on Supermicro with EPYCv1 - what's going on here?
  2023-10-22 13:00   ` Jan Vlach
  2023-10-23  7:54     ` DERUMIER, Alexandre
@ 2023-11-20 15:21     ` Jan Vlach
  1 sibling, 0 replies; 5+ messages in thread
From: Jan Vlach @ 2023-11-20 15:21 UTC (permalink / raw)
  To: Proxmox VE user list

Hello,

for what's it worth,

I've upgraded all the nodes to PVE8 and don't have problems migrating ironport VMs anymore. 

JV


> On 22. 10. 2023, at 15:00, Jan Vlach <janus@volny.cz> wrote:
> 
> Hi,
> 
> I’m using “host” there as both the CPU and HW platform (Supermicro H11DSU-iN) are the same.
> 
> JV
> 
>> On 20. 10. 2023, at 15:56, DERUMIER, Alexandre <alexandre.derumier@groupe-cyllene.com> wrote:
>> 
>> Hi,
>> 
>> what is the cpu model of the vms ?
>> 



^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2023-11-20 15:22 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-10-20  9:52 [PVE-User] Failed live migration on Supermicro with EPYCv1 - what's going on here? Jan Vlach
2023-10-20 13:56 ` DERUMIER, Alexandre
2023-10-22 13:00   ` Jan Vlach
2023-10-23  7:54     ` DERUMIER, Alexandre
2023-11-20 15:21     ` Jan Vlach

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox
Service provided by Proxmox Server Solutions GmbH | Privacy | Legal