all lists on lists.proxmox.com
 help / color / mirror / Atom feed
* [PVE-User] Live migration fail from old nodes to freshly installed one
@ 2022-06-18 11:29 Geoffray Levasseur
  0 siblings, 0 replies; only message in thread
From: Geoffray Levasseur @ 2022-06-18 11:29 UTC (permalink / raw)
  To: pve-user

[-- Attachment #1: Type: text/plain, Size: 7152 bytes --]

Hi everyone,

For a couple of foundation, I run and manage 4 Proxmox VE nodes with Ceph as 
main storage for VMs. After changing the hardware of one of the nodes, I had 
to reinstall it since the new motherboard don't support classic boot and had 
to change for UEFI. I was able to do that as cleanly as possible:

* Successful restoration of SSH keys in /root and node configuration in /var/
lib/pve-cluster
* Successful cleanup of former system keys of the reinstalled node in older 
nodes configuration
* Successful restoration of Ceph (yet not simple at all, the documentation 
IMHO lacks a clean procedure, unless I couldn't find it)

=> As far as I know, everything works as expected, except the online migration 
(with or without HA) to the reinstalled machine. Online migration from the 
reinstalled to older nodes works.

------- The fail happens with the following output:
task started by HA resource agent
2022-06-18 13:13:20 starting migration of VM 119 to node 'taal' 
(192.168.1.251)
2022-06-18 13:13:20 starting VM 119 on remote node 'taal'
2022-06-18 13:13:21 start remote tunnel
2022-06-18 13:13:22 ssh tunnel ver 1
2022-06-18 13:13:22 starting online/live migration on unix:/run/qemu-server/
119.migrate
2022-06-18 13:13:22 set migration capabilities
2022-06-18 13:13:22 migration downtime limit: 100 ms
2022-06-18 13:13:22 migration cachesize: 64.0 MiB
2022-06-18 13:13:22 set migration parameters
2022-06-18 13:13:22 start migrate command to unix:/run/qemu-server/119.migrate
channel 2: open failed: connect failed: open failed

2022-06-18 13:13:23 migration status error: failed
2022-06-18 13:13:23 ERROR: online migrate failure - aborting
2022-06-18 13:13:23 aborting phase 2 - cleanup resources
2022-06-18 13:13:23 migrate_cancel
2022-06-18 13:13:25 ERROR: migration finished with problems (duration 
00:00:06)
TASK ERROR: migration problems
------

------ On syslog server I see this:
Jun 18 13:13:20 taal qm[4036852]: <root@pam> starting task UPID:taal:
003D98F5:00FA02E3:62ADB350:qmstart:119:root@pam
:
Jun 18 13:13:20 taal qm[4036853]: start VM 119: UPID:taal:
003D98F5:00FA02E3:62ADB350:qmstart:119:root@pam:
Jun 18 13:13:20 pinatubo pmxcfs[1551671]: [status] notice: received log
Jun 18 13:13:20 mayon pmxcfs[1039713]: [status] notice: received log
Jun 18 13:13:20 ragang pmxcfs[1659329]: [status] notice: received log
Jun 18 13:13:20 taal systemd[1]: Started 119.scope.
Jun 18 13:13:21 taal systemd-udevd[4036885]: Using default interface naming 
scheme 'v247'.
Jun 18 13:13:21 taal systemd-udevd[4036885]: ethtool: autonegotiation is unset 
or enabled, the speed and duplex are 
not writable.
Jun 18 13:13:21 taal kernel: [163848.311597] device tap119i0 entered 
promiscuous mode
Jun 18 13:13:21 taal kernel: [163848.318750] vmbr0: port 6(tap119i0) entered 
blocking state
Jun 18 13:13:21 taal kernel: [163848.319250] vmbr0: port 6(tap119i0) entered 
disabled state
Jun 18 13:13:21 taal kernel: [163848.319797] vmbr0: port 6(tap119i0) entered 
blocking state
Jun 18 13:13:21 taal kernel: [163848.320320] vmbr0: port 6(tap119i0) entered 
forwarding state
Jun 18 13:13:21 taal systemd-udevd[4036884]: Using default interface naming 
scheme 'v247'.
Jun 18 13:13:21 taal systemd-udevd[4036884]: ethtool: autonegotiation is unset 
or enabled, the speed and duplex are 
not writable.
Jun 18 13:13:21 taal kernel: [163848.711643] device tap119i1 entered 
promiscuous mode
Jun 18 13:13:21 taal kernel: [163848.718476] vmbr0: port 7(tap119i1) entered 
blocking state
Jun 18 13:13:21 taal kernel: [163848.718951] vmbr0: port 7(tap119i1) entered 
disabled state
Jun 18 13:13:21 taal kernel: [163848.719477] vmbr0: port 7(tap119i1) entered 
blocking state
Jun 18 13:13:21 taal kernel: [163848.719982] vmbr0: port 7(tap119i1) entered 
forwarding state
Jun 18 13:13:21 taal qm[4036852]: <root@pam> end task UPID:taal:
003D98F5:00FA02E3:62ADB350:qmstart:119:root@pam: OK
Jun 18 13:13:21 ragang pmxcfs[1659329]: [status] notice: received log
Jun 18 13:13:21 pinatubo pmxcfs[1551671]: [status] notice: received log
Jun 18 13:13:21 mayon pmxcfs[1039713]: [status] notice: received log
Jun 18 13:13:21 taal systemd[1]: session-343.scope: Succeeded.
Jun 18 13:13:22 taal systemd[1]: Started Session 344 of user root.
Jun 18 13:13:22 ragang QEMU[4364]: kvm: Unable to write to socket: Broken pipe
Jun 18 13:13:23 taal systemd[1]: Started Session 345 of user root.
Jun 18 13:13:24 taal qm[4036985]: <root@pam> starting task UPID:taal:003D997A:
00FA042C:62ADB354:qmstop:119:root@pam:
Jun 18 13:13:24 taal qm[4036986]: stop VM 119: UPID:taal:003D997A:00FA042C:
62ADB354:qmstop:119:root@pam:
Jun 18 13:13:24 ragang pmxcfs[1659329]: [status] notice: received log
Jun 18 13:13:24 pinatubo pmxcfs[1551671]: [status] notice: received log
Jun 18 13:13:24 mayon pmxcfs[1039713]: [status] notice: received log
Jun 18 13:13:24 taal QEMU[4036862]: kvm: terminating on signal 15 from pid 
4036986 (task UPID:taal:003D997A:00FA042C
:62ADB354:qmstop:119:root@pam:)
Jun 18 13:13:24 taal qm[4036985]: <root@pam> end task UPID:taal:003D997A:
00FA042C:62ADB354:qmstop:119:root@pam: OK
Jun 18 13:13:24 pinatubo pmxcfs[1551671]: [status] notice: received log
Jun 18 13:13:24 mayon pmxcfs[1039713]: [status] notice: received log
Jun 18 13:13:24 ragang pmxcfs[1659329]: [status] notice: received log
Jun 18 13:13:24 taal systemd[1]: session-345.scope: Succeeded.
Jun 18 13:13:24 taal systemd[1]: session-344.scope: Succeeded.
Jun 18 13:13:24 taal kernel: [163851.231170] vmbr0: port 6(tap119i0) entered 
disabled state
Jun 18 13:13:24 taal kernel: [163851.459489] vmbr0: port 7(tap119i1) entered 
disabled state
Jun 18 13:13:24 taal qmeventd[1993]: read: Connection reset by peer
Jun 18 13:13:24 taal systemd[1]: 119.scope: Succeeded.
Jun 18 13:13:24 taal systemd[1]: 119.scope: Consumed 1.099s CPU time.
Jun 18 13:13:25 ragang pve-ha-lrm[2866146]: Task 'UPID:ragang:
002BBBE4:2519FE90:62ADB34F:qmigrate:119:root@pam:' sti
ll active, waiting
Jun 18 13:13:25 taal systemd[1]: Started Session 346 of user root.
Jun 18 13:13:25 taal systemd[1]: session-346.scope: Succeeded.
Jun 18 13:13:25 ragang pve-ha-lrm[2866148]: migration problems
Jun 18 13:13:25 ragang pve-ha-lrm[2866146]: <root@pam> end task UPID:ragang:
002BBBE4:2519FE90:62ADB34F:qmigrate:119:root@pam: migration problems
Jun 18 13:13:25 ragang pve-ha-lrm[2866146]: service vm:119 not moved 
(migration error)
------

mayon, pinatubo and ragang are the old nodes that didn't change. The 
reinstalled node is named taal. On those logs, ragang is the origin node 
initiating the migration.

I suspect lrm to be in trouble since on all the node status is shown active, 
except on taal where it's idle. Restarting pve-ha-lrm service is not fixing 
the issue.

Thank you for any your help.
-- 
Geoffray Levasseur
Ingénieur système E-3S
        <geoffray.levasseur@e-3s.com>
        <fatalerrors@geoffray-levasseur.org>
        http://www.geoffray-levasseur.org
GNU/PG public key : 2B3E 4116 769C 609F 0D17  07FD 5BA9 4CC9 E9D5 AC1B
Tu patere legem quam ipse fecisti.

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] only message in thread

only message in thread, other threads:[~2022-06-18 11:38 UTC | newest]

Thread overview: (only message) (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-06-18 11:29 [PVE-User] Live migration fail from old nodes to freshly installed one Geoffray Levasseur

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.
Service provided by Proxmox Server Solutions GmbH | Privacy | Legal