From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <fatalerrors@geoffray-levasseur.org>
Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68])
 (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
 key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256)
 (No client certificate requested)
 by lists.proxmox.com (Postfix) with ESMTPS id 80F8D7147B
 for <pve-user@lists.proxmox.com>; Sat, 18 Jun 2022 13:38:05 +0200 (CEST)
Received: from firstgate.proxmox.com (localhost [127.0.0.1])
 by firstgate.proxmox.com (Proxmox) with ESMTP id 7498D1D65E
 for <pve-user@lists.proxmox.com>; Sat, 18 Jun 2022 13:38:05 +0200 (CEST)
Received: from amorong.geoffray-levasseur.org (unknown [91.224.149.180])
 (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
 key-exchange X25519 server-signature RSA-PSS (2048 bits))
 (No client certificate requested)
 by firstgate.proxmox.com (Proxmox) with ESMTPS id DC2791D64E
 for <pve-user@lists.proxmox.com>; Sat, 18 Jun 2022 13:38:03 +0200 (CEST)
Received: from [127.0.0.1] (localhost [127.0.0.1]) by localhost (Mailerdaemon)
 with ESMTPSA id 6F91D67CD9
 for <pve-user@lists.proxmox.com>; Sat, 18 Jun 2022 13:29:13 +0200 (CEST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=geoffray-levasseur.org;
 s=dkim; t=1655551754;
 h=from:subject:date:message-id:to:mime-version:content-type; 
 bh=RWJlOoELFO/EFIJXuurUGcAiqYZSyzxgwxkOBAjbjBQ=;
 b=ntJx4S6VI0tSjsjh5LBz25/cx/zZ29mR+TJnmAcyh5AidPMkia8hjkhbulL9jaF7f1qu4L
 S9nSFFdmDU6bALU+hrU99nSVn4/FeklSDxCx0ObapHHilJ97FCzluLknzBCNPJ0K88B6x0
 5z648tcPYWihU22cuqtd10tqOcwrZds2yqX6RYcBRFehC4u5pk/1EAfyaGe0aI6WbgL+MY
 yFuTIdB7kC+RyeXU/YX+4vtI8KQSB+pO7bNcoKnyn7dXfX4+9dB7nWZUQ2zkZ+mJcRyxAq
 U0IDvWbTHvKm65h3OLWMsaDNmDGQFTYPW5aJSjHF+DKsz/9/xHYZOgTJ8/7DTw==
From: Geoffray Levasseur <fatalerrors@geoffray-levasseur.org>
To: pve-user@lists.proxmox.com
Date: Sat, 18 Jun 2022 13:29:05 +0200
Message-ID: <3555362.dWV9SEqChM@isarog>
MIME-Version: 1.0
Content-Type: multipart/signed; boundary="nextPart5502887.rdbgypaU67";
 micalg="pgp-sha256"; protocol="application/pgp-signature"
X-Last-TLS-Session-Version: TLSv1.3
X-SPAM-LEVEL: Spam detection results:  0
 BAYES_00                 -1.9 Bayes spam probability is 0 to 1%
 DKIM_INVALID              0.1 DKIM or DK signature exists, but is not valid
 DKIM_SIGNED               0.1 Message has a DKIM or DK signature,
 not necessarily valid
 KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment
 RDNS_NONE 0.793 Delivered to internal network by a host with no rDNS
 SPF_HELO_NONE           0.001 SPF: HELO does not publish an SPF Record
 SPF_PASS               -0.001 SPF: sender matches SPF record
 T_SCC_BODY_TEXT_LINE    -0.01 -
Subject: [PVE-User] Live migration fail from old nodes to freshly installed
 one
X-BeenThere: pve-user@lists.proxmox.com
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Proxmox VE user list <pve-user.lists.proxmox.com>
List-Unsubscribe: <https://lists.proxmox.com/cgi-bin/mailman/options/pve-user>, 
 <mailto:pve-user-request@lists.proxmox.com?subject=unsubscribe>
List-Archive: <http://lists.proxmox.com/pipermail/pve-user/>
List-Post: <mailto:pve-user@lists.proxmox.com>
List-Help: <mailto:pve-user-request@lists.proxmox.com?subject=help>
List-Subscribe: <https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user>, 
 <mailto:pve-user-request@lists.proxmox.com?subject=subscribe>
X-List-Received-Date: Sat, 18 Jun 2022 11:38:05 -0000

--nextPart5502887.rdbgypaU67
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="iso-8859-1"; protected-headers="v1"
From: Geoffray Levasseur <fatalerrors@geoffray-levasseur.org>
To: pve-user@lists.proxmox.com
Subject: Live migration fail from old nodes to freshly installed one
Date: Sat, 18 Jun 2022 13:29:05 +0200
Message-ID: <3555362.dWV9SEqChM@isarog>

Hi everyone,

=46or a couple of foundation, I run and manage 4 Proxmox VE nodes with Ceph=
 as=20
main storage for VMs. After changing the hardware of one of the nodes, I ha=
d=20
to reinstall it since the new motherboard don't support classic boot and ha=
d=20
to change for UEFI. I was able to do that as cleanly as possible:

* Successful restoration of SSH keys in /root and node configuration in /va=
r/
lib/pve-cluster
* Successful cleanup of former system keys of the reinstalled node in older=
=20
nodes configuration
* Successful restoration of Ceph (yet not simple at all, the documentation=
=20
IMHO lacks a clean procedure, unless I couldn't find it)

=3D> As far as I know, everything works as expected, except the online migr=
ation=20
(with or without HA) to the reinstalled machine. Online migration from the=
=20
reinstalled to older nodes works.

=2D------ The fail happens with the following output:
task started by HA resource agent
2022-06-18 13:13:20 starting migration of VM 119 to node 'taal'=20
(192.168.1.251)
2022-06-18 13:13:20 starting VM 119 on remote node 'taal'
2022-06-18 13:13:21 start remote tunnel
2022-06-18 13:13:22 ssh tunnel ver 1
2022-06-18 13:13:22 starting online/live migration on unix:/run/qemu-server/
119.migrate
2022-06-18 13:13:22 set migration capabilities
2022-06-18 13:13:22 migration downtime limit: 100 ms
2022-06-18 13:13:22 migration cachesize: 64.0 MiB
2022-06-18 13:13:22 set migration parameters
2022-06-18 13:13:22 start migrate command to unix:/run/qemu-server/119.migr=
ate
channel 2: open failed: connect failed: open failed

2022-06-18 13:13:23 migration status error: failed
2022-06-18 13:13:23 ERROR: online migrate failure - aborting
2022-06-18 13:13:23 aborting phase 2 - cleanup resources
2022-06-18 13:13:23 migrate_cancel
2022-06-18 13:13:25 ERROR: migration finished with problems (duration=20
00:00:06)
TASK ERROR: migration problems
=2D-----

=2D----- On syslog server I see this:
Jun 18 13:13:20 taal qm[4036852]: <root@pam> starting task UPID:taal:
003D98F5:00FA02E3:62ADB350:qmstart:119:root@pam
:
Jun 18 13:13:20 taal qm[4036853]: start VM 119: UPID:taal:
003D98F5:00FA02E3:62ADB350:qmstart:119:root@pam:
Jun 18 13:13:20 pinatubo pmxcfs[1551671]: [status] notice: received log
Jun 18 13:13:20 mayon pmxcfs[1039713]: [status] notice: received log
Jun 18 13:13:20 ragang pmxcfs[1659329]: [status] notice: received log
Jun 18 13:13:20 taal systemd[1]: Started 119.scope.
Jun 18 13:13:21 taal systemd-udevd[4036885]: Using default interface naming=
=20
scheme 'v247'.
Jun 18 13:13:21 taal systemd-udevd[4036885]: ethtool: autonegotiation is un=
set=20
or enabled, the speed and duplex are=20
not writable.
Jun 18 13:13:21 taal kernel: [163848.311597] device tap119i0 entered=20
promiscuous mode
Jun 18 13:13:21 taal kernel: [163848.318750] vmbr0: port 6(tap119i0) entere=
d=20
blocking state
Jun 18 13:13:21 taal kernel: [163848.319250] vmbr0: port 6(tap119i0) entere=
d=20
disabled state
Jun 18 13:13:21 taal kernel: [163848.319797] vmbr0: port 6(tap119i0) entere=
d=20
blocking state
Jun 18 13:13:21 taal kernel: [163848.320320] vmbr0: port 6(tap119i0) entere=
d=20
forwarding state
Jun 18 13:13:21 taal systemd-udevd[4036884]: Using default interface naming=
=20
scheme 'v247'.
Jun 18 13:13:21 taal systemd-udevd[4036884]: ethtool: autonegotiation is un=
set=20
or enabled, the speed and duplex are=20
not writable.
Jun 18 13:13:21 taal kernel: [163848.711643] device tap119i1 entered=20
promiscuous mode
Jun 18 13:13:21 taal kernel: [163848.718476] vmbr0: port 7(tap119i1) entere=
d=20
blocking state
Jun 18 13:13:21 taal kernel: [163848.718951] vmbr0: port 7(tap119i1) entere=
d=20
disabled state
Jun 18 13:13:21 taal kernel: [163848.719477] vmbr0: port 7(tap119i1) entere=
d=20
blocking state
Jun 18 13:13:21 taal kernel: [163848.719982] vmbr0: port 7(tap119i1) entere=
d=20
forwarding state
Jun 18 13:13:21 taal qm[4036852]: <root@pam> end task UPID:taal:
003D98F5:00FA02E3:62ADB350:qmstart:119:root@pam: OK
Jun 18 13:13:21 ragang pmxcfs[1659329]: [status] notice: received log
Jun 18 13:13:21 pinatubo pmxcfs[1551671]: [status] notice: received log
Jun 18 13:13:21 mayon pmxcfs[1039713]: [status] notice: received log
Jun 18 13:13:21 taal systemd[1]: session-343.scope: Succeeded.
Jun 18 13:13:22 taal systemd[1]: Started Session 344 of user root.
Jun 18 13:13:22 ragang QEMU[4364]: kvm: Unable to write to socket: Broken p=
ipe
Jun 18 13:13:23 taal systemd[1]: Started Session 345 of user root.
Jun 18 13:13:24 taal qm[4036985]: <root@pam> starting task UPID:taal:003D99=
7A:
00FA042C:62ADB354:qmstop:119:root@pam:
Jun 18 13:13:24 taal qm[4036986]: stop VM 119: UPID:taal:003D997A:00FA042C:
62ADB354:qmstop:119:root@pam:
Jun 18 13:13:24 ragang pmxcfs[1659329]: [status] notice: received log
Jun 18 13:13:24 pinatubo pmxcfs[1551671]: [status] notice: received log
Jun 18 13:13:24 mayon pmxcfs[1039713]: [status] notice: received log
Jun 18 13:13:24 taal QEMU[4036862]: kvm: terminating on signal 15 from pid=
=20
4036986 (task UPID:taal:003D997A:00FA042C
:62ADB354:qmstop:119:root@pam:)
Jun 18 13:13:24 taal qm[4036985]: <root@pam> end task UPID:taal:003D997A:
00FA042C:62ADB354:qmstop:119:root@pam: OK
Jun 18 13:13:24 pinatubo pmxcfs[1551671]: [status] notice: received log
Jun 18 13:13:24 mayon pmxcfs[1039713]: [status] notice: received log
Jun 18 13:13:24 ragang pmxcfs[1659329]: [status] notice: received log
Jun 18 13:13:24 taal systemd[1]: session-345.scope: Succeeded.
Jun 18 13:13:24 taal systemd[1]: session-344.scope: Succeeded.
Jun 18 13:13:24 taal kernel: [163851.231170] vmbr0: port 6(tap119i0) entere=
d=20
disabled state
Jun 18 13:13:24 taal kernel: [163851.459489] vmbr0: port 7(tap119i1) entere=
d=20
disabled state
Jun 18 13:13:24 taal qmeventd[1993]: read: Connection reset by peer
Jun 18 13:13:24 taal systemd[1]: 119.scope: Succeeded.
Jun 18 13:13:24 taal systemd[1]: 119.scope: Consumed 1.099s CPU time.
Jun 18 13:13:25 ragang pve-ha-lrm[2866146]: Task 'UPID:ragang:
002BBBE4:2519FE90:62ADB34F:qmigrate:119:root@pam:' sti
ll active, waiting
Jun 18 13:13:25 taal systemd[1]: Started Session 346 of user root.
Jun 18 13:13:25 taal systemd[1]: session-346.scope: Succeeded.
Jun 18 13:13:25 ragang pve-ha-lrm[2866148]: migration problems
Jun 18 13:13:25 ragang pve-ha-lrm[2866146]: <root@pam> end task UPID:ragang:
002BBBE4:2519FE90:62ADB34F:qmigrate:119:root@pam: migration problems
Jun 18 13:13:25 ragang pve-ha-lrm[2866146]: service vm:119 not moved=20
(migration error)
=2D-----

mayon, pinatubo and ragang are the old nodes that didn't change. The=20
reinstalled node is named taal. On those logs, ragang is the origin node=20
initiating the migration.

I suspect lrm to be in trouble since on all the node status is shown active=
,=20
except on taal where it's idle. Restarting pve-ha-lrm service is not fixing=
=20
the issue.

Thank you for any your help.
=2D-=20
Geoffray Levasseur
Ing=E9nieur syst=E8me E-3S
        <geoffray.levasseur@e-3s.com>
        <fatalerrors@geoffray-levasseur.org>
        http://www.geoffray-levasseur.org
GNU/PG public key : 2B3E 4116 769C 609F 0D17  07FD 5BA9 4CC9 E9D5 AC1B
Tu patere legem quam ipse fecisti.

--nextPart5502887.rdbgypaU67
Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: This is a digitally signed message part.
Content-Transfer-Encoding: 7Bit

-----BEGIN PGP SIGNATURE-----

iQEzBAABCAAdFiEEKz5BFnacYJ8NFwf9W6lMyenVrBsFAmKttwEACgkQW6lMyenV
rBtpsgf7BmAkJtBsv/s0F323v8jhlkKULK/tpcBhzVaCqyBX/fzs4mGDl/v0c+wV
MKrGdhRWm8E/vgbjl33X6WhxLLMiJXV/HjZp/HV78puycbQzph4TJxQaSNm5b36+
tTltU/1dKsonUAqO0o1HfZKPmPFX8eu44DpBmt03bLgzxx130yOiqGCM2gYnA+CG
YglGx2P63UJA+qX6TQMvyqcaqUbqqgIa75udXOwHUPt/RC0n4uNqb6b8IiK7ksh3
9NRSxbuhigGBxQfoReYpVu+I0/DLxE9ZzyJMyeVIGUUMG/MfRTLgCrIMsl3nUZRT
2T/x4rx6Vh2gJrHZpwfWXHk9bKrKtQ==
=PDRT
-----END PGP SIGNATURE-----

--nextPart5502887.rdbgypaU67--