From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits)) (No client certificate requested) by lists.proxmox.com (Postfix) with ESMTPS id 67AB39BB7C for ; Fri, 20 Oct 2023 11:53:41 +0200 (CEST) Received: from firstgate.proxmox.com (localhost [127.0.0.1]) by firstgate.proxmox.com (Proxmox) with ESMTP id 40CEB323A1 for ; Fri, 20 Oct 2023 11:53:11 +0200 (CEST) Received: from gmmr-4.centrum.cz (gmmr-4.centrum.cz [IPv6:2a00:da80:1:502::8]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by firstgate.proxmox.com (Proxmox) with ESMTPS for ; Fri, 20 Oct 2023 11:53:09 +0200 (CEST) Received: from gmmr-4.centrum.cz (localhost [127.0.0.1]) by gmmr-4.centrum.cz (Postfix) with ESMTP id 23BFF1A4ED for ; Fri, 20 Oct 2023 11:52:57 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=volny.cz; s=mail; t=1697795577; bh=lQcpVuO9sQpsmSgUZW01F8UfdKEycZfvbQW+4kYclUk=; h=From:Subject:Date:To:From; b=ZzcUiZuhaEOBlKJp03n7GQ2uEqN4qmppVEHDVHrkC7uOiGjxF+xz/OVApj8gElqJ/ T3ax7SzbaKJbSNIrCICRLn/7PFhLce5fDuvfSj04puDP+5n5GR5xolXkhBCQPqkidb cUSYpmRIi+OYyxnG6ADYtBXUYghndnvoR8nKvnXo= Received: from antispam23.centrum.cz (unknown [10.30.208.23]) by gmmr-4.centrum.cz (Postfix) with ESMTP id 1C0D22016AB2 for ; Fri, 20 Oct 2023 11:52:57 +0200 (CEST) X-IPAS-Result: =?us-ascii?q?A2GeAgANTTJl/03h/y5aHgEBCxIMSYE7C4ljkV+EPo1di?= =?us-ascii?q?g2BVoF+DwEBAQEBAQEBAQlEBAEBjB8nNAkOAQIEAQEBAQMCAwEBAQEBAQMBA?= =?us-ascii?q?QYBAQEBAQEGBQECgRmFL0aCNysBhBgEeQEYEwEJAl8KgweCX7RGfzMaAmWEd?= =?us-ascii?q?q4SH4FogUiHcBoBaGYBiHmCDYE8DBCCMAeFChKDcTmCLwSDcoN/gT4HMgmCG?= =?us-ascii?q?YMuFxOCCDIDVXOHNEw1R1oWGwMHA1kqECsHBC8iBgkWLSUGUQQXFiQJExI+B?= =?us-ascii?q?IM4CoEDPw8OEYJDIgIHNjYZS4ElgTYJFQY6TXYQKgQUF4EJCG4fFR43ERIXD?= =?us-ascii?q?QMIdh0CESM8AwUDBDQKFQ0LIQUTAUMDRwZKCwMCHAUDAwSBNgUNHgIQLicDA?= =?us-ascii?q?xlNAhAUAzsDAwYDCzEDMFdHDFkDbB8YAhwJPA8MNAMJAwcFLB1AAwsYDUgRL?= =?us-ascii?q?DUUGwY/cwecFQqCc1IuHxAdIQmBRlROAReSaQmTA58UhBYFgVqCZ5xdBC+XR?= =?us-ascii?q?wOSIIdnkFWjCQhlAYRZgWOCFk04ZQGCPT4TGaIigS8CBwEKAQEDCYhuAQMwg?= =?us-ascii?q?ikBAQ?= IronPort-PHdr: A9a23:GLx2qReiEPQF8podkgALVTUilGM+Rd7LVj580XLHo4xHfqnrxZn+J kuXvawr0AWZG9qLoKsUw6qO6ua8AzJGuc7A+Fk5M7V0HycfjssXmwFySOWkMmbcaMDQUiohA c5ZX0Vk9XzoeWJcGcL5ekGA6ibqtW1aFRrwLxd6KfroEYDOkcu3y/qy+5rOaAlUmTaxe7x/I Au1oAnLtMQbgoRuJrsyxxDUpndEZ/layXlnKF6Nnhvw/Nu88IJm/y9Np/8v6slMXLngca8lV 7JYFjMmM2405M3vqxbOSBaE62UfXGsLjBdGGhDJ4x7mUJj/tCv6rfd91zKBPcLqV7A0WC+t4 LltRRT1lSoILT858GXQisxtkKJWpQ+qqhJjz4LIZoyeKeFzdb3Bc9wEWWVBX95RVy1fDYO6c 4sPFPcKMeJBo4Xgu1cCsR6yCA+xD+3t1zBInGf7060m3OsuDA/I3wIuEcwJvnnPttr5KKISX Pq1zKXUzzjOae5d1zfn6IjPdxAsufWCUqh2ccHMxkYvExnKgUmQqYf4OD6V1P4Cs26G7+p7T u+vlWknqwV3ojmv3MsjlojIi5sTx1vZ+ip33Jw7KsekSE5nf9GkCp1QujmYOoZrQs4sTW5mt Ds4x7EYvZO1ciwHxpo5yhPBafGKbZaE7g/+WeiRLjp1i29pdry/iRuz7USty/DwW8aq3FtWq CdOj9rCtmgV2hHc68WLUOVx80eh1DqVyQzf9OFJLVo2mKfZLZMq36Q+mYAJsUvZGy/7gED2j KiLeUo64uWo8OHnYqn+pp+bKo90lhnyMqQwlcy7BuQ1KgcOX22C9eSn0b3j4VX5TKhWgvEsj qbWrpbaJdgBpq6kBg9ZyJos6henAzen1tQXg2UHIUpKdR+GlYTlJVHDLfDiAfuhnVihkC1ny vLEM7H5B5XCNHnDkLPvfbZn7E5czRI+zdJF6JJSF7EBO+n+WlH2tNzcCB84Mxa4zPrmCdll0 IMRQnqAArWFP6PKrV+I+uUvLvGRaIMNojbyN+Al5+LyjX8+gVIdZbep0oUOZHClBfRpPV+Zb GHogtcACmcKohE+QPbyiF2YVj5SaHOyX6Uz5z0hFI2mCoLDFciRh+m5xiCrG5pGLl5rQk2XH G2gc4SfR79YcDyfZ9J6nyYsUbm6V5RnzQuppAr92/xsKe+CqQMCspe29tFzr9PXnBoyvWh2B sKAyEmTSGVyjyUDVWllj+hEvUVhxwLbguBDiPtCGIkWvqsROjo= IronPort-Data: A9a23:SWiKcKOEZhgsJ7HvrR0HlcFynXyQoLVcMsEvi/4bfWQNrUpx1TYAz WMcWGqDP/iNNmvwKNoiYY7j/EhX7J6Bzd9mQHM5pCpnJ55oRWspJjg7wmPYZX76whjrFRo/h ykmQoCcappyFxcwnz/1WpD5t35wyKqUcbT1De/AK0hZSBRtIMsboUsLd9UR3Mgw2rBVPyvX4 Ymp+pWGaQf/s9JJGjt8B5yr+U4HUMva52twUmwWPZhjoFLYnn8JO5MTTYnZw6zQH+G4tsbjL wry5OnRElHxpn/BOfv5+lrPSXDmd5aJVeS4oiEPB/X92EgqShsaic7XPNJEAateZq7gc9pZk L2hvrToIesl0zGldE3wnHC0HgknVZCq9oMrLlCzlOqe7HeXc0LpmfRIPm0NPpE69fR4VDQmG fwwcFjhbziMgqetxa6jE7EqjcklMNP2OcUUqBmMzxmFU7B8HM2FGf+Xo4AHtNszrpkm8fL2Z cMfdCFHchPEZQwJMUV/5JcWxbn21iOnL2wCwL6TjZIx+GTr/Dwp6ZfCEd30IIClZ+Z/rn/N8 woq+Ey8WHn2Lue30SSIt2+3i/XnmSLgRJlUDKe/5vttkBuYwWl7NfENfQfl56Pk1wjkAY8Zd BN8FjcSkJXePXeDFrHVNyBUalba1vLAc7K8y9EH1Tw= IronPort-HdrOrdr: A9a23:rcXsAKyj/7PqfCcoVflOKrPwSb1zdoMgy1knxilNoH1uH/Bw+P rOoB1273XJYUgqNk3I6OrtBEDoexq1nqKdirN/AV7NZmnbUROTXeJf0bc= X-Talos-CUID: 9a23:T8GFNWPu5wwAiu5DVAxB+W0uS8Acdyfc6kX8GEykK2l7R+jA X-Talos-MUID: =?us-ascii?q?9a23=3Ae+sWLA0B2Iat7qxiaabi3QpzIDUj26K1BHg9oc0?= =?us-ascii?q?/lOalMg0qZxi/rx6He9py?= X-IronPort-Anti-Spam-Filtered: true X-IronPort-AV: E=Sophos;i="6.03,238,1694728800"; d="scan'208,217";a="166820492" Received: from unknown (HELO gm-smtp10.centrum.cz) ([46.255.225.77]) by antispam23.centrum.cz with ESMTP; 20 Oct 2023 11:52:56 +0200 Received: from smtpclient.apple (unknown [10.128.64.68]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by gm-smtp10.centrum.cz (Postfix) with ESMTPSA id 28BBBB8A50 for ; Fri, 20 Oct 2023 11:52:56 +0200 (CEST) From: Jan Vlach Mime-Version: 1.0 (Mac OS X Mail 16.0 \(3731.700.6\)) Message-Id: <82D01A63-40EC-4203-8313-0CF9BF58492D@volny.cz> Date: Fri, 20 Oct 2023 11:52:45 +0200 To: Proxmox VE user list X-Mailer: Apple Mail (2.3731.700.6) X-SPAM-LEVEL: Spam detection results: 0 AWL 0.157 Adjusted score from AWL reputation of From: address BAYES_00 -1.9 Bayes spam probability is 0 to 1% DKIM_SIGNED 0.1 Message has a DKIM or DK signature, not necessarily valid DKIM_VALID -0.1 Message has at least one valid DKIM or DK signature DKIM_VALID_AU -0.1 Message has a valid DKIM or DK signature from author's domain DKIM_VALID_EF -0.1 Message has a valid DKIM or DK signature from envelope-from domain DMARC_PASS -0.1 DMARC pass policy HTML_MESSAGE 0.001 HTML included in message RCVD_IN_DNSWL_NONE -0.0001 Sender listed at https://www.dnswl.org/, no trust SPF_HELO_NONE 0.001 SPF: HELO does not publish an SPF Record SPF_PASS -0.001 SPF: sender matches SPF record URIBL_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to URIBL was blocked. See http://wiki.apache.org/spamassassin/DnsBlocklists#dnsbl-block for more information. [volny.cz] Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Content-Filtered-By: Mailman/MimeDel 2.1.29 Subject: [PVE-User] Failed live migration on Supermicro with EPYCv1 - what's going on here? X-BeenThere: pve-user@lists.proxmox.com X-Mailman-Version: 2.1.29 Precedence: list List-Id: Proxmox VE user list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 20 Oct 2023 09:53:41 -0000 Hello proxmox-users, I have a proxmox cluster running PVE7 with local ZFS pool (striped = mirrors), fully patched (no-subscription repo) and now I=E2=80=99m = rebooting into new kernels (5.15 .60 -> .126) Migration network is dedicated 2x 10 GigE LACP interface on every node. These are dual socketed Supermicro boxes with 2x AMD EPYC 7281 16-Core = Processor. Microcode is already 0x800126e everywhere. The VMs are Cisco Ironport Appliances running FreeBSD (no = qemu-guest-agent, disabled in settings). For some, the live migration = fails on transferring contents of RAM. The job cleans up remote zvol, = but kills source VM. Couple weeks ago, I migrated at least 24 ironport VMs without a hiccup. What=E2=80=99s going on here? Where else can I look? Log with snipped = the 500G disk log transfer, there were no errors, just time and percent = going up.=20 On a tangent - on bulk migrate, first VM in the batch complains that = port 60001 is already used and job can=E2=80=99t bind, so the first VM = gets skipped. Probably unrelated, different error. Thank you for cluestick, JV 2023-10-20 11:10:47 use dedicated network address for sending migration = traffic (10.30.24.20) 2023-10-20 11:10:47 starting migration of VM 148 to node 'prox-node7' = (10.30.24.20) 2023-10-20 11:10:47 found local disk 'local-zfs:vm-148-disk-0' (in = current VM config) 2023-10-20 11:10:47 starting VM 148 on remote node 'prox-node7' 2023-10-20 11:10:52 volume 'local-zfs:vm-148-disk-0' is = 'local-zfs:vm-148-disk-0' on the target 2023-10-20 11:10:52 start remote tunnel 2023-10-20 11:10:53 ssh tunnel ver 1 2023-10-20 11:10:53 starting storage migration 2023-10-20 11:10:53 scsi0: start migration to = nbd:10.30.24.20:60001:exportname=3Ddrive-scsi0 drive mirror is starting for drive-scsi0 drive-scsi0: transferred 477.0 MiB of 500.0 GiB (0.09%) in 14m 17s drive-scsi0: transferred 1.1 GiB of 500.0 GiB (0.22%) in 14m 18s drive-scsi0: transferred 1.7 GiB of 500.0 GiB (0.34%) in 14m 19s drive-scsi0: transferred 2.3 GiB of 500.0 GiB (0.46%) in 14m 20s drive-scsi0: transferred 2.9 GiB of 500.0 GiB (0.57%) in 14m 21s drive-scsi0: transferred 3.5 GiB of 500.0 GiB (0.69%) in 14m 22s drive-scsi0: transferred 4.0 GiB of 500.0 GiB (0.81%) in 14m 23s drive-scsi0: transferred 4.6 GiB of 500.0 GiB (0.92%) in 14m 24s drive-scsi0: transferred 5.0 GiB of 500.0 GiB (1.01%) in 14m 25s drive-scsi0: transferred 5.5 GiB of 500.0 GiB (1.10%) in 14m 26s drive-scsi0: transferred 6.0 GiB of 500.0 GiB (1.20%) in 14m 27s drive-scsi0: transferred 6.5 GiB of 500.0 GiB (1.30%) in 14m 28s drive-scsi0: transferred 6.9 GiB of 500.0 GiB (1.38%) in 14m 29s drive-scsi0: transferred 7.4 GiB of 500.0 GiB (1.48%) in 14m 30s drive-scsi0: transferred 7.8 GiB of 500.0 GiB (1.56%) in 14m 31s drive-scsi0: transferred 8.2 GiB of 500.0 GiB (1.65%) in 14m 32s =E2=80=A6 snipped to keep it sane, no errors here =E2=80=A6 drive-scsi0: transferred 500.2 GiB of 500.8 GiB (99.87%) in 28m 32s drive-scsi0: transferred 500.5 GiB of 500.8 GiB (99.94%) in 28m 33s drive-scsi0: transferred 500.5 GiB of 500.8 GiB (99.95%) in 28m 34s drive-scsi0: transferred 500.6 GiB of 500.8 GiB (99.96%) in 28m 35s drive-scsi0: transferred 500.7 GiB of 500.8 GiB (99.97%) in 28m 36s drive-scsi0: transferred 500.7 GiB of 500.8 GiB (99.97%) in 28m 37s drive-scsi0: transferred 500.7 GiB of 500.8 GiB (99.98%) in 28m 38s drive-scsi0: transferred 500.8 GiB of 500.8 GiB (99.99%) in 28m 39s drive-scsi0: transferred 500.8 GiB of 500.8 GiB (100.00%) in 28m 40s drive-scsi0: transferred 500.8 GiB of 500.8 GiB (100.00%) in 28m 41s, = ready all 'mirror' jobs are ready 2023-10-20 11:39:34 starting online/live migration on = tcp:10.30.24.20:60000 2023-10-20 11:39:34 set migration capabilities 2023-10-20 11:39:34 migration downtime limit: 100 ms 2023-10-20 11:39:34 migration cachesize: 1.0 GiB 2023-10-20 11:39:34 set migration parameters 2023-10-20 11:39:34 start migrate command to tcp:10.30.24.20:60000 2023-10-20 11:39:35 migration active, transferred 615.9 MiB of 8.0 GiB = VM-state, 537.5 MiB/s 2023-10-20 11:39:36 migration active, transferred 1.1 GiB of 8.0 GiB = VM-state, 812.6 MiB/s 2023-10-20 11:39:37 migration active, transferred 1.6 GiB of 8.0 GiB = VM-state, 440.5 MiB/s 2023-10-20 11:39:38 migration active, transferred 2.1 GiB of 8.0 GiB = VM-state, 495.3 MiB/s 2023-10-20 11:39:39 migration active, transferred 2.5 GiB of 8.0 GiB = VM-state, 250.1 MiB/s 2023-10-20 11:39:40 migration active, transferred 2.9 GiB of 8.0 GiB = VM-state, 490.4 MiB/s 2023-10-20 11:39:41 migration active, transferred 3.4 GiB of 8.0 GiB = VM-state, 514.4 MiB/s 2023-10-20 11:39:42 migration active, transferred 3.9 GiB of 8.0 GiB = VM-state, 485.9 MiB/s 2023-10-20 11:39:43 migration active, transferred 4.3 GiB of 8.0 GiB = VM-state, 488.2 MiB/s 2023-10-20 11:39:44 migration active, transferred 4.8 GiB of 8.0 GiB = VM-state, 738.3 MiB/s 2023-10-20 11:39:45 migration active, transferred 5.6 GiB of 8.0 GiB = VM-state, 730.8 MiB/s 2023-10-20 11:39:46 migration active, transferred 6.2 GiB of 8.0 GiB = VM-state, 492.9 MiB/s 2023-10-20 11:39:47 migration active, transferred 6.7 GiB of 8.0 GiB = VM-state, 471.5 MiB/s 2023-10-20 11:39:48 migration active, transferred 7.1 GiB of 8.0 GiB = VM-state, 469.4 MiB/s 2023-10-20 11:39:49 migration active, transferred 7.9 GiB of 8.0 GiB = VM-state, 666.7 MiB/s 2023-10-20 11:39:50 migration active, transferred 8.6 GiB of 8.0 GiB = VM-state, 771.9 MiB/s 2023-10-20 11:39:51 migration active, transferred 9.4 GiB of 8.0 GiB = VM-state, 1.2 GiB/s 2023-10-20 11:39:51 xbzrle: send updates to 33286 pages in 23.2 MiB = encoded memory, cache-miss 96.68%, overflow 5045 2023-10-20 11:39:52 auto-increased downtime to continue migration: 200 = ms 2023-10-20 11:39:53 migration active, transferred 9.9 GiB of 8.0 GiB = VM-state, 1.1 GiB/s 2023-10-20 11:39:53 xbzrle: send updates to 177238 pages in 60.2 MiB = encoded memory, cache-miss 73.74%, overflow 9766 query migrate failed: VM 148 qmp command 'query-migrate' failed - client = closed connection 2023-10-20 11:39:54 query migrate failed: VM 148 qmp command = 'query-migrate' failed - client closed connection query migrate failed: VM 148 not running 2023-10-20 11:39:55 query migrate failed: VM 148 not running query migrate failed: VM 148 not running 2023-10-20 11:39:56 query migrate failed: VM 148 not running query migrate failed: VM 148 not running 2023-10-20 11:39:57 query migrate failed: VM 148 not running query migrate failed: VM 148 not running 2023-10-20 11:39:58 query migrate failed: VM 148 not running query migrate failed: VM 148 not running 2023-10-20 11:39:59 query migrate failed: VM 148 not running 2023-10-20 11:39:59 ERROR: online migrate failure - too many query = migrate failures - aborting 2023-10-20 11:39:59 aborting phase 2 - cleanup resources 2023-10-20 11:39:59 migrate_cancel 2023-10-20 11:39:59 migrate_cancel error: VM 148 not running 2023-10-20 11:39:59 ERROR: query-status error: VM 148 not running drive-scsi0: Cancelling block job 2023-10-20 11:39:59 ERROR: VM 148 not running 2023-10-20 11:40:04 ERROR: migration finished with problems (duration = 00:29:17) TASK ERROR: migration problems