From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits)) (No client certificate requested) by lists.proxmox.com (Postfix) with ESMTPS id 2FAB5831DC for ; Thu, 2 Dec 2021 16:37:15 +0100 (CET) Received: from firstgate.proxmox.com (localhost [127.0.0.1]) by firstgate.proxmox.com (Proxmox) with ESMTP id 1EBF01BD5E for ; Thu, 2 Dec 2021 16:36:45 +0100 (CET) Received: from proxmox-new.maurer-it.com (proxmox-new.maurer-it.com [94.136.29.106]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by firstgate.proxmox.com (Proxmox) with ESMTPS id C7CB71BD53 for ; Thu, 2 Dec 2021 16:36:43 +0100 (CET) Received: from proxmox-new.maurer-it.com (localhost.localdomain [127.0.0.1]) by proxmox-new.maurer-it.com (Proxmox) with ESMTP id 8F99244DE6 for ; Thu, 2 Dec 2021 16:36:43 +0100 (CET) Date: Thu, 02 Dec 2021 16:36:36 +0100 From: Fabian =?iso-8859-1?q?Gr=FCnbichler?= To: Fabian Ebner , pve-devel@lists.proxmox.com References: <20211111140721.3288364-1-f.gruenbichler@proxmox.com> <7a0968df-559b-81de-1df1-f912866b39d5@proxmox.com> In-Reply-To: <7a0968df-559b-81de-1df1-f912866b39d5@proxmox.com> MIME-Version: 1.0 User-Agent: astroid/0.15.0 (https://github.com/astroidmail/astroid) Message-Id: <1638457827.k2t0muctq1.astroid@nora.none> Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-SPAM-LEVEL: Spam detection results: 0 AWL -0.520 Adjusted score from AWL reputation of From: address BAYES_00 -1.9 Bayes spam probability is 0 to 1% KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment NUMERIC_HTTP_ADDR 1.242 Uses a numeric IP address in URL POISEN_SPAM_PILL 0.1 Meta: its spam POISEN_SPAM_PILL_1 0.1 random spam to be learned in bayes POISEN_SPAM_PILL_3 0.1 random spam to be learned in bayes SPF_HELO_NONE 0.001 SPF: HELO does not publish an SPF Record SPF_PASS -0.001 SPF: sender matches SPF record WEIRD_PORT 0.001 Uses non-standard port number for HTTP Subject: Re: [pve-devel] [PATCH v2 qemu-server++ 0/15] remote migration X-BeenThere: pve-devel@lists.proxmox.com X-Mailman-Version: 2.1.29 Precedence: list List-Id: Proxmox VE development discussion List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 02 Dec 2021 15:37:15 -0000 On November 30, 2021 3:06 pm, Fabian Ebner wrote: > Am 11.11.21 um 15:07 schrieb Fabian Gr=C3=BCnbichler: >> this series adds remote migration for VMs. >>=20 >> both live and offline migration including NBD and storage-migrated disks >> should work. >>=20 >=20 > Played around with it for a while. Biggest issue is that migration fails=20 > if there is no 'meta' property in the config. Most other things I wish=20 > for are better error handling, but it seems to be in good shape otherwise= ! thanks for testing and reviewing! > Error "storage does not exist" if the real issue is missing access=20 > rights. But that error also appears if missing access for=20 > /cluster/resources or if the target node does not exists. hmm, yeah, those are from the "abort early" checks that I'd like to punt=20 to some pre-condition API endpoint anyway, or to the client itself.. but=20 "missing access" and "doesn't exist" can be rather similar from that=20 perspective, so probably a more generic error message is best :) > For the 'config' command, 'Sys.Modify' seems to be required > failed to handle 'config' command - 403 Permission check failed (/,=20 > Sys.Modify) did you maybe have 'startup' set in the VM config? that requires=20 Sys.Modify on / to change, and we pass the full config (well, most of=20 it) to update_vm_api.. I don't think there is an intentional check=20 otherwise for that privilege, so it's likely from something in the=20 config.. > but it does create an empty configuration file, leading to > target_vmid: Guest with ID '5678' already exists on remote cluster > on the next attempt. > It also already allocates the disks, but doesn't clean them up, because=20 > it gets the wrong lock (since the config is empty) and aborts the 'quit'=20 > command. yeah, that is related: - we remove the lock prio to calling update_vm_api - if updating fails, the config is therefor unlocked (and empty, since=20 it only contains the lock prior to this command) - mtunnel still expects the create lock to be there and errors out the fix is to not remove the lock, but set skiplock for the=20 update_vm_api call (this is in an flocked context with the config lock=20 checked after entering the locked scope, so it is as safe as can be). that way if update_vm_api fails, the config still has the 'create' lock,=20 and cleanup triggered by the client works. > If the config is not recent enough to have a 'meta' property: > failed to handle 'config' command - unable to parse value of 'meta'=20 > - got undefined value > Same issue with disk+config cleanup as above. yeah, that should be skipped if undef/false. the missing cleanup is=20 fixed by the changes above. > The local VM stayes locked with 'migrate'. Is that how it should be? unless you pass `--delete`, yes. the idea was that this prevents=20 starting the VM on source and target cluster by accident, but using a=20 different lock value might be more appropriate to signify 'this guest=20 moved to another cluster'. > Also the __migration__ snapshot will stay around, resulting in an error=20 > when trying to migrate again. I'll take a look, that shouldn't happen and probably indicates some=20 missing cleanup somewhere ;) > For live migration I always got a (cosmetic?) "WS closed=20 > unexpectedly"-error: > tunnel: -> sending command "quit" to remote > tunnel: <- got reply > tunnel: Tunnel to=20 > https://192.168.20.142:8006/api2/json/nodes/rob2/qemu/5678/mtunnelwebsock= et? > ticket=3DPVETUNNEL%3A&socket=3D%2Frun%2Fqemu-server%2F5678.mtunnel=20 > failed - WS closed unexpectedly > 2021-11-30 13:49:39 migration finished successfully (duration 00:01:02) > UPID:pve701:0000D8AD:000CB782:61A61DA5:qmigrate:111:root@pam: yeah, this is cosmetic. we send the quit, the remote end closes the=20 socket, the tunnel doesn't know that the websocket closing is expected=20 and prints that error. I'll think about whether we can handle this=20 somehow with the existing commands, or we need to do a proper close=20 command from the client side as well (I dropped the start of something=20 like this in v2 ;)) -> quit <- done (server expects WS to close) -> close (ctrl cmd) -> close WS or -> close (ctrl cmd, sets flag to expect WS to be closed) -> quit <- done <- close WS I'd like to avoid having the tunnel parse the quit command and having=20 that serve a double-purpose, to keep the tunnel re-usable and agnostic. > Fun fact: the identity storage mapping will be used for storages that=20 > don't appear in the explicit mapping. E.g. it's possible to migrate a VM=20 > that only has disks on storeA with --target-storage storeB:storeB (if=20 > storeA exists on the target of course). But the explicit identity=20 > mapping is prohibited. thanks, that should be caught as well. > When a target bridge is not present (should that be detected ahead of=20 > starting the migration?) and likely for any other startup failure the=20 > only error in the log is: yeah, the targetbridge should probably be part of the initial checks=20 like the storages (and then be re-checked anyway for the actual start=20 command, but at least for non-misbehaving clients they get an early=20 abort). > 2021-11-30 14:43:10 ERROR: online migrate failure - error - tunnel=20 > command '{"cmd":"star > failed to handle 'start' command - start failed: QEMU exited with code 1 > For non-remote migration we are more verbose in this case and log the=20 > QEMU output. yeah, that should be easily improvable. the query-disk-import could=20 similarly be extended with an optional 'msg' that we can use to print=20 progress if there is any.. > Can/should an interrupt be handled more gracefully, so that remote=20 > cleanup still happens? > ^CCMD websocket tunnel died: command 'proxmox-websocket-tunnel' failed:=20 > interrupted by signal >=20 > 2021-11-30 14:39:07 ERROR: interrupted by signal > 2021-11-30 14:39:07 aborting phase 1 - cleanup resources > 2021-11-30 14:39:08 ERROR: writing to tunnel failed: broken pipe > 2021-11-30 14:39:08 ERROR: migration aborted (duration 00:00:10):=20 > interrupted by signal yeah, we should probably leave it up to the regular error-handling and=20 cleanup logic to cleanly close the tunnel, instead of it being directly=20 terminated by the signal..