From: "Fabian Grünbichler" <f.gruenbichler@proxmox.com>
To: Fabian Ebner <f.ebner@proxmox.com>, pve-devel@lists.proxmox.com
Subject: Re: [pve-devel] [PATCH v2 qemu-server++ 0/15] remote migration
Date: Thu, 02 Dec 2021 16:36:36 +0100 [thread overview]
Message-ID: <1638457827.k2t0muctq1.astroid@nora.none> (raw)
In-Reply-To: <7a0968df-559b-81de-1df1-f912866b39d5@proxmox.com>
On November 30, 2021 3:06 pm, Fabian Ebner wrote:
> Am 11.11.21 um 15:07 schrieb Fabian Grünbichler:
>> this series adds remote migration for VMs.
>>
>> both live and offline migration including NBD and storage-migrated disks
>> should work.
>>
>
> Played around with it for a while. Biggest issue is that migration fails
> if there is no 'meta' property in the config. Most other things I wish
> for are better error handling, but it seems to be in good shape otherwise!
thanks for testing and reviewing!
> Error "storage does not exist" if the real issue is missing access
> rights. But that error also appears if missing access for
> /cluster/resources or if the target node does not exists.
hmm, yeah, those are from the "abort early" checks that I'd like to punt
to some pre-condition API endpoint anyway, or to the client itself.. but
"missing access" and "doesn't exist" can be rather similar from that
perspective, so probably a more generic error message is best :)
> For the 'config' command, 'Sys.Modify' seems to be required
> failed to handle 'config' command - 403 Permission check failed (/,
> Sys.Modify)
did you maybe have 'startup' set in the VM config? that requires
Sys.Modify on / to change, and we pass the full config (well, most of
it) to update_vm_api.. I don't think there is an intentional check
otherwise for that privilege, so it's likely from something in the
config..
> but it does create an empty configuration file, leading to
> target_vmid: Guest with ID '5678' already exists on remote cluster
> on the next attempt.
> It also already allocates the disks, but doesn't clean them up, because
> it gets the wrong lock (since the config is empty) and aborts the 'quit'
> command.
yeah, that is related:
- we remove the lock prio to calling update_vm_api
- if updating fails, the config is therefor unlocked (and empty, since
it only contains the lock prior to this command)
- mtunnel still expects the create lock to be there and errors out
the fix is to not remove the lock, but set skiplock for the
update_vm_api call (this is in an flocked context with the config lock
checked after entering the locked scope, so it is as safe as can be).
that way if update_vm_api fails, the config still has the 'create' lock,
and cleanup triggered by the client works.
> If the config is not recent enough to have a 'meta' property:
> failed to handle 'config' command - unable to parse value of 'meta'
> - got undefined value
> Same issue with disk+config cleanup as above.
yeah, that should be skipped if undef/false. the missing cleanup is
fixed by the changes above.
> The local VM stayes locked with 'migrate'. Is that how it should be?
unless you pass `--delete`, yes. the idea was that this prevents
starting the VM on source and target cluster by accident, but using a
different lock value might be more appropriate to signify 'this guest
moved to another cluster'.
> Also the __migration__ snapshot will stay around, resulting in an error
> when trying to migrate again.
I'll take a look, that shouldn't happen and probably indicates some
missing cleanup somewhere ;)
> For live migration I always got a (cosmetic?) "WS closed
> unexpectedly"-error:
> tunnel: -> sending command "quit" to remote
> tunnel: <- got reply
> tunnel: Tunnel to
> https://192.168.20.142:8006/api2/json/nodes/rob2/qemu/5678/mtunnelwebsocket?
> ticket=PVETUNNEL%3A<SNIP>&socket=%2Frun%2Fqemu-server%2F5678.mtunnel
> failed - WS closed unexpectedly
> 2021-11-30 13:49:39 migration finished successfully (duration 00:01:02)
> UPID:pve701:0000D8AD:000CB782:61A61DA5:qmigrate:111:root@pam:
yeah, this is cosmetic. we send the quit, the remote end closes the
socket, the tunnel doesn't know that the websocket closing is expected
and prints that error. I'll think about whether we can handle this
somehow with the existing commands, or we need to do a proper close
command from the client side as well (I dropped the start of something
like this in v2 ;))
-> quit
<- done (server expects WS to close)
-> close (ctrl cmd)
-> close WS
or
-> close (ctrl cmd, sets flag to expect WS to be closed)
-> quit
<- done
<- close WS
I'd like to avoid having the tunnel parse the quit command and having
that serve a double-purpose, to keep the tunnel re-usable and agnostic.
> Fun fact: the identity storage mapping will be used for storages that
> don't appear in the explicit mapping. E.g. it's possible to migrate a VM
> that only has disks on storeA with --target-storage storeB:storeB (if
> storeA exists on the target of course). But the explicit identity
> mapping is prohibited.
thanks, that should be caught as well.
> When a target bridge is not present (should that be detected ahead of
> starting the migration?) and likely for any other startup failure the
> only error in the log is:
yeah, the targetbridge should probably be part of the initial checks
like the storages (and then be re-checked anyway for the actual start
command, but at least for non-misbehaving clients they get an early
abort).
> 2021-11-30 14:43:10 ERROR: online migrate failure - error - tunnel
> command '{"cmd":"star<SNIP>
> failed to handle 'start' command - start failed: QEMU exited with code 1
> For non-remote migration we are more verbose in this case and log the
> QEMU output.
yeah, that should be easily improvable. the query-disk-import could
similarly be extended with an optional 'msg' that we can use to print
progress if there is any..
> Can/should an interrupt be handled more gracefully, so that remote
> cleanup still happens?
> ^CCMD websocket tunnel died: command 'proxmox-websocket-tunnel' failed:
> interrupted by signal
>
> 2021-11-30 14:39:07 ERROR: interrupted by signal
> 2021-11-30 14:39:07 aborting phase 1 - cleanup resources
> 2021-11-30 14:39:08 ERROR: writing to tunnel failed: broken pipe
> 2021-11-30 14:39:08 ERROR: migration aborted (duration 00:00:10):
> interrupted by signal
yeah, we should probably leave it up to the regular error-handling and
cleanup logic to cleanly close the tunnel, instead of it being directly
terminated by the signal..
next prev parent reply other threads:[~2021-12-02 15:37 UTC|newest]
Thread overview: 24+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-11-11 14:07 Fabian Grünbichler
2021-11-11 14:07 ` [pve-devel] [PATCH v2 proxmox-websocket-tunnel 1/4] initial commit Fabian Grünbichler
2021-11-11 14:07 ` [pve-devel] [PATCH v2 proxmox-websocket-tunnel 2/4] add tunnel implementation Fabian Grünbichler
2021-11-11 14:07 ` [pve-devel] [PATCH v2 proxmox-websocket-tunnel 3/4] add fingerprint validation Fabian Grünbichler
2021-11-11 14:07 ` [pve-devel] [PATCH v2 proxmox-websocket-tunnel 4/4] add packaging Fabian Grünbichler
2021-11-11 14:07 ` [pve-devel] [PATCH v2 access-control 1/2] tickets: add tunnel ticket Fabian Grünbichler
2021-11-11 15:50 ` [pve-devel] applied: " Thomas Lamprecht
2021-11-11 14:07 ` [pve-devel] [PATCH v2 access-control 2/2] ticket: normalize path for verification Fabian Grünbichler
2021-11-11 15:50 ` [pve-devel] applied: " Thomas Lamprecht
2021-11-11 14:07 ` [pve-devel] [PATCH v2 http-server 1/1] webproxy: handle unflushed write buffer Fabian Grünbichler
2021-11-11 16:04 ` [pve-devel] applied: " Thomas Lamprecht
2021-11-11 14:07 ` [pve-devel] [PATCH v2 qemu-server 1/8] refactor map_storage to map_id Fabian Grünbichler
2021-11-11 14:07 ` [pve-devel] [PATCH v2 qemu-server 2/8] schema: use pve-bridge-id Fabian Grünbichler
2021-11-11 14:07 ` [pve-devel] [PATCH v2 qemu-server 3/8] update_vm: allow simultaneous setting of boot-order and dev Fabian Grünbichler
2021-11-11 14:07 ` [pve-devel] [PATCH v2 qemu-server 4/8] nbd alloc helper: allow passing in explicit format Fabian Grünbichler
2021-11-11 14:07 ` [pve-devel] [PATCH v2 qemu-server 5/8] mtunnel: add API endpoints Fabian Grünbichler
2021-11-11 14:07 ` [pve-devel] [PATCH v2 qemu-server 6/8] migrate: refactor remote VM/tunnel start Fabian Grünbichler
2021-11-11 14:07 ` [pve-devel] [PATCH v2 qemu-server 7/8] migrate: add remote migration handling Fabian Grünbichler
2021-11-30 13:57 ` Fabian Ebner
2021-11-11 14:07 ` [pve-devel] [PATCH v2 qemu-server 8/8] api: add remote migrate endpoint Fabian Grünbichler
2021-11-12 8:03 ` [pve-devel] [PATCH guest-common] migrate: handle migration_network with remote migration Fabian Grünbichler
2021-11-30 14:06 ` [pve-devel] [PATCH v2 qemu-server++ 0/15] " Fabian Ebner
2021-12-02 15:36 ` Fabian Grünbichler [this message]
2021-12-03 7:49 ` Fabian Ebner
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1638457827.k2t0muctq1.astroid@nora.none \
--to=f.gruenbichler@proxmox.com \
--cc=f.ebner@proxmox.com \
--cc=pve-devel@lists.proxmox.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox