Re: [pve-devel] [PATCH v2 qemu-server++ 0/15] remote migration

public inbox for pve-devel@lists.proxmox.com
 help / color / mirror / Atom feed

From: "Fabian Grünbichler" <f.gruenbichler@proxmox.com>
To: Fabian Ebner <f.ebner@proxmox.com>, pve-devel@lists.proxmox.com
Subject: Re: [pve-devel] [PATCH v2 qemu-server++ 0/15] remote migration
Date: Thu, 02 Dec 2021 16:36:36 +0100	[thread overview]
Message-ID: <1638457827.k2t0muctq1.astroid@nora.none> (raw)
In-Reply-To: <7a0968df-559b-81de-1df1-f912866b39d5@proxmox.com>

On November 30, 2021 3:06 pm, Fabian Ebner wrote:
> Am 11.11.21 um 15:07 schrieb Fabian Grünbichler:
>> this series adds remote migration for VMs.
>> 
>> both live and offline migration including NBD and storage-migrated disks
>> should work.
>> 
> 
> Played around with it for a while. Biggest issue is that migration fails 
> if there is no 'meta' property in the config. Most other things I wish 
> for are better error handling, but it seems to be in good shape otherwise!

thanks for testing and reviewing!

> Error "storage does not exist" if the real issue is missing access 
> rights. But that error also appears if missing access for 
> /cluster/resources or if the target node does not exists.

hmm, yeah, those are from the "abort early" checks that I'd like to punt 
to some pre-condition API endpoint anyway, or to the client itself.. but 
"missing access" and "doesn't exist" can be rather similar from that 
perspective, so probably a more generic error message is best :)

> For the 'config' command, 'Sys.Modify' seems to be required
>      failed to handle 'config' command - 403 Permission check failed (/, 
> Sys.Modify)

did you maybe have 'startup' set in the VM config? that requires 
Sys.Modify on / to change, and we pass the full config (well, most of 
it) to update_vm_api.. I don't think there is an intentional check 
otherwise for that privilege, so it's likely from something in the 
config..

> but it does create an empty configuration file, leading to
>      target_vmid: Guest with ID '5678' already exists on remote cluster
> on the next attempt.
> It also already allocates the disks, but doesn't clean them up, because 
> it gets the wrong lock (since the config is empty) and aborts the 'quit' 
> command.

yeah, that is related:
- we remove the lock prio to calling update_vm_api
- if updating fails, the config is therefor unlocked (and empty, since 
  it only contains the lock prior to this command)
- mtunnel still expects the create lock to be there and errors out

the fix is to not remove the lock, but set skiplock for the 
update_vm_api call (this is in an flocked context with the config lock 
checked after entering the locked scope, so it is as safe as can be).

that way if update_vm_api fails, the config still has the 'create' lock, 
and cleanup triggered by the client works.

> If the config is not recent enough to have a 'meta' property:
>      failed to handle 'config' command - unable to parse value of 'meta' 
> - got undefined value
> Same issue with disk+config cleanup as above.

yeah, that should be skipped if undef/false. the missing cleanup is 
fixed by the changes above.

> The local VM stayes locked with 'migrate'. Is that how it should be?

unless you pass `--delete`, yes. the idea was that this prevents 
starting the VM on source and target cluster by accident, but using a 
different lock value might be more appropriate to signify 'this guest 
moved to another cluster'.

> Also the __migration__ snapshot will stay around, resulting in an error 
> when trying to migrate again.

I'll take a look, that shouldn't happen and probably indicates some 
missing cleanup somewhere ;)

> For live migration I always got a (cosmetic?) "WS closed 
> unexpectedly"-error:
> tunnel: -> sending command "quit" to remote
> tunnel: <- got reply
> tunnel: Tunnel to 
> https://192.168.20.142:8006/api2/json/nodes/rob2/qemu/5678/mtunnelwebsocket?
> ticket=PVETUNNEL%3A<SNIP>&socket=%2Frun%2Fqemu-server%2F5678.mtunnel 
> failed - WS closed unexpectedly
> 2021-11-30 13:49:39 migration finished successfully (duration 00:01:02)
> UPID:pve701:0000D8AD:000CB782:61A61DA5:qmigrate:111:root@pam:

yeah, this is cosmetic. we send the quit, the remote end closes the 
socket, the tunnel doesn't know that the websocket closing is expected 
and prints that error. I'll think about whether we can handle this 
somehow with the existing commands, or we need to do a proper close 
command from the client side as well (I dropped the start of something 
like this in v2 ;))

-> quit
<- done (server expects WS to close)
-> close (ctrl cmd)
-> close WS

or

-> close (ctrl cmd, sets flag to expect WS to be closed)
-> quit
<- done
<- close WS

I'd like to avoid having the tunnel parse the quit command and having 
that serve a double-purpose, to keep the tunnel re-usable and agnostic.

> Fun fact: the identity storage mapping will be used for storages that 
> don't appear in the explicit mapping. E.g. it's possible to migrate a VM 
> that only has disks on storeA with --target-storage storeB:storeB (if 
> storeA exists on the target of course). But the explicit identity 
> mapping is prohibited.

thanks, that should be caught as well.

> When a target bridge is not present (should that be detected ahead of 
> starting the migration?) and likely for any other startup failure the 
> only error in the log is:

yeah, the targetbridge should probably be part of the initial checks 
like the storages (and then be re-checked anyway for the actual start 
command, but at least for non-misbehaving clients they get an early 
abort).

> 2021-11-30 14:43:10 ERROR: online migrate failure - error - tunnel 
> command '{"cmd":"star<SNIP>
> failed to handle 'start' command - start failed: QEMU exited with code 1
> For non-remote migration we are more verbose in this case and log the 
> QEMU output.

yeah, that should be easily improvable. the query-disk-import could 
similarly be extended with an optional 'msg' that we can use to print 
progress if there is any..

> Can/should an interrupt be handled more gracefully, so that remote 
> cleanup still happens?
> ^CCMD websocket tunnel died: command 'proxmox-websocket-tunnel' failed: 
> interrupted by signal
> 
> 2021-11-30 14:39:07 ERROR: interrupted by signal
> 2021-11-30 14:39:07 aborting phase 1 - cleanup resources
> 2021-11-30 14:39:08 ERROR: writing to tunnel failed: broken pipe
> 2021-11-30 14:39:08 ERROR: migration aborted (duration 00:00:10): 
> interrupted by signal

yeah, we should probably leave it up to the regular error-handling and 
cleanup logic to cleanly close the tunnel, instead of it being directly 
terminated by the signal..

next prev parent reply	other threads:[~2021-12-02 15:37 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-11-11 14:07 Fabian Grünbichler
2021-11-11 14:07 ` [pve-devel] [PATCH v2 proxmox-websocket-tunnel 1/4] initial commit Fabian Grünbichler
2021-11-11 14:07 ` [pve-devel] [PATCH v2 proxmox-websocket-tunnel 2/4] add tunnel implementation Fabian Grünbichler
2021-11-11 14:07 ` [pve-devel] [PATCH v2 proxmox-websocket-tunnel 3/4] add fingerprint validation Fabian Grünbichler
2021-11-11 14:07 ` [pve-devel] [PATCH v2 proxmox-websocket-tunnel 4/4] add packaging Fabian Grünbichler
2021-11-11 14:07 ` [pve-devel] [PATCH v2 access-control 1/2] tickets: add tunnel ticket Fabian Grünbichler
2021-11-11 15:50   ` [pve-devel] applied: " Thomas Lamprecht
2021-11-11 14:07 ` [pve-devel] [PATCH v2 access-control 2/2] ticket: normalize path for verification Fabian Grünbichler
2021-11-11 15:50   ` [pve-devel] applied: " Thomas Lamprecht
2021-11-11 14:07 ` [pve-devel] [PATCH v2 http-server 1/1] webproxy: handle unflushed write buffer Fabian Grünbichler
2021-11-11 16:04   ` [pve-devel] applied: " Thomas Lamprecht
2021-11-11 14:07 ` [pve-devel] [PATCH v2 qemu-server 1/8] refactor map_storage to map_id Fabian Grünbichler
2021-11-11 14:07 ` [pve-devel] [PATCH v2 qemu-server 2/8] schema: use pve-bridge-id Fabian Grünbichler
2021-11-11 14:07 ` [pve-devel] [PATCH v2 qemu-server 3/8] update_vm: allow simultaneous setting of boot-order and dev Fabian Grünbichler
2021-11-11 14:07 ` [pve-devel] [PATCH v2 qemu-server 4/8] nbd alloc helper: allow passing in explicit format Fabian Grünbichler
2021-11-11 14:07 ` [pve-devel] [PATCH v2 qemu-server 5/8] mtunnel: add API endpoints Fabian Grünbichler
2021-11-11 14:07 ` [pve-devel] [PATCH v2 qemu-server 6/8] migrate: refactor remote VM/tunnel start Fabian Grünbichler
2021-11-11 14:07 ` [pve-devel] [PATCH v2 qemu-server 7/8] migrate: add remote migration handling Fabian Grünbichler
2021-11-30 13:57   ` Fabian Ebner
2021-11-11 14:07 ` [pve-devel] [PATCH v2 qemu-server 8/8] api: add remote migrate endpoint Fabian Grünbichler
2021-11-12  8:03 ` [pve-devel] [PATCH guest-common] migrate: handle migration_network with remote migration Fabian Grünbichler
2021-11-30 14:06 ` [pve-devel] [PATCH v2 qemu-server++ 0/15] " Fabian Ebner
2021-12-02 15:36   ` Fabian Grünbichler [this message]
2021-12-03  7:49     ` Fabian Ebner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1638457827.k2t0muctq1.astroid@nora.none \
    --to=f.gruenbichler@proxmox.com \
    --cc=f.ebner@proxmox.com \
    --cc=pve-devel@lists.proxmox.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox

Service provided by Proxmox Server Solutions GmbH | Privacy | Legal