From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits)) (No client certificate requested) by lists.proxmox.com (Postfix) with ESMTPS id EAE9C98EA1 for ; Fri, 28 Apr 2023 11:13:32 +0200 (CEST) Received: from firstgate.proxmox.com (localhost [127.0.0.1]) by firstgate.proxmox.com (Proxmox) with ESMTP id C3E10319C5 for ; Fri, 28 Apr 2023 11:13:02 +0200 (CEST) Received: from proxmox-new.maurer-it.com (proxmox-new.maurer-it.com [94.136.29.106]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits)) (No client certificate requested) by firstgate.proxmox.com (Proxmox) with ESMTPS for ; Fri, 28 Apr 2023 11:13:01 +0200 (CEST) Received: from proxmox-new.maurer-it.com (localhost.localdomain [127.0.0.1]) by proxmox-new.maurer-it.com (Proxmox) with ESMTP id 8947B4635C for ; Fri, 28 Apr 2023 11:13:01 +0200 (CEST) Date: Fri, 28 Apr 2023 11:12:54 +0200 From: Fabian =?iso-8859-1?q?Gr=FCnbichler?= To: Proxmox VE development discussion References: <20230425165233.3745210-1-aderumier@odiso.com> <20230425165233.3745210-3-aderumier@odiso.com> <1682514292.71raew01tr.astroid@yuna.none> <1682580098.xwye6zkp88.astroid@yuna.none> <6999b2267fb55cad2a60675c58051a8ef8258284.camel@groupe-cyllene.com> In-Reply-To: <6999b2267fb55cad2a60675c58051a8ef8258284.camel@groupe-cyllene.com> MIME-Version: 1.0 User-Agent: astroid/0.16.0 (https://github.com/astroidmail/astroid) Message-Id: <1682671458.u9jiyz5qe3.astroid@yuna.none> Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-SPAM-LEVEL: Spam detection results: 0 AWL 0.076 Adjusted score from AWL reputation of From: address BAYES_00 -1.9 Bayes spam probability is 0 to 1% DMARC_MISSING 0.1 Missing DMARC policy KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment SPF_HELO_NONE 0.001 SPF: HELO does not publish an SPF Record SPF_PASS -0.001 SPF: sender matches SPF record T_SCC_BODY_TEXT_LINE -0.01 - URIBL_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to URIBL was blocked. See http://wiki.apache.org/spamassassin/DnsBlocklists#dnsbl-block for more information. [proxmox.com] Subject: Re: [pve-devel] [PATCH v2 qemu-server 2/2] remote-migration: add target-cpu param X-BeenThere: pve-devel@lists.proxmox.com X-Mailman-Version: 2.1.29 Precedence: list List-Id: Proxmox VE development discussion List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 28 Apr 2023 09:13:33 -0000 On April 28, 2023 8:43 am, DERUMIER, Alexandre wrote: >> > >>> And currently we don't support yet offline storage migration. (BTW, >>> This is also breaking migration with unused disk). >>> I don't known if we can send send|receiv transfert through the > tunnel ? >>> (I never tested it) >=20 >> we do, but maybe you tested with RBD which doesn't support storage >> migration yet? withing a cluster it doesn't need to, since it's a >> shared >> storage, but between cluster we need to implement it (it's on my TODO >> list and shouldn't be too hard since there is 'rbd export/import'). >>=20 > Yes, this was with an unused rbd device indeed. > (Another way could be to implement qemu-storage-daemon (never tested > it) for offline sync with any storage, like lvm) >=20 > Also cloud-init drive seem to be unmigratable currently. (I wonder if > we couldn't simply regenerate it on target, as now we have cloud-init > pending section, we can correctly generate the cloudinit with current > running config). >=20 >=20 >=20 >> > > given that it might make sense to save-guard this implementation >> > > here, >> > > and maybe switch to a new "mode" parameter? >> > >=20 >> > > online =3D> switching CPU not allowed >> > > offline or however-we-call-this-new-mode (or in the future, two- >> > > phase-restart) =3D> switching CPU allowed >> > >=20 >> >=20 >> > Yes, I was thinking about that too. >> > Maybe not "offline", because maybe we want to implement a real >> > offline >> > mode later. >> > But simply "restart" ? >>=20 >> no, I meant moving the existing --online switch to a new mode >> parameter, >> then we'd have "online" and "offline", and then add your new mode on >> top >> "however-we-call-this-new-mode", and then we could in the future also >> add "two-phase-restart" for the sync-twice mode I described :) >>=20 >> target-cpu would of course also be supported for the (existing) >> offline >> mode, since it just needs to adapt the target-cpu in the config. >>=20 >> the main thing I'd want to avoid is somebody accidentally setting >> "target-cpu", not knowing/noticing that that entails what amounts to >> a >> reset of the VM as part of the migration.. >>=20 > Yes, that what I had understanded > ;) >=20 > It's was more about "offline" term, because we don't offline the source > vm until the disk migration is finished. (to reduce downtime) > More like "online-restart" instead "offline". >=20 > Offline for me , is really, we shut the vm, then do the disk migration. hmm, I guess how you see it. for me, online means without interruption, anything else is offline :) but yeah, naming is hard, as always ;) >> there were a few things down below that might also be worthy of >> discussion. I also wonder whether the two variants of "freeze FS" and >> "suspend without state" are enough - that only ensures that no more >> I/O >> happens so the volumes are bitwise identical, but shouldn't we also >> at >> least have the option of doing a clean shutdown at that point so that >> applications can serialize/flush their state properly and that gets >> synced across as well? else this is the equivalent of cutting the >> power >> cord, which might not be a good fit for all use cases ;) >>=20 > I had try the clean shutdown in my v1 patch=C2=A0 > https://lists.proxmox.com/pipermail/pve-devel/2023-March/056291.html > (without doing the block-job-complete) in phase3, and I have fs > coruption sometime. > Not sure why exactly (Maybe os didn't have correctly shutdown or maybe > some datas in the buffer ?) > Maybe doing the block-job-complete before should make it safe. > (transfert access to the nbd , then do the clean shutdown). possibly we need a special "shutdown guest, but leave qemu running" way of shutting down (so that the guest and any applications within can do their thing, and the block job can transfer all the delta across). completing or cancelling the block job before the guest has shut down would mean the source and target are not consistent (since shutdown can change the disk content, and that would then not be mirrored anymore?), so I don't see any way that that could be an improvement. it would mean that starting the shutdown is already the point of no return - cancelling before would mean writes are not transferred to the target, completing before would mean writes are not written to the source anymore, so we can't fallback to the source node in error handling. I guess we could have to approaches: A - freeze or suspend (depending on QGA availability), then complete block job and (re)start target VM B - shutdown guest OS, complete, then exit source VM and (re)start target VM as always, there's a tradeoff there - A is faster, but less consistent from the guests point of view (somwhat similar to pulling the power cable). B can take a while (=3D=3D service downtime!), but it has the same semantics as a reboot. there are also IMHO multiple ways to think about the target side: A start VM in migration mode, but kill it without ever doing any migration, then start it again with modified config (current approach) B start VM paused (like when doing a backup of a stopped VM, without incoming migration), but with overridden CPU parameter, then just 'resume' it when the block migration is finished C don't start a VM at all, just the block devices via qemu-storage-daemon for the block migration, then do a regular start after the block migration and config update are done B has the advantage over A that we don't risk the VM not being able to restart (e.g., because of a race for memory or pass-through resources), and also the resume should be (depending on exact environment possibly quite a bit) faster than kill+start C has the advantage over A and B that the migration itself is cheaper resource-wise, but the big downside that we don't even know if the VM is startable on the target node, and of course, it's a lot more code to write. possibly I just included it because I am looking for an excuse to play around with qemu-storage-daemon - it's probably the least relevant variant for now ;) >=20 > I'll give a try in the V3.=20 >=20 > I just wonder if we can add a new param, like: >=20 > --online --fsfreeze >=20 > --online --shutdown >=20 > --online --2phase-restart that would also be an option. not sure by heart if it's possible to make --online into a property string that is backwards compatible with the "plain boolean" option? if so, we could do --online [mode=3Dlive,qga,suspend,shutdown,2phase,..] with live being the default (not supporting target-cpu) and qga,suspend,shutdown all handling target-cpu (2phase just included for completeness sake) alternative, if that doesn't work, having --online [--online-mode live,qga,= suspend,..] would be my second choice I guess, if we are reasonable sure that all the possible extensions would be for running VMs only. the only thing counter to that that I can think of would be storage migration using qemu-storage-daemon (e.g., like you said, to somehow bolt on incremental support using persistent bitmaps for storages/image formats that don't support that otherwise), and there I am not even sure whether that couldn't be somehow handled in pve-storage anyway > (I'm currently migrating a lot of vm between an old intel cluster to > the new amd cluster, on different datacenter, with a different ceph > cluster, so I can still do real production tests) technically, target-cpu might also be a worthwhile feature for heterogenous clusters where a proper/full live migration is not possible for certain node/CPU combinations.. we do already update volume IDs when using 'targetstorage', so also updating the CPU should be doable there as well. using the still experimental remote migration as a field for evaluation is fine, just something to keep in mind while thinking about options, so that we don't accidentally maneuver ourselves into a corner that makes that part impossible :)