Re: [pve-devel] [PATCH v2 qemu-server 2/2] remote-migration: add target-cpu param

From: "Fabian Grünbichler" <f.gruenbichler@proxmox.com>
To: Proxmox VE development discussion <pve-devel@lists.proxmox.com>
Subject: Re: [pve-devel] [PATCH v2 qemu-server 2/2] remote-migration: add target-cpu param
Date: Thu, 27 Apr 2023 09:32:42 +0200	[thread overview]
Message-ID: <1682580098.xwye6zkp88.astroid@yuna.none> (raw)
In-Reply-To: <a278d421ea7e917e42f89a4de42ca4fb39913bce.camel@groupe-cyllene.com>

On April 27, 2023 7:50 am, DERUMIER, Alexandre wrote:
> Hi,
> 
> Le mercredi 26 avril 2023 à 15:14 +0200, Fabian Grünbichler a écrit :
>> On April 25, 2023 6:52 pm, Alexandre Derumier wrote:
>> > This patch add support for remote migration when target
>> > cpu model is different.
>> > 
>> > The target vm is restart after the migration
>> 
>> so this effectively introduces a new "hybrid" migration mode ;) the
>> changes are a bit smaller than I expected (in part thanks to patch
>> #1),
>> which is good.
>> 
>> there are semi-frequent requests for another variant (also applicable
>> to
>> containers) in the form of a two phase migration
>> - storage migrate
>> - stop guest
>> - incremental storage migrate
>> - start guest on target
>> 
> 
> But I'm not sure how to to an incremental storage migrate, without
> storage snapshot send|receiv.  (so zfs && rbd could work).
> 
> - Vm/ct is running
> - do a first snapshot + sync to target with zfs|rbd send|receive
> - stop the guest
> - do a second snapshot + incremental sync + sync to target with zfs|rbd
> send|receive
> - start the guest on remote
> 
> 
> (or maybe for vm, without snapshot, with a dirty bitmap ? But we need
> to be able to write the dirty map content to disk somewhere after vm
> stop, and reread it for the last increment )

theoretically, we could support such a mode for non-snapshot storages by
using bitmaps+block-mirror, yes. either with a target VM, or with
qemu-storage-daemon on the target node exposing the target volumes

> - vm is running
> - create a dirty-bitmap and start sync with qemu-block-storage
> - stop the vm && save the dirty bitmap
> - reread the dirtymap && do incremental sync (with the new qemu-daemon-
> storage or starting the vm paused ?

stop here could also just mean stop the guest OS, but leave the process
for the incremental sync, so it would not need persistent bitmap
support.

> And currently we don't support yet offline storage migration. (BTW,
> This is also breaking migration with unused disk).
> I don't known if we can send send|receiv transfert through the tunnel ?
> (I never tested it)

we do, but maybe you tested with RBD which doesn't support storage
migration yet? withing a cluster it doesn't need to, since it's a shared
storage, but between cluster we need to implement it (it's on my TODO
list and shouldn't be too hard since there is 'rbd export/import').

>> given that it might make sense to save-guard this implementation
>> here,
>> and maybe switch to a new "mode" parameter?
>> 
>> online => switching CPU not allowed
>> offline or however-we-call-this-new-mode (or in the future, two-
>> phase-restart) => switching CPU allowed
>> 
> 
> Yes, I was thinking about that too.
> Maybe not "offline", because maybe we want to implement a real offline
> mode later.
> But simply "restart" ?

no, I meant moving the existing --online switch to a new mode parameter,
then we'd have "online" and "offline", and then add your new mode on top
"however-we-call-this-new-mode", and then we could in the future also
add "two-phase-restart" for the sync-twice mode I described :)

target-cpu would of course also be supported for the (existing) offline
mode, since it just needs to adapt the target-cpu in the config.

the main thing I'd want to avoid is somebody accidentally setting
"target-cpu", not knowing/noticing that that entails what amounts to a
reset of the VM as part of the migration..

there were a few things down below that might also be worthy of
discussion. I also wonder whether the two variants of "freeze FS" and
"suspend without state" are enough - that only ensures that no more I/O
happens so the volumes are bitwise identical, but shouldn't we also at
least have the option of doing a clean shutdown at that point so that
applications can serialize/flush their state properly and that gets
synced across as well? else this is the equivalent of cutting the power
cord, which might not be a good fit for all use cases ;)

>> > 
>> > Signed-off-by: Alexandre Derumier <aderumier@odiso.com>
>> > ---
>> >  PVE/API2/Qemu.pm   | 18 ++++++++++++++++++
>> >  PVE/CLI/qm.pm      |  6 ++++++
>> >  PVE/QemuMigrate.pm | 25 +++++++++++++++++++++++++
>> >  3 files changed, 49 insertions(+)
>> > 
>> > diff --git a/PVE/API2/Qemu.pm b/PVE/API2/Qemu.pm
>> > index 587bb22..6703c87 100644
>> > --- a/PVE/API2/Qemu.pm
>> > +++ b/PVE/API2/Qemu.pm
>> > @@ -4460,6 +4460,12 @@ __PACKAGE__->register_method({
>> >                 optional => 1,
>> >                 default => 0,
>> >             },
>> > +           'target-cpu' => {
>> > +               optional => 1,
>> > +               description => "Target Emulated CPU model. For
>> > online migration, the storage is live migrate, but the memory
>> > migration is skipped and the target vm is restarted.",
>> > +               type => 'string',
>> > +               format => 'pve-vm-cpu-conf',
>> > +           },
>> >             'target-storage' => get_standard_option('pve-
>> > targetstorage', {
>> >                 completion =>
>> > \&PVE::QemuServer::complete_migration_storage,
>> >                 optional => 0,
>> > @@ -4557,11 +4563,14 @@ __PACKAGE__->register_method({
>> >         raise_param_exc({ 'target-bridge' => "failed to parse
>> > bridge map: $@" })
>> >             if $@;
>> >  
>> > +       my $target_cpu = extract_param($param, 'target-cpu');
>> 
>> this is okay
>> 
>> > +
>> >         die "remote migration requires explicit storage mapping!\n"
>> >             if $storagemap->{identity};
>> >  
>> >         $param->{storagemap} = $storagemap;
>> >         $param->{bridgemap} = $bridgemap;
>> > +       $param->{targetcpu} = $target_cpu;
>> 
>> but this is a bit confusing with the variable/hash key naming ;)
>> 
>> >         $param->{remote} = {
>> >             conn => $conn_args, # re-use fingerprint for tunnel
>> >             client => $api_client,
>> > @@ -5604,6 +5613,15 @@ __PACKAGE__->register_method({
>> >                     PVE::QemuServer::nbd_stop($state->{vmid});
>> >                     return;
>> >                 },
>> > +               'restart' => sub {
>> > +                   PVE::QemuServer::vm_stop(undef, $state->{vmid},
>> > 1, 1);
>> > +                   my $info = PVE::QemuServer::vm_start_nolock(
>> > +                       $state->{storecfg},
>> > +                       $state->{vmid},
>> > +                       $state->{conf},
>> > +                   );
>> > +                   return;
>> > +               },
>> >                 'resume' => sub {
>> >                     if
>> > (PVE::QemuServer::Helpers::vm_running_locally($state->{vmid})) {
>> >                         PVE::QemuServer::vm_resume($state->{vmid},
>> > 1, 1);
>> > diff --git a/PVE/CLI/qm.pm b/PVE/CLI/qm.pm
>> > index c3c2982..06c74c1 100755
>> > --- a/PVE/CLI/qm.pm
>> > +++ b/PVE/CLI/qm.pm
>> > @@ -189,6 +189,12 @@ __PACKAGE__->register_method({
>> >                 optional => 1,
>> >                 default => 0,
>> >             },
>> > +           'target-cpu' => {
>> > +               optional => 1,
>> > +               description => "Target Emulated CPU model. For
>> > online migration, the storage is live migrate, but the memory
>> > migration is skipped and the target vm is restarted.",
>> > +               type => 'string',
>> > +               format => 'pve-vm-cpu-conf',
>> > +           },
>> >             'target-storage' => get_standard_option('pve-
>> > targetstorage', {
>> >                 completion =>
>> > \&PVE::QemuServer::complete_migration_storage,
>> >                 optional => 0,
>> > diff --git a/PVE/QemuMigrate.pm b/PVE/QemuMigrate.pm
>> > index e182415..04f8053 100644
>> > --- a/PVE/QemuMigrate.pm
>> > +++ b/PVE/QemuMigrate.pm
>> > @@ -731,6 +731,11 @@ sub cleanup_bitmaps {
>> >  sub live_migration {
>> >      my ($self, $vmid, $migrate_uri, $spice_port) = @_;
>> >  
>> > +    if($self->{opts}->{targetcpu}){
>> > +        $self->log('info', "target cpu is different - skip live
>> > migration.");
>> > +        return;
>> > +    }
>> > +
>> >      my $conf = $self->{vmconf};
>> >  
>> >      $self->log('info', "starting online/live migration on
>> > $migrate_uri");
>> > @@ -995,6 +1000,7 @@ sub phase1_remote {
>> >      my $remote_conf = PVE::QemuConfig->load_config($vmid);
>> >      PVE::QemuConfig->update_volume_ids($remote_conf, $self-
>> > >{volume_map});
>> >  
>> > +    $remote_conf->{cpu} = $self->{opts}->{targetcpu};
>> 
>> do we need permission checks here (or better, somewhere early on, for
>> doing this here)
>> 
>> >      my $bridges = map_bridges($remote_conf, $self->{opts}-
>> > >{bridgemap});
>> >      for my $target (keys $bridges->%*) {
>> >         for my $nic (keys $bridges->{$target}->%*) {
>> > @@ -1354,6 +1360,21 @@ sub phase2 {
>> >      live_migration($self, $vmid, $migrate_uri, $spice_port);
>> >  
>> >      if ($self->{storage_migration}) {
>> > +
>> > +        #freeze source vm io/s if target cpu is different (no
>> > livemigration)
>> > +       if ($self->{opts}->{targetcpu}) {
>> > +           my $agent_running = $self->{conf}->{agent} &&
>> > PVE::QemuServer::qga_check_running($vmid);
>> > +           if ($agent_running) {
>> > +               print "freeze filesystem\n";
>> > +               eval { mon_cmd($vmid, "guest-fsfreeze-freeze"); };
>> > +               die $@ if $@;
>> 
>> die here
>> 
>> > +           } else {
>> > +               print "suspend vm\n";
>> > +               eval { PVE::QemuServer::vm_suspend($vmid, 1); };
>> > +               warn $@ if $@;
>> 
>> but warn here?
>> 
>> I'd like some more rationale for these two variants, what are the
>> pros
>> and cons? should we make it configurable?
>> > +           }
>> > +       }
>> > +
>> >         # finish block-job with block-job-cancel, to disconnect
>> > source VM from NBD
>> >         # to avoid it trying to re-establish it. We are in blockjob
>> > ready state,
>> >         # thus, this command changes to it to blockjob complete
>> > (see qapi docs)
>> > @@ -1608,6 +1629,10 @@ sub phase3_cleanup {
>> >      # clear migrate lock
>> >      if ($tunnel && $tunnel->{version} >= 2) {
>> >         PVE::Tunnel::write_tunnel($tunnel, 10, "unlock");
>> > +       if ($self->{opts}->{targetcpu}) {
>> > +           $self->log('info', "target cpu is different - restart
>> > target vm.");
>> > +           PVE::Tunnel::write_tunnel($tunnel, 10, 'restart');
>> > +       }
>> >  
>> >         PVE::Tunnel::finish_tunnel($tunnel);
>> >      } else {
>> > -- 
>> > 2.30.2
>> > 
>> > 
>> > _______________________________________________
>> > pve-devel mailing list
>> > pve-devel@lists.proxmox.com
>> > https://antiphishing.cetsi.fr/proxy/v3?i=Zk92VEFKaGQ4Ums4cnZEUWMTpfHaXFQGRw1_CnOoOH0&r=bHA1dGV3NWJQVUloaWNFUZPm0fiiBviaiy_RDav2GQ1U4uy6lsDDv3uBszpvvWYQN5FqKqFD6WPYupfAUP1c9g&f=SlhDbE9uS2laS2JaZFpNWvmsxai1zlJP9llgnl5HIv-4jAji8Dh2BQawzxID5bzr6Uv-3EQd-eluQbsPfcUOTg&u=https%3A//lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel&k=XRKU
>> > 
>> > 
>> > 
>> 
>> 
>> _______________________________________________
>> pve-devel mailing list
>> pve-devel@lists.proxmox.com
>> https://antiphishing.cetsi.fr/proxy/v3?i=Zk92VEFKaGQ4Ums4cnZEUWMTpfHaXFQGRw1_CnOoOH0&r=bHA1dGV3NWJQVUloaWNFUZPm0fiiBviaiy_RDav2GQ1U4uy6lsDDv3uBszpvvWYQN5FqKqFD6WPYupfAUP1c9g&f=SlhDbE9uS2laS2JaZFpNWvmsxai1zlJP9llgnl5HIv-4jAji8Dh2BQawzxID5bzr6Uv-3EQd-eluQbsPfcUOTg&u=https%3A//lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel&k=XRKU
>> 
> 
> _______________________________________________
> pve-devel mailing list
> pve-devel@lists.proxmox.com
> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
>