public inbox for pve-devel@lists.proxmox.com
 help / color / mirror / Atom feed
* [pve-devel] [PATCH manager 1/3] Hold the guest migration lock when changing the replication config
@ 2020-07-30 11:29 Fabian Ebner
  2020-07-30 11:29 ` [pve-devel] [PATCH qemu-server 2/3] Repeat check for replication target in locked section Fabian Ebner
                   ` (3 more replies)
  0 siblings, 4 replies; 5+ messages in thread
From: Fabian Ebner @ 2020-07-30 11:29 UTC (permalink / raw)
  To: pve-devel

The guest migration lock is already held when running replications,
but it also makes sense to hold it when updating the replication
config itself. Otherwise, it can happen that the migration does
not know the de-facto state of replication.

For example:
1. migration starts
2. replication job is deleted
3. migration reads the replication config
4. migration runs the replication which causes the
   replicated disks to be removed, because the job
   is marked for removal
5. migration will continue without replication

Note that the migration doesn't actually fail, but it's probably
not the desired behavior either.

Suggested-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
Signed-off-by: Fabian Ebner <f.ebner@proxmox.com>
---
 PVE/API2/ReplicationConfig.pm | 18 ++++++++++++++----
 1 file changed, 14 insertions(+), 4 deletions(-)

diff --git a/PVE/API2/ReplicationConfig.pm b/PVE/API2/ReplicationConfig.pm
index 2b4ecd10..e5262068 100644
--- a/PVE/API2/ReplicationConfig.pm
+++ b/PVE/API2/ReplicationConfig.pm
@@ -9,6 +9,7 @@ use PVE::JSONSchema qw(get_standard_option);
 use PVE::RPCEnvironment;
 use PVE::ReplicationConfig;
 use PVE::Cluster;
+use PVE::GuestHelpers;
 
 use PVE::RESTHandler;
 
@@ -144,7 +145,9 @@ __PACKAGE__->register_method ({
 	    $cfg->write();
 	};
 
-	PVE::ReplicationConfig::lock($code);
+	PVE::GuestHelpers::guest_migration_lock($guest, 10, sub {
+	    PVE::ReplicationConfig::lock($code);
+	});
 
 	return undef;
     }});
@@ -167,6 +170,7 @@ __PACKAGE__->register_method ({
 	my $id = extract_param($param, 'id');
 	my $digest = extract_param($param, 'digest');
 	my $delete = extract_param($param, 'delete');
+	my ($guest_id) = PVE::ReplicationConfig::parse_replication_job_id($id);
 
 	my $code = sub {
 	    my $cfg = PVE::ReplicationConfig->new();
@@ -199,7 +203,9 @@ __PACKAGE__->register_method ({
 	    $cfg->write();
 	};
 
-	PVE::ReplicationConfig::lock($code);
+	PVE::GuestHelpers::guest_migration_lock($guest_id, 10, sub {
+	    PVE::ReplicationConfig::lock($code);
+	});
 
 	return undef;
     }});
@@ -237,10 +243,12 @@ __PACKAGE__->register_method ({
 
 	my $rpcenv = PVE::RPCEnvironment::get();
 
+	my $id = extract_param($param, 'id');
+	my ($guest_id) = PVE::ReplicationConfig::parse_replication_job_id($id);
+
 	my $code = sub {
 	    my $cfg = PVE::ReplicationConfig->new();
 
-	    my $id = $param->{id};
 	    if ($param->{force}) {
 		raise_param_exc({ 'keep' => "conflicts with parameter 'force'" }) if $param->{keep};
 		delete $cfg->{ids}->{$id};
@@ -262,7 +270,9 @@ __PACKAGE__->register_method ({
 	    $cfg->write();
 	};
 
-	PVE::ReplicationConfig::lock($code);
+	PVE::GuestHelpers::guest_migration_lock($guest_id, 10, sub {
+	    PVE::ReplicationConfig::lock($code);
+	});
 
 	return undef;
     }});
-- 
2.20.1





^ permalink raw reply	[flat|nested] 5+ messages in thread

* [pve-devel] [PATCH qemu-server 2/3] Repeat check for replication target in locked section
  2020-07-30 11:29 [pve-devel] [PATCH manager 1/3] Hold the guest migration lock when changing the replication config Fabian Ebner
@ 2020-07-30 11:29 ` Fabian Ebner
  2020-07-30 11:29 ` [pve-devel] [PATCH/RFC qemu-server 3/3] Fix checks for transfering replication state/switching job target Fabian Ebner
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 5+ messages in thread
From: Fabian Ebner @ 2020-07-30 11:29 UTC (permalink / raw)
  To: pve-devel

No need to warn twice, so the warning from the outside check
was removed.

Suggested-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
Signed-off-by: Fabian Ebner <f.ebner@proxmox.com>
---
 PVE/API2/Qemu.pm   | 11 +++--------
 PVE/QemuMigrate.pm | 13 +++++++++++++
 2 files changed, 16 insertions(+), 8 deletions(-)

diff --git a/PVE/API2/Qemu.pm b/PVE/API2/Qemu.pm
index 8da616a..bc67666 100644
--- a/PVE/API2/Qemu.pm
+++ b/PVE/API2/Qemu.pm
@@ -3539,14 +3539,9 @@ __PACKAGE__->register_method({
 	    my $repl_conf = PVE::ReplicationConfig->new();
 	    my $is_replicated = $repl_conf->check_for_existing_jobs($vmid, 1);
 	    my $is_replicated_to_target = defined($repl_conf->find_local_replication_job($vmid, $target));
-	    if ($is_replicated && !$is_replicated_to_target) {
-		if ($param->{force}) {
-		    warn "WARNING: Node '$target' is not a replication target. Existing replication " .
-		         "jobs will fail after migration!\n";
-		} else {
-		    die "Cannot live-migrate replicated VM to node '$target' - not a replication target." .
-		        " Use 'force' to override.\n";
-		}
+	    if (!$param->{force} && $is_replicated && !$is_replicated_to_target) {
+		die "Cannot live-migrate replicated VM to node '$target' - not a replication " .
+		    "target. Use 'force' to override.\n";
 	    }
 	} else {
 	    warn "VM isn't running. Doing offline migration instead.\n" if $param->{online};
diff --git a/PVE/QemuMigrate.pm b/PVE/QemuMigrate.pm
index b699b67..a20e1c7 100644
--- a/PVE/QemuMigrate.pm
+++ b/PVE/QemuMigrate.pm
@@ -227,6 +227,19 @@ sub prepare {
 	die "can't migrate running VM without --online\n" if !$online;
 	$running = $pid;
 
+	my $repl_conf = PVE::ReplicationConfig->new();
+	my $is_replicated = $repl_conf->check_for_existing_jobs($vmid, 1);
+	my $is_replicated_to_target = defined($repl_conf->find_local_replication_job($vmid, $self->{node}));
+	if ($is_replicated && !$is_replicated_to_target) {
+	    if ($self->{opts}->{force}) {
+		$self->log('warn', "WARNING: Node '$self->{node}' is not a replication target. Existing " .
+			           "replication jobs will fail after migration!\n");
+	    } else {
+		die "Cannot live-migrate replicated VM to node '$self->{node}' - not a replication " .
+		    "target. Use 'force' to override.\n";
+	    }
+	}
+
 	$self->{forcemachine} = PVE::QemuServer::Machine::qemu_machine_pxe($vmid, $conf);
 
 	# To support custom CPU types, we keep QEMU's "-cpu" parameter intact.
-- 
2.20.1





^ permalink raw reply	[flat|nested] 5+ messages in thread

* [pve-devel] [PATCH/RFC qemu-server 3/3] Fix checks for transfering replication state/switching job target
  2020-07-30 11:29 [pve-devel] [PATCH manager 1/3] Hold the guest migration lock when changing the replication config Fabian Ebner
  2020-07-30 11:29 ` [pve-devel] [PATCH qemu-server 2/3] Repeat check for replication target in locked section Fabian Ebner
@ 2020-07-30 11:29 ` Fabian Ebner
  2020-08-03  7:11 ` [pve-devel] [PATCH manager 1/3] Hold the guest migration lock when changing the replication config Fabian Ebner
  2020-08-03  7:49 ` Fabian Grünbichler
  3 siblings, 0 replies; 5+ messages in thread
From: Fabian Ebner @ 2020-07-30 11:29 UTC (permalink / raw)
  To: pve-devel

When there are offline disks, $self->{replicated_volumes} will be
auto-vivified to {} by the check:
next if $self->{replicated_volumes}->{$volid}
in sync_disks() and then {} would evaluate to true in a boolean context.

Now the replication job information is retrieved once in prepare,
and the job information rather than the information if volumes
were replicated is used to decide whether to make the calls or not.

For offline migration to a non-replication target, there are no
$self->{replicated_volumes}, but the state should be transfered nonetheless.

For online migration to a non-replication target, replication
is broken afterwards anyways, so it doesn't make much of a difference
if the state is transferred or not.

Signed-off-by: Fabian Ebner <f.ebner@proxmox.com>
---

Hope I'm not misinterpreting when these calls should or shouldn't
be made.

 PVE/QemuMigrate.pm | 21 ++++++++++-----------
 1 file changed, 10 insertions(+), 11 deletions(-)

diff --git a/PVE/QemuMigrate.pm b/PVE/QemuMigrate.pm
index a20e1c7..6097ef2 100644
--- a/PVE/QemuMigrate.pm
+++ b/PVE/QemuMigrate.pm
@@ -220,6 +220,10 @@ sub prepare {
     # test if VM exists
     my $conf = $self->{vmconf} = PVE::QemuConfig->load_config($vmid);
 
+    my $repl_conf = PVE::ReplicationConfig->new();
+    $self->{replication_jobcfg} = $repl_conf->find_local_replication_job($vmid, $self->{node});
+    $self->{is_replicated} = $repl_conf->check_for_existing_jobs($vmid, 1);
+
     PVE::QemuConfig->check_lock($conf);
 
     my $running = 0;
@@ -227,10 +231,7 @@ sub prepare {
 	die "can't migrate running VM without --online\n" if !$online;
 	$running = $pid;
 
-	my $repl_conf = PVE::ReplicationConfig->new();
-	my $is_replicated = $repl_conf->check_for_existing_jobs($vmid, 1);
-	my $is_replicated_to_target = defined($repl_conf->find_local_replication_job($vmid, $self->{node}));
-	if ($is_replicated && !$is_replicated_to_target) {
+	if ($self->{is_replicated} && !$self->{replication_jobcfg}) {
 	    if ($self->{opts}->{force}) {
 		$self->log('warn', "WARNING: Node '$self->{node}' is not a replication target. Existing " .
 			           "replication jobs will fail after migration!\n");
@@ -362,9 +363,7 @@ sub sync_disks {
 	    });
 	}
 
-	my $rep_cfg = PVE::ReplicationConfig->new();
-	my $replication_jobcfg = $rep_cfg->find_local_replication_job($vmid, $self->{node});
-	my $replicatable_volumes = !$replication_jobcfg ? {}
+	my $replicatable_volumes = !$self->{replication_jobcfg} ? {}
 	    : PVE::QemuConfig->get_replicatable_volumes($storecfg, $vmid, $conf, 0, 1);
 
 	my $test_volid = sub {
@@ -489,7 +488,7 @@ sub sync_disks {
 	    }
 	}
 
-	if ($replication_jobcfg) {
+	if ($self->{replication_jobcfg}) {
 	    if ($self->{running}) {
 
 		my $version = PVE::QemuServer::kvm_user_version();
@@ -523,7 +522,7 @@ sub sync_disks {
 	    my $start_time = time();
 	    my $logfunc = sub { $self->log('info', shift) };
 	    $self->{replicated_volumes} = PVE::Replication::run_replication(
-	       'PVE::QemuConfig', $replication_jobcfg, $start_time, $start_time, $logfunc);
+	       'PVE::QemuConfig', $self->{replication_jobcfg}, $start_time, $start_time, $logfunc);
 	}
 
 	# sizes in config have to be accurate for remote node to correctly
@@ -1193,7 +1192,7 @@ sub phase3_cleanup {
     }
 
     # transfer replication state before move config
-    $self->transfer_replication_state() if $self->{replicated_volumes};
+    $self->transfer_replication_state() if $self->{is_replicated};
 
     # move config to remote node
     my $conffile = PVE::QemuConfig->config_file($vmid);
@@ -1202,7 +1201,7 @@ sub phase3_cleanup {
     die "Failed to move config to node '$self->{node}' - rename failed: $!\n"
         if !rename($conffile, $newconffile);
 
-    $self->switch_replication_job_target() if $self->{replicated_volumes};
+    $self->switch_replication_job_target() if $self->{replication_jobcfg};
 
     if ($self->{livemigration}) {
 	if ($self->{stopnbd}) {
-- 
2.20.1





^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [pve-devel] [PATCH manager 1/3] Hold the guest migration lock when changing the replication config
  2020-07-30 11:29 [pve-devel] [PATCH manager 1/3] Hold the guest migration lock when changing the replication config Fabian Ebner
  2020-07-30 11:29 ` [pve-devel] [PATCH qemu-server 2/3] Repeat check for replication target in locked section Fabian Ebner
  2020-07-30 11:29 ` [pve-devel] [PATCH/RFC qemu-server 3/3] Fix checks for transfering replication state/switching job target Fabian Ebner
@ 2020-08-03  7:11 ` Fabian Ebner
  2020-08-03  7:49 ` Fabian Grünbichler
  3 siblings, 0 replies; 5+ messages in thread
From: Fabian Ebner @ 2020-08-03  7:11 UTC (permalink / raw)
  To: pve-devel

Am 30.07.20 um 13:29 schrieb Fabian Ebner:
> The guest migration lock is already held when running replications,
> but it also makes sense to hold it when updating the replication
> config itself. Otherwise, it can happen that the migration does
> not know the de-facto state of replication.
> 
> For example:
> 1. migration starts
> 2. replication job is deleted
> 3. migration reads the replication config
> 4. migration runs the replication which causes the
>     replicated disks to be removed, because the job
>     is marked for removal
> 5. migration will continue without replication
> 

This situation can still happen even with the locking from this patch:
1. replication job is deleted
2. migration starts before the replication was run, so the job is still 
marked for removal in the replication config
3.-5. same as above

So we probably want to check during migration whether the replication 
job that we want to use is marked for removal. If it is, we could either:
- leave the situation as is, i.e. the replication job will be removed 
during migration and migration will continue without replication
- fail the migration (principle of least surprise?)
- run replication without the removal mark during migration. Then the 
replication job would be removed the next time replication runs after 
migration and hence after the target was switched.

Also: If we only read the replication config once during a migration, 
the locking from this patch shouldn't even be necessary. 
switch_replication_job_target does read the config once more, but that 
would still be compatible with allowing other changes to the replication 
config during migration. But of course this locking might make things 
more future proof.

> Note that the migration doesn't actually fail, but it's probably
> not the desired behavior either.
> 
> Suggested-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
> Signed-off-by: Fabian Ebner <f.ebner@proxmox.com>
> ---
>   PVE/API2/ReplicationConfig.pm | 18 ++++++++++++++----
>   1 file changed, 14 insertions(+), 4 deletions(-)
> 
> diff --git a/PVE/API2/ReplicationConfig.pm b/PVE/API2/ReplicationConfig.pm
> index 2b4ecd10..e5262068 100644
> --- a/PVE/API2/ReplicationConfig.pm
> +++ b/PVE/API2/ReplicationConfig.pm
> @@ -9,6 +9,7 @@ use PVE::JSONSchema qw(get_standard_option);
>   use PVE::RPCEnvironment;
>   use PVE::ReplicationConfig;
>   use PVE::Cluster;
> +use PVE::GuestHelpers;
>   
>   use PVE::RESTHandler;
>   
> @@ -144,7 +145,9 @@ __PACKAGE__->register_method ({
>   	    $cfg->write();
>   	};
>   
> -	PVE::ReplicationConfig::lock($code);
> +	PVE::GuestHelpers::guest_migration_lock($guest, 10, sub {
> +	    PVE::ReplicationConfig::lock($code);
> +	});
>   
>   	return undef;
>       }});
> @@ -167,6 +170,7 @@ __PACKAGE__->register_method ({
>   	my $id = extract_param($param, 'id');
>   	my $digest = extract_param($param, 'digest');
>   	my $delete = extract_param($param, 'delete');
> +	my ($guest_id) = PVE::ReplicationConfig::parse_replication_job_id($id);
>   
>   	my $code = sub {
>   	    my $cfg = PVE::ReplicationConfig->new();
> @@ -199,7 +203,9 @@ __PACKAGE__->register_method ({
>   	    $cfg->write();
>   	};
>   
> -	PVE::ReplicationConfig::lock($code);
> +	PVE::GuestHelpers::guest_migration_lock($guest_id, 10, sub {
> +	    PVE::ReplicationConfig::lock($code);
> +	});
>   
>   	return undef;
>       }});
> @@ -237,10 +243,12 @@ __PACKAGE__->register_method ({
>   
>   	my $rpcenv = PVE::RPCEnvironment::get();
>   
> +	my $id = extract_param($param, 'id');
> +	my ($guest_id) = PVE::ReplicationConfig::parse_replication_job_id($id);
> +
>   	my $code = sub {
>   	    my $cfg = PVE::ReplicationConfig->new();
>   
> -	    my $id = $param->{id};
>   	    if ($param->{force}) {
>   		raise_param_exc({ 'keep' => "conflicts with parameter 'force'" }) if $param->{keep};
>   		delete $cfg->{ids}->{$id};
> @@ -262,7 +270,9 @@ __PACKAGE__->register_method ({
>   	    $cfg->write();
>   	};
>   
> -	PVE::ReplicationConfig::lock($code);
> +	PVE::GuestHelpers::guest_migration_lock($guest_id, 10, sub {
> +	    PVE::ReplicationConfig::lock($code);
> +	});
>   
>   	return undef;
>       }});
> 




^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [pve-devel] [PATCH manager 1/3] Hold the guest migration lock when changing the replication config
  2020-07-30 11:29 [pve-devel] [PATCH manager 1/3] Hold the guest migration lock when changing the replication config Fabian Ebner
                   ` (2 preceding siblings ...)
  2020-08-03  7:11 ` [pve-devel] [PATCH manager 1/3] Hold the guest migration lock when changing the replication config Fabian Ebner
@ 2020-08-03  7:49 ` Fabian Grünbichler
  3 siblings, 0 replies; 5+ messages in thread
From: Fabian Grünbichler @ 2020-08-03  7:49 UTC (permalink / raw)
  To: Fabian Ebner, pve-devel

On July 30, 2020 1:29 pm, Fabian Ebner wrote:
> The guest migration lock is already held when running replications,
> but it also makes sense to hold it when updating the replication
> config itself. Otherwise, it can happen that the migration does
> not know the de-facto state of replication.
> 
> For example:
> 1. migration starts
> 2. replication job is deleted
> 3. migration reads the replication config
> 4. migration runs the replication which causes the
>    replicated disks to be removed, because the job
>    is marked for removal
> 5. migration will continue without replication
> 
> Note that the migration doesn't actually fail, but it's probably
> not the desired behavior either.
> 
> Suggested-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
> Signed-off-by: Fabian Ebner <f.ebner@proxmox.com>
> ---
>  PVE/API2/ReplicationConfig.pm | 18 ++++++++++++++----
>  1 file changed, 14 insertions(+), 4 deletions(-)
> 
> diff --git a/PVE/API2/ReplicationConfig.pm b/PVE/API2/ReplicationConfig.pm
> index 2b4ecd10..e5262068 100644
> --- a/PVE/API2/ReplicationConfig.pm
> +++ b/PVE/API2/ReplicationConfig.pm
> @@ -9,6 +9,7 @@ use PVE::JSONSchema qw(get_standard_option);
>  use PVE::RPCEnvironment;
>  use PVE::ReplicationConfig;
>  use PVE::Cluster;
> +use PVE::GuestHelpers;
>  
>  use PVE::RESTHandler;
>  
> @@ -144,7 +145,9 @@ __PACKAGE__->register_method ({
>  	    $cfg->write();
>  	};
>  
> -	PVE::ReplicationConfig::lock($code);
> +	PVE::GuestHelpers::guest_migration_lock($guest, 10, sub {
> +	    PVE::ReplicationConfig::lock($code);
> +	});

it might make sense to have a single wrapper for this, or add the guest 
ID as parameter to ReplicationConfig::lock (to not miss it or get the 
order wrong).

what about the calls to lock within ReplicationConfig? they are all 
job/guest ID specific, and should also get this additional protection, 
right?

from a quick glance, there seems to be only a single call to 
ReplicationConfig::lock that spans more than one job (job_status in 
ReplicationState), but that immediately iterates over jobs, so we could 
either move the lock into the loop (expensive, since it involves a 
cfs_lock), or split the cfs and flock just for this instance?

(side note, that code and possibly other stuff in ReplicationConfig is 
buggy since it does not re-read the config after locking)

>  
>  	return undef;
>      }});
> @@ -167,6 +170,7 @@ __PACKAGE__->register_method ({
>  	my $id = extract_param($param, 'id');
>  	my $digest = extract_param($param, 'digest');
>  	my $delete = extract_param($param, 'delete');
> +	my ($guest_id) = PVE::ReplicationConfig::parse_replication_job_id($id);
>  
>  	my $code = sub {
>  	    my $cfg = PVE::ReplicationConfig->new();
> @@ -199,7 +203,9 @@ __PACKAGE__->register_method ({
>  	    $cfg->write();
>  	};
>  
> -	PVE::ReplicationConfig::lock($code);
> +	PVE::GuestHelpers::guest_migration_lock($guest_id, 10, sub {
> +	    PVE::ReplicationConfig::lock($code);
> +	});
>  
>  	return undef;
>      }});
> @@ -237,10 +243,12 @@ __PACKAGE__->register_method ({
>  
>  	my $rpcenv = PVE::RPCEnvironment::get();
>  
> +	my $id = extract_param($param, 'id');
> +	my ($guest_id) = PVE::ReplicationConfig::parse_replication_job_id($id);
> +
>  	my $code = sub {
>  	    my $cfg = PVE::ReplicationConfig->new();
>  
> -	    my $id = $param->{id};
>  	    if ($param->{force}) {
>  		raise_param_exc({ 'keep' => "conflicts with parameter 'force'" }) if $param->{keep};
>  		delete $cfg->{ids}->{$id};
> @@ -262,7 +270,9 @@ __PACKAGE__->register_method ({
>  	    $cfg->write();
>  	};
>  
> -	PVE::ReplicationConfig::lock($code);
> +	PVE::GuestHelpers::guest_migration_lock($guest_id, 10, sub {
> +	    PVE::ReplicationConfig::lock($code);
> +	});
>  
>  	return undef;
>      }});
> -- 
> 2.20.1
> 
> 




^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2020-08-03  7:49 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-07-30 11:29 [pve-devel] [PATCH manager 1/3] Hold the guest migration lock when changing the replication config Fabian Ebner
2020-07-30 11:29 ` [pve-devel] [PATCH qemu-server 2/3] Repeat check for replication target in locked section Fabian Ebner
2020-07-30 11:29 ` [pve-devel] [PATCH/RFC qemu-server 3/3] Fix checks for transfering replication state/switching job target Fabian Ebner
2020-08-03  7:11 ` [pve-devel] [PATCH manager 1/3] Hold the guest migration lock when changing the replication config Fabian Ebner
2020-08-03  7:49 ` Fabian Grünbichler

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox
Service provided by Proxmox Server Solutions GmbH | Privacy | Legal