Re: [pve-devel] [PATCH v2 guest-common 2/2] fix 3111: replicate guest on rollback if there are replication jobs for it

From: "Fabian Grünbichler" <f.gruenbichler@proxmox.com>
To: Proxmox VE development discussion <pve-devel@lists.proxmox.com>
Subject: Re: [pve-devel] [PATCH v2 guest-common 2/2] fix 3111: replicate guest on rollback if there are replication jobs for it
Date: Tue, 22 Jun 2021 09:41:35 +0200	[thread overview]
Message-ID: <1624346775.lqzf5lnz7c.astroid@nora.none> (raw)
In-Reply-To: <20210609091858.27219-2-f.ebner@proxmox.com>

On June 9, 2021 11:18 am, Fabian Ebner wrote:
> so that there will be a valid replication snapshot again.
> 
> Otherwise, replication will be broken after a rollback if the last
> (non-replication) snapshot is removed before replication can run again.

I still see issues with these two patches applied..

A: snapshot 'test2', successful replication afterwards, then rollback to 
'test2':

delete stale replication snapshot '__replicate_102-1_1624346101__' on local-zfs:vm-102-disk-0
delete stale replication snapshot '__replicate_102-0_1624346407__' on local-zfs:vm-102-disk-0
replicating rolled back guest to node 'pve64test02'
start replication job
guest => VM 102, running => 0
volumes => local-zfs:vm-102-disk-0
create snapshot '__replicate_102-0_1624346600__' on local-zfs:vm-102-disk-0
using secure transmission, rate limit: none
incremental sync 'local-zfs:vm-102-disk-0' (test2 => __replicate_102-0_1624346600__)
send from @test2 to rpool/data/vm-102-disk-0@__replicate_102-0_1624346600__ estimated size is 624B
total estimated size is 624B
successfully imported 'local-zfs:vm-102-disk-0'
delete previous replication snapshot '__replicate_102-0_1624346407__' on local-zfs:vm-102-disk-0
zfs error: could not find any snapshots to destroy; check snapshot names.
end replication job
replicating rolled back guest to node 'pve64test03'
start replication job
guest => VM 102, running => 0
volumes => local-zfs:vm-102-disk-0
create snapshot '__replicate_102-1_1624346602__' on local-zfs:vm-102-disk-0
using secure transmission, rate limit: none
incremental sync 'local-zfs:vm-102-disk-0' (test2 => __replicate_102-1_1624346602__)
send from @test2 to rpool/data/vm-102-disk-0@__replicate_102-0_1624346600__ estimated size is 624B
send from @__replicate_102-0_1624346600__ to rpool/data/vm-102-disk-0@__replicate_102-1_1624346602__ estimated size is 624B
total estimated size is 1.22K
successfully imported 'local-zfs:vm-102-disk-0'
delete previous replication snapshot '__replicate_102-1_1624346101__' on local-zfs:vm-102-disk-0
zfs error: could not find any snapshots to destroy; check snapshot names.
end replication job

two misleading errors from attempting to delete already-cleaned up 
snapshots (this is just confusing, likely caused by the state file being 
outdated after prepare(), replication is still working as expected 
afterwards)

B: successful replication, snapshot 'test3', rollback to 'test3'

delete stale replication snapshot '__replicate_102-1_1624346602__' on local-zfs:vm-102-disk-0
delete stale replication snapshot '__replicate_102-0_1624346600__' on local-zfs:vm-102-disk-0
replicating rolled back guest to node 'pve64test02'
start replication job
guest => VM 102, running => 0
volumes => local-zfs:vm-102-disk-0
create snapshot '__replicate_102-0_1624346683__' on local-zfs:vm-102-disk-0
using secure transmission, rate limit: none
full sync 'local-zfs:vm-102-disk-0' (__replicate_102-0_1624346683__)
full send of rpool/data/vm-102-disk-0@test estimated size is 5.59G
send from @test to rpool/data/vm-102-disk-0@test2 estimated size is 624B
send from @test2 to rpool/data/vm-102-disk-0@test3 estimated size is 624B
send from @test3 to rpool/data/vm-102-disk-0@__replicate_102-0_1624346683__ estimated size is 624B
total estimated size is 5.59G
volume 'rpool/data/vm-102-disk-0' already exists
command 'zfs send -Rpv -- rpool/data/vm-102-disk-0@__replicate_102-0_1624346683__' failed: got signal 13
send/receive failed, cleaning up snapshot(s)..
delete previous replication snapshot '__replicate_102-0_1624346683__' on local-zfs:vm-102-disk-0
end replication job with error: command 'set -o pipefail && pvesm export local-zfs:vm-102-disk-0 zfs - -with-snapshots 1 -snapshot __replicate_102-0_1624346683__ | /usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=pve64test02' root@10.0.0.172 -- pvesm import local-zfs:vm-102-disk-0 zfs - -with-snapshots 1 -allow-rename 0' failed: exit code 255
unable to replicate rolled back guest to node 'pve64test02' - command 'set -o pipefail && pvesm export local-zfs:vm-102-disk-0 zfs - -with-snapshots 1 -snapshot __replicate_102-0_1624346683__ | /usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=pve64test02' root@10.0.0.172 -- pvesm import local-zfs:vm-102-disk-0 zfs - -with-snapshots 1 -allow-rename 0' failed: exit code 255
replicating rolled back guest to node 'pve64test03'
start replication job
guest => VM 102, running => 0
volumes => local-zfs:vm-102-disk-0
create snapshot '__replicate_102-1_1624346685__' on local-zfs:vm-102-disk-0
using secure transmission, rate limit: none
full sync 'local-zfs:vm-102-disk-0' (__replicate_102-1_1624346685__)
full send of rpool/data/vm-102-disk-0@test estimated size is 5.59G
send from @test to rpool/data/vm-102-disk-0@test2 estimated size is 624B
send from @test2 to rpool/data/vm-102-disk-0@test3 estimated size is 624B
send from @test3 to rpool/data/vm-102-disk-0@__replicate_102-1_1624346685__ estimated size is 624B
total estimated size is 5.59G
volume 'rpool/data/vm-102-disk-0' already exists
command 'zfs send -Rpv -- rpool/data/vm-102-disk-0@__replicate_102-1_1624346685__' failed: got signal 13
send/receive failed, cleaning up snapshot(s)..
delete previous replication snapshot '__replicate_102-1_1624346685__' on local-zfs:vm-102-disk-0
end replication job with error: command 'set -o pipefail && pvesm export local-zfs:vm-102-disk-0 zfs - -with-snapshots 1 -snapshot __replicate_102-1_1624346685__ | /usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=pve64test03' root@10.0.0.173 -- pvesm import local-zfs:vm-102-disk-0 zfs - -with-snapshots 1 -allow-rename 0' failed: exit code 255
unable to replicate rolled back guest to node 'pve64test03' - command 'set -o pipefail && pvesm export local-zfs:vm-102-disk-0 zfs - -with-snapshots 1 -snapshot __replicate_102-1_1624346685__ | /usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=pve64test03' root@10.0.0.173 -- pvesm import local-zfs:vm-102-disk-0 zfs - -with-snapshots 1 -allow-rename 0' failed: exit code 255

now the replication is broken and requires manual intervention to be 
fixed:

source: no replication snapshots, regular snapshots test, test2, test3

target: one replication snapshot, regular snapshots test, test2 (test3 
is missing)

> 
> Signed-off-by: Fabian Ebner <f.ebner@proxmox.com>
> ---
> 
> No changes from v1 (except rebase).
> 
> Not a huge fan of this, but the alternatives I could come up with don't seem
> much better IMHO:
> 
> 1. Invalidate/remove replicated volumes after a rollback altogether and require
> a full sync on the next replication job afterwards.
> 
> 2. Another one is to disallow removing the last non-replication snapshot if:
>     * there is a replication job configured
>     * no replication snapshot for that job currently exists
>       (which likely means it was removed by a previous rollback operation, but
>       can also happen for a new job that didn't run yet).
> 
> 3. Hope not very many people immediately delete their snapshots after rollback.
> 
> Pick a favorite or suggest your own ;)
> 
>  src/PVE/AbstractConfig.pm    | 19 +++++++++++++++++--
>  src/PVE/ReplicationConfig.pm | 14 ++++++++++++++
>  2 files changed, 31 insertions(+), 2 deletions(-)
> 
> diff --git a/src/PVE/AbstractConfig.pm b/src/PVE/AbstractConfig.pm
> index 6542ae4..6cc0537 100644
> --- a/src/PVE/AbstractConfig.pm
> +++ b/src/PVE/AbstractConfig.pm
> @@ -951,6 +951,9 @@ sub snapshot_rollback {
>  
>      my $storecfg = PVE::Storage::config();
>  
> +    my $repl_conf = PVE::ReplicationConfig->new();
> +    my $logfunc = sub { my $line = shift; chomp $line; print "$line\n"; };
> +
>      my $data = {};
>  
>      my $get_snapshot_config = sub {
> @@ -972,7 +975,6 @@ sub snapshot_rollback {
>  	$snap = $get_snapshot_config->($conf);
>  
>  	if ($prepare) {
> -	    my $repl_conf = PVE::ReplicationConfig->new();
>  	    if ($repl_conf->check_for_existing_jobs($vmid, 1)) {
>  		# remove replication snapshots on volumes affected by rollback *only*!
>  		my $volumes = $class->get_replicatable_volumes($storecfg, $vmid, $snap, 1);
> @@ -989,7 +991,6 @@ sub snapshot_rollback {
>  		});
>  
>  		# remove all local replication snapshots (jobid => undef)
> -		my $logfunc = sub { my $line = shift; chomp $line; print "$line\n"; };
>  		PVE::Replication::prepare($storecfg, $volids, undef, 1, undef, $logfunc);

I) maybe prepare or at least this calling context should hold the 
replication lock so that it can't race with concurrent replication runs?
II) maybe prepare should update the state file in general (if the last 
snapshot referenced there is stale and gets removed, the next 
replication run gets confused) - might/should fix A
III) maybe prepare and/or run_replication need to learn how to deal with 
"only regular snapshots, but not the last one match" (we could match on
name+guid to ensure the common base is actually a common, previously 
replicated base/the same snapshot and not just the same name with 
different content) and construct a send stream from that shared snapshot 
instead of attempting a full sync where the target already (partially) 
exists.. that would fix B, and improve replication robustness in 
general (allowing resuming from a partial shared state instead of having 
to remove, start over from scratch..)

implementing III would also avoid the need for doing a replication after 
rollback - the next replication would handle the situation just fine 
unless ALL previously shared snapshots are removed - we could check for 
that in the remove snapshot code path though.. or we could just schedule 
a replication here instead of directly doing it. rollback is an almost 
instant action (as long as no vmstate is involved), and a replication 
can take a long time so it seems a bit strange to conflate the two..

>  	    }
>  
> @@ -1047,6 +1048,20 @@ sub snapshot_rollback {
>  
>      $prepare = 0;
>      $class->lock_config($vmid, $updatefn);
> +
> +    my $replication_jobs = $repl_conf->list_guests_replication_jobs($vmid);
> +    for my $job (@{$replication_jobs}) {
> +	my $target = $job->{target};
> +	$logfunc->("replicating rolled back guest to node '$target'");
> +
> +	my $start_time = time();
> +	eval {
> +	    PVE::Replication::run_replication($class, $job, $start_time, $start_time, $logfunc);
> +	};
> +	if (my $err = $@) {
> +	    warn "unable to replicate rolled back guest to node '$target' - $err";
> +	}
> +    }
>  }
>  
>  # bash completion helper
> diff --git a/src/PVE/ReplicationConfig.pm b/src/PVE/ReplicationConfig.pm
> index fd856a0..84a718f 100644
> --- a/src/PVE/ReplicationConfig.pm
> +++ b/src/PVE/ReplicationConfig.pm
> @@ -228,6 +228,20 @@ sub find_local_replication_job {
>      return undef;
>  }
>  
> +sub list_guests_replication_jobs {
> +    my ($cfg, $vmid) = @_;
> +
> +    my $jobs = [];
> +
> +    for my $job (values %{$cfg->{ids}}) {
> +	next if $job->{type} ne 'local' || $job->{guest} != $vmid;
> +
> +	push @{$jobs}, $job;
> +    }
> +
> +    return $jobs;
> +}
> +
>  # makes old_target the new source for all local jobs of this guest
>  # makes new_target the target for the single local job with target old_target
>  sub switch_replication_job_target_nolock {
> -- 
> 2.30.2
> 
> 
> 
> _______________________________________________
> pve-devel mailing list
> pve-devel@lists.proxmox.com
> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
> 
> 
>