public inbox for pve-devel@lists.proxmox.com
 help / color / mirror / Atom feed
* [pve-devel] [PATCH guest-common 1/2] replication: update last_sync before removing old replication snapshots
@ 2021-11-26 10:52 Fabian Ebner
  2021-11-26 10:52 ` [pve-devel] [PATCH guest-common 2/2] replication: prepare: simplify code Fabian Ebner
  0 siblings, 1 reply; 3+ messages in thread
From: Fabian Ebner @ 2021-11-26 10:52 UTC (permalink / raw)
  To: pve-devel

If pvesr was terminated after finishing with the new sync and after
removing old replication snapshots, but before it could write the new
state, the next replication would fail. It would wrongly interpret the
actual last replication snapshot as stale, remove it, and (if no other
snapshots are present) attempt a full sync, which would fail.

Reported in the community forum [0], this was brought to light by the
new pvescheduler before it learned graceful reload.

It's not possible to simply preserve a last remaining snapshot in
prepare(), because prepare() is also used for valid removals. Instead,
update last_sync early enough. Stale snapshots will still be removed
on the next run if there are any.

[0]: https://forum.proxmox.com/threads/100154

Signed-off-by: Fabian Ebner <f.ebner@proxmox.com>
---
 src/PVE/Replication.pm      | 3 +++
 src/PVE/ReplicationState.pm | 7 +++++++
 2 files changed, 10 insertions(+)

diff --git a/src/PVE/Replication.pm b/src/PVE/Replication.pm
index 051dfd9..de652f2 100644
--- a/src/PVE/Replication.pm
+++ b/src/PVE/Replication.pm
@@ -372,6 +372,9 @@ sub replicate {
 	die $err;
     }
 
+    # Ensure that new sync is recorded before removing old replication snapshots.
+    PVE::ReplicationState::record_sync_end($jobcfg, $state, $start_time);
+
     # remove old snapshots because they are no longer needed
     $cleanup_local_snapshots->($last_snapshots, $last_sync_snapname);
 
diff --git a/src/PVE/ReplicationState.pm b/src/PVE/ReplicationState.pm
index 81a1b31..8efe0e2 100644
--- a/src/PVE/ReplicationState.pm
+++ b/src/PVE/ReplicationState.pm
@@ -159,6 +159,13 @@ sub delete_guest_states {
     PVE::Tools::lock_file($state_lock, 10, $code);
 }
 
+sub record_sync_end {
+    my ($jobcfg, $state, $start_time) = @_;
+
+    $state->{last_sync} = $start_time;
+    write_job_state($jobcfg, $state);
+}
+
 sub record_job_end {
     my ($jobcfg, $state, $start_time, $duration, $err) = @_;
 
-- 
2.30.2





^ permalink raw reply	[flat|nested] 3+ messages in thread

* [pve-devel] [PATCH guest-common 2/2] replication: prepare: simplify code
  2021-11-26 10:52 [pve-devel] [PATCH guest-common 1/2] replication: update last_sync before removing old replication snapshots Fabian Ebner
@ 2021-11-26 10:52 ` Fabian Ebner
  2021-11-29 10:18   ` [pve-devel] applied-series: " Fabian Grünbichler
  0 siblings, 1 reply; 3+ messages in thread
From: Fabian Ebner @ 2021-11-26 10:52 UTC (permalink / raw)
  To: pve-devel

No functional change is intended.

Signed-off-by: Fabian Ebner <f.ebner@proxmox.com>
---
 src/PVE/Replication.pm | 46 +++++++++++++++++++-----------------------
 1 file changed, 21 insertions(+), 25 deletions(-)

diff --git a/src/PVE/Replication.pm b/src/PVE/Replication.pm
index de652f2..31cabec 100644
--- a/src/PVE/Replication.pm
+++ b/src/PVE/Replication.pm
@@ -176,32 +176,28 @@ sub prepare {
     foreach my $volid (@$volids) {
 	my $info = PVE::Storage::volume_snapshot_info($storecfg, $volid);
 	for my $snap (keys $info->%*) {
-	    if ((defined($snapname) && ($snap eq $snapname)) ||
-		(defined($parent_snapname) && ($snap eq $parent_snapname))) {
-		$last_snapshots->{$volid}->{$snap} = $info->{$snap};
-	    } elsif ($snap =~ m/^\Q$prefix\E/) {
-		if ($last_sync != 0) {
-		    $logfunc->("delete stale replication snapshot '$snap' on $volid");
-		    eval {
-			PVE::Storage::volume_snapshot_delete($storecfg, $volid, $snap);
-			$cleaned_replicated_volumes->{$volid} = 1;
-		    };
-
-		    # If deleting the snapshot fails, we can not be sure if it was due to an error or a timeout.
-		    # The likelihood that the delete has worked out is high at a timeout.
-		    # If it really fails, it will try to remove on the next run.
-		    if (my $err = $@) {
-			# warn is for syslog/journal.
-			warn $err;
-
-			# logfunc will written in replication log.
-			$logfunc->("delete stale replication snapshot error: $err");
-		    }		
-		# Last_sync=0 and a replication snapshot only occur, if the VM was stolen
-		} else {
-		    $last_snapshots->{$volid}->{$snap} = $info->{$snap};
+	    if ( # check if it's a stale replication snapshot
+		!(defined($snapname) && $snap eq $snapname) &&
+		!(defined($parent_snapname) && $snap eq $parent_snapname) &&
+		$snap =~ m/^\Q$prefix\E/ &&
+		$last_sync != 0 # last_sync is 0 if the VM was stolen
+	    ) {
+		$logfunc->("delete stale replication snapshot '$snap' on $volid");
+		eval {
+		    PVE::Storage::volume_snapshot_delete($storecfg, $volid, $snap);
+		    $cleaned_replicated_volumes->{$volid} = 1;
+		};
+
+		# If deleting the snapshot fails, we can not be sure if it was due to an error or a timeout.
+		# The likelihood that the delete has worked out is high at a timeout.
+		# If it really fails, it will try to remove on the next run.
+		if (my $err = $@) {
+		    # warn is for syslog/journal.
+		    warn $err;
+
+		    # logfunc will written in replication log.
+		    $logfunc->("delete stale replication snapshot error: $err");
 		}
-	    # Other snapshots might need to serve as replication base after rollback
 	    } else {
 		$last_snapshots->{$volid}->{$snap} = $info->{$snap};
 	    }
-- 
2.30.2





^ permalink raw reply	[flat|nested] 3+ messages in thread

* [pve-devel] applied-series: [PATCH guest-common 2/2] replication: prepare: simplify code
  2021-11-26 10:52 ` [pve-devel] [PATCH guest-common 2/2] replication: prepare: simplify code Fabian Ebner
@ 2021-11-29 10:18   ` Fabian Grünbichler
  0 siblings, 0 replies; 3+ messages in thread
From: Fabian Grünbichler @ 2021-11-29 10:18 UTC (permalink / raw)
  To: Proxmox VE development discussion

On November 26, 2021 11:52 am, Fabian Ebner wrote:
> No functional change is intended.
> 
> Signed-off-by: Fabian Ebner <f.ebner@proxmox.com>
> ---
>  src/PVE/Replication.pm | 46 +++++++++++++++++++-----------------------
>  1 file changed, 21 insertions(+), 25 deletions(-)
> 
> diff --git a/src/PVE/Replication.pm b/src/PVE/Replication.pm
> index de652f2..31cabec 100644
> --- a/src/PVE/Replication.pm
> +++ b/src/PVE/Replication.pm
> @@ -176,32 +176,28 @@ sub prepare {
>      foreach my $volid (@$volids) {
>  	my $info = PVE::Storage::volume_snapshot_info($storecfg, $volid);
>  	for my $snap (keys $info->%*) {
> -	    if ((defined($snapname) && ($snap eq $snapname)) ||
> -		(defined($parent_snapname) && ($snap eq $parent_snapname))) {
> -		$last_snapshots->{$volid}->{$snap} = $info->{$snap};
> -	    } elsif ($snap =~ m/^\Q$prefix\E/) {
> -		if ($last_sync != 0) {
> -		    $logfunc->("delete stale replication snapshot '$snap' on $volid");
> -		    eval {
> -			PVE::Storage::volume_snapshot_delete($storecfg, $volid, $snap);
> -			$cleaned_replicated_volumes->{$volid} = 1;
> -		    };
> -
> -		    # If deleting the snapshot fails, we can not be sure if it was due to an error or a timeout.
> -		    # The likelihood that the delete has worked out is high at a timeout.
> -		    # If it really fails, it will try to remove on the next run.
> -		    if (my $err = $@) {
> -			# warn is for syslog/journal.
> -			warn $err;
> -
> -			# logfunc will written in replication log.
> -			$logfunc->("delete stale replication snapshot error: $err");
> -		    }		
> -		# Last_sync=0 and a replication snapshot only occur, if the VM was stolen
> -		} else {
> -		    $last_snapshots->{$volid}->{$snap} = $info->{$snap};
> +	    if ( # check if it's a stale replication snapshot
> +		!(defined($snapname) && $snap eq $snapname) &&
> +		!(defined($parent_snapname) && $snap eq $parent_snapname) &&
> +		$snap =~ m/^\Q$prefix\E/ &&
> +		$last_sync != 0 # last_sync is 0 if the VM was stolen
> +	    ) {
> +		$logfunc->("delete stale replication snapshot '$snap' on $volid");
> +		eval {
> +		    PVE::Storage::volume_snapshot_delete($storecfg, $volid, $snap);
> +		    $cleaned_replicated_volumes->{$volid} = 1;
> +		};
> +
> +		# If deleting the snapshot fails, we can not be sure if it was due to an error or a timeout.
> +		# The likelihood that the delete has worked out is high at a timeout.
> +		# If it really fails, it will try to remove on the next run.
> +		if (my $err = $@) {
> +		    # warn is for syslog/journal.
> +		    warn $err;
> +
> +		    # logfunc will written in replication log.
> +		    $logfunc->("delete stale replication snapshot error: $err");
>  		}
> -	    # Other snapshots might need to serve as replication base after rollback
>  	    } else {
>  		$last_snapshots->{$volid}->{$snap} = $info->{$snap};
>  	    }
> -- 
> 2.30.2
> 
> 
> 
> _______________________________________________
> pve-devel mailing list
> pve-devel@lists.proxmox.com
> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
> 
> 
> 




^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2021-11-29 10:19 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-11-26 10:52 [pve-devel] [PATCH guest-common 1/2] replication: update last_sync before removing old replication snapshots Fabian Ebner
2021-11-26 10:52 ` [pve-devel] [PATCH guest-common 2/2] replication: prepare: simplify code Fabian Ebner
2021-11-29 10:18   ` [pve-devel] applied-series: " Fabian Grünbichler

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox
Service provided by Proxmox Server Solutions GmbH | Privacy | Legal