From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits)) (No client certificate requested) by lists.proxmox.com (Postfix) with ESMTPS id 3BC306D038 for ; Thu, 12 Aug 2021 13:02:04 +0200 (CEST) Received: from firstgate.proxmox.com (localhost [127.0.0.1]) by firstgate.proxmox.com (Proxmox) with ESMTP id A6D2922182 for ; Thu, 12 Aug 2021 13:01:33 +0200 (CEST) Received: from proxmox-new.maurer-it.com (proxmox-new.maurer-it.com [94.136.29.106]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits)) (No client certificate requested) by firstgate.proxmox.com (Proxmox) with ESMTPS id A9C2421FF8 for ; Thu, 12 Aug 2021 13:01:24 +0200 (CEST) Received: from proxmox-new.maurer-it.com (localhost.localdomain [127.0.0.1]) by proxmox-new.maurer-it.com (Proxmox) with ESMTP id 145D84330F for ; Thu, 12 Aug 2021 13:01:19 +0200 (CEST) From: Fabian Ebner To: pve-devel@lists.proxmox.com Date: Thu, 12 Aug 2021 13:01:10 +0200 Message-Id: <20210812110111.73883-12-f.ebner@proxmox.com> X-Mailer: git-send-email 2.30.2 In-Reply-To: <20210812110111.73883-1-f.ebner@proxmox.com> References: <20210812110111.73883-1-f.ebner@proxmox.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-SPAM-LEVEL: Spam detection results: 0 AWL 0.408 Adjusted score from AWL reputation of From: address BAYES_00 -1.9 Bayes spam probability is 0 to 1% KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment SPF_HELO_NONE 0.001 SPF: HELO does not publish an SPF Record SPF_PASS -0.001 SPF: sender matches SPF record Subject: [pve-devel] [PATCH v3 guest-common 6/7] partially fix #3111: replication: be less picky when selecting incremental base X-BeenThere: pve-devel@lists.proxmox.com X-Mailman-Version: 2.1.29 Precedence: list List-Id: Proxmox VE development discussion List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 12 Aug 2021 11:02:04 -0000 After rollback, it might be necessary to start the replication from an earlier, possibly non-replication, snapshot, because the replication snapshot might have been removed from the source node. Previously, replication could only recover in case the current parent snapshot was already replicated. To get into the bad situation (with no replication happening between the steps): 1. have existing replication 2. take new snapshot 3. rollback to that snapshot In case the partial fix to only remove blocking replication snapshots for rollback was already applied, an additional step is necessary to get into the bad situation: 4. take a second new snapshot Since non-replication snapshots are now also included, where no timestamp is readily available, it is necessary to filter them out when probing for replication snapshots. If no common replication snapshot is present, iterate backwards through the config snapshots. The changes are backwards compatible: If one side is running the old code, and the other the new code, the fact that one of the two prepare() calls does not return the new additional snapshot candidates, means that no new match is possible. Since the new prepare() returns a superset, no previously possible match is now impossible. The branch with @desc_sorted_snap is now taken more often, but it can still only be taken when the volume exists on the remote side (and has snapshots). In such cases, it is safe to die if no incremental base snapshot can be found, because a full sync would not be possible. Signed-off-by: Fabian Ebner --- $parent_snapname could now slowly be dropped from prepare(), but I'll save that for later (it'll take at least until 8.x because of backwards compatibility anyways). src/PVE/Replication.pm | 31 +++++++++++++++++++++++++++++-- 1 file changed, 29 insertions(+), 2 deletions(-) diff --git a/src/PVE/Replication.pm b/src/PVE/Replication.pm index 4056ea2..2609ad6 100644 --- a/src/PVE/Replication.pm +++ b/src/PVE/Replication.pm @@ -65,10 +65,12 @@ sub find_common_replication_snapshot { ($last_snapshots->{$volid}->{$parent_snapname} && $remote_snapshots->{$volid}->{$parent_snapname})) { $base_snapshots->{$volid} = $parent_snapname; - } elsif ($last_sync == 0) { + } else { + # First, try all replication snapshots. my @desc_sorted_snap = map { $_->[1] } sort { $b->[0] <=> $a->[0] } - map { [ ($_ =~ /__replicate_\Q$jobid\E_(\d+)_/)[0] || 0, $_ ] } + grep { $_->[0] != 0 } # only consider replication snapshots + map { [ ($_ =~ /__replicate_\Q$vmid\E-(?:\d+)_(\d+)_/)[0] || 0, $_ ] } keys %{$remote_snapshots->{$volid}}; foreach my $remote_snap (@desc_sorted_snap) { @@ -77,6 +79,28 @@ sub find_common_replication_snapshot { last; } } + + # Then, try config snapshots ($parent_snapname was already tested for above). + my $snapname = $parent_snapname // ''; + + # Be robust against loop, just in case. + # https://forum.proxmox.com/threads/snapshot-not-working.69511/post-312281 + my $max_count = scalar(keys $guest_conf->{snapshots}->%*); + for (my $count = 0; $count < $max_count; $count++) { + last if defined($base_snapshots->{$volid}); + + my $snap_conf = $guest_conf->{snapshots}->{$snapname} || last; + $snapname = $snap_conf->{parent} // last; + + if ($last_snapshots->{$volid}->{$snapname} && + $remote_snapshots->{$volid}->{$snapname}) + { + $base_snapshots->{$volid} = $snapname; + } + } + + # The volume exists on the remote side, so trying a full sync won't work. + # Die early with a clean error. die "No common base to restore the job state\n". "please delete jobid: $jobid and create the job again\n" if !defined($base_snapshots->{$volid}); @@ -182,6 +206,9 @@ sub prepare { } else { $last_snapshots->{$volid}->{$snap} = 1; } + # Other snapshots might need to serve as replication base after rollback + } else { + $last_snapshots->{$volid}->{$snap} = 1; } } } -- 2.30.2