public inbox for pve-devel@lists.proxmox.com
 help / color / mirror / Atom feed
* [pve-devel] [PATCH guest-common v2 1/2] ReplicationState: purge state from non local vms
@ 2022-06-03  7:16 Dominik Csapak
  2022-06-03  7:16 ` [pve-devel] [PATCH guest-common v2 2/2] ReplicationState: deterministically order replication jobs Dominik Csapak
                   ` (2 more replies)
  0 siblings, 3 replies; 4+ messages in thread
From: Dominik Csapak @ 2022-06-03  7:16 UTC (permalink / raw)
  To: pve-devel

when running replication, we don't want to keep replication states for
non-local vms. Normally this would not be a problem, since on migration,
we transfer the states anyway, but when the ha-manager steals a vm, it
cannot do that. In that case, having an old state lying around is
harmful, since the code does not expect the state to be out-of-sync
with the actual snapshots on disk.

One such problem is the following:

Replicate vm 100 from node A to node B and C, and activate HA. When node
A dies, it will be relocated to e.g. node B and start replicate from
there. If node B now had an old state lying around for it's sync to node
C, it might delete the common base snapshots of B and C and cannot sync
again.

Deleting the state for all non local guests fixes that issue, since it
always starts fresh, and the potentially existing old state cannot be
valid anyway since we just relocated the vm here (from a dead node).

Signed-off-by: Dominik Csapak <d.csapak@proxmox.com>
Reviewed-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
---
 src/PVE/ReplicationState.pm | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/PVE/ReplicationState.pm b/src/PVE/ReplicationState.pm
index 0a5e410..8eebb42 100644
--- a/src/PVE/ReplicationState.pm
+++ b/src/PVE/ReplicationState.pm
@@ -215,7 +215,7 @@ sub purge_old_states {
 	my $tid = $plugin->get_unique_target_id($jobcfg);
 	my $vmid = $jobcfg->{guest};
 	$used_tids->{$vmid}->{$tid} = 1
-	    if defined($vms->{ids}->{$vmid}); # && $vms->{ids}->{$vmid}->{node} eq $local_node;
+	    if defined($vms->{ids}->{$vmid}) && $vms->{ids}->{$vmid}->{node} eq $local_node;
     }
 
     my $purge_state = sub {
-- 
2.30.2





^ permalink raw reply	[flat|nested] 4+ messages in thread

* [pve-devel] [PATCH guest-common v2 2/2] ReplicationState: deterministically order replication jobs
  2022-06-03  7:16 [pve-devel] [PATCH guest-common v2 1/2] ReplicationState: purge state from non local vms Dominik Csapak
@ 2022-06-03  7:16 ` Dominik Csapak
  2022-06-07  9:12 ` [pve-devel] [PATCH guest-common v2 1/2] ReplicationState: purge state from non local vms Fabian Ebner
  2022-06-08  6:49 ` [pve-devel] applied-series: " Thomas Lamprecht
  2 siblings, 0 replies; 4+ messages in thread
From: Dominik Csapak @ 2022-06-03  7:16 UTC (permalink / raw)
  To: pve-devel

if we have multiple jobs for the same vmid with the same schedule,
the last_sync, next_sync and vmid will always be the same, so the order
depends on the order of the $jobs hash (which is random; thanks perl)

to have a fixed order, take the jobid also into consideration

Signed-off-by: Dominik Csapak <d.csapak@proxmox.com>
---
changes from v1:
* combined all returns into one

 src/PVE/ReplicationState.pm | 9 ++++-----
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/src/PVE/ReplicationState.pm b/src/PVE/ReplicationState.pm
index 8eebb42..16b3d90 100644
--- a/src/PVE/ReplicationState.pm
+++ b/src/PVE/ReplicationState.pm
@@ -318,11 +318,10 @@ sub get_next_job {
 	my $jobb = $jobs->{$b};
 	my $sa =  $joba->{state};
 	my $sb =  $jobb->{state};
-	my $res = $sa->{last_iteration} <=> $sb->{last_iteration};
-	return $res if $res != 0;
-	$res = $joba->{next_sync} <=> $jobb->{next_sync};
-	return $res if $res != 0;
-	return  $joba->{guest} <=> $jobb->{guest};
+	return $sa->{last_iteration} <=> $sb->{last_iteration} ||
+	    $joba->{next_sync} <=> $jobb->{next_sync} ||
+	    $joba->{guest} <=> $jobb->{guest} ||
+	    $a cmp $b;
     };
 
     foreach my $jobid (sort $sort_func keys %$jobs) {
-- 
2.30.2





^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [pve-devel] [PATCH guest-common v2 1/2] ReplicationState: purge state from non local vms
  2022-06-03  7:16 [pve-devel] [PATCH guest-common v2 1/2] ReplicationState: purge state from non local vms Dominik Csapak
  2022-06-03  7:16 ` [pve-devel] [PATCH guest-common v2 2/2] ReplicationState: deterministically order replication jobs Dominik Csapak
@ 2022-06-07  9:12 ` Fabian Ebner
  2022-06-08  6:49 ` [pve-devel] applied-series: " Thomas Lamprecht
  2 siblings, 0 replies; 4+ messages in thread
From: Fabian Ebner @ 2022-06-07  9:12 UTC (permalink / raw)
  To: pve-devel, Dominik Csapak

Am 03.06.22 um 09:16 schrieb Dominik Csapak:
> when running replication, we don't want to keep replication states for
> non-local vms. Normally this would not be a problem, since on migration,
> we transfer the states anyway, but when the ha-manager steals a vm, it
> cannot do that. In that case, having an old state lying around is
> harmful, since the code does not expect the state to be out-of-sync
> with the actual snapshots on disk.
> 
> One such problem is the following:
> 
> Replicate vm 100 from node A to node B and C, and activate HA. When node
> A dies, it will be relocated to e.g. node B and start replicate from
> there. If node B now had an old state lying around for it's sync to node
> C, it might delete the common base snapshots of B and C and cannot sync
> again.

To be even more robust, we could ensure that the last_sync snapshot
mentioned in the job state is actually present before starting to remove
replication snapshots in prepare() on the source side, or change it to
only remove older snapshots. But prepare() is also used on the target
side to remove stale volumes, so we'd have to be careful not to break
the logic for that.

I'm working on the v2 of a series for improving removal of stale volumes
anyways, so I'll see if I can add something there.

> 
> Deleting the state for all non local guests fixes that issue, since it
> always starts fresh, and the potentially existing old state cannot be
> valid anyway since we just relocated the vm here (from a dead node).
> 
> Signed-off-by: Dominik Csapak <d.csapak@proxmox.com>
> Reviewed-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>

Both patches:
Reviewed-by: Fabian Ebner <f.ebner@proxmox.com>




^ permalink raw reply	[flat|nested] 4+ messages in thread

* [pve-devel] applied-series: [PATCH guest-common v2 1/2] ReplicationState: purge state from non local vms
  2022-06-03  7:16 [pve-devel] [PATCH guest-common v2 1/2] ReplicationState: purge state from non local vms Dominik Csapak
  2022-06-03  7:16 ` [pve-devel] [PATCH guest-common v2 2/2] ReplicationState: deterministically order replication jobs Dominik Csapak
  2022-06-07  9:12 ` [pve-devel] [PATCH guest-common v2 1/2] ReplicationState: purge state from non local vms Fabian Ebner
@ 2022-06-08  6:49 ` Thomas Lamprecht
  2 siblings, 0 replies; 4+ messages in thread
From: Thomas Lamprecht @ 2022-06-08  6:49 UTC (permalink / raw)
  To: Proxmox VE development discussion, Dominik Csapak

Am 03/06/2022 um 09:16 schrieb Dominik Csapak:
> when running replication, we don't want to keep replication states for
> non-local vms. Normally this would not be a problem, since on migration,
> we transfer the states anyway, but when the ha-manager steals a vm, it
> cannot do that. In that case, having an old state lying around is
> harmful, since the code does not expect the state to be out-of-sync
> with the actual snapshots on disk.
> 
> One such problem is the following:
> 
> Replicate vm 100 from node A to node B and C, and activate HA. When node
> A dies, it will be relocated to e.g. node B and start replicate from
> there. If node B now had an old state lying around for it's sync to node
> C, it might delete the common base snapshots of B and C and cannot sync
> again.
> 
> Deleting the state for all non local guests fixes that issue, since it
> always starts fresh, and the potentially existing old state cannot be
> valid anyway since we just relocated the vm here (from a dead node).
> 
> Signed-off-by: Dominik Csapak <d.csapak@proxmox.com>
> Reviewed-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
> ---
>  src/PVE/ReplicationState.pm | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
>

applied, with Fabi's R-b, thanks!




^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2022-06-08  6:49 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-06-03  7:16 [pve-devel] [PATCH guest-common v2 1/2] ReplicationState: purge state from non local vms Dominik Csapak
2022-06-03  7:16 ` [pve-devel] [PATCH guest-common v2 2/2] ReplicationState: deterministically order replication jobs Dominik Csapak
2022-06-07  9:12 ` [pve-devel] [PATCH guest-common v2 1/2] ReplicationState: purge state from non local vms Fabian Ebner
2022-06-08  6:49 ` [pve-devel] applied-series: " Thomas Lamprecht

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox
Service provided by Proxmox Server Solutions GmbH | Privacy | Legal