[pve-devel] [PATCH-SERIES guest-common/container/qemu-server] fix #3424: wait for active replication when deleting a snapshot

public inbox for pve-devel@lists.proxmox.com
 help / color / mirror / Atom feed

* [pve-devel] [PATCH-SERIES guest-common/container/qemu-server] fix #3424: wait for active replication when deleting a snapshot
@ 2022-02-21 11:58 Fabian Ebner
  2022-02-21 11:58 ` [pve-devel] [PATCH v3 guest-common 1/1] guest helpers: add run_with_replication_guard Fabian Ebner
                   ` (3 more replies)
  0 siblings, 4 replies; 8+ messages in thread
From: Fabian Ebner @ 2022-02-21 11:58 UTC (permalink / raw)
  To: pve-devel

Avoid that an attempt to remove a snapshot that's actively used by
replication leads to a partially (or not) removed snapshot and locked
guest.

I decided to make the checks at the call sides, because passing the
log function and timeout to snapshot_delete felt awkward as they
would only be used for obtaining the lock.

Changes from v2:
    * Also check upon manual snapshot removal, not just for vzdump.
    * Add common helper.


guest-common:

Fabian Ebner (1):
  guest helpers: add run_with_replication_guard

 src/PVE/GuestHelpers.pm | 15 ++++++++++++++-
 1 file changed, 14 insertions(+), 1 deletion(-)


container:

Fabian Ebner (2):
  partially fix #3424: vzdump: cleanup: wait for active replication
  fix #3424: api: snapshot delete: wait for active replication

 src/PVE/API2/LXC/Snapshot.pm | 12 +++++++++++-
 src/PVE/VZDump/LXC.pm        | 11 +++++++++--
 2 files changed, 20 insertions(+), 3 deletions(-)


qemu-server:

Fabian Ebner (1):
  fix #3424: api: snapshot delete: wait for active replication

 PVE/API2/Qemu.pm | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

-- 
2.30.2





^ permalink raw reply	[flat|nested] 8+ messages in thread

* [pve-devel] [PATCH v3 guest-common 1/1] guest helpers: add run_with_replication_guard
  2022-02-21 11:58 [pve-devel] [PATCH-SERIES guest-common/container/qemu-server] fix #3424: wait for active replication when deleting a snapshot Fabian Ebner
@ 2022-02-21 11:58 ` Fabian Ebner
  2022-02-22  9:41   ` Fabian Ebner
  2022-02-21 11:58 ` [pve-devel] [PATCH v3 container 1/2] partially fix #3424: vzdump: cleanup: wait for active replication Fabian Ebner
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 8+ messages in thread
From: Fabian Ebner @ 2022-02-21 11:58 UTC (permalink / raw)
  To: pve-devel

Signed-off-by: Fabian Ebner <f.ebner@proxmox.com>
---

New in v3.

 src/PVE/GuestHelpers.pm | 15 ++++++++++++++-
 1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/src/PVE/GuestHelpers.pm b/src/PVE/GuestHelpers.pm
index 970c460..1183819 100644
--- a/src/PVE/GuestHelpers.pm
+++ b/src/PVE/GuestHelpers.pm
@@ -3,8 +3,9 @@ package PVE::GuestHelpers;
 use strict;
 use warnings;
 
-use PVE::Tools;
+use PVE::ReplicationConfig;
 use PVE::Storage;
+use PVE::Tools;
 
 use POSIX qw(strftime);
 use Scalar::Util qw(weaken);
@@ -82,6 +83,18 @@ sub guest_migration_lock {
     return $res;
 }
 
+sub run_with_replication_guard {
+    my ($vmid, $timeout, $log, $func, @param) = @_;
+
+    my $repl_conf = PVE::ReplicationConfig->new();
+    if ($repl_conf->check_for_existing_jobs($vmid, 1)) {
+	$log->("checking/waiting for active replication..") if $log;
+	guest_migration_lock($vmid, $timeout, $func, @param);
+    } else {
+	$func->(@param);
+    }
+}
+
 sub check_hookscript {
     my ($volid, $storecfg) = @_;
 
-- 
2.30.2





^ permalink raw reply	[flat|nested] 8+ messages in thread

* [pve-devel] [PATCH v3 container 1/2] partially fix #3424: vzdump: cleanup: wait for active replication
  2022-02-21 11:58 [pve-devel] [PATCH-SERIES guest-common/container/qemu-server] fix #3424: wait for active replication when deleting a snapshot Fabian Ebner
  2022-02-21 11:58 ` [pve-devel] [PATCH v3 guest-common 1/1] guest helpers: add run_with_replication_guard Fabian Ebner
@ 2022-02-21 11:58 ` Fabian Ebner
  2022-02-21 11:58 ` [pve-devel] [PATCH v3 container 2/2] fix #3424: api: snapshot delete: " Fabian Ebner
  2022-02-21 11:58 ` [pve-devel] [PATCH v3 qemu-server 1/1] " Fabian Ebner
  3 siblings, 0 replies; 8+ messages in thread
From: Fabian Ebner @ 2022-02-21 11:58 UTC (permalink / raw)
  To: pve-devel

As replication and backup can happen at the same time, the vzdump
snapshot might be actively used by replication when backup tries
to cleanup, resulting in a not (or only partially) removed snapshot
and locked (snapshot-delete) container.

Wait up to 10 minutes for any ongoing replication. If replication
doesn't finish in time, the fact that there is no attempt to remove
the snapshot means that there's no risk for the container to end up in
a locked state. And the beginning of the next backup will force remove
the left-over snapshot, which will very likely succeed even at the
storage layer, because the replication really should be done by then
(subsequent replications shouldn't matter as they don't need to
re-transfer the vzdump snapshot).

Suggested-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
Signed-off-by: Fabian Ebner <f.ebner@proxmox.com>
---

Dependency bump for guest-common needed.

Changes from v2:
    * Use new helper.

VM backups are not affected by this, because they don't use
storage/config snapshots, but use pve-qemu's block layer.

 src/PVE/VZDump/LXC.pm | 11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/src/PVE/VZDump/LXC.pm b/src/PVE/VZDump/LXC.pm
index b7f7463..2d943a1 100644
--- a/src/PVE/VZDump/LXC.pm
+++ b/src/PVE/VZDump/LXC.pm
@@ -8,6 +8,7 @@ use File::Path;
 use POSIX qw(strftime);

 use PVE::Cluster qw(cfs_read_file);
+use PVE::GuestHelpers;
 use PVE::INotify;
 use PVE::LXC::Config;
 use PVE::LXC;
@@ -476,8 +477,14 @@ sub cleanup {
     }

     if ($task->{cleanup}->{remove_snapshot}) {
-	$self->loginfo("cleanup temporary 'vzdump' snapshot");
-	PVE::LXC::Config->snapshot_delete($vmid, 'vzdump', 0);
+	my $do_delete = sub {
+	    $self->loginfo("cleanup temporary 'vzdump' snapshot");
+	    PVE::LXC::Config->snapshot_delete($vmid, 'vzdump', 0);
+	};
+	my $logfn = sub { $self->loginfo($_[0]); };
+
+	eval { PVE::GuestHelpers::run_with_replication_guard($vmid, 600, $logfn, $do_delete); };
+	die "snapshot 'vzdump' was not (fully) removed - $@" if $@;
     }
 }

-- 
2.30.2

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [pve-devel] [PATCH v3 container 2/2] fix #3424: api: snapshot delete: wait for active replication
  2022-02-21 11:58 [pve-devel] [PATCH-SERIES guest-common/container/qemu-server] fix #3424: wait for active replication when deleting a snapshot Fabian Ebner
  2022-02-21 11:58 ` [pve-devel] [PATCH v3 guest-common 1/1] guest helpers: add run_with_replication_guard Fabian Ebner
  2022-02-21 11:58 ` [pve-devel] [PATCH v3 container 1/2] partially fix #3424: vzdump: cleanup: wait for active replication Fabian Ebner
@ 2022-02-21 11:58 ` Fabian Ebner
  2022-02-21 11:58 ` [pve-devel] [PATCH v3 qemu-server 1/1] " Fabian Ebner
  3 siblings, 0 replies; 8+ messages in thread
From: Fabian Ebner @ 2022-02-21 11:58 UTC (permalink / raw)
  To: pve-devel

A to-be-deleted snapshot might be actively used by replication,
resulting in a not (or only partially) removed snapshot and locked
(snapshot-delete) container. Simply wait a few seconds for any ongoing
replication.

Signed-off-by: Fabian Ebner <f.ebner@proxmox.com>
---

New in v3.

 src/PVE/API2/LXC/Snapshot.pm | 12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/src/PVE/API2/LXC/Snapshot.pm b/src/PVE/API2/LXC/Snapshot.pm
index 160e5eb..8009586 100644
--- a/src/PVE/API2/LXC/Snapshot.pm
+++ b/src/PVE/API2/LXC/Snapshot.pm
@@ -10,6 +10,7 @@ use PVE::INotify;
 use PVE::Cluster qw(cfs_read_file);
 use PVE::AccessControl;
 use PVE::Firewall;
+use PVE::GuestHelpers;
 use PVE::Storage;
 use PVE::RESTHandler;
 use PVE::RPCEnvironment;
@@ -198,11 +199,20 @@ __PACKAGE__->register_method({
 
 	my $snapname = extract_param($param, 'snapname');
 
-	my $realcmd = sub {
+	my $do_delete = sub {
 	    PVE::Cluster::log_msg('info', $authuser, "delete snapshot VM $vmid: $snapname");
 	    PVE::LXC::Config->snapshot_delete($vmid, $snapname, $param->{force});
 	};
 
+	my $realcmd = sub {
+	    if ($param->{force}) {
+		$do_delete->();
+	    } else {
+		my $logfn = sub { print "$_[0]\n"; };
+		PVE::GuestHelpers::run_with_replication_guard($vmid, 10, $logfn, $do_delete);
+	    }
+	};
+
 	return $rpcenv->fork_worker('vzdelsnapshot', $vmid, $authuser, $realcmd);
     }});
 
-- 
2.30.2





^ permalink raw reply	[flat|nested] 8+ messages in thread

* [pve-devel] [PATCH v3 qemu-server 1/1] fix #3424: api: snapshot delete: wait for active replication
  2022-02-21 11:58 [pve-devel] [PATCH-SERIES guest-common/container/qemu-server] fix #3424: wait for active replication when deleting a snapshot Fabian Ebner
                   ` (2 preceding siblings ...)
  2022-02-21 11:58 ` [pve-devel] [PATCH v3 container 2/2] fix #3424: api: snapshot delete: " Fabian Ebner
@ 2022-02-21 11:58 ` Fabian Ebner
  3 siblings, 0 replies; 8+ messages in thread
From: Fabian Ebner @ 2022-02-21 11:58 UTC (permalink / raw)
  To: pve-devel

A to-be-deleted snapshot might be actively used by replication,
resulting in a not (or only partially) removed snapshot and locked
(snapshot-delete) VM. Simply wait a few seconds for any ongoing
replication.

Signed-off-by: Fabian Ebner <f.ebner@proxmox.com>
---

Dependency bump for guest-common needed.

New in v3.

 PVE/API2/Qemu.pm | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/PVE/API2/Qemu.pm b/PVE/API2/Qemu.pm
index 9be1caf..4fb05f7 100644
--- a/PVE/API2/Qemu.pm
+++ b/PVE/API2/Qemu.pm
@@ -4552,11 +4552,20 @@ __PACKAGE__->register_method({
 
 	my $snapname = extract_param($param, 'snapname');
 
-	my $realcmd = sub {
+	my $do_delete = sub {
 	    PVE::Cluster::log_msg('info', $authuser, "delete snapshot VM $vmid: $snapname");
 	    PVE::QemuConfig->snapshot_delete($vmid, $snapname, $param->{force});
 	};
 
+	my $realcmd = sub {
+	    if ($param->{force}) {
+		$do_delete->();
+	    } else {
+		my $logfn = sub { print "$_[0]\n"; };
+		PVE::GuestHelpers::run_with_replication_guard($vmid, 10, $logfn, $do_delete);
+	    }
+	};
+
 	return $rpcenv->fork_worker('qmdelsnapshot', $vmid, $authuser, $realcmd);
     }});
 
-- 
2.30.2





^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [pve-devel] [PATCH v3 guest-common 1/1] guest helpers: add run_with_replication_guard
  2022-02-21 11:58 ` [pve-devel] [PATCH v3 guest-common 1/1] guest helpers: add run_with_replication_guard Fabian Ebner
@ 2022-02-22  9:41   ` Fabian Ebner
  2022-02-22 10:27     ` Fabian Grünbichler
  0 siblings, 1 reply; 8+ messages in thread
From: Fabian Ebner @ 2022-02-22  9:41 UTC (permalink / raw)
  To: pve-devel, Fabian Grünbichler, Thomas Lamprecht

Am 21.02.22 um 12:58 schrieb Fabian Ebner:
> Signed-off-by: Fabian Ebner <f.ebner@proxmox.com>
> ---
> 
> New in v3.
> 
>  src/PVE/GuestHelpers.pm | 15 ++++++++++++++-
>  1 file changed, 14 insertions(+), 1 deletion(-)
> 
> diff --git a/src/PVE/GuestHelpers.pm b/src/PVE/GuestHelpers.pm
> index 970c460..1183819 100644
> --- a/src/PVE/GuestHelpers.pm
> +++ b/src/PVE/GuestHelpers.pm
> @@ -3,8 +3,9 @@ package PVE::GuestHelpers;
>  use strict;
>  use warnings;
>  
> -use PVE::Tools;
> +use PVE::ReplicationConfig;
>  use PVE::Storage;
> +use PVE::Tools;
>  
>  use POSIX qw(strftime);
>  use Scalar::Util qw(weaken);
> @@ -82,6 +83,18 @@ sub guest_migration_lock {
>      return $res;
>  }
>  
> +sub run_with_replication_guard {
> +    my ($vmid, $timeout, $log, $func, @param) = @_;
> +
> +    my $repl_conf = PVE::ReplicationConfig->new();
> +    if ($repl_conf->check_for_existing_jobs($vmid, 1)) {
> +	$log->("checking/waiting for active replication..") if $log;
> +	guest_migration_lock($vmid, $timeout, $func, @param);

I wonder if we should unconditionally take the lock? If not, we can race
with a newly created replication job:
1. snapshot deletion starts
2. replication job is created
3. replication job starts
4. snapshot deletion runs into 'dataset is busy' error, because snapshot
is used by replication

IIRC Thomas didn't want the log line to be printed when there is no
replication configured, and we can still do that if we go for the
"unconditionally acquire lock" approach (it should get the lock quickly
except in the above edge case), but it would mean checking the
replication config just for that. My suggestion would be to get rid of
this helper, not log anything in the cases with 10s timeout, and log if
replication is configured in the backup case with 600s timeout.

> +    } else {
> +	$func->(@param);
> +    }
> +}
> +
>  sub check_hookscript {
>      my ($volid, $storecfg) = @_;
>  

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [pve-devel] [PATCH v3 guest-common 1/1] guest helpers: add run_with_replication_guard
  2022-02-22  9:41   ` Fabian Ebner
@ 2022-02-22 10:27     ` Fabian Grünbichler
  2022-02-22 13:44       ` Fabian Ebner
  0 siblings, 1 reply; 8+ messages in thread
From: Fabian Grünbichler @ 2022-02-22 10:27 UTC (permalink / raw)
  To: Fabian Ebner, pve-devel, Thomas Lamprecht

On February 22, 2022 10:41 am, Fabian Ebner wrote:
> Am 21.02.22 um 12:58 schrieb Fabian Ebner:
>> Signed-off-by: Fabian Ebner <f.ebner@proxmox.com>
>> ---
>> 
>> New in v3.
>> 
>>  src/PVE/GuestHelpers.pm | 15 ++++++++++++++-
>>  1 file changed, 14 insertions(+), 1 deletion(-)
>> 
>> diff --git a/src/PVE/GuestHelpers.pm b/src/PVE/GuestHelpers.pm
>> index 970c460..1183819 100644
>> --- a/src/PVE/GuestHelpers.pm
>> +++ b/src/PVE/GuestHelpers.pm
>> @@ -3,8 +3,9 @@ package PVE::GuestHelpers;
>>  use strict;
>>  use warnings;
>>  
>> -use PVE::Tools;
>> +use PVE::ReplicationConfig;
>>  use PVE::Storage;
>> +use PVE::Tools;
>>  
>>  use POSIX qw(strftime);
>>  use Scalar::Util qw(weaken);
>> @@ -82,6 +83,18 @@ sub guest_migration_lock {
>>      return $res;
>>  }
>>  
>> +sub run_with_replication_guard {
>> +    my ($vmid, $timeout, $log, $func, @param) = @_;
>> +
>> +    my $repl_conf = PVE::ReplicationConfig->new();
>> +    if ($repl_conf->check_for_existing_jobs($vmid, 1)) {
>> +	$log->("checking/waiting for active replication..") if $log;
>> +	guest_migration_lock($vmid, $timeout, $func, @param);
> 
> I wonder if we should unconditionally take the lock? If not, we can race
> with a newly created replication job:
> 1. snapshot deletion starts
> 2. replication job is created
> 3. replication job starts
> 4. snapshot deletion runs into 'dataset is busy' error, because snapshot
> is used by replication

that could also be solved by lock_config on the guest config when 
creating/modifying a replication job, but unconditionally trying to 
obtain the lock is also fine by me.

> IIRC Thomas didn't want the log line to be printed when there is no
> replication configured, and we can still do that if we go for the
> "unconditionally acquire lock" approach (it should get the lock quickly
> except in the above edge case), but it would mean checking the
> replication config just for that. My suggestion would be to get rid of
> this helper, not log anything in the cases with 10s timeout, and log if
> replication is configured in the backup case with 600s timeout.

for the long timeout case, the following would also work? or the 
existing helper could switch to something like this, if we want to 
keep it after all..

my $lock_obtained;
my $do_delete = sub {
  $lock_obtained = 1;
  ...
};

eval {
  # no msg
  lock_with_really_short_timeout($do_delete);
};

if (my $err = $@) {
  # $lock_obtained tells us whether the error is failing to lock
  # or failing to delete
  die $err
    if $lock_obtained;

  msg(..)
  lock_with_long_timeout($do_delete);
}

that way we don't have a delay warranting a log message if no 
replication is running (or even configured), but if one is running, 
we'll quickly fail to obtain the lock, and can retry with a message that 
explains the pause and a longer timeout..

> 
>> +    } else {
>> +	$func->(@param);
>> +    }
>> +}
>> +
>>  sub check_hookscript {
>>      my ($volid, $storecfg) = @_;
>>  
> 




^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [pve-devel] [PATCH v3 guest-common 1/1] guest helpers: add run_with_replication_guard
  2022-02-22 10:27     ` Fabian Grünbichler
@ 2022-02-22 13:44       ` Fabian Ebner
  0 siblings, 0 replies; 8+ messages in thread
From: Fabian Ebner @ 2022-02-22 13:44 UTC (permalink / raw)
  To: Fabian Grünbichler, pve-devel

Am 22.02.22 um 11:27 schrieb Fabian Grünbichler:
> On February 22, 2022 10:41 am, Fabian Ebner wrote:
>> Am 21.02.22 um 12:58 schrieb Fabian Ebner:
>>> @@ -82,6 +83,18 @@ sub guest_migration_lock {
>>>      return $res;
>>>  }
>>>  
>>> +sub run_with_replication_guard {
>>> +    my ($vmid, $timeout, $log, $func, @param) = @_;
>>> +
>>> +    my $repl_conf = PVE::ReplicationConfig->new();
>>> +    if ($repl_conf->check_for_existing_jobs($vmid, 1)) {
>>> +	$log->("checking/waiting for active replication..") if $log;
>>> +	guest_migration_lock($vmid, $timeout, $func, @param);
>>
>> I wonder if we should unconditionally take the lock? If not, we can race
>> with a newly created replication job:
>> 1. snapshot deletion starts
>> 2. replication job is created
>> 3. replication job starts
>> 4. snapshot deletion runs into 'dataset is busy' error, because snapshot
>> is used by replication
> 
> that could also be solved by lock_config on the guest config when 
> creating/modifying a replication job, but unconditionally trying to 
> obtain the lock is also fine by me.
> 

Should be enough to do upon creation and would also fix a race between
creating a replication job and adding a non-replicatable disk.




^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2022-02-22 13:44 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-02-21 11:58 [pve-devel] [PATCH-SERIES guest-common/container/qemu-server] fix #3424: wait for active replication when deleting a snapshot Fabian Ebner
2022-02-21 11:58 ` [pve-devel] [PATCH v3 guest-common 1/1] guest helpers: add run_with_replication_guard Fabian Ebner
2022-02-22  9:41   ` Fabian Ebner
2022-02-22 10:27     ` Fabian Grünbichler
2022-02-22 13:44       ` Fabian Ebner
2022-02-21 11:58 ` [pve-devel] [PATCH v3 container 1/2] partially fix #3424: vzdump: cleanup: wait for active replication Fabian Ebner
2022-02-21 11:58 ` [pve-devel] [PATCH v3 container 2/2] fix #3424: api: snapshot delete: " Fabian Ebner
2022-02-21 11:58 ` [pve-devel] [PATCH v3 qemu-server 1/1] " Fabian Ebner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox

Service provided by Proxmox Server Solutions GmbH | Privacy | Legal