* [pve-devel] [PATCH v4 container 1/2] partially fix #3424: vzdump: cleanup: wait for active replication
2022-02-23 12:03 [pve-devel] [PATCH-SERIES v4 container/qemu-server] fix #3424: wait for active replication when deleting a snapshot Fabian Ebner
@ 2022-02-23 12:03 ` Fabian Ebner
2022-02-23 12:03 ` [pve-devel] [PATCH v4 container 2/2] fix #3424: api: snapshot delete: " Fabian Ebner
` (2 subsequent siblings)
3 siblings, 0 replies; 5+ messages in thread
From: Fabian Ebner @ 2022-02-23 12:03 UTC (permalink / raw)
To: pve-devel
As replication and backup can happen at the same time, the vzdump
snapshot might be actively used by replication when backup tries
to cleanup, resulting in a not (or only partially) removed snapshot
and locked (snapshot-delete) container.
Wait up to 10 minutes for any ongoing replication. If replication
doesn't finish in time, the fact that there is no attempt to remove
the snapshot means that there's no risk for the container to end up in
a locked state. And the beginning of the next backup will force remove
the left-over snapshot, which will very likely succeed even at the
storage layer, because the replication really should be done by then
(subsequent replications shouldn't matter as they don't need to
re-transfer the vzdump snapshot).
Suggested-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
Co-developed-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
Signed-off-by: Fabian Ebner <f.ebner@proxmox.com>
---
Changes from v3:
* Unconditionally take the lock, to not race with replication job
creation and better future-proofing.
* Only log if we can't obtain the lock quickly rather than if a
replication job is configured.
* Make message more general, because it might be another snapshot
removal operation holding the lock.
src/PVE/VZDump/LXC.pm | 20 ++++++++++++++++++--
1 file changed, 18 insertions(+), 2 deletions(-)
diff --git a/src/PVE/VZDump/LXC.pm b/src/PVE/VZDump/LXC.pm
index b7f7463..078b74d 100644
--- a/src/PVE/VZDump/LXC.pm
+++ b/src/PVE/VZDump/LXC.pm
@@ -8,6 +8,7 @@ use File::Path;
use POSIX qw(strftime);
use PVE::Cluster qw(cfs_read_file);
+use PVE::GuestHelpers;
use PVE::INotify;
use PVE::LXC::Config;
use PVE::LXC;
@@ -476,8 +477,23 @@ sub cleanup {
}
if ($task->{cleanup}->{remove_snapshot}) {
- $self->loginfo("cleanup temporary 'vzdump' snapshot");
- PVE::LXC::Config->snapshot_delete($vmid, 'vzdump', 0);
+ my $lock_obtained;
+ my $do_delete = sub {
+ $lock_obtained = 1;
+ $self->loginfo("cleanup temporary 'vzdump' snapshot");
+ PVE::LXC::Config->snapshot_delete($vmid, 'vzdump', 0);
+ };
+
+ eval {
+ eval { PVE::GuestHelpers::guest_migration_lock($vmid, 1, $do_delete); };
+ if (my $err = $@) {
+ die $err if $lock_obtained;
+
+ $self->loginfo("waiting for active replication or other operation..");
+ PVE::GuestHelpers::guest_migration_lock($vmid, 600, $do_delete);
+ }
+ };
+ die "snapshot 'vzdump' was not (fully) removed - $@" if $@;
}
}
--
2.30.2
^ permalink raw reply [flat|nested] 5+ messages in thread
* [pve-devel] [PATCH v4 container 2/2] fix #3424: api: snapshot delete: wait for active replication
2022-02-23 12:03 [pve-devel] [PATCH-SERIES v4 container/qemu-server] fix #3424: wait for active replication when deleting a snapshot Fabian Ebner
2022-02-23 12:03 ` [pve-devel] [PATCH v4 container 1/2] partially fix #3424: vzdump: cleanup: wait for active replication Fabian Ebner
@ 2022-02-23 12:03 ` Fabian Ebner
2022-02-23 12:03 ` [pve-devel] [PATCH v4 qemu-server 1/1] " Fabian Ebner
2022-03-15 12:34 ` [pve-devel] applied-series: [PATCH-SERIES v4 container/qemu-server] fix #3424: wait for active replication when deleting a snapshot Fabian Grünbichler
3 siblings, 0 replies; 5+ messages in thread
From: Fabian Ebner @ 2022-02-23 12:03 UTC (permalink / raw)
To: pve-devel
A to-be-deleted snapshot might be actively used by replication,
resulting in a not (or only partially) removed snapshot and locked
(snapshot-delete) container. Simply wait a few seconds for any ongoing
replication.
Signed-off-by: Fabian Ebner <f.ebner@proxmox.com>
---
Changes from v3:
* Use guest_migration_lock() directly.
src/PVE/API2/LXC/Snapshot.pm | 11 ++++++++++-
1 file changed, 10 insertions(+), 1 deletion(-)
diff --git a/src/PVE/API2/LXC/Snapshot.pm b/src/PVE/API2/LXC/Snapshot.pm
index 160e5eb..8d0319a 100644
--- a/src/PVE/API2/LXC/Snapshot.pm
+++ b/src/PVE/API2/LXC/Snapshot.pm
@@ -10,6 +10,7 @@ use PVE::INotify;
use PVE::Cluster qw(cfs_read_file);
use PVE::AccessControl;
use PVE::Firewall;
+use PVE::GuestHelpers;
use PVE::Storage;
use PVE::RESTHandler;
use PVE::RPCEnvironment;
@@ -198,11 +199,19 @@ __PACKAGE__->register_method({
my $snapname = extract_param($param, 'snapname');
- my $realcmd = sub {
+ my $do_delete = sub {
PVE::Cluster::log_msg('info', $authuser, "delete snapshot VM $vmid: $snapname");
PVE::LXC::Config->snapshot_delete($vmid, $snapname, $param->{force});
};
+ my $realcmd = sub {
+ if ($param->{force}) {
+ $do_delete->();
+ } else {
+ PVE::GuestHelpers::guest_migration_lock($vmid, 10, $do_delete);
+ }
+ };
+
return $rpcenv->fork_worker('vzdelsnapshot', $vmid, $authuser, $realcmd);
}});
--
2.30.2
^ permalink raw reply [flat|nested] 5+ messages in thread
* [pve-devel] [PATCH v4 qemu-server 1/1] fix #3424: api: snapshot delete: wait for active replication
2022-02-23 12:03 [pve-devel] [PATCH-SERIES v4 container/qemu-server] fix #3424: wait for active replication when deleting a snapshot Fabian Ebner
2022-02-23 12:03 ` [pve-devel] [PATCH v4 container 1/2] partially fix #3424: vzdump: cleanup: wait for active replication Fabian Ebner
2022-02-23 12:03 ` [pve-devel] [PATCH v4 container 2/2] fix #3424: api: snapshot delete: " Fabian Ebner
@ 2022-02-23 12:03 ` Fabian Ebner
2022-03-15 12:34 ` [pve-devel] applied-series: [PATCH-SERIES v4 container/qemu-server] fix #3424: wait for active replication when deleting a snapshot Fabian Grünbichler
3 siblings, 0 replies; 5+ messages in thread
From: Fabian Ebner @ 2022-02-23 12:03 UTC (permalink / raw)
To: pve-devel
A to-be-deleted snapshot might be actively used by replication,
resulting in a not (or only partially) removed snapshot and locked
(snapshot-delete) VM. Simply wait a few seconds for any ongoing
replication.
Signed-off-by: Fabian Ebner <f.ebner@proxmox.com>
---
Changes from v3:
* Use guest_migration_lock() directly.
PVE/API2/Qemu.pm | 10 +++++++++-
1 file changed, 9 insertions(+), 1 deletion(-)
diff --git a/PVE/API2/Qemu.pm b/PVE/API2/Qemu.pm
index 9be1caf..d123ece 100644
--- a/PVE/API2/Qemu.pm
+++ b/PVE/API2/Qemu.pm
@@ -4552,11 +4552,19 @@ __PACKAGE__->register_method({
my $snapname = extract_param($param, 'snapname');
- my $realcmd = sub {
+ my $do_delete = sub {
PVE::Cluster::log_msg('info', $authuser, "delete snapshot VM $vmid: $snapname");
PVE::QemuConfig->snapshot_delete($vmid, $snapname, $param->{force});
};
+ my $realcmd = sub {
+ if ($param->{force}) {
+ $do_delete->();
+ } else {
+ PVE::GuestHelpers::guest_migration_lock($vmid, 10, $do_delete);
+ }
+ };
+
return $rpcenv->fork_worker('qmdelsnapshot', $vmid, $authuser, $realcmd);
}});
--
2.30.2
^ permalink raw reply [flat|nested] 5+ messages in thread
* [pve-devel] applied-series: [PATCH-SERIES v4 container/qemu-server] fix #3424: wait for active replication when deleting a snapshot
2022-02-23 12:03 [pve-devel] [PATCH-SERIES v4 container/qemu-server] fix #3424: wait for active replication when deleting a snapshot Fabian Ebner
` (2 preceding siblings ...)
2022-02-23 12:03 ` [pve-devel] [PATCH v4 qemu-server 1/1] " Fabian Ebner
@ 2022-03-15 12:34 ` Fabian Grünbichler
3 siblings, 0 replies; 5+ messages in thread
From: Fabian Grünbichler @ 2022-03-15 12:34 UTC (permalink / raw)
To: Proxmox VE development discussion
with some small follow-ups as discussed off-list:
- improve error messages in case of failure to acquire lock
- downgrade die to warn in pve-container vzdump patch, since it's not
fatal
On February 23, 2022 1:03 pm, Fabian Ebner wrote:
> Avoid that an attempt to remove a snapshot that's actively used by
> replication leads to a partially (or not) removed snapshot and locked
> guest.
>
> I decided to make the checks at the call sides, because passing the
> log function and timeout to snapshot_delete felt awkward as they
> would only be used for obtaining the lock.
>
> Changes from v3:
> * Unconditionally take the lock, to not race with replication job
> creation and future-proofing.
> * Only log in the case with the long timeout if we can't obtain
> the lock quickly.
> * Make message more general, because it might be another snapshot
> removal operation holding the lock.
>
>
> container:
>
> Fabian Ebner (2):
> partially fix #3424: vzdump: cleanup: wait for active replication
> fix #3424: api: snapshot delete: wait for active replication
>
> src/PVE/API2/LXC/Snapshot.pm | 11 ++++++++++-
> src/PVE/VZDump/LXC.pm | 20 ++++++++++++++++++--
> 2 files changed, 28 insertions(+), 3 deletions(-)
>
>
> qemu-server:
>
> Fabian Ebner (1):
> fix #3424: api: snapshot delete: wait for active replication
>
> PVE/API2/Qemu.pm | 10 +++++++++-
> 1 file changed, 9 insertions(+), 1 deletion(-)
>
> --
> 2.30.2
>
>
>
> _______________________________________________
> pve-devel mailing list
> pve-devel@lists.proxmox.com
> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
>
>
>
^ permalink raw reply [flat|nested] 5+ messages in thread