all lists on lists.proxmox.com
 help / color / mirror / Atom feed
* [pve-devel] [PATCH v2 storage] lvm plugin: fix locking for rollback when using CLI
@ 2025-11-20 10:17 Fiona Ebner
  2025-11-20 13:21 ` Fabian Grünbichler
  2025-11-20 14:06 ` [pve-devel] applied: " Thomas Lamprecht
  0 siblings, 2 replies; 4+ messages in thread
From: Fiona Ebner @ 2025-11-20 10:17 UTC (permalink / raw)
  To: pve-devel

Doing a rollback via CLI on an LVM storage with 'saferemove' and
'snapshot-as-volume-chain' would run into a locking issue, because
the forked zero-out worker would try to acquire the lock while the
main CLI task is still inside the locked section for
volume_snapshot_rollback_locked(). The same issue does not happen when
the rollback is done via UI. The reason for this can be found in the
note regarding fork_worker():

> we simulate running in foreground if ($self->{type} eq 'cli')

So the worker will be awaited synchronously in CLI context, resulting
in the deadlock, while via API/UI, the main task would move on and
release the lock allowing the zero-out worker to acquire it.

Avoid doing fork_cleanup_worker() inside the locked section to avoid
the issue.

Fixes: 8eabcc7 ("lvm plugin: snapshot-as-volume-chain: use locking for snapshot operations")
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
---

Changes in v2:
* Better explanation in commit message.
* Also spawn cleanup worker in error scenario like before the patch.

 src/PVE/Storage/LVMPlugin.pm | 43 ++++++++++++++++++++++--------------
 1 file changed, 26 insertions(+), 17 deletions(-)

diff --git a/src/PVE/Storage/LVMPlugin.pm b/src/PVE/Storage/LVMPlugin.pm
index 97f7bf4..102cf22 100644
--- a/src/PVE/Storage/LVMPlugin.pm
+++ b/src/PVE/Storage/LVMPlugin.pm
@@ -1117,23 +1117,17 @@ sub volume_rollback_is_possible {
 }
 
 my sub volume_snapshot_rollback_locked {
-    my ($class, $scfg, $storeid, $volname, $snap) = @_;
+    my ($class, $scfg, $storeid, $volname, $snap, $cleanup_worker) = @_;
 
     my $format = ($class->parse_volname($volname))[6];
 
     die "can't rollback snapshot for '$format' volume\n" if $format ne 'qcow2';
 
-    my $cleanup_worker = eval { free_snap_image($class, $storeid, $scfg, $volname, 'current'); };
+    $cleanup_worker->$* = eval { free_snap_image($class, $storeid, $scfg, $volname, 'current'); };
     die "error deleting snapshot $snap $@\n" if $@;
 
     eval { alloc_snap_image($class, $storeid, $scfg, $volname, $snap) };
-    my $alloc_err = $@;
-
-    fork_cleanup_worker($cleanup_worker);
-
-    if ($alloc_err) {
-        die "can't allocate new volume $volname: $alloc_err\n";
-    }
+    die "can't allocate new volume $volname: $@\n" if $@;
 
     return undef;
 }
@@ -1141,14 +1135,29 @@ my sub volume_snapshot_rollback_locked {
 sub volume_snapshot_rollback {
     my ($class, $scfg, $storeid, $volname, $snap) = @_;
 
-    return $class->cluster_lock_storage(
-        $storeid,
-        $scfg->{shared},
-        undef,
-        sub {
-            return volume_snapshot_rollback_locked($class, $scfg, $storeid, $volname, $snap);
-        },
-    );
+    my $cleanup_worker;
+
+    eval {
+        $class->cluster_lock_storage(
+            $storeid,
+            $scfg->{shared},
+            undef,
+            sub {
+                volume_snapshot_rollback_locked(
+                    $class, $scfg, $storeid, $volname, $snap, \$cleanup_worker,
+                );
+            },
+        );
+    };
+    my $err = $@;
+
+    # Spawn outside of the locked section, because with 'saferemove', the cleanup worker also needs
+    # to obtain the lock, and in CLI context, it will be awaited synchronously, see fork_worker().
+    fork_cleanup_worker($cleanup_worker);
+
+    die $err if $err;
+
+    return;
 }
 
 sub volume_snapshot_delete {
-- 
2.47.3



_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [pve-devel] [PATCH v2 storage] lvm plugin: fix locking for rollback when using CLI
  2025-11-20 10:17 [pve-devel] [PATCH v2 storage] lvm plugin: fix locking for rollback when using CLI Fiona Ebner
@ 2025-11-20 13:21 ` Fabian Grünbichler
  2025-11-20 13:35   ` Fiona Ebner
  2025-11-20 14:06 ` [pve-devel] applied: " Thomas Lamprecht
  1 sibling, 1 reply; 4+ messages in thread
From: Fabian Grünbichler @ 2025-11-20 13:21 UTC (permalink / raw)
  To: Proxmox VE development discussion

On November 20, 2025 11:17 am, Fiona Ebner wrote:
> Doing a rollback via CLI on an LVM storage with 'saferemove' and
> 'snapshot-as-volume-chain' would run into a locking issue, because
> the forked zero-out worker would try to acquire the lock while the
> main CLI task is still inside the locked section for
> volume_snapshot_rollback_locked(). The same issue does not happen when
> the rollback is done via UI. The reason for this can be found in the
> note regarding fork_worker():
> 
>> we simulate running in foreground if ($self->{type} eq 'cli')
> 
> So the worker will be awaited synchronously in CLI context, resulting
> in the deadlock, while via API/UI, the main task would move on and
> release the lock allowing the zero-out worker to acquire it.
> 
> Avoid doing fork_cleanup_worker() inside the locked section to avoid
> the issue.
> 
> Fixes: 8eabcc7 ("lvm plugin: snapshot-as-volume-chain: use locking for snapshot operations")
> Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
> ---
> 
> Changes in v2:
> * Better explanation in commit message.
> * Also spawn cleanup worker in error scenario like before the patch.
> 
>  src/PVE/Storage/LVMPlugin.pm | 43 ++++++++++++++++++++++--------------
>  1 file changed, 26 insertions(+), 17 deletions(-)
> 
> diff --git a/src/PVE/Storage/LVMPlugin.pm b/src/PVE/Storage/LVMPlugin.pm
> index 97f7bf4..102cf22 100644
> --- a/src/PVE/Storage/LVMPlugin.pm
> +++ b/src/PVE/Storage/LVMPlugin.pm
> @@ -1117,23 +1117,17 @@ sub volume_rollback_is_possible {
>  }
>  
>  my sub volume_snapshot_rollback_locked {
> -    my ($class, $scfg, $storeid, $volname, $snap) = @_;
> +    my ($class, $scfg, $storeid, $volname, $snap, $cleanup_worker) = @_;
>  
>      my $format = ($class->parse_volname($volname))[6];
>  
>      die "can't rollback snapshot for '$format' volume\n" if $format ne 'qcow2';
>  
> -    my $cleanup_worker = eval { free_snap_image($class, $storeid, $scfg, $volname, 'current'); };
> +    $cleanup_worker->$* = eval { free_snap_image($class, $storeid, $scfg, $volname, 'current'); };
>      die "error deleting snapshot $snap $@\n" if $@;
>  
>      eval { alloc_snap_image($class, $storeid, $scfg, $volname, $snap) };
> -    my $alloc_err = $@;
> -
> -    fork_cleanup_worker($cleanup_worker);
> -
> -    if ($alloc_err) {
> -        die "can't allocate new volume $volname: $alloc_err\n";
> -    }
> +    die "can't allocate new volume $volname: $@\n" if $@;
>  
>      return undef;
>  }
> @@ -1141,14 +1135,29 @@ my sub volume_snapshot_rollback_locked {
>  sub volume_snapshot_rollback {
>      my ($class, $scfg, $storeid, $volname, $snap) = @_;
>  
> -    return $class->cluster_lock_storage(
> -        $storeid,
> -        $scfg->{shared},
> -        undef,
> -        sub {
> -            return volume_snapshot_rollback_locked($class, $scfg, $storeid, $volname, $snap);
> -        },
> -    );
> +    my $cleanup_worker;
> +
> +    eval {
> +        $class->cluster_lock_storage(
> +            $storeid,
> +            $scfg->{shared},
> +            undef,
> +            sub {
> +                volume_snapshot_rollback_locked(
> +                    $class, $scfg, $storeid, $volname, $snap, \$cleanup_worker,
> +                );
> +            },
> +        );
> +    };
> +    my $err = $@;
> +
> +    # Spawn outside of the locked section, because with 'saferemove', the cleanup worker also needs
> +    # to obtain the lock, and in CLI context, it will be awaited synchronously, see fork_worker().
> +    fork_cleanup_worker($cleanup_worker);

I mean, this just worked because the forked cleanup sub waited long
enough by chance for the lock/before acquiring the lock to allow the
parent to release it in the non-CLI case, right? i.e., it was still
wrong, because volume_snapshot_rollback_locked might have done more work
after forking the cleanup worker while still holding the lock..

all the other code paths here have this pattern:

my $cleanup_worker = eval { lock and do something };
fork_cleanup_worker($cleanup_worker)

which is exactly what this patch switches to, so consider this

Reviewed-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>

> +
> +    die $err if $err;
> +
> +    return;
>  }
>  
>  sub volume_snapshot_delete {
> -- 
> 2.47.3
> 
> 
> 
> _______________________________________________
> pve-devel mailing list
> pve-devel@lists.proxmox.com
> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
> 
> 
> 


_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [pve-devel] [PATCH v2 storage] lvm plugin: fix locking for rollback when using CLI
  2025-11-20 13:21 ` Fabian Grünbichler
@ 2025-11-20 13:35   ` Fiona Ebner
  0 siblings, 0 replies; 4+ messages in thread
From: Fiona Ebner @ 2025-11-20 13:35 UTC (permalink / raw)
  To: Proxmox VE development discussion, Fabian Grünbichler

Am 20.11.25 um 2:21 PM schrieb Fabian Grünbichler:
> On November 20, 2025 11:17 am, Fiona Ebner wrote:
>> @@ -1141,14 +1135,29 @@ my sub volume_snapshot_rollback_locked {
>>  sub volume_snapshot_rollback {
>>      my ($class, $scfg, $storeid, $volname, $snap) = @_;
>>  
>> -    return $class->cluster_lock_storage(
>> -        $storeid,
>> -        $scfg->{shared},
>> -        undef,
>> -        sub {
>> -            return volume_snapshot_rollback_locked($class, $scfg, $storeid, $volname, $snap);
>> -        },
>> -    );
>> +    my $cleanup_worker;
>> +
>> +    eval {
>> +        $class->cluster_lock_storage(
>> +            $storeid,
>> +            $scfg->{shared},
>> +            undef,
>> +            sub {
>> +                volume_snapshot_rollback_locked(
>> +                    $class, $scfg, $storeid, $volname, $snap, \$cleanup_worker,
>> +                );
>> +            },
>> +        );
>> +    };
>> +    my $err = $@;
>> +
>> +    # Spawn outside of the locked section, because with 'saferemove', the cleanup worker also needs
>> +    # to obtain the lock, and in CLI context, it will be awaited synchronously, see fork_worker().
>> +    fork_cleanup_worker($cleanup_worker);
> 
> I mean, this just worked because the forked cleanup sub waited long
> enough by chance for the lock/before acquiring the lock to allow the
> parent to release it in the non-CLI case, right? i.e., it was still
> wrong, because volume_snapshot_rollback_locked might have done more work
> after forking the cleanup worker while still holding the lock.. 

Yes, that is true.

> all the other code paths here have this pattern:
> 
> my $cleanup_worker = eval { lock and do something };
> fork_cleanup_worker($cleanup_worker)
> 
> which is exactly what this patch switches to, so consider this
> 
> Reviewed-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>

Thanks for the review!


_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [pve-devel] applied: [PATCH v2 storage] lvm plugin: fix locking for rollback when using CLI
  2025-11-20 10:17 [pve-devel] [PATCH v2 storage] lvm plugin: fix locking for rollback when using CLI Fiona Ebner
  2025-11-20 13:21 ` Fabian Grünbichler
@ 2025-11-20 14:06 ` Thomas Lamprecht
  1 sibling, 0 replies; 4+ messages in thread
From: Thomas Lamprecht @ 2025-11-20 14:06 UTC (permalink / raw)
  To: pve-devel, Fiona Ebner

On Thu, 20 Nov 2025 11:17:30 +0100, Fiona Ebner wrote:
> Doing a rollback via CLI on an LVM storage with 'saferemove' and
> 'snapshot-as-volume-chain' would run into a locking issue, because
> the forked zero-out worker would try to acquire the lock while the
> main CLI task is still inside the locked section for
> volume_snapshot_rollback_locked(). The same issue does not happen when
> the rollback is done via UI. The reason for this can be found in the
> note regarding fork_worker():
> 
> [...]

Applied, thanks!

[1/1] lvm plugin: fix locking for rollback when using CLI
      commit: 5001f0326951bec23336bc768a951f4983d0a94a


_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel


^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2025-11-20 14:06 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-11-20 10:17 [pve-devel] [PATCH v2 storage] lvm plugin: fix locking for rollback when using CLI Fiona Ebner
2025-11-20 13:21 ` Fabian Grünbichler
2025-11-20 13:35   ` Fiona Ebner
2025-11-20 14:06 ` [pve-devel] applied: " Thomas Lamprecht

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.
Service provided by Proxmox Server Solutions GmbH | Privacy | Legal