From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <f.gruenbichler@proxmox.com>
Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68])
 (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
 key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256)
 (No client certificate requested)
 by lists.proxmox.com (Postfix) with ESMTPS id B737891676
 for <pve-devel@lists.proxmox.com>; Wed, 21 Dec 2022 11:44:43 +0100 (CET)
Received: from firstgate.proxmox.com (localhost [127.0.0.1])
 by firstgate.proxmox.com (Proxmox) with ESMTP id A1B5166B9
 for <pve-devel@lists.proxmox.com>; Wed, 21 Dec 2022 11:44:43 +0100 (CET)
Received: from proxmox-new.maurer-it.com (proxmox-new.maurer-it.com
 [94.136.29.106])
 (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
 key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256)
 (No client certificate requested)
 by firstgate.proxmox.com (Proxmox) with ESMTPS
 for <pve-devel@lists.proxmox.com>; Wed, 21 Dec 2022 11:44:42 +0100 (CET)
Received: from proxmox-new.maurer-it.com (localhost.localdomain [127.0.0.1])
 by proxmox-new.maurer-it.com (Proxmox) with ESMTP id 1ABF5440DC
 for <pve-devel@lists.proxmox.com>; Wed, 21 Dec 2022 11:44:42 +0100 (CET)
Date: Wed, 21 Dec 2022 11:44:34 +0100
From: Fabian =?iso-8859-1?q?Gr=FCnbichler?= <f.gruenbichler@proxmox.com>
To: Proxmox VE development discussion <pve-devel@lists.proxmox.com>
References: <20221212134312.257662-1-s.hanreich@proxmox.com>
 <20221212134312.257662-3-s.hanreich@proxmox.com>
In-Reply-To: <20221212134312.257662-3-s.hanreich@proxmox.com>
MIME-Version: 1.0
User-Agent: astroid/0.16.0 (https://github.com/astroidmail/astroid)
Message-Id: <1671617798.6ip8zjgw5r.astroid@yuna.none>
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
X-SPAM-LEVEL: Spam detection results:  0
 AWL 0.133 Adjusted score from AWL reputation of From: address
 BAYES_00                 -1.9 Bayes spam probability is 0 to 1%
 KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment
 SPF_HELO_NONE           0.001 SPF: HELO does not publish an SPF Record
 SPF_PASS               -0.001 SPF: sender matches SPF record
 URIBL_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to URIBL was blocked. See
 http://wiki.apache.org/spamassassin/DnsBlocklists#dnsbl-block for more
 information. [proxmox.com, abstractconfig.pm]
Subject: Re: [pve-devel] [PATCH pve-guest-common 1/1] partially fix #2530:
 snapshots: add pre/post/failed-snapshot hooks
X-BeenThere: pve-devel@lists.proxmox.com
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Proxmox VE development discussion <pve-devel.lists.proxmox.com>
List-Unsubscribe: <https://lists.proxmox.com/cgi-bin/mailman/options/pve-devel>, 
 <mailto:pve-devel-request@lists.proxmox.com?subject=unsubscribe>
List-Archive: <http://lists.proxmox.com/pipermail/pve-devel/>
List-Post: <mailto:pve-devel@lists.proxmox.com>
List-Help: <mailto:pve-devel-request@lists.proxmox.com?subject=help>
List-Subscribe: <https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel>, 
 <mailto:pve-devel-request@lists.proxmox.com?subject=subscribe>
X-List-Received-Date: Wed, 21 Dec 2022 10:44:43 -0000

this is v2, right? ;)

On December 12, 2022 2:43 pm, Stefan Hanreich wrote:
> This commit adds hooks to the snapshotting process, which can be used
> to run additional setup scripts to prepare the VM for snapshotting.
>=20
> Examples for use cases include:
> * forcing processes to flush their writes
> * blocking processes from writing
> * altering the configuration of the VM to make snapshotting possible
>=20
> The prepare step has been split into two parts, so the configuration
> can be locked a bit earlier during the snapshotting process. Doing it
> this way ensures that the configuration is already locked during the
> pre-snapshot hook. Because of this split, the VM config gets written
> in two stages now, rather than one.
>=20
> In case of failure during the preparation step - after the lock is
> written - error handling has been added so the lock gets released
> properly. The failed-snapshot hook runs when the snapshot fails, if
> the pre-snapshot hook ran already. This enables users to revert any
> changes done during the pre-snapshot hookscript.

see below
=20
> The preparation step assumes that the hook does not convert the
> current VM into a template, which is why the basic checks are not
> re-run after the pre-snapshot hook. The storage check runs after the
> pre-snapshot hook, because the hook might get used to setup the
> storage for snapshotting. If the hook would run after the storage
> checks, this becomes impossible.
>=20
> cfs_update() gets called after every invocation of a hookscript, since
> it is impossible to know which changes get made by the hookscript.
> Doing this ensures that we see the updated state of the CFS after the
> hookscript got invoked.
>=20
> Signed-off-by: Stefan Hanreich <s.hanreich@proxmox.com>
> ---
>  src/PVE/AbstractConfig.pm | 49 ++++++++++++++++++++++++++++++++++-----
>  1 file changed, 43 insertions(+), 6 deletions(-)
>=20
> diff --git a/src/PVE/AbstractConfig.pm b/src/PVE/AbstractConfig.pm
> index a0c0bc6..3bff600 100644
> --- a/src/PVE/AbstractConfig.pm
> +++ b/src/PVE/AbstractConfig.pm
> @@ -710,8 +710,7 @@ sub __snapshot_prepare {
> =20
>      my $snap;
> =20
> -    my $updatefn =3D  sub {
> -
> +    my $run_checks =3D sub {
>  	my $conf =3D $class->load_config($vmid);
> =20
>  	die "you can't take a snapshot if it's a template\n"
> @@ -721,15 +720,21 @@ sub __snapshot_prepare {
> =20
>  	$conf->{lock} =3D 'snapshot';
> =20
> -	my $snapshots =3D $conf->{snapshots};
> -
>  	die "snapshot name '$snapname' already used\n"
> -	    if defined($snapshots->{$snapname});
> +	    if defined($conf->{snapshots}->{$snapname});
> +
> +	$class->write_config($vmid, $conf);
> +    };
> =20
> +    my $updatefn =3D sub {
> +	my $conf =3D $class->load_config($vmid);
>  	my $storecfg =3D PVE::Storage::config();
> +
>  	die "snapshot feature is not available\n"
>  	    if !$class->has_feature('snapshot', $conf, $storecfg, undef, undef,=
 $snapname eq 'vzdump');
> =20
> +	my $snapshots =3D $conf->{snapshots};
> +
>  	for my $snap (sort keys %$snapshots) {
>  	    my $parent_name =3D $snapshots->{$snap}->{parent} // '';
>  	    if ($snapname eq $parent_name) {
> @@ -753,7 +758,32 @@ sub __snapshot_prepare {
>  	$class->write_config($vmid, $conf);
>      };
> =20
> -    $class->lock_config($vmid, $updatefn);
> +    $class->lock_config($vmid, $run_checks);
> +
> +    eval {
> +	my $conf =3D $class->load_config($vmid);
> +	PVE::GuestHelpers::exec_hookscript($conf, $vmid, "pre-snapshot", 1);
> +    };
> +    my $err =3D $@;
> +
> +    PVE::Cluster::cfs_update();
> +
> +    if ($err) {
> +	$class->remove_lock($vmid, 'snapshot');
> +	die $err;
> +    }
> +

I wonder if we don't also want to call the 'failed-snapshot' phase when jus=
t the
pre-snapshot invocation failed? might be possible to combine the error hand=
ling
then, although I am not sure it makes it more readable if combined..

> +  =20
> +    if (my $err =3D $@) {
> +	my $conf =3D $class->load_config($vmid);
> +	PVE::GuestHelpers::exec_hookscript($conf, $vmid, "failed-snapshot");

this exec_hookscript needs to be inside an eval {}, with warn in case it fa=
ils..

also, this call here happens when preparing for making the snapshot, after
possibly saving the VM state, but before taking the volume snapshots..

> +	PVE::Cluster::cfs_update();
> +
> +	$class->remove_lock($vmid, 'snapshot');
> +	die $err;
> +    }
> =20
>      return $snap;
>  }
> @@ -837,11 +867,18 @@ sub snapshot_create {
> =20
>      if ($err) {
>  	warn "snapshot create failed: starting cleanup\n";
> +
> +	PVE::GuestHelpers::exec_hookscript($conf, $vmid, "failed-snapshot");

eval + warn as well

this call here happens when the volume snapshots might or might not have be=
en
created already (depending on what exactly the error cause is).

> +	PVE::Cluster::cfs_update();
> +
>  	eval { $class->snapshot_delete($vmid, $snapname, 1, $drivehash); };
>  	warn "$@" if $@;
>  	die "$err\n";
>      }
> =20
> +    PVE::GuestHelpers::exec_hookscript($conf, $vmid, "post-snapshot");

and here we have a similar issue (no eval), what should happen if post-snap=
shot
fails?

A die immediately (very likely wrong, current)
B eval + warn but proceed with commit (possibly leaving leftover hook chang=
es around)
C eval + warn, call failed-snapshot but proceed with commit (gives the
  hookscript a chance to cleanup, but how does it differentiate between the
  different failed-snapshot call sites?)
D eval + delete snapshot (seems suboptimal)
E eval + call failed-snapshot + delete snapshot (same, and also the issue o=
f the
  hookscript being able to know what's going on again)

B and C seem most sensible to me, but C adds to the issue of "missing
failed-snapshot context", depending on what the hookscript is doing..

one way to pass information is via the environment, we do that for the migr=
ation
case already (setting PVE_MIGRATED_FROM, so that the pre-start/post-start
hookscript can know the start happens in a migration context, and where to
(possibly) find the guest config..

for example, we could set PVE_SNAPSHOT_PHASE here, and have prepare/commit/=
post
as sub-phases, or even pass a list of volumes already snapshotted (or creat=
ed,
in case of vmstate), or ..

obviously setting the environment is only allowed in a forked worker contex=
t,
else it would affect the next API endpoint handled by the pveproxy/pvedaemo=
n/..
process, so it might be worth double-checking and cleaning up to avoid
side-effects with replication/migration/.. if we go down that route..

> +    PVE::Cluster::cfs_update();
> +
>      $class->__snapshot_commit($vmid, $snapname);
>  }
> =20
> --=20
> 2.30.2
>=20
>=20
> _______________________________________________
> pve-devel mailing list
> pve-devel@lists.proxmox.com
> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
>=20
>=20
>=20