From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by lists.proxmox.com (Postfix) with ESMTPS id 6CC5891742 for ; Wed, 21 Dec 2022 13:42:34 +0100 (CET) Received: from firstgate.proxmox.com (localhost [127.0.0.1]) by firstgate.proxmox.com (Proxmox) with ESMTP id 4FB2E7860 for ; Wed, 21 Dec 2022 13:42:04 +0100 (CET) Received: from proxmox-new.maurer-it.com (proxmox-new.maurer-it.com [94.136.29.106]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by firstgate.proxmox.com (Proxmox) with ESMTPS for ; Wed, 21 Dec 2022 13:42:02 +0100 (CET) Received: from proxmox-new.maurer-it.com (localhost.localdomain [127.0.0.1]) by proxmox-new.maurer-it.com (Proxmox) with ESMTP id 59139447FF for ; Wed, 21 Dec 2022 13:42:02 +0100 (CET) Date: Wed, 21 Dec 2022 13:41:53 +0100 From: Fabian =?iso-8859-1?q?Gr=FCnbichler?= To: Proxmox VE development discussion , Stefan Hanreich References: <20221212134312.257662-1-s.hanreich@proxmox.com> <20221212134312.257662-3-s.hanreich@proxmox.com> <1671617798.6ip8zjgw5r.astroid@yuna.none> <792e1e61-9f57-1514-857c-6c933d5a9c95@proxmox.com> In-Reply-To: <792e1e61-9f57-1514-857c-6c933d5a9c95@proxmox.com> MIME-Version: 1.0 User-Agent: astroid/0.16.0 (https://github.com/astroidmail/astroid) Message-Id: <1671625943.twv14tw18p.astroid@yuna.none> Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-SPAM-LEVEL: Spam detection results: 0 AWL 0.133 Adjusted score from AWL reputation of From: address BAYES_00 -1.9 Bayes spam probability is 0 to 1% KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment SPF_HELO_NONE 0.001 SPF: HELO does not publish an SPF Record SPF_PASS -0.001 SPF: sender matches SPF record URIBL_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to URIBL was blocked. See http://wiki.apache.org/spamassassin/DnsBlocklists#dnsbl-block for more information. [proxmox.com, abstractconfig.pm] Subject: Re: [pve-devel] [PATCH pve-guest-common 1/1] partially fix #2530: snapshots: add pre/post/failed-snapshot hooks X-BeenThere: pve-devel@lists.proxmox.com X-Mailman-Version: 2.1.29 Precedence: list List-Id: Proxmox VE development discussion List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 21 Dec 2022 12:42:34 -0000 On December 21, 2022 12:26 pm, Stefan Hanreich wrote: >=20 >=20 > On 12/21/22 11:44, Fabian Gr=C3=BCnbichler wrote: >> this is v2, right? ;) >=20 > Oh no - for some reason it's only in the cover letter.. >=20 >>=20 >> On December 12, 2022 2:43 pm, Stefan Hanreich wrote: >>> This commit adds hooks to the snapshotting process, which can be used >>> to run additional setup scripts to prepare the VM for snapshotting. >>> >>> Examples for use cases include: >>> * forcing processes to flush their writes >>> * blocking processes from writing >>> * altering the configuration of the VM to make snapshotting possible >>> >>> The prepare step has been split into two parts, so the configuration >>> can be locked a bit earlier during the snapshotting process. Doing it >>> this way ensures that the configuration is already locked during the >>> pre-snapshot hook. Because of this split, the VM config gets written >>> in two stages now, rather than one. >>> >>> In case of failure during the preparation step - after the lock is >>> written - error handling has been added so the lock gets released >>> properly. The failed-snapshot hook runs when the snapshot fails, if >>> the pre-snapshot hook ran already. This enables users to revert any >>> changes done during the pre-snapshot hookscript. >>=20 >> see below >> =20 >>> The preparation step assumes that the hook does not convert the >>> current VM into a template, which is why the basic checks are not >>> re-run after the pre-snapshot hook. The storage check runs after the >>> pre-snapshot hook, because the hook might get used to setup the >>> storage for snapshotting. If the hook would run after the storage >>> checks, this becomes impossible. >>> >>> cfs_update() gets called after every invocation of a hookscript, since >>> it is impossible to know which changes get made by the hookscript. >>> Doing this ensures that we see the updated state of the CFS after the >>> hookscript got invoked. >>> >>> Signed-off-by: Stefan Hanreich >>> --- >>> src/PVE/AbstractConfig.pm | 49 ++++++++++++++++++++++++++++++++++----= - >>> 1 file changed, 43 insertions(+), 6 deletions(-) >>> >>> diff --git a/src/PVE/AbstractConfig.pm b/src/PVE/AbstractConfig.pm >>> index a0c0bc6..3bff600 100644 >>> --- a/src/PVE/AbstractConfig.pm >>> +++ b/src/PVE/AbstractConfig.pm >>> @@ -710,8 +710,7 @@ sub __snapshot_prepare { >>> =20 >>> my $snap; >>> =20 >>> - my $updatefn =3D sub { >>> - >>> + my $run_checks =3D sub { >>> my $conf =3D $class->load_config($vmid); >>> =20 >>> die "you can't take a snapshot if it's a template\n" >>> @@ -721,15 +720,21 @@ sub __snapshot_prepare { >>> =20 >>> $conf->{lock} =3D 'snapshot'; >>> =20 >>> - my $snapshots =3D $conf->{snapshots}; >>> - >>> die "snapshot name '$snapname' already used\n" >>> - if defined($snapshots->{$snapname}); >>> + if defined($conf->{snapshots}->{$snapname}); >>> + >>> + $class->write_config($vmid, $conf); >>> + }; >>> =20 >>> + my $updatefn =3D sub { >>> + my $conf =3D $class->load_config($vmid); >>> my $storecfg =3D PVE::Storage::config(); >>> + >>> die "snapshot feature is not available\n" >>> if !$class->has_feature('snapshot', $conf, $storecfg, undef, und= ef, $snapname eq 'vzdump'); >>> =20 >>> + my $snapshots =3D $conf->{snapshots}; >>> + >>> for my $snap (sort keys %$snapshots) { >>> my $parent_name =3D $snapshots->{$snap}->{parent} // ''; >>> if ($snapname eq $parent_name) { >>> @@ -753,7 +758,32 @@ sub __snapshot_prepare { >>> $class->write_config($vmid, $conf); >>> }; >>> =20 >>> - $class->lock_config($vmid, $updatefn); >>> + $class->lock_config($vmid, $run_checks); >>> + >>> + eval { >>> + my $conf =3D $class->load_config($vmid); >>> + PVE::GuestHelpers::exec_hookscript($conf, $vmid, "pre-snapshot", 1); >>> + }; >>> + my $err =3D $@; >>> + >>> + PVE::Cluster::cfs_update(); >>> + >>> + if ($err) { >>> + $class->remove_lock($vmid, 'snapshot'); >>> + die $err; >>> + } >>> + >>=20 >> I wonder if we don't also want to call the 'failed-snapshot' phase when = just the >> pre-snapshot invocation failed? might be possible to combine the error h= andling >> then, although I am not sure it makes it more readable if combined.. >>=20 >=20 > I thought about it, but I thought that if the user die's in his perl=20 > script he should be able to run any cleanup code before that. This=20 > doesn't consider any problems in the hookscript unforeseen by the user=20 > though, so I think your approach is better, since it is easier to use.=20 > This places less burden on the author of the hookscript. Might make the=20 > code a bit more convoluted though (depending on how we want to handle=20 > errors in failed-snapshot), but the upsides are way better imo. >=20 > One thing that would be easier with making the user do his cleanup in=20 > pre-snapshot would be that the pre-snapshot hook knows exactly what=20 > failed in pre-snapshot, so cleanup-code could use that information to=20 > skip certain steps. But again, it assumes that pre-snapshot will=20 > properly handle any possible error, which might be a bit much to assume. yes, there is always the question of whether the hookscript does (proper) e= rror handling.. but if it does, an additional call to failed-snapshot shouldn't = hurt since that should be covered as well (in this case, if we pass the informat= ion that prepare failed, it could be a no-op for example). >>> + >>> + if (my $err =3D $@) { >>> + my $conf =3D $class->load_config($vmid); >>> + PVE::GuestHelpers::exec_hookscript($conf, $vmid, "failed-snapshot"); >>=20 >> this exec_hookscript needs to be inside an eval {}, with warn in case it= fails.. >=20 > Isn't this already handled by the exec_hookscript function, since I am=20 > not passing $stop_on_error ? It should exit with warn instead of die=20 > then. Maybe I am misunderstanding something. >=20 > See: > https://git.proxmox.com/?p=3Dpve-guest-common.git;a=3Dblob;f=3Dsrc/PVE/Gu= estHelpers.pm;h=3Db4ccbaa73a3fd08ba5d34350ebd57ee31355035b;hb=3DHEAD#l125 ah yeah, missed that - thanks for pointing it out :) >>=20 >> also, this call here happens when preparing for making the snapshot, aft= er >> possibly saving the VM state, but before taking the volume snapshots.. >>=20 >=20 > This should be alleviated by the envvars you proposed below, because=20 > then we could pass that information to the hookscript and the user=20 > decides what to do with this information, right? yes exactly, this is part of the issue that could be solved by passing more information to the hook script. >>> + PVE::Cluster::cfs_update(); >>> + >>> + $class->remove_lock($vmid, 'snapshot'); >>> + die $err; >>> + } >>> =20 >>> return $snap; >>> } >>> @@ -837,11 +867,18 @@ sub snapshot_create { >>> =20 >>> if ($err) { >>> warn "snapshot create failed: starting cleanup\n"; >>> + >>> + PVE::GuestHelpers::exec_hookscript($conf, $vmid, "failed-snapshot"); >>=20 >> eval + warn as well >=20 > see above indeed >>=20 >> this call here happens when the volume snapshots might or might not have= been >> created already (depending on what exactly the error cause is). >> >=20 > same here - should be alleviated by adding envvars, right? >=20 >>> + PVE::Cluster::cfs_update(); >>> + >>> eval { $class->snapshot_delete($vmid, $snapname, 1, $drivehash); }; >>> warn "$@" if $@; >>> die "$err\n"; >>> } >>> =20 >>> + PVE::GuestHelpers::exec_hookscript($conf, $vmid, "post-snapshot"); >>=20 >> and here we have a similar issue (no eval), what should happen if post-s= napshot >> fails? >>=20 >> A die immediately (very likely wrong, current) >> B eval + warn but proceed with commit (possibly leaving leftover hook ch= anges around) >> C eval + warn, call failed-snapshot but proceed with commit (gives the >> hookscript a chance to cleanup, but how does it differentiate between= the >> different failed-snapshot call sites?) >> D eval + delete snapshot (seems suboptimal) >> E eval + call failed-snapshot + delete snapshot (same, and also the issu= e of the >> hookscript being able to know what's going on again) >>=20 >> B and C seem most sensible to me, but C adds to the issue of "missing >> failed-snapshot context", depending on what the hookscript is doing.. >=20 > again, see above - I think it currently actually behaves like B because=20 > of how exec_hookscript works if I understand correctly. yes. but the question is whether that is good ;) I guess switching to C wou= ld require passing stop_on_error and wrapping in eval.. =20 > Similar idea to not running the failed-snapshot hook if pre-snapshot=20 > fails. I thought that the user should be aware that his hookscript=20 > failed at some point and run possible cleanup code before returning. As=20 > I said above that's probably a worse idea than just running=20 > failed-snapshot. It also enables the user to just have all the cleanup=20 > handled by failed-snapshot instead of having to add it to pre/post/failed= . like mentioned above, doing if !pre { failed } if !snapshot { failed } if !post { failed } doesn't stop the user from handling all errors in the phase itself, and the= n doing `exit 1` to fail the hookscript, with the failed phase only handling actual "snapshotting didn't work" errors - as long as we pass along *why* w= e are calling the failed phase. >> one way to pass information is via the environment, we do that for the m= igration >> case already (setting PVE_MIGRATED_FROM, so that the pre-start/post-star= t >> hookscript can know the start happens in a migration context, and where = to >> (possibly) find the guest config.. >>=20 >> for example, we could set PVE_SNAPSHOT_PHASE here, and have prepare/comm= it/post >> as sub-phases, or even pass a list of volumes already snapshotted (or cr= eated, >> in case of vmstate), or=20 >=20 > That's a good idea, I'll look into sensible values for=20 > PVE_SNAPSHOT_PHASE as well as look into how we could pass the=20 > information about volumes to the hookscript best. >=20 >> obviously setting the environment is only allowed in a forked worker con= text, >> else it would affect the next API endpoint handled by the pveproxy/pveda= emon/.. >> process, so it might be worth double-checking and cleaning up to avoid >> side-effects with replication/migration/.. if we go down that route.. >>=20 >=20 > very good remark - thanks. I would not have thought of it even though it=20 > is kinda obvious now you pointed it out. well, I just realized that with replication/migration we are not doing a fu= ll guest snapshot anyway, so that should be irrelevant ;) it just leaves conta= iner backups in snapshot mode where we need to be careful to clean the env so th= at the next container's hookscript execution in a single vzdump job doesn't se= e wrong information.