From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits)) (No client certificate requested) by lists.proxmox.com (Postfix) with ESMTPS id E84889174D for ; Wed, 21 Dec 2022 13:57:14 +0100 (CET) Received: from firstgate.proxmox.com (localhost [127.0.0.1]) by firstgate.proxmox.com (Proxmox) with ESMTP id C9C137C4D for ; Wed, 21 Dec 2022 13:57:14 +0100 (CET) Received: from proxmox-new.maurer-it.com (proxmox-new.maurer-it.com [94.136.29.106]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits)) (No client certificate requested) by firstgate.proxmox.com (Proxmox) with ESMTPS for ; Wed, 21 Dec 2022 13:57:13 +0100 (CET) Received: from proxmox-new.maurer-it.com (localhost.localdomain [127.0.0.1]) by proxmox-new.maurer-it.com (Proxmox) with ESMTP id D2C3A44DBE for ; Wed, 21 Dec 2022 13:57:12 +0100 (CET) Message-ID: Date: Wed, 21 Dec 2022 13:57:10 +0100 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.5.0 Content-Language: en-US To: =?UTF-8?Q?Fabian_Gr=c3=bcnbichler?= , Proxmox VE development discussion References: <20221212134312.257662-1-s.hanreich@proxmox.com> <20221212134312.257662-3-s.hanreich@proxmox.com> <1671617798.6ip8zjgw5r.astroid@yuna.none> <792e1e61-9f57-1514-857c-6c933d5a9c95@proxmox.com> <1671625943.twv14tw18p.astroid@yuna.none> From: Stefan Hanreich In-Reply-To: <1671625943.twv14tw18p.astroid@yuna.none> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-SPAM-LEVEL: Spam detection results: 0 AWL 1.137 Adjusted score from AWL reputation of From: address BAYES_00 -1.9 Bayes spam probability is 0 to 1% KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment NICE_REPLY_A -1.161 Looks like a legit reply (A) SPF_HELO_NONE 0.001 SPF: HELO does not publish an SPF Record SPF_PASS -0.001 SPF: sender matches SPF record URIBL_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to URIBL was blocked. See http://wiki.apache.org/spamassassin/DnsBlocklists#dnsbl-block for more information. [proxmox.com, abstractconfig.pm] Subject: Re: [pve-devel] [PATCH pve-guest-common 1/1] partially fix #2530: snapshots: add pre/post/failed-snapshot hooks X-BeenThere: pve-devel@lists.proxmox.com X-Mailman-Version: 2.1.29 Precedence: list List-Id: Proxmox VE development discussion List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 21 Dec 2022 12:57:15 -0000 On 12/21/22 13:41, Fabian Grünbichler wrote: > On December 21, 2022 12:26 pm, Stefan Hanreich wrote: >> >> >> On 12/21/22 11:44, Fabian Grünbichler wrote: >>> this is v2, right? ;) >> >> Oh no - for some reason it's only in the cover letter.. >> >>> >>> On December 12, 2022 2:43 pm, Stefan Hanreich wrote: >>>> This commit adds hooks to the snapshotting process, which can be used >>>> to run additional setup scripts to prepare the VM for snapshotting. >>>> >>>> Examples for use cases include: >>>> * forcing processes to flush their writes >>>> * blocking processes from writing >>>> * altering the configuration of the VM to make snapshotting possible >>>> >>>> The prepare step has been split into two parts, so the configuration >>>> can be locked a bit earlier during the snapshotting process. Doing it >>>> this way ensures that the configuration is already locked during the >>>> pre-snapshot hook. Because of this split, the VM config gets written >>>> in two stages now, rather than one. >>>> >>>> In case of failure during the preparation step - after the lock is >>>> written - error handling has been added so the lock gets released >>>> properly. The failed-snapshot hook runs when the snapshot fails, if >>>> the pre-snapshot hook ran already. This enables users to revert any >>>> changes done during the pre-snapshot hookscript. >>> >>> see below >>> >>>> The preparation step assumes that the hook does not convert the >>>> current VM into a template, which is why the basic checks are not >>>> re-run after the pre-snapshot hook. The storage check runs after the >>>> pre-snapshot hook, because the hook might get used to setup the >>>> storage for snapshotting. If the hook would run after the storage >>>> checks, this becomes impossible. >>>> >>>> cfs_update() gets called after every invocation of a hookscript, since >>>> it is impossible to know which changes get made by the hookscript. >>>> Doing this ensures that we see the updated state of the CFS after the >>>> hookscript got invoked. >>>> >>>> Signed-off-by: Stefan Hanreich >>>> --- >>>> src/PVE/AbstractConfig.pm | 49 ++++++++++++++++++++++++++++++++++----- >>>> 1 file changed, 43 insertions(+), 6 deletions(-) >>>> >>>> diff --git a/src/PVE/AbstractConfig.pm b/src/PVE/AbstractConfig.pm >>>> index a0c0bc6..3bff600 100644 >>>> --- a/src/PVE/AbstractConfig.pm >>>> +++ b/src/PVE/AbstractConfig.pm >>>> @@ -710,8 +710,7 @@ sub __snapshot_prepare { >>>> >>>> my $snap; >>>> >>>> - my $updatefn = sub { >>>> - >>>> + my $run_checks = sub { >>>> my $conf = $class->load_config($vmid); >>>> >>>> die "you can't take a snapshot if it's a template\n" >>>> @@ -721,15 +720,21 @@ sub __snapshot_prepare { >>>> >>>> $conf->{lock} = 'snapshot'; >>>> >>>> - my $snapshots = $conf->{snapshots}; >>>> - >>>> die "snapshot name '$snapname' already used\n" >>>> - if defined($snapshots->{$snapname}); >>>> + if defined($conf->{snapshots}->{$snapname}); >>>> + >>>> + $class->write_config($vmid, $conf); >>>> + }; >>>> >>>> + my $updatefn = sub { >>>> + my $conf = $class->load_config($vmid); >>>> my $storecfg = PVE::Storage::config(); >>>> + >>>> die "snapshot feature is not available\n" >>>> if !$class->has_feature('snapshot', $conf, $storecfg, undef, undef, $snapname eq 'vzdump'); >>>> >>>> + my $snapshots = $conf->{snapshots}; >>>> + >>>> for my $snap (sort keys %$snapshots) { >>>> my $parent_name = $snapshots->{$snap}->{parent} // ''; >>>> if ($snapname eq $parent_name) { >>>> @@ -753,7 +758,32 @@ sub __snapshot_prepare { >>>> $class->write_config($vmid, $conf); >>>> }; >>>> >>>> - $class->lock_config($vmid, $updatefn); >>>> + $class->lock_config($vmid, $run_checks); >>>> + >>>> + eval { >>>> + my $conf = $class->load_config($vmid); >>>> + PVE::GuestHelpers::exec_hookscript($conf, $vmid, "pre-snapshot", 1); >>>> + }; >>>> + my $err = $@; >>>> + >>>> + PVE::Cluster::cfs_update(); >>>> + >>>> + if ($err) { >>>> + $class->remove_lock($vmid, 'snapshot'); >>>> + die $err; >>>> + } >>>> + >>> >>> I wonder if we don't also want to call the 'failed-snapshot' phase when just the >>> pre-snapshot invocation failed? might be possible to combine the error handling >>> then, although I am not sure it makes it more readable if combined.. >>> >> >> I thought about it, but I thought that if the user die's in his perl >> script he should be able to run any cleanup code before that. This >> doesn't consider any problems in the hookscript unforeseen by the user >> though, so I think your approach is better, since it is easier to use. >> This places less burden on the author of the hookscript. Might make the >> code a bit more convoluted though (depending on how we want to handle >> errors in failed-snapshot), but the upsides are way better imo. >> >> One thing that would be easier with making the user do his cleanup in >> pre-snapshot would be that the pre-snapshot hook knows exactly what >> failed in pre-snapshot, so cleanup-code could use that information to >> skip certain steps. But again, it assumes that pre-snapshot will >> properly handle any possible error, which might be a bit much to assume. > > yes, there is always the question of whether the hookscript does (proper) error > handling.. but if it does, an additional call to failed-snapshot shouldn't hurt > since that should be covered as well (in this case, if we pass the information > that prepare failed, it could be a no-op for example). > >>>> + >>>> + if (my $err = $@) { >>>> + my $conf = $class->load_config($vmid); >>>> + PVE::GuestHelpers::exec_hookscript($conf, $vmid, "failed-snapshot"); >>> >>> this exec_hookscript needs to be inside an eval {}, with warn in case it fails.. >> >> Isn't this already handled by the exec_hookscript function, since I am >> not passing $stop_on_error ? It should exit with warn instead of die >> then. Maybe I am misunderstanding something. >> >> See: >> https://git.proxmox.com/?p=pve-guest-common.git;a=blob;f=src/PVE/GuestHelpers.pm;h=b4ccbaa73a3fd08ba5d34350ebd57ee31355035b;hb=HEAD#l125 > > ah yeah, missed that - thanks for pointing it out :) > >>> >>> also, this call here happens when preparing for making the snapshot, after >>> possibly saving the VM state, but before taking the volume snapshots.. >>> >> >> This should be alleviated by the envvars you proposed below, because >> then we could pass that information to the hookscript and the user >> decides what to do with this information, right? > > yes exactly, this is part of the issue that could be solved by passing more > information to the hook script. > >>>> + PVE::Cluster::cfs_update(); >>>> + >>>> + $class->remove_lock($vmid, 'snapshot'); >>>> + die $err; >>>> + } >>>> >>>> return $snap; >>>> } >>>> @@ -837,11 +867,18 @@ sub snapshot_create { >>>> >>>> if ($err) { >>>> warn "snapshot create failed: starting cleanup\n"; >>>> + >>>> + PVE::GuestHelpers::exec_hookscript($conf, $vmid, "failed-snapshot"); >>> >>> eval + warn as well >> >> see above > > indeed > >>> >>> this call here happens when the volume snapshots might or might not have been >>> created already (depending on what exactly the error cause is). >>> >> >> same here - should be alleviated by adding envvars, right? >> >>>> + PVE::Cluster::cfs_update(); >>>> + >>>> eval { $class->snapshot_delete($vmid, $snapname, 1, $drivehash); }; >>>> warn "$@" if $@; >>>> die "$err\n"; >>>> } >>>> >>>> + PVE::GuestHelpers::exec_hookscript($conf, $vmid, "post-snapshot"); >>> >>> and here we have a similar issue (no eval), what should happen if post-snapshot >>> fails? >>> >>> A die immediately (very likely wrong, current) >>> B eval + warn but proceed with commit (possibly leaving leftover hook changes around) >>> C eval + warn, call failed-snapshot but proceed with commit (gives the >>> hookscript a chance to cleanup, but how does it differentiate between the >>> different failed-snapshot call sites?) >>> D eval + delete snapshot (seems suboptimal) >>> E eval + call failed-snapshot + delete snapshot (same, and also the issue of the >>> hookscript being able to know what's going on again) >>> >>> B and C seem most sensible to me, but C adds to the issue of "missing >>> failed-snapshot context", depending on what the hookscript is doing.. >> >> again, see above - I think it currently actually behaves like B because >> of how exec_hookscript works if I understand correctly. > > yes. but the question is whether that is good ;) I guess switching to C would > require passing stop_on_error and wrapping in eval.. > >> Similar idea to not running the failed-snapshot hook if pre-snapshot >> fails. I thought that the user should be aware that his hookscript >> failed at some point and run possible cleanup code before returning. As >> I said above that's probably a worse idea than just running >> failed-snapshot. It also enables the user to just have all the cleanup >> handled by failed-snapshot instead of having to add it to pre/post/failed. > > like mentioned above, doing > > if !pre { > failed > } > > if !snapshot { > failed > } > > if !post { > failed > } > > doesn't stop the user from handling all errors in the phase itself, and then > doing `exit 1` to fail the hookscript, with the failed phase only handling > actual "snapshotting didn't work" errors - as long as we pass along *why* we are > calling the failed phase. > This sounds like the best way to implement this (and would then behave like you described in option C). I will implement it this way together with the envvars you proposed. Again, thanks for the feedback! >>> one way to pass information is via the environment, we do that for the migration >>> case already (setting PVE_MIGRATED_FROM, so that the pre-start/post-start >>> hookscript can know the start happens in a migration context, and where to >>> (possibly) find the guest config.. >>> >>> for example, we could set PVE_SNAPSHOT_PHASE here, and have prepare/commit/post >>> as sub-phases, or even pass a list of volumes already snapshotted (or created, >>> in case of vmstate), or >> >> That's a good idea, I'll look into sensible values for >> PVE_SNAPSHOT_PHASE as well as look into how we could pass the >> information about volumes to the hookscript best. >> >>> obviously setting the environment is only allowed in a forked worker context, >>> else it would affect the next API endpoint handled by the pveproxy/pvedaemon/.. >>> process, so it might be worth double-checking and cleaning up to avoid >>> side-effects with replication/migration/.. if we go down that route.. >>> >> >> very good remark - thanks. I would not have thought of it even though it >> is kinda obvious now you pointed it out. > > well, I just realized that with replication/migration we are not doing a full > guest snapshot anyway, so that should be irrelevant ;) it just leaves container > backups in snapshot mode where we need to be careful to clean the env so that > the next container's hookscript execution in a single vzdump job doesn't see > wrong information.