From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <f.ebner@proxmox.com>
Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68])
 (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
 key-exchange X25519 server-signature RSA-PSS (2048 bits))
 (No client certificate requested)
 by lists.proxmox.com (Postfix) with ESMTPS id 14E076382E
 for <pve-devel@lists.proxmox.com>; Wed, 26 Jan 2022 10:08:14 +0100 (CET)
Received: from firstgate.proxmox.com (localhost [127.0.0.1])
 by firstgate.proxmox.com (Proxmox) with ESMTP id 0A7625847
 for <pve-devel@lists.proxmox.com>; Wed, 26 Jan 2022 10:08:14 +0100 (CET)
Received: from proxmox-new.maurer-it.com (proxmox-new.maurer-it.com
 [94.136.29.106])
 (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
 key-exchange X25519 server-signature RSA-PSS (2048 bits))
 (No client certificate requested)
 by firstgate.proxmox.com (Proxmox) with ESMTPS id 533AC583B
 for <pve-devel@lists.proxmox.com>; Wed, 26 Jan 2022 10:08:12 +0100 (CET)
Received: from proxmox-new.maurer-it.com (localhost.localdomain [127.0.0.1])
 by proxmox-new.maurer-it.com (Proxmox) with ESMTP id 17F5A43938
 for <pve-devel@lists.proxmox.com>; Wed, 26 Jan 2022 10:08:12 +0100 (CET)
Message-ID: <e29c93d5-740a-d3d5-d6e1-67cfb89a2e2b@proxmox.com>
Date: Wed, 26 Jan 2022 10:08:06 +0100
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101
 Thunderbird/91.4.1
Content-Language: en-US
To: pve-devel@lists.proxmox.com
References: <20220114130849.57616-1-f.ebner@proxmox.com>
From: Fabian Ebner <f.ebner@proxmox.com>
In-Reply-To: <20220114130849.57616-1-f.ebner@proxmox.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
X-SPAM-LEVEL: Spam detection results:  0
 AWL 0.134 Adjusted score from AWL reputation of From: address
 BAYES_00                 -1.9 Bayes spam probability is 0 to 1%
 KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment
 NICE_REPLY_A           -0.001 Looks like a legit reply (A)
 SPF_HELO_NONE           0.001 SPF: HELO does not publish an SPF Record
 SPF_PASS               -0.001 SPF: sender matches SPF record
 URIBL_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to URIBL was blocked. See
 http://wiki.apache.org/spamassassin/DnsBlocklists#dnsbl-block for more
 information. [proxmox.com, lxc.pm]
Subject: Re: [pve-devel] [PATCH v2 container] fix #3424: vzdump: cleanup:
 wait for active replication
X-BeenThere: pve-devel@lists.proxmox.com
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Proxmox VE development discussion <pve-devel.lists.proxmox.com>
List-Unsubscribe: <https://lists.proxmox.com/cgi-bin/mailman/options/pve-devel>, 
 <mailto:pve-devel-request@lists.proxmox.com?subject=unsubscribe>
List-Archive: <http://lists.proxmox.com/pipermail/pve-devel/>
List-Post: <mailto:pve-devel@lists.proxmox.com>
List-Help: <mailto:pve-devel-request@lists.proxmox.com?subject=help>
List-Subscribe: <https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel>, 
 <mailto:pve-devel-request@lists.proxmox.com?subject=subscribe>
X-List-Received-Date: Wed, 26 Jan 2022 09:08:14 -0000

Am 14.01.22 um 14:08 schrieb Fabian Ebner:
> As replication and backup can happen at the same time, the vzdump
> snapshot might be actively used by replication when backup tries
> to cleanup, resulting in a not (or only partially) removed snapshot
> and locked (snapshot-delete) container.
> 
> Wait up to 10 minutes for any ongoing replication. If replication
> doesn't finish in time, the fact that there is no attempt to remove
> the snapshot means that there's no risk for the container to end up in
> a locked state. And the beginning of the next backup will force remove
> the left-over snapshot, which will very likely succeed even at the
> storage layer, because the replication really should be done by then
> (subsequent replications shouldn't matter as they don't need to
> re-transfer the vzdump snapshot).
> 
> Suggested-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
> Signed-off-by: Fabian Ebner <f.ebner@proxmox.com>
> ---

Might not be the best approach as it doesn't cover the same edge case 
with manual snapshot removal:
https://bugzilla.proxmox.com/show_bug.cgi?id=3424#c1

> 
> Changes from v1:
>      * Check if replication is configured first.
>      * Use "active replication" in log message.
> 
> VM backups are not affected by this, because they don't use
> storage/config snapshots, but use pve-qemu's block layer.
> 
> Decided to go for this approach rather than replication waiting on
> backup, because "full backup can take much longer than replication
> usually does", and even if we time out, we can just skip the removal
> for now and have the next backup do it.
> 
>   src/PVE/VZDump/LXC.pm | 19 +++++++++++++++++--
>   1 file changed, 17 insertions(+), 2 deletions(-)
> 
> diff --git a/src/PVE/VZDump/LXC.pm b/src/PVE/VZDump/LXC.pm
> index b7f7463..5bac089 100644
> --- a/src/PVE/VZDump/LXC.pm
> +++ b/src/PVE/VZDump/LXC.pm
> @@ -8,9 +8,11 @@ use File::Path;
>   use POSIX qw(strftime);
>   
>   use PVE::Cluster qw(cfs_read_file);
> +use PVE::GuestHelpers;
>   use PVE::INotify;
>   use PVE::LXC::Config;
>   use PVE::LXC;
> +use PVE::ReplicationConfig;
>   use PVE::Storage;
>   use PVE::Tools;
>   use PVE::VZDump;
> @@ -476,8 +478,21 @@ sub cleanup {
>       }
>   
>       if ($task->{cleanup}->{remove_snapshot}) {
> -	$self->loginfo("cleanup temporary 'vzdump' snapshot");
> -	PVE::LXC::Config->snapshot_delete($vmid, 'vzdump', 0);
> +	my $do_remove = sub {
> +	    $self->loginfo("cleanup temporary 'vzdump' snapshot");
> +	    PVE::LXC::Config->snapshot_delete($vmid, 'vzdump', 0);
> +	};
> +
> +	my $repl_conf = PVE::ReplicationConfig->new();
> +	eval {
> +	    if ($repl_conf->check_for_existing_jobs($vmid, 1)) {
> +		$self->loginfo("checking/waiting for active replication..");
> +		PVE::GuestHelpers::guest_migration_lock($vmid, 600, $do_remove);
> +	    } else {
> +		$do_remove->();
> +	    }
> +	};
> +	die "snapshot 'vzdump' was not (fully) removed - $@" if $@;
>       }
>   }
>