From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits)) (No client certificate requested) by lists.proxmox.com (Postfix) with ESMTPS id B736569CF6 for ; Thu, 25 Feb 2021 11:14:31 +0100 (CET) Received: from firstgate.proxmox.com (localhost [127.0.0.1]) by firstgate.proxmox.com (Proxmox) with ESMTP id AD96432CB8 for ; Thu, 25 Feb 2021 11:14:31 +0100 (CET) Received: from proxmox-new.maurer-it.com (proxmox-new.maurer-it.com [212.186.127.180]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits)) (No client certificate requested) by firstgate.proxmox.com (Proxmox) with ESMTPS id 2A65932CAB for ; Thu, 25 Feb 2021 11:14:31 +0100 (CET) Received: from proxmox-new.maurer-it.com (localhost.localdomain [127.0.0.1]) by proxmox-new.maurer-it.com (Proxmox) with ESMTP id E440741CCB for ; Thu, 25 Feb 2021 11:14:30 +0100 (CET) Message-ID: <41302765-2bd4-28cc-bb6b-ed91cd46e23c@proxmox.com> Date: Thu, 25 Feb 2021 11:14:29 +0100 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:86.0) Gecko/20100101 Thunderbird/86.0 Content-Language: en-US To: Stoiko Ivanov , pmg-devel@lists.proxmox.com References: <20210224183109.29014-1-s.ivanov@proxmox.com> <20210224183109.29014-6-s.ivanov@proxmox.com> From: Thomas Lamprecht In-Reply-To: <20210224183109.29014-6-s.ivanov@proxmox.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-SPAM-LEVEL: Spam detection results: 0 AWL -0.054 Adjusted score from AWL reputation of From: address KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment NICE_REPLY_A -0.001 Looks like a legit reply (A) RCVD_IN_DNSWL_MED -2.3 Sender listed at https://www.dnswl.org/, medium trust SPF_HELO_NONE 0.001 SPF: HELO does not publish an SPF Record SPF_PASS -0.001 SPF: sender matches SPF record Subject: Re: [pmg-devel] [PATCH pmg-api 5/5] backup: pbs: prevent race in concurrent backups X-BeenThere: pmg-devel@lists.proxmox.com X-Mailman-Version: 2.1.29 Precedence: list List-Id: Proxmox Mail Gateway development discussion List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 25 Feb 2021 10:14:31 -0000 On 24.02.21 19:31, Stoiko Ivanov wrote: > If two pbs backup-creation calls happen simultaenously, it is possible s/simultaenously/simultaneously/ > that the first removes the backup dir before the other is done > creating or sending it to the pbs remote. > > non-PBS backups are not affected, since they create the files for > tar in a tempdir (indexed by PID and current time). seems like that has a proven track record and avoids issues this one has, see below. > > Noticed while having 2 schedules to different PBS instances with the > same interval and w/o random delay. > > Signed-off-by: Stoiko Ivanov > --- > src/PMG/API2/PBS/Job.pm | 7 ++++++- > 1 file changed, 6 insertions(+), 1 deletion(-) > > diff --git a/src/PMG/API2/PBS/Job.pm b/src/PMG/API2/PBS/Job.pm > index 279afbc..e5dcb9c 100644 > --- a/src/PMG/API2/PBS/Job.pm > +++ b/src/PMG/API2/PBS/Job.pm > @@ -303,13 +303,14 @@ __PACKAGE__->register_method ({ > > my $pbs = PVE::PBSClient->new($remote_config, $remote, $conf->{secret_dir}); > my $backup_dir = "/var/lib/pmg/backup/current"; > + my $lockfile = "/var/lock/pmg-pbs-backup.lck"; > > my $worker = sub { > my $upid = shift; > > my $log = "starting update of current backup state\n"; > > - eval { > + my $create_backup = sub { > -d $backup_dir || mkdir $backup_dir; > PMG::Backup::pmg_backup($backup_dir, $param->{statistic}); > > @@ -317,6 +318,10 @@ __PACKAGE__->register_method ({ > > rmtree $backup_dir; > }; > + > + eval { > + PVE::Tools::lock_file($lockfile, undef, $create_backup); lock_file times out in 10s, as we have multiple people running into a 20s timeout in PBS I guess this does not solves the problem at all, as the backup coming second to the lock acquire can still always fail if backup always needs more than 10s (maybe unlikely in your fast local setup, not so unlikely if PBS is external both are slow and/or high loaded). Instead of bumping that timeout to dice-roll-times-100 I'd rather use different target backups as mentioned yesterday in our lighthearted off-list lunch talk about this. Between same-backup job locking could be an idea, but not to sure how many people plan to have jobs requiring minutes and setting up schedules minutely. That could be something one could warn about in the backup task log at the end if wanted, (there we now the duration and could check time between next run) > + }; > my $err = $@; > $log .= $err if $err; > >