From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits)) (No client certificate requested) by lists.proxmox.com (Postfix) with ESMTPS id 578D36BC8C for ; Mon, 14 Dec 2020 14:00:45 +0100 (CET) Received: from firstgate.proxmox.com (localhost [127.0.0.1]) by firstgate.proxmox.com (Proxmox) with ESMTP id 4D0D1161F8 for ; Mon, 14 Dec 2020 14:00:45 +0100 (CET) Received: from proxmox-new.maurer-it.com (proxmox-new.maurer-it.com [212.186.127.180]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits)) (No client certificate requested) by firstgate.proxmox.com (Proxmox) with ESMTPS id 0CF84161EC for ; Mon, 14 Dec 2020 14:00:44 +0100 (CET) Received: from proxmox-new.maurer-it.com (localhost.localdomain [127.0.0.1]) by proxmox-new.maurer-it.com (Proxmox) with ESMTP id C7BE845100 for ; Mon, 14 Dec 2020 14:00:43 +0100 (CET) From: Fabian Ebner To: pve-devel@lists.proxmox.com Date: Mon, 14 Dec 2020 14:00:39 +0100 Message-Id: <20201214130039.9997-1-f.ebner@proxmox.com> X-Mailer: git-send-email 2.20.1 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-SPAM-LEVEL: Spam detection results: 0 AWL -0.009 Adjusted score from AWL reputation of From: address KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment RCVD_IN_DNSWL_MED -2.3 Sender listed at https://www.dnswl.org/, medium trust SPF_HELO_NONE 0.001 SPF: HELO does not publish an SPF Record SPF_PASS -0.001 SPF: sender matches SPF record Subject: [pve-devel] [PATCH zsync] fix #2821: only abort if there really is a waiting/syncing job instance already X-BeenThere: pve-devel@lists.proxmox.com X-Mailman-Version: 2.1.29 Precedence: list List-Id: Proxmox VE development discussion List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 14 Dec 2020 13:00:45 -0000 By remembering the instance via PID and start time and checking for that information in later instances. If the old instance can't be found, the new one will continue and register itself in the state. After updating, if there is a waiting instance running the old version, one more might be created, because there is no instance_id yet. But the new instance will set the instance_id, which any later instance will see. More importantly, if the state is wrongly 'waiting' or 'syncing', e.g. because an instance was terminated before finishing, we don't abort anymore and recover from the wrong state, thus fixing the bug. Signed-off-by: Fabian Ebner --- I couldn't find a better unique identifier that can be easily verfied from within another instance, but PID and start time should be good enough for the intended purpose. Another alternative would be to introduce job-specific locking around the whole sync() block, but then we would have some three lock-level deep code... @Thomas: I felt like this was more complete than the "clear state after boot"- solution, because it also works when the processes are killed for different reasons than during shutdown. pve-zsync | 40 +++++++++++++++++++++++++++++++++++++++- 1 file changed, 39 insertions(+), 1 deletion(-) diff --git a/pve-zsync b/pve-zsync index f3b98c4..506bfab 100755 --- a/pve-zsync +++ b/pve-zsync @@ -266,6 +266,7 @@ sub add_state_to_job { $job->{state} = $state->{state}; $job->{lsync} = $state->{lsync}; $job->{vm_type} = $state->{vm_type}; + $job->{instance_id} = $state->{instance_id}; for (my $i = 0; $state->{"snap$i"}; $i++) { $job->{"snap$i"} = $state->{"snap$i"}; @@ -365,6 +366,7 @@ sub update_state { if ($job->{state} ne "del") { $state->{state} = $job->{state}; $state->{lsync} = $job->{lsync}; + $state->{instance_id} = $job->{instance_id}; $state->{vm_type} = $job->{vm_type}; for (my $i = 0; $job->{"snap$i"} ; $i++) { @@ -584,6 +586,33 @@ sub destroy_job { }); } +sub get_process_start_time { + my ($pid) = @_; + + return eval { run_cmd(['ps', '-o', 'lstart=', '-p', "$pid"]); }; +} + +sub get_instance_id { + my ($pid) = @_; + + my $starttime = get_process_start_time($pid) + or die "could not determine start time for process '$pid'\n"; + + return "${pid}:${starttime}"; +} + +sub instance_exists { + my ($instance_id) = @_; + + if (defined($instance_id) && $instance_id =~ m/^([1-9][0-9]*):(.*)$/) { + my ($pid, $starttime) = ($1, $2); + my $actual_starttime = get_process_start_time($pid); + return defined($actual_starttime) && $starttime eq $actual_starttime; + } + + return 0; +} + sub sync { my ($param) = @_; @@ -593,11 +622,18 @@ sub sync { eval { $job = get_job($param) }; if ($job) { - if (defined($job->{state}) && ($job->{state} eq "syncing" || $job->{state} eq "waiting")) { + my $state = $job->{state} // 'ok'; + $state = 'ok' if !instance_exists($job->{instance_id}); + + if ($state eq "syncing" || $state eq "waiting") { die "Job --source $param->{source} --name $param->{name} is already scheduled to sync\n"; } $job->{state} = "waiting"; + + eval { $job->{instance_id} = get_instance_id($$); }; + warn "Could not set instance ID - $@" if $@; + update_state($job); } }); @@ -671,6 +707,7 @@ sub sync { eval { $job = get_job($param); }; if ($job) { $job->{state} = "error"; + delete $job->{instance_id}; update_state($job); } }); @@ -687,6 +724,7 @@ sub sync { $job->{state} = "ok"; } $job->{lsync} = $date; + delete $job->{instance_id}; update_state($job); } }); -- 2.20.1