From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by lists.proxmox.com (Postfix) with ESMTPS id 9B5506153F for ; Thu, 17 Dec 2020 09:40:49 +0100 (CET) Received: from firstgate.proxmox.com (localhost [127.0.0.1]) by firstgate.proxmox.com (Proxmox) with ESMTP id 915DC23854 for ; Thu, 17 Dec 2020 09:40:49 +0100 (CET) Received: from proxmox-new.maurer-it.com (proxmox-new.maurer-it.com [212.186.127.180]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by firstgate.proxmox.com (Proxmox) with ESMTPS id D61E823846 for ; Thu, 17 Dec 2020 09:40:48 +0100 (CET) Received: from proxmox-new.maurer-it.com (localhost.localdomain [127.0.0.1]) by proxmox-new.maurer-it.com (Proxmox) with ESMTP id 9D02C4521E for ; Thu, 17 Dec 2020 09:40:48 +0100 (CET) To: Thomas Lamprecht , Proxmox VE development discussion References: <20201214130039.9997-1-f.ebner@proxmox.com> <5c12de59-cec9-d1f3-9c3d-17a99c67e872@proxmox.com> From: Fabian Ebner Message-ID: Date: Thu, 17 Dec 2020 09:40:36 +0100 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.5.1 MIME-Version: 1.0 In-Reply-To: <5c12de59-cec9-d1f3-9c3d-17a99c67e872@proxmox.com> Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: 7bit X-SPAM-LEVEL: Spam detection results: 0 AWL -0.008 Adjusted score from AWL reputation of From: address KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment NICE_REPLY_A -0.001 Looks like a legit reply (A) RCVD_IN_DNSWL_MED -2.3 Sender listed at https://www.dnswl.org/, medium trust SPF_HELO_NONE 0.001 SPF: HELO does not publish an SPF Record SPF_PASS -0.001 SPF: sender matches SPF record Subject: Re: [pve-devel] [PATCH zsync] fix #2821: only abort if there really is a waiting/syncing job instance already X-BeenThere: pve-devel@lists.proxmox.com X-Mailman-Version: 2.1.29 Precedence: list List-Id: Proxmox VE development discussion List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 17 Dec 2020 08:40:49 -0000 Am 14.12.20 um 14:47 schrieb Thomas Lamprecht: > On 14.12.20 14:00, Fabian Ebner wrote: >> By remembering the instance via PID and start time and checking for that >> information in later instances. If the old instance can't be found, the new one >> will continue and register itself in the state. >> >> After updating, if there is a waiting instance running the old version, one more >> might be created, because there is no instance_id yet. But the new instance will >> set the instance_id, which any later instance will see. >> >> More importantly, if the state is wrongly 'waiting' or 'syncing', e.g. >> because an instance was terminated before finishing, we don't abort anymore and >> recover from the wrong state, thus fixing the bug. >> >> Signed-off-by: Fabian Ebner >> --- >> >> I couldn't find a better unique identifier that can be easily verfied from >> within another instance, but PID and start time should be good enough for the >> intended purpose. >> >> Another alternative would be to introduce job-specific locking around the whole >> sync() block, but then we would have some three lock-level deep code... >> >> @Thomas: I felt like this was more complete than the "clear state after boot"- >> solution, because it also works when the processes are killed for different >> reasons than during shutdown. > > that's true, and it seems like a quite nice and short approach to me, great! > >> >> pve-zsync | 40 +++++++++++++++++++++++++++++++++++++++- >> 1 file changed, 39 insertions(+), 1 deletion(-) >> >> diff --git a/pve-zsync b/pve-zsync >> index f3b98c4..506bfab 100755 >> --- a/pve-zsync >> +++ b/pve-zsync >> @@ -266,6 +266,7 @@ sub add_state_to_job { >> $job->{state} = $state->{state}; >> $job->{lsync} = $state->{lsync}; >> $job->{vm_type} = $state->{vm_type}; >> + $job->{instance_id} = $state->{instance_id}; >> >> for (my $i = 0; $state->{"snap$i"}; $i++) { >> $job->{"snap$i"} = $state->{"snap$i"}; >> @@ -365,6 +366,7 @@ sub update_state { >> if ($job->{state} ne "del") { >> $state->{state} = $job->{state}; >> $state->{lsync} = $job->{lsync}; >> + $state->{instance_id} = $job->{instance_id}; >> $state->{vm_type} = $job->{vm_type}; >> >> for (my $i = 0; $job->{"snap$i"} ; $i++) { >> @@ -584,6 +586,33 @@ sub destroy_job { >> }); >> } >> >> +sub get_process_start_time { >> + my ($pid) = @_; >> + >> + return eval { run_cmd(['ps', '-o', 'lstart=', '-p', "$pid"]); }; > > instead of fork+exec do a much cheaper file read? > > I.e., copying over file_read_firstline from PVE::Tools then: > > sub get_process_start_time { > my $stat_str = file_read_firstline("/proc/$pid/stat"); > my $stat = [ split(/\s+/, $stat_str) ]; > > return $stat->[21]; > } > > plus some error handling (note I did not test above) > Agreed, although we also need to obtain the boot time (from /proc/stat) to have the actual start time, because the value in /proc/$pid/stat is just the number of clock ticks since boot when the process was started. But it's still much cheaper of course. >> +} >> + >> +sub get_instance_id { >> + my ($pid) = @_; >> + >> + my $starttime = get_process_start_time($pid) >> + or die "could not determine start time for process '$pid'\n"; >> + >> + return "${pid}:${starttime}"; >> +} >> + >> +sub instance_exists { >> + my ($instance_id) = @_; >> + >> + if (defined($instance_id) && $instance_id =~ m/^([1-9][0-9]*):(.*)$/) { >> + my ($pid, $starttime) = ($1, $2); >> + my $actual_starttime = get_process_start_time($pid); >> + return defined($actual_starttime) && $starttime eq $actual_starttime; >> + } >> + >> + return 0; >> +} >> + >> sub sync { >> my ($param) = @_; >> >> @@ -593,11 +622,18 @@ sub sync { >> eval { $job = get_job($param) }; >> >> if ($job) { >> - if (defined($job->{state}) && ($job->{state} eq "syncing" || $job->{state} eq "waiting")) { >> + my $state = $job->{state} // 'ok'; >> + $state = 'ok' if !instance_exists($job->{instance_id}); >> + >> + if ($state eq "syncing" || $state eq "waiting") { >> die "Job --source $param->{source} --name $param->{name} is already scheduled to sync\n"; >> } >> >> $job->{state} = "waiting"; >> + >> + eval { $job->{instance_id} = get_instance_id($$); }; > > I'd query and cache the local instance ID from the current process on startup, this > would have the nice side effect of avoiding error potential here completely > What if querying fails on startup? I'd rather have it be a non-critical failure and continue. Then we'd still need a check here to see if the cached instance_id is defined. >> + warn "Could not set instance ID - $@" if $@; >> + >> update_state($job); >> } >> }); >> @@ -671,6 +707,7 @@ sub sync { >> eval { $job = get_job($param); }; >> if ($job) { >> $job->{state} = "error"; >> + delete $job->{instance_id}; >> update_state($job); >> } >> }); >> @@ -687,6 +724,7 @@ sub sync { >> $job->{state} = "ok"; >> } >> $job->{lsync} = $date; >> + delete $job->{instance_id}; >> update_state($job); >> } >> }); >> > >