From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits)) (No client certificate requested) by lists.proxmox.com (Postfix) with ESMTPS id 49A516BCCF for ; Mon, 14 Dec 2020 14:47:32 +0100 (CET) Received: from firstgate.proxmox.com (localhost [127.0.0.1]) by firstgate.proxmox.com (Proxmox) with ESMTP id 3DF8B16869 for ; Mon, 14 Dec 2020 14:47:32 +0100 (CET) Received: from proxmox-new.maurer-it.com (proxmox-new.maurer-it.com [212.186.127.180]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits)) (No client certificate requested) by firstgate.proxmox.com (Proxmox) with ESMTPS id 956311685F for ; Mon, 14 Dec 2020 14:47:31 +0100 (CET) Received: from proxmox-new.maurer-it.com (localhost.localdomain [127.0.0.1]) by proxmox-new.maurer-it.com (Proxmox) with ESMTP id 59DAF450E1 for ; Mon, 14 Dec 2020 14:47:31 +0100 (CET) To: Proxmox VE development discussion , Fabian Ebner References: <20201214130039.9997-1-f.ebner@proxmox.com> From: Thomas Lamprecht Message-ID: <5c12de59-cec9-d1f3-9c3d-17a99c67e872@proxmox.com> Date: Mon, 14 Dec 2020 14:47:30 +0100 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:84.0) Gecko/20100101 Thunderbird/84.0 MIME-Version: 1.0 In-Reply-To: <20201214130039.9997-1-f.ebner@proxmox.com> Content-Type: text/plain; charset=UTF-8 Content-Language: en-US Content-Transfer-Encoding: quoted-printable X-SPAM-LEVEL: Spam detection results: 0 AWL -0.067 Adjusted score from AWL reputation of From: address KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment NICE_REPLY_A -0.001 Looks like a legit reply (A) RCVD_IN_DNSWL_MED -2.3 Sender listed at https://www.dnswl.org/, medium trust SPF_HELO_NONE 0.001 SPF: HELO does not publish an SPF Record SPF_PASS -0.001 SPF: sender matches SPF record Subject: Re: [pve-devel] [PATCH zsync] fix #2821: only abort if there really is a waiting/syncing job instance already X-BeenThere: pve-devel@lists.proxmox.com X-Mailman-Version: 2.1.29 Precedence: list List-Id: Proxmox VE development discussion List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 14 Dec 2020 13:47:32 -0000 On 14.12.20 14:00, Fabian Ebner wrote: > By remembering the instance via PID and start time and checking for tha= t > information in later instances. If the old instance can't be found, the= new one > will continue and register itself in the state. >=20 > After updating, if there is a waiting instance running the old version,= one more > might be created, because there is no instance_id yet. But the new inst= ance will > set the instance_id, which any later instance will see. >=20 > More importantly, if the state is wrongly 'waiting' or 'syncing', e.g. > because an instance was terminated before finishing, we don't abort any= more and > recover from the wrong state, thus fixing the bug. >=20 > Signed-off-by: Fabian Ebner > --- >=20 > I couldn't find a better unique identifier that can be easily verfied f= rom > within another instance, but PID and start time should be good enough f= or the > intended purpose. >=20 > Another alternative would be to introduce job-specific locking around t= he whole > sync() block, but then we would have some three lock-level deep code...= >=20 > @Thomas: I felt like this was more complete than the "clear state after= boot"- > solution, because it also works when the processes are killed for diffe= rent > reasons than during shutdown. that's true, and it seems like a quite nice and short approach to me, gre= at! >=20 > pve-zsync | 40 +++++++++++++++++++++++++++++++++++++++- > 1 file changed, 39 insertions(+), 1 deletion(-) >=20 > diff --git a/pve-zsync b/pve-zsync > index f3b98c4..506bfab 100755 > --- a/pve-zsync > +++ b/pve-zsync > @@ -266,6 +266,7 @@ sub add_state_to_job { > $job->{state} =3D $state->{state}; > $job->{lsync} =3D $state->{lsync}; > $job->{vm_type} =3D $state->{vm_type}; > + $job->{instance_id} =3D $state->{instance_id}; > =20 > for (my $i =3D 0; $state->{"snap$i"}; $i++) { > $job->{"snap$i"} =3D $state->{"snap$i"}; > @@ -365,6 +366,7 @@ sub update_state { > if ($job->{state} ne "del") { > $state->{state} =3D $job->{state}; > $state->{lsync} =3D $job->{lsync}; > + $state->{instance_id} =3D $job->{instance_id}; > $state->{vm_type} =3D $job->{vm_type}; > =20 > for (my $i =3D 0; $job->{"snap$i"} ; $i++) { > @@ -584,6 +586,33 @@ sub destroy_job { > }); > } > =20 > +sub get_process_start_time { > + my ($pid) =3D @_; > + > + return eval { run_cmd(['ps', '-o', 'lstart=3D', '-p', "$pid"]); };= instead of fork+exec do a much cheaper file read? I.e., copying over file_read_firstline from PVE::Tools then: sub get_process_start_time { my $stat_str =3D file_read_firstline("/proc/$pid/stat"); my $stat =3D [ split(/\s+/, $stat_str) ]; return $stat->[21]; } plus some error handling (note I did not test above) > +} > + > +sub get_instance_id { > + my ($pid) =3D @_; > + > + my $starttime =3D get_process_start_time($pid) > + or die "could not determine start time for process '$pid'\n"; > + > + return "${pid}:${starttime}"; > +} > + > +sub instance_exists { > + my ($instance_id) =3D @_; > + > + if (defined($instance_id) && $instance_id =3D~ m/^([1-9][0-9]*):(.= *)$/) { > + my ($pid, $starttime) =3D ($1, $2); > + my $actual_starttime =3D get_process_start_time($pid); > + return defined($actual_starttime) && $starttime eq $actual_starttime;= > + } > + > + return 0; > +} > + > sub sync { > my ($param) =3D @_; > =20 > @@ -593,11 +622,18 @@ sub sync { > eval { $job =3D get_job($param) }; > =20 > if ($job) { > - if (defined($job->{state}) && ($job->{state} eq "syncing" || $job= ->{state} eq "waiting")) { > + my $state =3D $job->{state} // 'ok'; > + $state =3D 'ok' if !instance_exists($job->{instance_id}); > + > + if ($state eq "syncing" || $state eq "waiting") { > die "Job --source $param->{source} --name $param->{name} is already = scheduled to sync\n"; > } > =20 > $job->{state} =3D "waiting"; > + > + eval { $job->{instance_id} =3D get_instance_id($$); }; I'd query and cache the local instance ID from the current process on sta= rtup, this would have the nice side effect of avoiding error potential here complete= ly > + warn "Could not set instance ID - $@" if $@; > + > update_state($job); > } > }); > @@ -671,6 +707,7 @@ sub sync { > eval { $job =3D get_job($param); }; > if ($job) { > $job->{state} =3D "error"; > + delete $job->{instance_id}; > update_state($job); > } > }); > @@ -687,6 +724,7 @@ sub sync { > $job->{state} =3D "ok"; > } > $job->{lsync} =3D $date; > + delete $job->{instance_id}; > update_state($job); > } > }); >=20