public inbox for pve-devel@lists.proxmox.com
 help / color / mirror / Atom feed
* [pve-devel] [PATCH zsync] fix #2821: only abort if there really is a waiting/syncing job instance already
@ 2020-12-14 13:00 Fabian Ebner
  2020-12-14 13:47 ` Thomas Lamprecht
  0 siblings, 1 reply; 4+ messages in thread
From: Fabian Ebner @ 2020-12-14 13:00 UTC (permalink / raw)
  To: pve-devel

By remembering the instance via PID and start time and checking for that
information in later instances. If the old instance can't be found, the new one
will continue and register itself in the state.

After updating, if there is a waiting instance running the old version, one more
might be created, because there is no instance_id yet. But the new instance will
set the instance_id, which any later instance will see.

More importantly, if the state is wrongly 'waiting' or 'syncing', e.g.
because an instance was terminated before finishing, we don't abort anymore and
recover from the wrong state, thus fixing the bug.

Signed-off-by: Fabian Ebner <f.ebner@proxmox.com>
---

I couldn't find a better unique identifier that can be easily verfied from
within another instance, but PID and start time should be good enough for the
intended purpose.

Another alternative would be to introduce job-specific locking around the whole
sync() block, but then we would have some three lock-level deep code...

@Thomas: I felt like this was more complete than the "clear state after boot"-
solution, because it also works when the processes are killed for different
reasons than during shutdown.

 pve-zsync | 40 +++++++++++++++++++++++++++++++++++++++-
 1 file changed, 39 insertions(+), 1 deletion(-)

diff --git a/pve-zsync b/pve-zsync
index f3b98c4..506bfab 100755
--- a/pve-zsync
+++ b/pve-zsync
@@ -266,6 +266,7 @@ sub add_state_to_job {
     $job->{state} = $state->{state};
     $job->{lsync} = $state->{lsync};
     $job->{vm_type} = $state->{vm_type};
+    $job->{instance_id} = $state->{instance_id};
 
     for (my $i = 0; $state->{"snap$i"}; $i++) {
 	$job->{"snap$i"} = $state->{"snap$i"};
@@ -365,6 +366,7 @@ sub update_state {
     if ($job->{state} ne "del") {
 	$state->{state} = $job->{state};
 	$state->{lsync} = $job->{lsync};
+	$state->{instance_id} = $job->{instance_id};
 	$state->{vm_type} = $job->{vm_type};
 
 	for (my $i = 0; $job->{"snap$i"} ; $i++) {
@@ -584,6 +586,33 @@ sub destroy_job {
     });
 }
 
+sub get_process_start_time {
+    my ($pid) = @_;
+
+    return eval { run_cmd(['ps', '-o', 'lstart=', '-p', "$pid"]); };
+}
+
+sub get_instance_id {
+    my ($pid) = @_;
+
+    my $starttime = get_process_start_time($pid)
+	or die "could not determine start time for process '$pid'\n";
+
+    return "${pid}:${starttime}";
+}
+
+sub instance_exists {
+    my ($instance_id) = @_;
+
+    if (defined($instance_id) && $instance_id =~ m/^([1-9][0-9]*):(.*)$/) {
+	my ($pid, $starttime) = ($1, $2);
+	my $actual_starttime = get_process_start_time($pid);
+	return defined($actual_starttime) && $starttime eq $actual_starttime;
+    }
+
+    return 0;
+}
+
 sub sync {
     my ($param) = @_;
 
@@ -593,11 +622,18 @@ sub sync {
 	eval { $job = get_job($param) };
 
 	if ($job) {
-	    if (defined($job->{state}) && ($job->{state} eq "syncing" || $job->{state} eq "waiting")) {
+	    my $state = $job->{state} // 'ok';
+	    $state = 'ok' if !instance_exists($job->{instance_id});
+
+	    if ($state eq "syncing" || $state eq "waiting") {
 		die "Job --source $param->{source} --name $param->{name} is already scheduled to sync\n";
 	    }
 
 	    $job->{state} = "waiting";
+
+	    eval { $job->{instance_id} = get_instance_id($$); };
+	    warn "Could not set instance ID - $@" if $@;
+
 	    update_state($job);
 	}
     });
@@ -671,6 +707,7 @@ sub sync {
 		eval { $job = get_job($param); };
 		if ($job) {
 		    $job->{state} = "error";
+		    delete $job->{instance_id};
 		    update_state($job);
 		}
 	    });
@@ -687,6 +724,7 @@ sub sync {
 		    $job->{state} = "ok";
 		}
 		$job->{lsync} = $date;
+		delete $job->{instance_id};
 		update_state($job);
 	    }
 	});
-- 
2.20.1





^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [pve-devel] [PATCH zsync] fix #2821: only abort if there really is a waiting/syncing job instance already
  2020-12-14 13:00 [pve-devel] [PATCH zsync] fix #2821: only abort if there really is a waiting/syncing job instance already Fabian Ebner
@ 2020-12-14 13:47 ` Thomas Lamprecht
  2020-12-17  8:40   ` Fabian Ebner
  0 siblings, 1 reply; 4+ messages in thread
From: Thomas Lamprecht @ 2020-12-14 13:47 UTC (permalink / raw)
  To: Proxmox VE development discussion, Fabian Ebner

On 14.12.20 14:00, Fabian Ebner wrote:
> By remembering the instance via PID and start time and checking for that
> information in later instances. If the old instance can't be found, the new one
> will continue and register itself in the state.
> 
> After updating, if there is a waiting instance running the old version, one more
> might be created, because there is no instance_id yet. But the new instance will
> set the instance_id, which any later instance will see.
> 
> More importantly, if the state is wrongly 'waiting' or 'syncing', e.g.
> because an instance was terminated before finishing, we don't abort anymore and
> recover from the wrong state, thus fixing the bug.
> 
> Signed-off-by: Fabian Ebner <f.ebner@proxmox.com>
> ---
> 
> I couldn't find a better unique identifier that can be easily verfied from
> within another instance, but PID and start time should be good enough for the
> intended purpose.
> 
> Another alternative would be to introduce job-specific locking around the whole
> sync() block, but then we would have some three lock-level deep code...
> 
> @Thomas: I felt like this was more complete than the "clear state after boot"-
> solution, because it also works when the processes are killed for different
> reasons than during shutdown.

that's true, and it seems like a quite nice and short approach to me, great!

> 
>  pve-zsync | 40 +++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 39 insertions(+), 1 deletion(-)
> 
> diff --git a/pve-zsync b/pve-zsync
> index f3b98c4..506bfab 100755
> --- a/pve-zsync
> +++ b/pve-zsync
> @@ -266,6 +266,7 @@ sub add_state_to_job {
>      $job->{state} = $state->{state};
>      $job->{lsync} = $state->{lsync};
>      $job->{vm_type} = $state->{vm_type};
> +    $job->{instance_id} = $state->{instance_id};
>  
>      for (my $i = 0; $state->{"snap$i"}; $i++) {
>  	$job->{"snap$i"} = $state->{"snap$i"};
> @@ -365,6 +366,7 @@ sub update_state {
>      if ($job->{state} ne "del") {
>  	$state->{state} = $job->{state};
>  	$state->{lsync} = $job->{lsync};
> +	$state->{instance_id} = $job->{instance_id};
>  	$state->{vm_type} = $job->{vm_type};
>  
>  	for (my $i = 0; $job->{"snap$i"} ; $i++) {
> @@ -584,6 +586,33 @@ sub destroy_job {
>      });
>  }
>  
> +sub get_process_start_time {
> +    my ($pid) = @_;
> +
> +    return eval { run_cmd(['ps', '-o', 'lstart=', '-p', "$pid"]); };

instead of fork+exec do a much cheaper file read?

I.e., copying over file_read_firstline from PVE::Tools then:

sub get_process_start_time {
    my $stat_str = file_read_firstline("/proc/$pid/stat");
    my $stat = [ split(/\s+/, $stat_str) ];

    return $stat->[21];
}

plus some error handling (note I did not test above)

> +}
> +
> +sub get_instance_id {
> +    my ($pid) = @_;
> +
> +    my $starttime = get_process_start_time($pid)
> +	or die "could not determine start time for process '$pid'\n";
> +
> +    return "${pid}:${starttime}";
> +}
> +
> +sub instance_exists {
> +    my ($instance_id) = @_;
> +
> +    if (defined($instance_id) && $instance_id =~ m/^([1-9][0-9]*):(.*)$/) {
> +	my ($pid, $starttime) = ($1, $2);
> +	my $actual_starttime = get_process_start_time($pid);
> +	return defined($actual_starttime) && $starttime eq $actual_starttime;
> +    }
> +
> +    return 0;
> +}
> +
>  sub sync {
>      my ($param) = @_;
>  
> @@ -593,11 +622,18 @@ sub sync {
>  	eval { $job = get_job($param) };
>  
>  	if ($job) {
> -	    if (defined($job->{state}) && ($job->{state} eq "syncing" || $job->{state} eq "waiting")) {
> +	    my $state = $job->{state} // 'ok';
> +	    $state = 'ok' if !instance_exists($job->{instance_id});
> +
> +	    if ($state eq "syncing" || $state eq "waiting") {
>  		die "Job --source $param->{source} --name $param->{name} is already scheduled to sync\n";
>  	    }
>  
>  	    $job->{state} = "waiting";
> +
> +	    eval { $job->{instance_id} = get_instance_id($$); };

I'd query and cache the local instance ID from the current process on startup, this
would have the nice side effect of avoiding error potential here completely

> +	    warn "Could not set instance ID - $@" if $@;
> +
>  	    update_state($job);
>  	}
>      });
> @@ -671,6 +707,7 @@ sub sync {
>  		eval { $job = get_job($param); };
>  		if ($job) {
>  		    $job->{state} = "error";
> +		    delete $job->{instance_id};
>  		    update_state($job);
>  		}
>  	    });
> @@ -687,6 +724,7 @@ sub sync {
>  		    $job->{state} = "ok";
>  		}
>  		$job->{lsync} = $date;
> +		delete $job->{instance_id};
>  		update_state($job);
>  	    }
>  	});
> 






^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [pve-devel] [PATCH zsync] fix #2821: only abort if there really is a waiting/syncing job instance already
  2020-12-14 13:47 ` Thomas Lamprecht
@ 2020-12-17  8:40   ` Fabian Ebner
  2020-12-17  9:23     ` Thomas Lamprecht
  0 siblings, 1 reply; 4+ messages in thread
From: Fabian Ebner @ 2020-12-17  8:40 UTC (permalink / raw)
  To: Thomas Lamprecht, Proxmox VE development discussion

Am 14.12.20 um 14:47 schrieb Thomas Lamprecht:
> On 14.12.20 14:00, Fabian Ebner wrote:
>> By remembering the instance via PID and start time and checking for that
>> information in later instances. If the old instance can't be found, the new one
>> will continue and register itself in the state.
>>
>> After updating, if there is a waiting instance running the old version, one more
>> might be created, because there is no instance_id yet. But the new instance will
>> set the instance_id, which any later instance will see.
>>
>> More importantly, if the state is wrongly 'waiting' or 'syncing', e.g.
>> because an instance was terminated before finishing, we don't abort anymore and
>> recover from the wrong state, thus fixing the bug.
>>
>> Signed-off-by: Fabian Ebner <f.ebner@proxmox.com>
>> ---
>>
>> I couldn't find a better unique identifier that can be easily verfied from
>> within another instance, but PID and start time should be good enough for the
>> intended purpose.
>>
>> Another alternative would be to introduce job-specific locking around the whole
>> sync() block, but then we would have some three lock-level deep code...
>>
>> @Thomas: I felt like this was more complete than the "clear state after boot"-
>> solution, because it also works when the processes are killed for different
>> reasons than during shutdown.
> 
> that's true, and it seems like a quite nice and short approach to me, great!
> 
>>
>>   pve-zsync | 40 +++++++++++++++++++++++++++++++++++++++-
>>   1 file changed, 39 insertions(+), 1 deletion(-)
>>
>> diff --git a/pve-zsync b/pve-zsync
>> index f3b98c4..506bfab 100755
>> --- a/pve-zsync
>> +++ b/pve-zsync
>> @@ -266,6 +266,7 @@ sub add_state_to_job {
>>       $job->{state} = $state->{state};
>>       $job->{lsync} = $state->{lsync};
>>       $job->{vm_type} = $state->{vm_type};
>> +    $job->{instance_id} = $state->{instance_id};
>>   
>>       for (my $i = 0; $state->{"snap$i"}; $i++) {
>>   	$job->{"snap$i"} = $state->{"snap$i"};
>> @@ -365,6 +366,7 @@ sub update_state {
>>       if ($job->{state} ne "del") {
>>   	$state->{state} = $job->{state};
>>   	$state->{lsync} = $job->{lsync};
>> +	$state->{instance_id} = $job->{instance_id};
>>   	$state->{vm_type} = $job->{vm_type};
>>   
>>   	for (my $i = 0; $job->{"snap$i"} ; $i++) {
>> @@ -584,6 +586,33 @@ sub destroy_job {
>>       });
>>   }
>>   
>> +sub get_process_start_time {
>> +    my ($pid) = @_;
>> +
>> +    return eval { run_cmd(['ps', '-o', 'lstart=', '-p', "$pid"]); };
> 
> instead of fork+exec do a much cheaper file read?
> 
> I.e., copying over file_read_firstline from PVE::Tools then:
> 
> sub get_process_start_time {
>      my $stat_str = file_read_firstline("/proc/$pid/stat");
>      my $stat = [ split(/\s+/, $stat_str) ];
> 
>      return $stat->[21];
> }
> 
> plus some error handling (note I did not test above)
> 

Agreed, although we also need to obtain the boot time (from /proc/stat) 
to have the actual start time, because the value in /proc/$pid/stat is 
just the number of clock ticks since boot when the process was started. 
But it's still much cheaper of course.

>> +}
>> +
>> +sub get_instance_id {
>> +    my ($pid) = @_;
>> +
>> +    my $starttime = get_process_start_time($pid)
>> +	or die "could not determine start time for process '$pid'\n";
>> +
>> +    return "${pid}:${starttime}";
>> +}
>> +
>> +sub instance_exists {
>> +    my ($instance_id) = @_;
>> +
>> +    if (defined($instance_id) && $instance_id =~ m/^([1-9][0-9]*):(.*)$/) {
>> +	my ($pid, $starttime) = ($1, $2);
>> +	my $actual_starttime = get_process_start_time($pid);
>> +	return defined($actual_starttime) && $starttime eq $actual_starttime;
>> +    }
>> +
>> +    return 0;
>> +}
>> +
>>   sub sync {
>>       my ($param) = @_;
>>   
>> @@ -593,11 +622,18 @@ sub sync {
>>   	eval { $job = get_job($param) };
>>   
>>   	if ($job) {
>> -	    if (defined($job->{state}) && ($job->{state} eq "syncing" || $job->{state} eq "waiting")) {
>> +	    my $state = $job->{state} // 'ok';
>> +	    $state = 'ok' if !instance_exists($job->{instance_id});
>> +
>> +	    if ($state eq "syncing" || $state eq "waiting") {
>>   		die "Job --source $param->{source} --name $param->{name} is already scheduled to sync\n";
>>   	    }
>>   
>>   	    $job->{state} = "waiting";
>> +
>> +	    eval { $job->{instance_id} = get_instance_id($$); };
> 
> I'd query and cache the local instance ID from the current process on startup, this
> would have the nice side effect of avoiding error potential here completely
> 

What if querying fails on startup? I'd rather have it be a non-critical 
failure and continue. Then we'd still need a check here to see if the 
cached instance_id is defined.

>> +	    warn "Could not set instance ID - $@" if $@;
>> +
>>   	    update_state($job);
>>   	}
>>       });
>> @@ -671,6 +707,7 @@ sub sync {
>>   		eval { $job = get_job($param); };
>>   		if ($job) {
>>   		    $job->{state} = "error";
>> +		    delete $job->{instance_id};
>>   		    update_state($job);
>>   		}
>>   	    });
>> @@ -687,6 +724,7 @@ sub sync {
>>   		    $job->{state} = "ok";
>>   		}
>>   		$job->{lsync} = $date;
>> +		delete $job->{instance_id};
>>   		update_state($job);
>>   	    }
>>   	});
>>
> 
> 




^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [pve-devel] [PATCH zsync] fix #2821: only abort if there really is a waiting/syncing job instance already
  2020-12-17  8:40   ` Fabian Ebner
@ 2020-12-17  9:23     ` Thomas Lamprecht
  0 siblings, 0 replies; 4+ messages in thread
From: Thomas Lamprecht @ 2020-12-17  9:23 UTC (permalink / raw)
  To: Fabian Ebner, Proxmox VE development discussion

On 17/12/2020 09:40, Fabian Ebner wrote:
> Am 14.12.20 um 14:47 schrieb Thomas Lamprecht:
>> On 14.12.20 14:00, Fabian Ebner wrote:
>>> @@ -584,6 +586,33 @@ sub destroy_job {
>>>       });
>>>   }
>>>   +sub get_process_start_time {
>>> +    my ($pid) = @_;
>>> +
>>> +    return eval { run_cmd(['ps', '-o', 'lstart=', '-p', "$pid"]); };
>>
>> instead of fork+exec do a much cheaper file read?
>>
>> I.e., copying over file_read_firstline from PVE::Tools then:
>>
>> sub get_process_start_time {
>>      my $stat_str = file_read_firstline("/proc/$pid/stat");
>>      my $stat = [ split(/\s+/, $stat_str) ];
>>
>>      return $stat->[21];
>> }
>>
>> plus some error handling (note I did not test above)
>>
> 
> Agreed, although we also need to obtain the boot time (from /proc/stat) to have the actual start time, because the value in /proc/$pid/stat is just the number of clock ticks since boot when the process was started. But it's still much cheaper of course.

hmm, yeah intra-boot this would not be enough to always tell 100% for sure.
FYI, there you probably could also use `/proc/sys/kernel/random/boot_id` can be
read once at program startup.

http://0pointer.de/blog/projects/ids.html (see "Software IDs"),


>>>   @@ -593,11 +622,18 @@ sub sync {
>>>       eval { $job = get_job($param) };
>>>         if ($job) {
>>> -        if (defined($job->{state}) && ($job->{state} eq "syncing" || $job->{state} eq "waiting")) {
>>> +        my $state = $job->{state} // 'ok';
>>> +        $state = 'ok' if !instance_exists($job->{instance_id});
>>> +
>>> +        if ($state eq "syncing" || $state eq "waiting") {
>>>           die "Job --source $param->{source} --name $param->{name} is already scheduled to sync\n";
>>>           }
>>>             $job->{state} = "waiting";
>>> +
>>> +        eval { $job->{instance_id} = get_instance_id($$); };
>>
>> I'd query and cache the local instance ID from the current process on startup, this
>> would have the nice side effect of avoiding error potential here completely
>>
> 
> What if querying fails on startup? I'd rather have it be a non-critical failure and continue. Then we'd still need a check here to see if the cached instance_id is defined.

if you make it just reads of /proc and it fails you can assume critical
conditions and abort. If you really do not want too, you can add a singleton
which returns the cached info and if not available retry getting it and warn.

my $id_cache;
sub get_local_instance_id {
    return $id_cache if defined($id_cache);
    $id_cache = eval { get_instance_id($$) };
    warn $@ if $@;
    return $id_cache;
}

Albeit, I'd have less hard feelings about caching if getting the ID doesn't
fork, nor other rather costly operations.




^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2020-12-17  9:24 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-12-14 13:00 [pve-devel] [PATCH zsync] fix #2821: only abort if there really is a waiting/syncing job instance already Fabian Ebner
2020-12-14 13:47 ` Thomas Lamprecht
2020-12-17  8:40   ` Fabian Ebner
2020-12-17  9:23     ` Thomas Lamprecht

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox
Service provided by Proxmox Server Solutions GmbH | Privacy | Legal