public inbox for pve-devel@lists.proxmox.com
 help / color / mirror / Atom feed
* [pve-devel] [PATCH common] allow longer timeout for cancelling 'vzdump' jobs
@ 2021-01-14 15:39 Stefan Reiter
  2021-01-26 18:23 ` Thomas Lamprecht
  0 siblings, 1 reply; 3+ messages in thread
From: Stefan Reiter @ 2021-01-14 15:39 UTC (permalink / raw)
  To: pve-devel

This attempts to solve the issue where on slow network storages,
aborting a backup job (which may wait for buffers to flush) could take
longer than 5 seconds, and would thus result in the task being killed by
SIGKILL, not removing the backup lock in the process.

Make the implementation future-proof by using a map from task type to a
timeout value. Default stays at 5, so tasks other than 'vzdump' are not
affected.

Signed-off-by: Stefan Reiter <s.reiter@proxmox.com>
---
 src/PVE/RESTEnvironment.pm | 16 ++++++++++++----
 1 file changed, 12 insertions(+), 4 deletions(-)

diff --git a/src/PVE/RESTEnvironment.pm b/src/PVE/RESTEnvironment.pm
index d5b84d0..8a0cb9a 100644
--- a/src/PVE/RESTEnvironment.pm
+++ b/src/PVE/RESTEnvironment.pm
@@ -365,8 +365,16 @@ sub active_workers  {
     return $res;
 }
 
+my $timeout_map = {
+    # backup cancellation on slow target storages might take a while, avoid
+    # leaving the VM in locked state
+    "vzdump" => 60,
+};
+
 my $kill_process_group = sub {
-    my ($pid, $pstart) = @_;
+    my ($pid, $pstart, $timeout) = @_;
+
+    $timeout //= 5;
 
     # send kill to process group (negative pid)
     my $kpid = -$pid;
@@ -374,8 +382,7 @@ my $kill_process_group = sub {
     # always send signal to all pgrp members
     kill(15, $kpid); # send TERM signal
 
-    # give max 5 seconds to shut down
-    for (my $i = 0; $i < 5; $i++) {
+    for (my $i = 0; $i < $timeout; $i++) {
 	return if !PVE::ProcFSTools::check_process_running($pid, $pstart);
 	sleep (1);
     }
@@ -394,7 +401,8 @@ sub check_worker {
     return 0 if !$running;
 
     if ($killit) {
-	&$kill_process_group($task->{pid});
+	my $type = $task->{type};
+	&$kill_process_group($task->{pid}, undef, $timeout_map->{$type});
 	return 0;
     }
 
-- 
2.20.1





^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [pve-devel] [PATCH common] allow longer timeout for cancelling 'vzdump' jobs
  2021-01-14 15:39 [pve-devel] [PATCH common] allow longer timeout for cancelling 'vzdump' jobs Stefan Reiter
@ 2021-01-26 18:23 ` Thomas Lamprecht
  2021-01-27 11:11   ` Stefan Reiter
  0 siblings, 1 reply; 3+ messages in thread
From: Thomas Lamprecht @ 2021-01-26 18:23 UTC (permalink / raw)
  To: Proxmox VE development discussion, Stefan Reiter

On 14.01.21 16:39, Stefan Reiter wrote:
> This attempts to solve the issue where on slow network storages,
> aborting a backup job (which may wait for buffers to flush) could take
> longer than 5 seconds, and would thus result in the task being killed by
> SIGKILL, not removing the backup lock in the process.
> 
> Make the implementation future-proof by using a map from task type to a
> timeout value. Default stays at 5, so tasks other than 'vzdump' are not
> affected.
> 
> Signed-off-by: Stefan Reiter <s.reiter@proxmox.com>
> ---
>  src/PVE/RESTEnvironment.pm | 16 ++++++++++++----
>  1 file changed, 12 insertions(+), 4 deletions(-)
> 

Not to sure about that map there in pve-common, that module should stay rather
agnostic of user special treatment.

Did you thought about passing that explicitly on worker creation, or setting it
in the RPCEnv inside a worker?

> diff --git a/src/PVE/RESTEnvironment.pm b/src/PVE/RESTEnvironment.pm
> index d5b84d0..8a0cb9a 100644
> --- a/src/PVE/RESTEnvironment.pm
> +++ b/src/PVE/RESTEnvironment.pm
> @@ -365,8 +365,16 @@ sub active_workers  {
>      return $res;
>  }
>  
> +my $timeout_map = {
> +    # backup cancellation on slow target storages might take a while, avoid
> +    # leaving the VM in locked state
> +    "vzdump" => 60,
> +};
> +
>  my $kill_process_group = sub {
> -    my ($pid, $pstart) = @_;
> +    my ($pid, $pstart, $timeout) = @_;
> +
> +    $timeout //= 5;
>  
>      # send kill to process group (negative pid)
>      my $kpid = -$pid;
> @@ -374,8 +382,7 @@ my $kill_process_group = sub {
>      # always send signal to all pgrp members
>      kill(15, $kpid); # send TERM signal
>  
> -    # give max 5 seconds to shut down
> -    for (my $i = 0; $i < 5; $i++) {
> +    for (my $i = 0; $i < $timeout; $i++) {
>  	return if !PVE::ProcFSTools::check_process_running($pid, $pstart);
>  	sleep (1);
>      }
> @@ -394,7 +401,8 @@ sub check_worker {
>      return 0 if !$running;
>  
>      if ($killit) {
> -	&$kill_process_group($task->{pid});
> +	my $type = $task->{type};
> +	&$kill_process_group($task->{pid}, undef, $timeout_map->{$type});
>  	return 0;
>      }
>  
> 






^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [pve-devel] [PATCH common] allow longer timeout for cancelling 'vzdump' jobs
  2021-01-26 18:23 ` Thomas Lamprecht
@ 2021-01-27 11:11   ` Stefan Reiter
  0 siblings, 0 replies; 3+ messages in thread
From: Stefan Reiter @ 2021-01-27 11:11 UTC (permalink / raw)
  To: Thomas Lamprecht, Proxmox VE development discussion

On 26/01/2021 19:23, Thomas Lamprecht wrote:
> On 14.01.21 16:39, Stefan Reiter wrote:
>> This attempts to solve the issue where on slow network storages,
>> aborting a backup job (which may wait for buffers to flush) could take
>> longer than 5 seconds, and would thus result in the task being killed by
>> SIGKILL, not removing the backup lock in the process.
>>
>> Make the implementation future-proof by using a map from task type to a
>> timeout value. Default stays at 5, so tasks other than 'vzdump' are not
>> affected.
>>
>> Signed-off-by: Stefan Reiter <s.reiter@proxmox.com>
>> ---
>>   src/PVE/RESTEnvironment.pm | 16 ++++++++++++----
>>   1 file changed, 12 insertions(+), 4 deletions(-)
>>
> 
> Not to sure about that map there in pve-common, that module should stay rather
> agnostic of user special treatment.
> 
> Did you thought about passing that explicitly on worker creation, or setting it
> in the RPCEnv inside a worker?

I generally agree that it's a bit misplaced, but I don't see a way to 
encode it in the worker - the only info we have in check_worker and 
stop_task is the UPID, and I don't think it makes sense to encode a 
timeout in that? Or is there a way I'm not seeing to retrieve additional 
info about a worker from the UPID alone?

We could at least put the map in pve-manager, but I'm not sure if that's 
any better.

> 
>> diff --git a/src/PVE/RESTEnvironment.pm b/src/PVE/RESTEnvironment.pm
>> index d5b84d0..8a0cb9a 100644
>> --- a/src/PVE/RESTEnvironment.pm
>> +++ b/src/PVE/RESTEnvironment.pm
>> @@ -365,8 +365,16 @@ sub active_workers  {
>>       return $res;
>>   }
>>   
>> +my $timeout_map = {
>> +    # backup cancellation on slow target storages might take a while, avoid
>> +    # leaving the VM in locked state
>> +    "vzdump" => 60,
>> +};
>> +
>>   my $kill_process_group = sub {
>> -    my ($pid, $pstart) = @_;
>> +    my ($pid, $pstart, $timeout) = @_;
>> +
>> +    $timeout //= 5;
>>   
>>       # send kill to process group (negative pid)
>>       my $kpid = -$pid;
>> @@ -374,8 +382,7 @@ my $kill_process_group = sub {
>>       # always send signal to all pgrp members
>>       kill(15, $kpid); # send TERM signal
>>   
>> -    # give max 5 seconds to shut down
>> -    for (my $i = 0; $i < 5; $i++) {
>> +    for (my $i = 0; $i < $timeout; $i++) {
>>   	return if !PVE::ProcFSTools::check_process_running($pid, $pstart);
>>   	sleep (1);
>>       }
>> @@ -394,7 +401,8 @@ sub check_worker {
>>       return 0 if !$running;
>>   
>>       if ($killit) {
>> -	&$kill_process_group($task->{pid});
>> +	my $type = $task->{type};
>> +	&$kill_process_group($task->{pid}, undef, $timeout_map->{$type});
>>   	return 0;
>>       }
>>   
>>
> 
> 




^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2021-01-27 11:11 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-01-14 15:39 [pve-devel] [PATCH common] allow longer timeout for cancelling 'vzdump' jobs Stefan Reiter
2021-01-26 18:23 ` Thomas Lamprecht
2021-01-27 11:11   ` Stefan Reiter

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox
Service provided by Proxmox Server Solutions GmbH | Privacy | Legal