* [pve-devel] [PATCH common] allow longer timeout for cancelling 'vzdump' jobs
@ 2021-01-14 15:39 Stefan Reiter
2021-01-26 18:23 ` Thomas Lamprecht
0 siblings, 1 reply; 3+ messages in thread
From: Stefan Reiter @ 2021-01-14 15:39 UTC (permalink / raw)
To: pve-devel
This attempts to solve the issue where on slow network storages,
aborting a backup job (which may wait for buffers to flush) could take
longer than 5 seconds, and would thus result in the task being killed by
SIGKILL, not removing the backup lock in the process.
Make the implementation future-proof by using a map from task type to a
timeout value. Default stays at 5, so tasks other than 'vzdump' are not
affected.
Signed-off-by: Stefan Reiter <s.reiter@proxmox.com>
---
src/PVE/RESTEnvironment.pm | 16 ++++++++++++----
1 file changed, 12 insertions(+), 4 deletions(-)
diff --git a/src/PVE/RESTEnvironment.pm b/src/PVE/RESTEnvironment.pm
index d5b84d0..8a0cb9a 100644
--- a/src/PVE/RESTEnvironment.pm
+++ b/src/PVE/RESTEnvironment.pm
@@ -365,8 +365,16 @@ sub active_workers {
return $res;
}
+my $timeout_map = {
+ # backup cancellation on slow target storages might take a while, avoid
+ # leaving the VM in locked state
+ "vzdump" => 60,
+};
+
my $kill_process_group = sub {
- my ($pid, $pstart) = @_;
+ my ($pid, $pstart, $timeout) = @_;
+
+ $timeout //= 5;
# send kill to process group (negative pid)
my $kpid = -$pid;
@@ -374,8 +382,7 @@ my $kill_process_group = sub {
# always send signal to all pgrp members
kill(15, $kpid); # send TERM signal
- # give max 5 seconds to shut down
- for (my $i = 0; $i < 5; $i++) {
+ for (my $i = 0; $i < $timeout; $i++) {
return if !PVE::ProcFSTools::check_process_running($pid, $pstart);
sleep (1);
}
@@ -394,7 +401,8 @@ sub check_worker {
return 0 if !$running;
if ($killit) {
- &$kill_process_group($task->{pid});
+ my $type = $task->{type};
+ &$kill_process_group($task->{pid}, undef, $timeout_map->{$type});
return 0;
}
--
2.20.1
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: [pve-devel] [PATCH common] allow longer timeout for cancelling 'vzdump' jobs
2021-01-14 15:39 [pve-devel] [PATCH common] allow longer timeout for cancelling 'vzdump' jobs Stefan Reiter
@ 2021-01-26 18:23 ` Thomas Lamprecht
2021-01-27 11:11 ` Stefan Reiter
0 siblings, 1 reply; 3+ messages in thread
From: Thomas Lamprecht @ 2021-01-26 18:23 UTC (permalink / raw)
To: Proxmox VE development discussion, Stefan Reiter
On 14.01.21 16:39, Stefan Reiter wrote:
> This attempts to solve the issue where on slow network storages,
> aborting a backup job (which may wait for buffers to flush) could take
> longer than 5 seconds, and would thus result in the task being killed by
> SIGKILL, not removing the backup lock in the process.
>
> Make the implementation future-proof by using a map from task type to a
> timeout value. Default stays at 5, so tasks other than 'vzdump' are not
> affected.
>
> Signed-off-by: Stefan Reiter <s.reiter@proxmox.com>
> ---
> src/PVE/RESTEnvironment.pm | 16 ++++++++++++----
> 1 file changed, 12 insertions(+), 4 deletions(-)
>
Not to sure about that map there in pve-common, that module should stay rather
agnostic of user special treatment.
Did you thought about passing that explicitly on worker creation, or setting it
in the RPCEnv inside a worker?
> diff --git a/src/PVE/RESTEnvironment.pm b/src/PVE/RESTEnvironment.pm
> index d5b84d0..8a0cb9a 100644
> --- a/src/PVE/RESTEnvironment.pm
> +++ b/src/PVE/RESTEnvironment.pm
> @@ -365,8 +365,16 @@ sub active_workers {
> return $res;
> }
>
> +my $timeout_map = {
> + # backup cancellation on slow target storages might take a while, avoid
> + # leaving the VM in locked state
> + "vzdump" => 60,
> +};
> +
> my $kill_process_group = sub {
> - my ($pid, $pstart) = @_;
> + my ($pid, $pstart, $timeout) = @_;
> +
> + $timeout //= 5;
>
> # send kill to process group (negative pid)
> my $kpid = -$pid;
> @@ -374,8 +382,7 @@ my $kill_process_group = sub {
> # always send signal to all pgrp members
> kill(15, $kpid); # send TERM signal
>
> - # give max 5 seconds to shut down
> - for (my $i = 0; $i < 5; $i++) {
> + for (my $i = 0; $i < $timeout; $i++) {
> return if !PVE::ProcFSTools::check_process_running($pid, $pstart);
> sleep (1);
> }
> @@ -394,7 +401,8 @@ sub check_worker {
> return 0 if !$running;
>
> if ($killit) {
> - &$kill_process_group($task->{pid});
> + my $type = $task->{type};
> + &$kill_process_group($task->{pid}, undef, $timeout_map->{$type});
> return 0;
> }
>
>
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: [pve-devel] [PATCH common] allow longer timeout for cancelling 'vzdump' jobs
2021-01-26 18:23 ` Thomas Lamprecht
@ 2021-01-27 11:11 ` Stefan Reiter
0 siblings, 0 replies; 3+ messages in thread
From: Stefan Reiter @ 2021-01-27 11:11 UTC (permalink / raw)
To: Thomas Lamprecht, Proxmox VE development discussion
On 26/01/2021 19:23, Thomas Lamprecht wrote:
> On 14.01.21 16:39, Stefan Reiter wrote:
>> This attempts to solve the issue where on slow network storages,
>> aborting a backup job (which may wait for buffers to flush) could take
>> longer than 5 seconds, and would thus result in the task being killed by
>> SIGKILL, not removing the backup lock in the process.
>>
>> Make the implementation future-proof by using a map from task type to a
>> timeout value. Default stays at 5, so tasks other than 'vzdump' are not
>> affected.
>>
>> Signed-off-by: Stefan Reiter <s.reiter@proxmox.com>
>> ---
>> src/PVE/RESTEnvironment.pm | 16 ++++++++++++----
>> 1 file changed, 12 insertions(+), 4 deletions(-)
>>
>
> Not to sure about that map there in pve-common, that module should stay rather
> agnostic of user special treatment.
>
> Did you thought about passing that explicitly on worker creation, or setting it
> in the RPCEnv inside a worker?
I generally agree that it's a bit misplaced, but I don't see a way to
encode it in the worker - the only info we have in check_worker and
stop_task is the UPID, and I don't think it makes sense to encode a
timeout in that? Or is there a way I'm not seeing to retrieve additional
info about a worker from the UPID alone?
We could at least put the map in pve-manager, but I'm not sure if that's
any better.
>
>> diff --git a/src/PVE/RESTEnvironment.pm b/src/PVE/RESTEnvironment.pm
>> index d5b84d0..8a0cb9a 100644
>> --- a/src/PVE/RESTEnvironment.pm
>> +++ b/src/PVE/RESTEnvironment.pm
>> @@ -365,8 +365,16 @@ sub active_workers {
>> return $res;
>> }
>>
>> +my $timeout_map = {
>> + # backup cancellation on slow target storages might take a while, avoid
>> + # leaving the VM in locked state
>> + "vzdump" => 60,
>> +};
>> +
>> my $kill_process_group = sub {
>> - my ($pid, $pstart) = @_;
>> + my ($pid, $pstart, $timeout) = @_;
>> +
>> + $timeout //= 5;
>>
>> # send kill to process group (negative pid)
>> my $kpid = -$pid;
>> @@ -374,8 +382,7 @@ my $kill_process_group = sub {
>> # always send signal to all pgrp members
>> kill(15, $kpid); # send TERM signal
>>
>> - # give max 5 seconds to shut down
>> - for (my $i = 0; $i < 5; $i++) {
>> + for (my $i = 0; $i < $timeout; $i++) {
>> return if !PVE::ProcFSTools::check_process_running($pid, $pstart);
>> sleep (1);
>> }
>> @@ -394,7 +401,8 @@ sub check_worker {
>> return 0 if !$running;
>>
>> if ($killit) {
>> - &$kill_process_group($task->{pid});
>> + my $type = $task->{type};
>> + &$kill_process_group($task->{pid}, undef, $timeout_map->{$type});
>> return 0;
>> }
>>
>>
>
>
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2021-01-27 11:11 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-01-14 15:39 [pve-devel] [PATCH common] allow longer timeout for cancelling 'vzdump' jobs Stefan Reiter
2021-01-26 18:23 ` Thomas Lamprecht
2021-01-27 11:11 ` Stefan Reiter
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox