public inbox for pve-devel@lists.proxmox.com
 help / color / mirror / Atom feed
* [pve-devel] [RFC container] fix: shutdown: if lxc-stop fails, wait for socket closing with timeout
@ 2023-01-19 12:39 Friedrich Weber
  2023-01-25  8:25 ` Wolfgang Bumiller
  2023-02-17 13:59 ` [pve-devel] applied: " Wolfgang Bumiller
  0 siblings, 2 replies; 4+ messages in thread
From: Friedrich Weber @ 2023-01-19 12:39 UTC (permalink / raw)
  To: pve-devel

When trying to shutdown a hung container with `forceStop=0` (e.g. via
the Web UI), the shutdown task may run indefinitely while holding a
lock on the container config. The reason is that the shutdown
subroutine waits for the LXC command socket to close, even if the
`lxc-stop` command has failed due to timeout. This prevents other
tasks (such as a stop task) from acquiring the lock. In order to stop
the container, the shutdown task has to be explicitly killed first,
which is inconvenient. This occurs e.g. when trying to shutdown a hung
CentOS 7 container (with systemd <v232) in a cgroupv2 environment.

This fix imposes a timeout on the socket read operation if the
`lxc-stop` command has failed. Behavior in case `lxc-stop` succeeds is
unchanged. This reintroduces some code from b1bad293. The timeout
duration is the given shutdown timeout, meaning that the final task
duration in the scenario above is twice the shutdown timeout.

Signed-off-by: Friedrich Weber <f.weber@proxmox.com>
---

I stumbled upon the hanging CentOS 7 container shutdown task while
looking into #4474. However, it is quite the edge case and only
slightly inconvenient, so I'm not sure whether it needs to be
addressed -- and if it needs to be addressed, I'm not sure whether the
attached fix is the way to go. :) So I'm submitting it as an RFC. Let
me know what you think.

 src/PVE/LXC.pm | 16 +++++++++++++---
 1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/src/PVE/LXC.pm b/src/PVE/LXC.pm
index ce6d5a5..9b3cd64 100644
--- a/src/PVE/LXC.pm
+++ b/src/PVE/LXC.pm
@@ -2473,11 +2473,21 @@ sub vm_stop {
     }
 
     eval { run_command($cmd, timeout => $shutdown_timeout) };
+
+    my $result = 1;
+    my $wait = sub { $result = <$sock>; };
+
+    # Wait until the command socket is closed.
+    # In case the lxc-stop call failed, reading from the command socket may block forever,
+    # so read with another timeout to avoid freezing the shutdown task.
     if (my $err = $@) {
-	warn $@ if $@;
-    }
+	warn $err if $err;
 
-    my $result = <$sock>;
+	eval { PVE::Tools::run_with_timeout($shutdown_timeout, $wait); };
+	warn "read from command socket failed: $@" if $@;
+    } else {
+	$wait->();
+    }
 
     return if !defined $result; # monitor is gone and the ct has stopped.
     die "container did not stop\n";
-- 
2.30.2





^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [pve-devel] [RFC container] fix: shutdown: if lxc-stop fails, wait for socket closing with timeout
  2023-01-19 12:39 [pve-devel] [RFC container] fix: shutdown: if lxc-stop fails, wait for socket closing with timeout Friedrich Weber
@ 2023-01-25  8:25 ` Wolfgang Bumiller
  2023-01-25 12:19   ` Friedrich Weber
  2023-02-17 13:59 ` [pve-devel] applied: " Wolfgang Bumiller
  1 sibling, 1 reply; 4+ messages in thread
From: Wolfgang Bumiller @ 2023-01-25  8:25 UTC (permalink / raw)
  To: Friedrich Weber; +Cc: pve-devel

On Thu, Jan 19, 2023 at 01:39:02PM +0100, Friedrich Weber wrote:
> When trying to shutdown a hung container with `forceStop=0` (e.g. via
> the Web UI), the shutdown task may run indefinitely while holding a
> lock on the container config. The reason is that the shutdown
> subroutine waits for the LXC command socket to close, even if the
> `lxc-stop` command has failed due to timeout. This prevents other
> tasks (such as a stop task) from acquiring the lock. In order to stop
> the container, the shutdown task has to be explicitly killed first,
> which is inconvenient. This occurs e.g. when trying to shutdown a hung
> CentOS 7 container (with systemd <v232) in a cgroupv2 environment.
> 
> This fix imposes a timeout on the socket read operation if the
> `lxc-stop` command has failed. Behavior in case `lxc-stop` succeeds is
> unchanged. This reintroduces some code from b1bad293. The timeout
> duration is the given shutdown timeout, meaning that the final task
> duration in the scenario above is twice the shutdown timeout.
> 
> Signed-off-by: Friedrich Weber <f.weber@proxmox.com>
> ---
> 
> I stumbled upon the hanging CentOS 7 container shutdown task while
> looking into #4474. However, it is quite the edge case and only
> slightly inconvenient, so I'm not sure whether it needs to be
> addressed -- and if it needs to be addressed, I'm not sure whether the
> attached fix is the way to go. :) So I'm submitting it as an RFC. Let
> me know what you think.
> 
>  src/PVE/LXC.pm | 16 +++++++++++++---
>  1 file changed, 13 insertions(+), 3 deletions(-)
> 
> diff --git a/src/PVE/LXC.pm b/src/PVE/LXC.pm
> index ce6d5a5..9b3cd64 100644
> --- a/src/PVE/LXC.pm
> +++ b/src/PVE/LXC.pm
> @@ -2473,11 +2473,21 @@ sub vm_stop {
>      }
>  
>      eval { run_command($cmd, timeout => $shutdown_timeout) };
> +
> +    my $result = 1;
> +    my $wait = sub { $result = <$sock>; };
> +
> +    # Wait until the command socket is closed.
> +    # In case the lxc-stop call failed, reading from the command socket may block forever,
> +    # so read with another timeout to avoid freezing the shutdown task.
>      if (my $err = $@) {
> -	warn $@ if $@;
> -    }
> +	warn $err if $err;
>  
> -    my $result = <$sock>;
> +	eval { PVE::Tools::run_with_timeout($shutdown_timeout, $wait); };

The general approach is fine, but `run_with_timeout` uses SIGALRM and
messes with signal handlers which is rather inelegant for such a thing,
we should limit its use to when we have no other option (mainly
file-locking).

For this case we can just use IO::Poll like:

    my $poll = IO::Poll->new();
    $poll->mask($sock => POLLIN | POLLHUP); # watch for input & EOF
    $poll->poll($shutdown_timeout);

If the socket was closed, then `$poll->mask($sock)` should contain the
`POLLHUP` bits.




^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [pve-devel] [RFC container] fix: shutdown: if lxc-stop fails, wait for socket closing with timeout
  2023-01-25  8:25 ` Wolfgang Bumiller
@ 2023-01-25 12:19   ` Friedrich Weber
  0 siblings, 0 replies; 4+ messages in thread
From: Friedrich Weber @ 2023-01-25 12:19 UTC (permalink / raw)
  To: Wolfgang Bumiller; +Cc: pve-devel

On 25/01/2023 09:25, Wolfgang Bumiller wrote:
> The general approach is fine, but `run_with_timeout` uses SIGALRM and
> messes with signal handlers which is rather inelegant for such a thing,
> we should limit its use to when we have no other option (mainly
> file-locking).
>
> For this case we can just use IO::Poll like:
>
>      my $poll = IO::Poll->new();
>      $poll->mask($sock => POLLIN | POLLHUP); # watch for input & EOF
>      $poll->poll($shutdown_timeout);
>
> If the socket was closed, then `$poll->mask($sock)` should contain the
> `POLLHUP` bits.

Thanks for the suggestion, looks much nicer! I'll send a new version of 
the patch.





^ permalink raw reply	[flat|nested] 4+ messages in thread

* [pve-devel] applied: [RFC container] fix: shutdown: if lxc-stop fails, wait for socket closing with timeout
  2023-01-19 12:39 [pve-devel] [RFC container] fix: shutdown: if lxc-stop fails, wait for socket closing with timeout Friedrich Weber
  2023-01-25  8:25 ` Wolfgang Bumiller
@ 2023-02-17 13:59 ` Wolfgang Bumiller
  1 sibling, 0 replies; 4+ messages in thread
From: Wolfgang Bumiller @ 2023-02-17 13:59 UTC (permalink / raw)
  To: Friedrich Weber; +Cc: pve-devel

applied

We may want to somehow deal with the timeout better in the future (since
we use the same timeout duration twice now), but there are already some
calls before that aren't included in the time either (the hook scripts
for one - I'm not sure those *should* be included, but also in theory
before *those* it could take time for lxc to `accept()` the connection
as well, so the whole thing is split into various phases...)

Still - some room for improvement is left, but this is a rare case
anyway.  Maybe we should simply take the time difference from before and
after the `run_command` and subtract that from the timeout...

On Thu, Jan 19, 2023 at 01:39:02PM +0100, Friedrich Weber wrote:
> When trying to shutdown a hung container with `forceStop=0` (e.g. via
> the Web UI), the shutdown task may run indefinitely while holding a
> lock on the container config. The reason is that the shutdown
> subroutine waits for the LXC command socket to close, even if the
> `lxc-stop` command has failed due to timeout. This prevents other
> tasks (such as a stop task) from acquiring the lock. In order to stop
> the container, the shutdown task has to be explicitly killed first,
> which is inconvenient. This occurs e.g. when trying to shutdown a hung
> CentOS 7 container (with systemd <v232) in a cgroupv2 environment.
> 
> This fix imposes a timeout on the socket read operation if the
> `lxc-stop` command has failed. Behavior in case `lxc-stop` succeeds is
> unchanged. This reintroduces some code from b1bad293. The timeout
> duration is the given shutdown timeout, meaning that the final task
> duration in the scenario above is twice the shutdown timeout.
> 
> Signed-off-by: Friedrich Weber <f.weber@proxmox.com>
> ---
> 
> I stumbled upon the hanging CentOS 7 container shutdown task while
> looking into #4474. However, it is quite the edge case and only
> slightly inconvenient, so I'm not sure whether it needs to be
> addressed -- and if it needs to be addressed, I'm not sure whether the
> attached fix is the way to go. :) So I'm submitting it as an RFC. Let
> me know what you think.
> 
>  src/PVE/LXC.pm | 16 +++++++++++++---
>  1 file changed, 13 insertions(+), 3 deletions(-)
> 
> diff --git a/src/PVE/LXC.pm b/src/PVE/LXC.pm
> index ce6d5a5..9b3cd64 100644
> --- a/src/PVE/LXC.pm
> +++ b/src/PVE/LXC.pm
> @@ -2473,11 +2473,21 @@ sub vm_stop {
>      }
>  
>      eval { run_command($cmd, timeout => $shutdown_timeout) };
> +
> +    my $result = 1;
> +    my $wait = sub { $result = <$sock>; };
> +
> +    # Wait until the command socket is closed.
> +    # In case the lxc-stop call failed, reading from the command socket may block forever,
> +    # so read with another timeout to avoid freezing the shutdown task.
>      if (my $err = $@) {
> -	warn $@ if $@;
> -    }
> +	warn $err if $err;
>  
> -    my $result = <$sock>;
> +	eval { PVE::Tools::run_with_timeout($shutdown_timeout, $wait); };
> +	warn "read from command socket failed: $@" if $@;
> +    } else {
> +	$wait->();
> +    }
>  
>      return if !defined $result; # monitor is gone and the ct has stopped.
>      die "container did not stop\n";
> -- 
> 2.30.2




^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2023-02-17 13:59 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-01-19 12:39 [pve-devel] [RFC container] fix: shutdown: if lxc-stop fails, wait for socket closing with timeout Friedrich Weber
2023-01-25  8:25 ` Wolfgang Bumiller
2023-01-25 12:19   ` Friedrich Weber
2023-02-17 13:59 ` [pve-devel] applied: " Wolfgang Bumiller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox
Service provided by Proxmox Server Solutions GmbH | Privacy | Legal