From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits)) (No client certificate requested) by lists.proxmox.com (Postfix) with ESMTPS id 9332D931AB for ; Fri, 17 Feb 2023 14:59:33 +0100 (CET) Received: from firstgate.proxmox.com (localhost [127.0.0.1]) by firstgate.proxmox.com (Proxmox) with ESMTP id 737666ED4 for ; Fri, 17 Feb 2023 14:59:33 +0100 (CET) Received: from proxmox-new.maurer-it.com (proxmox-new.maurer-it.com [94.136.29.106]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits)) (No client certificate requested) by firstgate.proxmox.com (Proxmox) with ESMTPS for ; Fri, 17 Feb 2023 14:59:32 +0100 (CET) Received: from proxmox-new.maurer-it.com (localhost.localdomain [127.0.0.1]) by proxmox-new.maurer-it.com (Proxmox) with ESMTP id A8D5047719 for ; Fri, 17 Feb 2023 14:59:31 +0100 (CET) Date: Fri, 17 Feb 2023 14:59:30 +0100 From: Wolfgang Bumiller To: Friedrich Weber Cc: pve-devel@lists.proxmox.com Message-ID: <20230217135930.y3btcekqwolrw73g@fwblub> References: <20230119123902.745440-1-f.weber@proxmox.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20230119123902.745440-1-f.weber@proxmox.com> X-SPAM-LEVEL: Spam detection results: 0 AWL 0.187 Adjusted score from AWL reputation of From: address BAYES_00 -1.9 Bayes spam probability is 0 to 1% KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment SPF_HELO_NONE 0.001 SPF: HELO does not publish an SPF Record SPF_PASS -0.001 SPF: sender matches SPF record Subject: [pve-devel] applied: [RFC container] fix: shutdown: if lxc-stop fails, wait for socket closing with timeout X-BeenThere: pve-devel@lists.proxmox.com X-Mailman-Version: 2.1.29 Precedence: list List-Id: Proxmox VE development discussion List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 17 Feb 2023 13:59:33 -0000 applied We may want to somehow deal with the timeout better in the future (since we use the same timeout duration twice now), but there are already some calls before that aren't included in the time either (the hook scripts for one - I'm not sure those *should* be included, but also in theory before *those* it could take time for lxc to `accept()` the connection as well, so the whole thing is split into various phases...) Still - some room for improvement is left, but this is a rare case anyway. Maybe we should simply take the time difference from before and after the `run_command` and subtract that from the timeout... On Thu, Jan 19, 2023 at 01:39:02PM +0100, Friedrich Weber wrote: > When trying to shutdown a hung container with `forceStop=0` (e.g. via > the Web UI), the shutdown task may run indefinitely while holding a > lock on the container config. The reason is that the shutdown > subroutine waits for the LXC command socket to close, even if the > `lxc-stop` command has failed due to timeout. This prevents other > tasks (such as a stop task) from acquiring the lock. In order to stop > the container, the shutdown task has to be explicitly killed first, > which is inconvenient. This occurs e.g. when trying to shutdown a hung > CentOS 7 container (with systemd > This fix imposes a timeout on the socket read operation if the > `lxc-stop` command has failed. Behavior in case `lxc-stop` succeeds is > unchanged. This reintroduces some code from b1bad293. The timeout > duration is the given shutdown timeout, meaning that the final task > duration in the scenario above is twice the shutdown timeout. > > Signed-off-by: Friedrich Weber > --- > > I stumbled upon the hanging CentOS 7 container shutdown task while > looking into #4474. However, it is quite the edge case and only > slightly inconvenient, so I'm not sure whether it needs to be > addressed -- and if it needs to be addressed, I'm not sure whether the > attached fix is the way to go. :) So I'm submitting it as an RFC. Let > me know what you think. > > src/PVE/LXC.pm | 16 +++++++++++++--- > 1 file changed, 13 insertions(+), 3 deletions(-) > > diff --git a/src/PVE/LXC.pm b/src/PVE/LXC.pm > index ce6d5a5..9b3cd64 100644 > --- a/src/PVE/LXC.pm > +++ b/src/PVE/LXC.pm > @@ -2473,11 +2473,21 @@ sub vm_stop { > } > > eval { run_command($cmd, timeout => $shutdown_timeout) }; > + > + my $result = 1; > + my $wait = sub { $result = <$sock>; }; > + > + # Wait until the command socket is closed. > + # In case the lxc-stop call failed, reading from the command socket may block forever, > + # so read with another timeout to avoid freezing the shutdown task. > if (my $err = $@) { > - warn $@ if $@; > - } > + warn $err if $err; > > - my $result = <$sock>; > + eval { PVE::Tools::run_with_timeout($shutdown_timeout, $wait); }; > + warn "read from command socket failed: $@" if $@; > + } else { > + $wait->(); > + } > > return if !defined $result; # monitor is gone and the ct has stopped. > die "container did not stop\n"; > -- > 2.30.2