From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <f.weber@proxmox.com>
Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68])
 (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
 key-exchange X25519 server-signature RSA-PSS (2048 bits))
 (No client certificate requested)
 by lists.proxmox.com (Postfix) with ESMTPS id D6F4495A95
 for <pve-devel@lists.proxmox.com>; Thu, 19 Jan 2023 13:39:39 +0100 (CET)
Received: from firstgate.proxmox.com (localhost [127.0.0.1])
 by firstgate.proxmox.com (Proxmox) with ESMTP id B81BE2BFEB
 for <pve-devel@lists.proxmox.com>; Thu, 19 Jan 2023 13:39:39 +0100 (CET)
Received: from proxmox-new.maurer-it.com (proxmox-new.maurer-it.com
 [94.136.29.106])
 (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
 key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256)
 (No client certificate requested)
 by firstgate.proxmox.com (Proxmox) with ESMTPS
 for <pve-devel@lists.proxmox.com>; Thu, 19 Jan 2023 13:39:39 +0100 (CET)
Received: from proxmox-new.maurer-it.com (localhost.localdomain [127.0.0.1])
 by proxmox-new.maurer-it.com (Proxmox) with ESMTP id DDCC4446F2
 for <pve-devel@lists.proxmox.com>; Thu, 19 Jan 2023 13:39:38 +0100 (CET)
From: Friedrich Weber <f.weber@proxmox.com>
To: pve-devel@lists.proxmox.com
Date: Thu, 19 Jan 2023 13:39:02 +0100
Message-Id: <20230119123902.745440-1-f.weber@proxmox.com>
X-Mailer: git-send-email 2.30.2
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-SPAM-LEVEL: Spam detection results:  0
 BAYES_00                 -1.9 Bayes spam probability is 0 to 1%
 KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment
 SPF_HELO_NONE           0.001 SPF: HELO does not publish an SPF Record
 SPF_PASS               -0.001 SPF: sender matches SPF record
 URIBL_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to URIBL was blocked. See
 http://wiki.apache.org/spamassassin/DnsBlocklists#dnsbl-block for more
 information. [lxc.pm]
Subject: [pve-devel] [RFC container] fix: shutdown: if lxc-stop fails,
 wait for socket closing with timeout
X-BeenThere: pve-devel@lists.proxmox.com
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Proxmox VE development discussion <pve-devel.lists.proxmox.com>
List-Unsubscribe: <https://lists.proxmox.com/cgi-bin/mailman/options/pve-devel>, 
 <mailto:pve-devel-request@lists.proxmox.com?subject=unsubscribe>
List-Archive: <http://lists.proxmox.com/pipermail/pve-devel/>
List-Post: <mailto:pve-devel@lists.proxmox.com>
List-Help: <mailto:pve-devel-request@lists.proxmox.com?subject=help>
List-Subscribe: <https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel>, 
 <mailto:pve-devel-request@lists.proxmox.com?subject=subscribe>
X-List-Received-Date: Thu, 19 Jan 2023 12:39:39 -0000

When trying to shutdown a hung container with `forceStop=0` (e.g. via
the Web UI), the shutdown task may run indefinitely while holding a
lock on the container config. The reason is that the shutdown
subroutine waits for the LXC command socket to close, even if the
`lxc-stop` command has failed due to timeout. This prevents other
tasks (such as a stop task) from acquiring the lock. In order to stop
the container, the shutdown task has to be explicitly killed first,
which is inconvenient. This occurs e.g. when trying to shutdown a hung
CentOS 7 container (with systemd <v232) in a cgroupv2 environment.

This fix imposes a timeout on the socket read operation if the
`lxc-stop` command has failed. Behavior in case `lxc-stop` succeeds is
unchanged. This reintroduces some code from b1bad293. The timeout
duration is the given shutdown timeout, meaning that the final task
duration in the scenario above is twice the shutdown timeout.

Signed-off-by: Friedrich Weber <f.weber@proxmox.com>
---

I stumbled upon the hanging CentOS 7 container shutdown task while
looking into #4474. However, it is quite the edge case and only
slightly inconvenient, so I'm not sure whether it needs to be
addressed -- and if it needs to be addressed, I'm not sure whether the
attached fix is the way to go. :) So I'm submitting it as an RFC. Let
me know what you think.

 src/PVE/LXC.pm | 16 +++++++++++++---
 1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/src/PVE/LXC.pm b/src/PVE/LXC.pm
index ce6d5a5..9b3cd64 100644
--- a/src/PVE/LXC.pm
+++ b/src/PVE/LXC.pm
@@ -2473,11 +2473,21 @@ sub vm_stop {
     }
 
     eval { run_command($cmd, timeout => $shutdown_timeout) };
+
+    my $result = 1;
+    my $wait = sub { $result = <$sock>; };
+
+    # Wait until the command socket is closed.
+    # In case the lxc-stop call failed, reading from the command socket may block forever,
+    # so read with another timeout to avoid freezing the shutdown task.
     if (my $err = $@) {
-	warn $@ if $@;
-    }
+	warn $err if $err;
 
-    my $result = <$sock>;
+	eval { PVE::Tools::run_with_timeout($shutdown_timeout, $wait); };
+	warn "read from command socket failed: $@" if $@;
+    } else {
+	$wait->();
+    }
 
     return if !defined $result; # monitor is gone and the ct has stopped.
     die "container did not stop\n";
-- 
2.30.2