From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <f.ebner@proxmox.com>
Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68])
 (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
 key-exchange X25519 server-signature RSA-PSS (2048 bits))
 (No client certificate requested)
 by lists.proxmox.com (Postfix) with ESMTPS id 08BFB602E1
 for <pve-devel@lists.proxmox.com>; Tue,  8 Sep 2020 13:59:20 +0200 (CEST)
Received: from firstgate.proxmox.com (localhost [127.0.0.1])
 by firstgate.proxmox.com (Proxmox) with ESMTP id F24A59949
 for <pve-devel@lists.proxmox.com>; Tue,  8 Sep 2020 13:58:49 +0200 (CEST)
Received: from proxmox-new.maurer-it.com (proxmox-new.maurer-it.com
 [212.186.127.180])
 (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
 key-exchange X25519 server-signature RSA-PSS (2048 bits))
 (No client certificate requested)
 by firstgate.proxmox.com (Proxmox) with ESMTPS id 05C459939
 for <pve-devel@lists.proxmox.com>; Tue,  8 Sep 2020 13:58:49 +0200 (CEST)
Received: from proxmox-new.maurer-it.com (localhost.localdomain [127.0.0.1])
 by proxmox-new.maurer-it.com (Proxmox) with ESMTP id BFDA444A95
 for <pve-devel@lists.proxmox.com>; Tue,  8 Sep 2020 13:58:48 +0200 (CEST)
From: Fabian Ebner <f.ebner@proxmox.com>
To: pve-devel@lists.proxmox.com
Date: Tue,  8 Sep 2020 13:58:43 +0200
Message-Id: <20200908115843.345-2-f.ebner@proxmox.com>
X-Mailer: git-send-email 2.20.1
In-Reply-To: <20200908115843.345-1-f.ebner@proxmox.com>
References: <20200908115843.345-1-f.ebner@proxmox.com>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-SPAM-LEVEL: Spam detection results:  0
 AWL -0.041 Adjusted score from AWL reputation of From: address
 KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment
 RCVD_IN_DNSWL_MED        -2.3 Sender listed at https://www.dnswl.org/,
 medium trust
 SPF_HELO_NONE           0.001 SPF: HELO does not publish an SPF Record
 SPF_PASS               -0.001 SPF: sender matches SPF record
 URIBL_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to URIBL was blocked. See
 http://wiki.apache.org/spamassassin/DnsBlocklists#dnsbl-block for more
 information. [lxc.pm]
Subject: [pve-devel] [PATCH v2 container 2/2] Improve feedback for startup
X-BeenThere: pve-devel@lists.proxmox.com
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Proxmox VE development discussion <pve-devel.lists.proxmox.com>
List-Unsubscribe: <https://lists.proxmox.com/cgi-bin/mailman/options/pve-devel>, 
 <mailto:pve-devel-request@lists.proxmox.com?subject=unsubscribe>
List-Archive: <http://lists.proxmox.com/pipermail/pve-devel/>
List-Post: <mailto:pve-devel@lists.proxmox.com>
List-Help: <mailto:pve-devel-request@lists.proxmox.com?subject=help>
List-Subscribe: <https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel>, 
 <mailto:pve-devel-request@lists.proxmox.com?subject=subscribe>
X-List-Received-Date: Tue, 08 Sep 2020 11:59:20 -0000

Since it was necessary to switch to 'Type=Simple' in the systemd
service, see 545d6f0a13ac2bf3a8d3f224c19c0e0def12116d,
'systemctl start' would not wait for the 'lxc-start' command anymore.
Thus every container start was reported as a success and the 'post-start'
hook would trigger immediately after the 'systemctl start' command.

Use the monitor socket to get the necessary information and detect
startup failure, and only run the 'post-start' hookscript after
the container is effectively running. If something goes wrong
with the monitor socket, for example if lxc-monitord is not running,
fall back to the old behavior.

Signed-off-by: Fabian Ebner <f.ebner@proxmox.com>
---

Changes from v1:
    * use monitor socket directly instead of forking off an lxc-monitor process
    * use run_with_timeout helper
    * warn instead of die on unexpected message

 src/PVE/LXC.pm | 40 +++++++++++++++++++++++++++++++++++++++-
 1 file changed, 39 insertions(+), 1 deletion(-)

diff --git a/src/PVE/LXC.pm b/src/PVE/LXC.pm
index db5b8ca..370adda 100644
--- a/src/PVE/LXC.pm
+++ b/src/PVE/LXC.pm
@@ -32,6 +32,7 @@ use PVE::LXC::Config;
 use PVE::GuestHelpers qw(safe_string_ne safe_num_ne safe_boolean_ne);
 use PVE::LXC::Tools;
 use PVE::LXC::CGroup;
+use PVE::LXC::Monitor;
 
 use Time::HiRes qw (gettimeofday);
 my $have_sdn;
@@ -2191,10 +2192,47 @@ sub vm_start {
 
     PVE::Storage::activate_volumes($storage_cfg, $vollist);
 
+    my $monitor_socket = eval { PVE::LXC::Monitor::get_monitor_socket(); };
+    warn $@ if $@;
+
+    my $monitor_state_change = sub {
+	die "no monitor socket" if !defined($monitor_socket);
+
+	while (1) {
+	    my ($type, $name, $value) = PVE::LXC::Monitor::read_lxc_message($monitor_socket);
+
+	    die "monitor socket EOF" if !defined($type);
+
+	    next if $name ne "$vmid" || $type ne 'STATE';
+
+	    if ($value eq PVE::LXC::Monitor::STATE_STARTING) {
+		alarm(0); # don't timeout after seeing the starting state
+	    } elsif ($value eq PVE::LXC::Monitor::STATE_ABORTING ||
+		     $value eq PVE::LXC::Monitor::STATE_STOPPING ||
+		     $value eq PVE::LXC::Monitor::STATE_STOPPED) {
+		return 0;
+	    } elsif ($value eq PVE::LXC::Monitor::STATE_RUNNING) {
+		return 1;
+	    } else {
+		warn "unexpected message from monitor socket - " .
+		     "type: '$type' - value: '$value'\n";
+	    }
+	}
+    };
+
     my $cmd = ['systemctl', 'start', "pve-container\@$vmid"];
 
     PVE::GuestHelpers::exec_hookscript($conf, $vmid, 'pre-start', 1);
-    eval { PVE::Tools::run_command($cmd); };
+    eval {
+	PVE::Tools::run_command($cmd);
+
+	my $success = eval { PVE::Tools::run_with_timeout(10, $monitor_state_change); };
+	if (my $err = $@) {
+	    warn "problem with monitor socket: $err - continuing anyway\n";
+	} elsif (!$success) {
+	    die "startup for container '$vmid' failed\n";
+	}
+    };
     if (my $err = $@) {
 	unlink $skiplock_flag_fn;
 	die $err;
-- 
2.20.1