From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits)) (No client certificate requested) by lists.proxmox.com (Postfix) with ESMTPS id E1A6793670 for ; Thu, 5 Jan 2023 11:09:31 +0100 (CET) Received: from firstgate.proxmox.com (localhost [127.0.0.1]) by firstgate.proxmox.com (Proxmox) with ESMTP id C022427DFF for ; Thu, 5 Jan 2023 11:09:31 +0100 (CET) Received: from proxmox-new.maurer-it.com (proxmox-new.maurer-it.com [94.136.29.106]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits)) (No client certificate requested) by firstgate.proxmox.com (Proxmox) with ESMTPS for ; Thu, 5 Jan 2023 11:09:31 +0100 (CET) Received: from proxmox-new.maurer-it.com (localhost.localdomain [127.0.0.1]) by proxmox-new.maurer-it.com (Proxmox) with ESMTP id CF83B44A75 for ; Thu, 5 Jan 2023 11:09:30 +0100 (CET) From: Daniel Tschlatscher To: pve-devel@lists.proxmox.com Date: Thu, 5 Jan 2023 11:08:34 +0100 Message-Id: <20230105100837.195520-4-d.tschlatscher@proxmox.com> X-Mailer: git-send-email 2.30.2 In-Reply-To: <20230105100837.195520-1-d.tschlatscher@proxmox.com> References: <20230105100837.195520-1-d.tschlatscher@proxmox.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-SPAM-LEVEL: Spam detection results: 0 AWL -0.113 Adjusted score from AWL reputation of From: address BAYES_00 -1.9 Bayes spam probability is 0 to 1% KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment SPF_HELO_NONE 0.001 SPF: HELO does not publish an SPF Record SPF_PASS -0.001 SPF: sender matches SPF record URIBL_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to URIBL was blocked. See http://wiki.apache.org/spamassassin/DnsBlocklists#dnsbl-block for more information. [qemuserver.pm] Subject: [pve-devel] [PATCH qemu-server v4 3/6] await and kill lingering KVM thread when VM start reaches timeout X-BeenThere: pve-devel@lists.proxmox.com X-Mailman-Version: 2.1.29 Precedence: list List-Id: Proxmox VE development discussion List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 05 Jan 2023 10:09:31 -0000 In some cases the VM API start method would return before the detached KVM process would have exited. This is especially problematic with HA, because the HA manager would think the VM started successfully, later see that it exited and start it again in an endless loop. Moreover, another case exists when resuming a hibernated VM. In this case, the qemu thread will attempt to load the whole vmstate into memory before exiting. Depending on vmstate size, disk read speed, and similar factors this can take quite a while though and it is not possible to start the VM normally during this time. To get around this, this patch intercepts the error, looks whether a corresponding KVM thread is still running, and waits for/kills it, before continuing. Signed-off-by: Daniel Tschlatscher --- Changes from v3: * Minor code clean up concerning the usage of "$pid" in ifs according to Fabian's suggestion PVE/QemuServer.pm | 38 +++++++++++++++++++++++++++++++------- 1 file changed, 31 insertions(+), 7 deletions(-) diff --git a/PVE/QemuServer.pm b/PVE/QemuServer.pm index 2a4bc75..549d666 100644 --- a/PVE/QemuServer.pm +++ b/PVE/QemuServer.pm @@ -5881,15 +5881,39 @@ sub vm_start_nolock { $tpmpid = start_swtpm($storecfg, $vmid, $tpm, $migratedfrom); } - my $exitcode = run_command($cmd, %run_params); - if ($exitcode) { - if ($tpmpid) { - warn "stopping swtpm instance (pid $tpmpid) due to QEMU startup error\n"; - kill 'TERM', $tpmpid; + eval { + my $exitcode = run_command($cmd, %run_params); + + if ($exitcode) { + if ($tpmpid) { + log_warn "stopping swtpm instance (pid $tpmpid) due to QEMU startup error\n"; + kill 'TERM', $tpmpid; + } + die "QEMU exited with code $exitcode\n"; } - die "QEMU exited with code $exitcode\n"; + }; + + if (my $err = $@) { + if (my $pid = PVE::QemuServer::Helpers::vm_running_locally($vmid)) { + my $count = 0; + my $timeout = 300; + + print "Waiting $timeout seconds for detached qemu process $pid to exit\n"; + while (($count < $timeout) && + PVE::QemuServer::Helpers::vm_running_locally($vmid)) { + $count++; + sleep(1); + } + + if ($count >= $timeout) { + log_warn "Reached timeout. Terminating now with SIGKILL\n"; + kill(9, $pid) if PVE::QemuServer::Helpers::vm_running_locally($vmid) eq $pid; + } + } + + die $err; } - }; + } }; if ($conf->{hugepages}) { -- 2.30.2