From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits)) (No client certificate requested) by lists.proxmox.com (Postfix) with ESMTPS id B47C691724 for ; Wed, 21 Dec 2022 12:15:20 +0100 (CET) Received: from firstgate.proxmox.com (localhost [127.0.0.1]) by firstgate.proxmox.com (Proxmox) with ESMTP id C2C396BBE for ; Wed, 21 Dec 2022 12:14:49 +0100 (CET) Received: from proxmox-new.maurer-it.com (proxmox-new.maurer-it.com [94.136.29.106]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits)) (No client certificate requested) by firstgate.proxmox.com (Proxmox) with ESMTPS for ; Wed, 21 Dec 2022 12:14:48 +0100 (CET) Received: from proxmox-new.maurer-it.com (localhost.localdomain [127.0.0.1]) by proxmox-new.maurer-it.com (Proxmox) with ESMTP id 91C3B44330 for ; Wed, 21 Dec 2022 12:14:48 +0100 (CET) Date: Wed, 21 Dec 2022 12:14:41 +0100 From: Fabian =?iso-8859-1?q?Gr=FCnbichler?= To: Proxmox VE development discussion References: <20221216133655.510957-1-d.tschlatscher@proxmox.com> <20221216133655.510957-4-d.tschlatscher@proxmox.com> In-Reply-To: <20221216133655.510957-4-d.tschlatscher@proxmox.com> MIME-Version: 1.0 User-Agent: astroid/0.16.0 (https://github.com/astroidmail/astroid) Message-Id: <1671620408.1e4qgx6uw7.astroid@yuna.none> Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-SPAM-LEVEL: Spam detection results: 0 AWL 0.133 Adjusted score from AWL reputation of From: address BAYES_00 -1.9 Bayes spam probability is 0 to 1% KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment SPF_HELO_NONE 0.001 SPF: HELO does not publish an SPF Record SPF_PASS -0.001 SPF: sender matches SPF record URIBL_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to URIBL was blocked. See http://wiki.apache.org/spamassassin/DnsBlocklists#dnsbl-block for more information. [qemuserver.pm, proxmox.com] Subject: Re: [pve-devel] [PATCH qemu-server v3 3/5] await and kill lingering KVM thread when VM start reaches timeout X-BeenThere: pve-devel@lists.proxmox.com X-Mailman-Version: 2.1.29 Precedence: list List-Id: Proxmox VE development discussion List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 21 Dec 2022 11:15:20 -0000 On December 16, 2022 2:36 pm, Daniel Tschlatscher wrote: > In some cases the VM API start method would return before the detached > KVM process would have exited. This is especially problematic with HA, > because the HA manager would think the VM started successfully, later > see that it exited and start it again in an endless loop. >=20 > Moreover, another case exists when resuming a hibernated VM. In this > case, the qemu thread will attempt to load the whole vmstate into > memory before exiting. > Depending on vmstate size, disk read speed, and similar factors this > can take quite a while though and it is not possible to start the VM > normally during this time. >=20 > To get around this, this patch intercepts the error, looks whether a > corresponding KVM thread is still running, and waits for/kills it, > before continuing. >=20 > Signed-off-by: Daniel Tschlatscher > --- >=20 > Changes from v2: > * Rebased to current master > * Changed warn to use 'log_warn' instead > * Reworded log message when waiting for lingering qemu process >=20 > PVE/QemuServer.pm | 40 +++++++++++++++++++++++++++++++++------- > 1 file changed, 33 insertions(+), 7 deletions(-) >=20 > diff --git a/PVE/QemuServer.pm b/PVE/QemuServer.pm > index 2adbe3a..f63dc3f 100644 > --- a/PVE/QemuServer.pm > +++ b/PVE/QemuServer.pm > @@ -5884,15 +5884,41 @@ sub vm_start_nolock { > $tpmpid =3D start_swtpm($storecfg, $vmid, $tpm, $migratedfrom); > } > =20 > - my $exitcode =3D run_command($cmd, %run_params); > - if ($exitcode) { > - if ($tpmpid) { > - warn "stopping swtpm instance (pid $tpmpid) due to QEMU startup er= ror\n"; > - kill 'TERM', $tpmpid; > + eval { > + my $exitcode =3D run_command($cmd, %run_params); > + > + if ($exitcode) { > + if ($tpmpid) { > + log_warn "stopping swtpm instance (pid $tpmpid) due to QEMU startup error\n"; this warn -> log_warn change kind of slipped in, it's not really part of th= is patch? > + kill 'TERM', $tpmpid; > + } > + die "QEMU exited with code $exitcode\n"; > } > - die "QEMU exited with code $exitcode\n"; > + }; > + > + if (my $err =3D $@) { > + my $pid =3D PVE::QemuServer::Helpers::vm_running_locally($vmid); > + > + if ($pid ne "") { can be combined: if (my $pid =3D ...) { } (empty string evaluates to false in perl ;)) > + my $count =3D 0; > + my $timeout =3D 300; > + > + print "Waiting $timeout seconds for detached qemu process $pid to = exit\n"; > + while (($count < $timeout) && > + PVE::QemuServer::Helpers::vm_running_locally($vmid)) { > + $count++; > + sleep(1); > + } > + either here > + if ($count >=3D $timeout) { > + log_warn "Reached timeout. Terminating now with SIGKILL\n"; or here, recheck that VM is still running and still has the same PID, and l= og accordingly instead of KILLing if not.. the same is also true in _do_vm_stop > + kill(9, $pid); > + } > + } > + > + die $err; > } > - }; > + } > }; > =20 > if ($conf->{hugepages}) { > --=20 > 2.30.2 >=20 >=20 >=20 > _______________________________________________ > pve-devel mailing list > pve-devel@lists.proxmox.com > https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel >=20 >=20 >=20