From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from firstgate.proxmox.com (firstgate.proxmox.com [IPv6:2a01:7e0:0:424::9]) by lore.proxmox.com (Postfix) with ESMTPS id 002761FF136 for ; Mon, 23 Feb 2026 16:49:14 +0100 (CET) Received: from firstgate.proxmox.com (localhost [127.0.0.1]) by firstgate.proxmox.com (Proxmox) with ESMTP id 0AC291214A; Mon, 23 Feb 2026 16:50:06 +0100 (CET) Message-ID: <2c237643-99bd-4e9b-8838-efa9a97f9d75@proxmox.com> Date: Mon, 23 Feb 2026 16:49:59 +0100 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [RFC PATCH qemu-server] cleanup: refactor to make cleanup flow more consistent To: Dominik Csapak , pve-devel@lists.proxmox.com References: <20260223105542.1525232-1-d.csapak@proxmox.com> Content-Language: en-US From: Fiona Ebner In-Reply-To: <20260223105542.1525232-1-d.csapak@proxmox.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Bm-Milter-Handled: 55990f41-d878-4baa-be0a-ee34c49e34d2 X-Bm-Transport-Timestamp: 1771861785275 X-SPAM-LEVEL: Spam detection results: 0 AWL -1.080 Adjusted score from AWL reputation of From: address BAYES_00 -1.9 Bayes spam probability is 0 to 1% DMARC_MISSING 0.1 Missing DMARC policy KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment RCVD_IN_VALIDITY_CERTIFIED_BLOCKED 0.798 ADMINISTRATOR NOTICE: The query to Validity was blocked. See https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more information. RCVD_IN_VALIDITY_RPBL_BLOCKED 0.79 ADMINISTRATOR NOTICE: The query to Validity was blocked. See https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more information. RCVD_IN_VALIDITY_SAFE_BLOCKED 0.547 ADMINISTRATOR NOTICE: The query to Validity was blocked. See https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more information. SPF_HELO_NONE 0.001 SPF: HELO does not publish an SPF Record SPF_PASS -0.001 SPF: sender matches SPF record Message-ID-Hash: CACNL766IBJV2LU6FCCNADTNLDBNYNK5 X-Message-ID-Hash: CACNL766IBJV2LU6FCCNADTNLDBNYNK5 X-MailFrom: f.ebner@proxmox.com X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; loop; banned-address; emergency; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header X-Mailman-Version: 3.3.10 Precedence: list List-Id: Proxmox VE development discussion List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: Am 23.02.26 um 11:56 AM schrieb Dominik Csapak: > There are two ways a cleanup can be triggered: > > * When a guest is stopped/shutdown via the API, 'vm_stop' calls 'vm_stop_cleanup'. > * When the guest process disconnects from qmeventd, 'qm cleanup' is > called, which in turn also tries to call 'vm_stop_cleanup'. > > Both of these happen under a qemu config lock, so there is no direct > race condition that it will be called out of order, but it could happen > that the 'qm cleanup' call happened in addition so cleanup was called > twice. Which could be a problem when the shutdown was called with > 'keepActive' which 'qm cleanup' would simply know nothing of and ignore. > > Also the post-stop hook might not be triggered in case e.g. a stop-mode > backup was done, since that was only happening via qm cleanup and this > would detect the now again running guest and abort. > > To improve the situation we move the exec_hookscript call at the end > of vm_stop_cleanup. At this point we know the vm is stopped and we still > have the config lock. > > On _do_vm_stop (and in the one case for migration) a 'cleanup-flag' is > created that marks the vm is cleaned up by the api, so 'qm cleanup' > should not do it again. > > On vm start, this flag is cleared. > > There is still a tiny race possible: > > a guest is stopped from within (or crashes) and the vm is started again > via the api before 'qm cleanup' can run > > This should be a very rare case though, and all operation via the API > (reboot, shutdown+start, stop-mode backup, etc.) should work as intended. How difficult is it to trigger the race with an HA-managed VM? > Signed-off-by: Dominik Csapak > --- > I'm not sure how we could possibly eliminate the race i mentioned: > * we can't block on start because i don't think we can sensibly decide between: > - vm was crashing/powered off > - vm was never started > > We could maybe leave a 'started' flag somewhere too and clear the > cleanup flag also in 'qm cleanup', then we would start the vm > only when the cleanup flag is cleared > (or have the cleanup flags have multiple states, like 'started', > 'finished') > > Though this has some regression potential i think... > > src/PVE/CLI/qm.pm | 6 +++--- > src/PVE/QemuServer.pm | 6 ++++++ > src/PVE/QemuServer/Helpers.pm | 23 +++++++++++++++++++++++ > src/test/MigrationTest/QmMock.pm | 9 +++++++++ > 4 files changed, 41 insertions(+), 3 deletions(-) > > diff --git a/src/PVE/CLI/qm.pm b/src/PVE/CLI/qm.pm > index bdae9641..f31d2ee0 100755 > --- a/src/PVE/CLI/qm.pm > +++ b/src/PVE/CLI/qm.pm > @@ -1120,13 +1120,13 @@ __PACKAGE__->register_method({ > } > } > > - if (!$clean || $guest) { > + if (PVE::QemuServer::Helpers::cleanup_flag_exists($vmid)) { > + warn "Cleanup already done for $vmid, skipping...\n"; > + } else { > # vm was shutdown from inside the guest or crashed, doing api cleanup > PVE::QemuServer::vm_stop_cleanup($storecfg, $vmid, $conf, 0, 0, 1); > } > > - PVE::GuestHelpers::exec_hookscript($conf, $vmid, 'post-stop'); > - > $restart = eval { PVE::QemuServer::clear_reboot_request($vmid) }; > warn $@ if $@; > }, > diff --git a/src/PVE/QemuServer.pm b/src/PVE/QemuServer.pm > index 545758dc..1253b601 100644 > --- a/src/PVE/QemuServer.pm > +++ b/src/PVE/QemuServer.pm > @@ -5439,6 +5439,8 @@ sub vm_start { > return PVE::QemuConfig->lock_config( > $vmid, > sub { > + PVE::QemuServer::Helpers::clear_cleanup_flag($vmid); > + > my $conf = PVE::QemuConfig->load_config($vmid, $migrate_opts->{migratedfrom}); > > die "you can't start a vm if it's a template\n" > @@ -6158,6 +6160,7 @@ sub vm_stop_cleanup { > die $err if !$noerr; > warn $err; > } > + PVE::GuestHelpers::exec_hookscript($conf, $vmid, 'post-stop'); > } > > # call only in locked context > @@ -6167,6 +6170,8 @@ sub _do_vm_stop { > my $pid = check_running($vmid, $nocheck); > return if !$pid; > > + PVE::QemuServer::Helpers::create_cleanup_flag($vmid); We don't do the cleanup if $nohcheck=1, so we may not create the flag then. > + > my $conf; > if (!$nocheck) { > $conf = PVE::QemuConfig->load_config($vmid); > @@ -6261,6 +6266,7 @@ sub vm_stop { > $force = 1 if !defined($force) && !$shutdown; > > if ($migratedfrom) { > + PVE::QemuServer::Helpers::create_cleanup_flag($vmid); > my $pid = check_running($vmid, $nocheck, $migratedfrom); > kill 15, $pid if $pid; > my $conf = PVE::QemuConfig->load_config($vmid, $migratedfrom); > diff --git a/src/PVE/QemuServer/Helpers.pm b/src/PVE/QemuServer/Helpers.pm > index 35c00754..79277dd6 100644 > --- a/src/PVE/QemuServer/Helpers.pm > +++ b/src/PVE/QemuServer/Helpers.pm > @@ -6,8 +6,10 @@ use warnings; > use File::stat; > use IO::File; > use JSON; > +use POSIX qw(); > > use PVE::Cluster; > +use PVE::File; > use PVE::INotify; > use PVE::ProcFSTools; > use PVE::Tools qw(get_host_arch); > @@ -363,4 +365,25 @@ sub get_iscsi_initiator_name { > return $initiator; > } > > +my sub get_cleanup_path { > + my ($vmid) = @_; > + return "/var/run/qemu-server/$vmid.cleanup"; > +} > + > +sub create_cleanup_flag { > + my ($vmid) = @_; > + PVE::File::file_set_contents(get_cleanup_path($vmid), time()); > +} > + > +sub clear_cleanup_flag { > + my ($vmid) = @_; > + my $path = get_cleanup_path($vmid); > + unlink $path or $! == POSIX::ENOENT or die "removing cleanup flag for $vmid failed: $!\n"; > +} > + > +sub cleanup_flag_exists { > + my ($vmid) = @_; > + return -f get_cleanup_path($vmid); > +} > + Maybe those helpers are better placed in the RunState module rather than the generic Helper module? > 1; > diff --git a/src/test/MigrationTest/QmMock.pm b/src/test/MigrationTest/QmMock.pm > index 78be47d3..7d514644 100644 > --- a/src/test/MigrationTest/QmMock.pm > +++ b/src/test/MigrationTest/QmMock.pm > @@ -75,6 +75,15 @@ $qemu_server_helpers_module->mock( > vm_running_locally => sub { > return $kvm_exectued; > }, > + create_cleanup_flag => sub { > + return undef; > + }, > + clear_cleanup_flag => sub { > + return undef; > + }, > + cleanup_flag_exists => sub { > + return ''; # what -f returns if a file does not exist > + }, > ); > > our $qemu_server_machine_module = Test::MockModule->new("PVE::QemuServer::Machine");