From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68]) by lore.proxmox.com (Postfix) with ESMTPS id DBA231FF136 for ; Mon, 23 Feb 2026 11:55:29 +0100 (CET) Received: from firstgate.proxmox.com (localhost [127.0.0.1]) by firstgate.proxmox.com (Proxmox) with ESMTP id 715E183CE; Mon, 23 Feb 2026 11:56:20 +0100 (CET) From: Dominik Csapak To: pve-devel@lists.proxmox.com Subject: [RFC PATCH qemu-server] cleanup: refactor to make cleanup flow more consistent Date: Mon, 23 Feb 2026 11:49:44 +0100 Message-ID: <20260223105542.1525232-1-d.csapak@proxmox.com> X-Mailer: git-send-email 2.47.3 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-SPAM-LEVEL: Spam detection results: 0 AWL -1.035 Adjusted score from AWL reputation of From: address BAYES_00 -1.9 Bayes spam probability is 0 to 1% DMARC_MISSING 0.1 Missing DMARC policy KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment RCVD_IN_VALIDITY_CERTIFIED_BLOCKED 0.798 ADMINISTRATOR NOTICE: The query to Validity was blocked. See https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more information. RCVD_IN_VALIDITY_RPBL_BLOCKED 0.79 ADMINISTRATOR NOTICE: The query to Validity was blocked. See https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more information. RCVD_IN_VALIDITY_SAFE_BLOCKED 0.547 ADMINISTRATOR NOTICE: The query to Validity was blocked. See https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more information. SPF_HELO_NONE 0.001 SPF: HELO does not publish an SPF Record SPF_PASS -0.001 SPF: sender matches SPF record Message-ID-Hash: OKMANR5OGQYN5DG5MJWEBTKHUUCIMKHP X-Message-ID-Hash: OKMANR5OGQYN5DG5MJWEBTKHUUCIMKHP X-MailFrom: d.csapak@proxmox.com X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; loop; banned-address; emergency; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header X-Mailman-Version: 3.3.10 Precedence: list List-Id: Proxmox VE development discussion List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: There are two ways a cleanup can be triggered: * When a guest is stopped/shutdown via the API, 'vm_stop' calls 'vm_stop_cleanup'. * When the guest process disconnects from qmeventd, 'qm cleanup' is called, which in turn also tries to call 'vm_stop_cleanup'. Both of these happen under a qemu config lock, so there is no direct race condition that it will be called out of order, but it could happen that the 'qm cleanup' call happened in addition so cleanup was called twice. Which could be a problem when the shutdown was called with 'keepActive' which 'qm cleanup' would simply know nothing of and ignore. Also the post-stop hook might not be triggered in case e.g. a stop-mode backup was done, since that was only happening via qm cleanup and this would detect the now again running guest and abort. To improve the situation we move the exec_hookscript call at the end of vm_stop_cleanup. At this point we know the vm is stopped and we still have the config lock. On _do_vm_stop (and in the one case for migration) a 'cleanup-flag' is created that marks the vm is cleaned up by the api, so 'qm cleanup' should not do it again. On vm start, this flag is cleared. There is still a tiny race possible: a guest is stopped from within (or crashes) and the vm is started again via the api before 'qm cleanup' can run This should be a very rare case though, and all operation via the API (reboot, shutdown+start, stop-mode backup, etc.) should work as intended. Signed-off-by: Dominik Csapak --- I'm not sure how we could possibly eliminate the race i mentioned: * we can't block on start because i don't think we can sensibly decide between: - vm was crashing/powered off - vm was never started We could maybe leave a 'started' flag somewhere too and clear the cleanup flag also in 'qm cleanup', then we would start the vm only when the cleanup flag is cleared (or have the cleanup flags have multiple states, like 'started', 'finished') Though this has some regression potential i think... src/PVE/CLI/qm.pm | 6 +++--- src/PVE/QemuServer.pm | 6 ++++++ src/PVE/QemuServer/Helpers.pm | 23 +++++++++++++++++++++++ src/test/MigrationTest/QmMock.pm | 9 +++++++++ 4 files changed, 41 insertions(+), 3 deletions(-) diff --git a/src/PVE/CLI/qm.pm b/src/PVE/CLI/qm.pm index bdae9641..f31d2ee0 100755 --- a/src/PVE/CLI/qm.pm +++ b/src/PVE/CLI/qm.pm @@ -1120,13 +1120,13 @@ __PACKAGE__->register_method({ } } - if (!$clean || $guest) { + if (PVE::QemuServer::Helpers::cleanup_flag_exists($vmid)) { + warn "Cleanup already done for $vmid, skipping...\n"; + } else { # vm was shutdown from inside the guest or crashed, doing api cleanup PVE::QemuServer::vm_stop_cleanup($storecfg, $vmid, $conf, 0, 0, 1); } - PVE::GuestHelpers::exec_hookscript($conf, $vmid, 'post-stop'); - $restart = eval { PVE::QemuServer::clear_reboot_request($vmid) }; warn $@ if $@; }, diff --git a/src/PVE/QemuServer.pm b/src/PVE/QemuServer.pm index 545758dc..1253b601 100644 --- a/src/PVE/QemuServer.pm +++ b/src/PVE/QemuServer.pm @@ -5439,6 +5439,8 @@ sub vm_start { return PVE::QemuConfig->lock_config( $vmid, sub { + PVE::QemuServer::Helpers::clear_cleanup_flag($vmid); + my $conf = PVE::QemuConfig->load_config($vmid, $migrate_opts->{migratedfrom}); die "you can't start a vm if it's a template\n" @@ -6158,6 +6160,7 @@ sub vm_stop_cleanup { die $err if !$noerr; warn $err; } + PVE::GuestHelpers::exec_hookscript($conf, $vmid, 'post-stop'); } # call only in locked context @@ -6167,6 +6170,8 @@ sub _do_vm_stop { my $pid = check_running($vmid, $nocheck); return if !$pid; + PVE::QemuServer::Helpers::create_cleanup_flag($vmid); + my $conf; if (!$nocheck) { $conf = PVE::QemuConfig->load_config($vmid); @@ -6261,6 +6266,7 @@ sub vm_stop { $force = 1 if !defined($force) && !$shutdown; if ($migratedfrom) { + PVE::QemuServer::Helpers::create_cleanup_flag($vmid); my $pid = check_running($vmid, $nocheck, $migratedfrom); kill 15, $pid if $pid; my $conf = PVE::QemuConfig->load_config($vmid, $migratedfrom); diff --git a/src/PVE/QemuServer/Helpers.pm b/src/PVE/QemuServer/Helpers.pm index 35c00754..79277dd6 100644 --- a/src/PVE/QemuServer/Helpers.pm +++ b/src/PVE/QemuServer/Helpers.pm @@ -6,8 +6,10 @@ use warnings; use File::stat; use IO::File; use JSON; +use POSIX qw(); use PVE::Cluster; +use PVE::File; use PVE::INotify; use PVE::ProcFSTools; use PVE::Tools qw(get_host_arch); @@ -363,4 +365,25 @@ sub get_iscsi_initiator_name { return $initiator; } +my sub get_cleanup_path { + my ($vmid) = @_; + return "/var/run/qemu-server/$vmid.cleanup"; +} + +sub create_cleanup_flag { + my ($vmid) = @_; + PVE::File::file_set_contents(get_cleanup_path($vmid), time()); +} + +sub clear_cleanup_flag { + my ($vmid) = @_; + my $path = get_cleanup_path($vmid); + unlink $path or $! == POSIX::ENOENT or die "removing cleanup flag for $vmid failed: $!\n"; +} + +sub cleanup_flag_exists { + my ($vmid) = @_; + return -f get_cleanup_path($vmid); +} + 1; diff --git a/src/test/MigrationTest/QmMock.pm b/src/test/MigrationTest/QmMock.pm index 78be47d3..7d514644 100644 --- a/src/test/MigrationTest/QmMock.pm +++ b/src/test/MigrationTest/QmMock.pm @@ -75,6 +75,15 @@ $qemu_server_helpers_module->mock( vm_running_locally => sub { return $kvm_exectued; }, + create_cleanup_flag => sub { + return undef; + }, + clear_cleanup_flag => sub { + return undef; + }, + cleanup_flag_exists => sub { + return ''; # what -f returns if a file does not exist + }, ); our $qemu_server_machine_module = Test::MockModule->new("PVE::QemuServer::Machine"); -- 2.47.3