public inbox for pve-devel@lists.proxmox.com
 help / color / mirror / Atom feed
From: Fiona Ebner <f.ebner@proxmox.com>
To: pve-devel@lists.proxmox.com
Subject: [pve-devel] [PATCH qemu-server v2 3/5] agent: implement fsfreeze helper to better handle lost commands
Date: Tue,  9 Sep 2025 15:26:00 +0200	[thread overview]
Message-ID: <20250909132613.96402-4-f.ebner@proxmox.com> (raw)
In-Reply-To: <20250909132613.96402-1-f.ebner@proxmox.com>

As reported in the enterprise support, it can happen that a guest
agent command is read, but then the guest agent never sends an answer,
because the service in the guest is stopped/killed. For example, if a
guest reboot happens before the command can be successfully executed.
This is usually not problematic, but the fsfreeze-freeze command has a
timeout of 1 hour, so the guest agent socket would be blocked for that
amount of time, waiting on a command that is not being executed
anymore.

Use a lower timeout for the initial fsfreeze-freeze command, and issue
an fsfreeze-status command afterwards, which will return immediately
if the fsfreeze-freeze command already finished, and which will be
queued if not. This is used as a proxy to determine whether the
fsfreeze-freeze command is still running and to check whether it was
successful. Using a too low timeout would mean stuffing/queuing many
fsfreeze-status commands while the guest agent might still be busy
actually doing the freeze. In total, fsfreeze-freeze is still allowed
to take 1 hour, but the time the socket is blocked after a
"lost command" is at most 10 minutes.

Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
---

Changes in v2:
* Slightly improve log messages.
* Use POD for documentation.
* Mention why not an even lower timeout is used in commit message.

 src/PVE/QMPClient.pm           |  4 ++
 src/PVE/QemuConfig.pm          |  4 +-
 src/PVE/QemuServer/Agent.pm    | 68 ++++++++++++++++++++++++++++++++++
 src/PVE/QemuServer/BlockJob.pm |  2 +-
 src/PVE/VZDump/QemuServer.pm   |  4 +-
 5 files changed, 77 insertions(+), 5 deletions(-)

diff --git a/src/PVE/QMPClient.pm b/src/PVE/QMPClient.pm
index 68ce0edb..1935a336 100644
--- a/src/PVE/QMPClient.pm
+++ b/src/PVE/QMPClient.pm
@@ -110,6 +110,8 @@ sub cmd {
         } elsif ($cmd->{execute} =~ m/^(eject|change)/) {
             $timeout = 60; # note: cdrom mount command is slow
         } elsif ($cmd->{execute} eq 'guest-fsfreeze-freeze') {
+            # consider using the guest_fsfreeze() helper in Agent.pm
+            #
             # freeze syncs all guest FS, if we kill it it stays in an unfreezable
             # locked state with high probability, so use an generous timeout
             $timeout = 60 * 60; # 1 hour
@@ -158,6 +160,7 @@ sub cmd {
     if (defined($queue_info->{error})) {
         die "VM $vmid qmp command '$cmd->{execute}' failed - $queue_info->{error}" if !$noerr;
         $result = { error => $queue_info->{error} };
+        $result->{'error-is-timeout'} = 1 if $queue_info->{'error-is-timeout'};
     }
 
     return $result;
@@ -484,6 +487,7 @@ sub mux_timeout {
 
     if (my $queue_info = &$lookup_queue_info($self, $fh)) {
         $queue_info->{error} = "got timeout\n";
+        $queue_info->{'error-is-timeout'} = 1;
         $self->{mux}->inbuffer($fh, ''); # clear to avoid warnings
     }
 
diff --git a/src/PVE/QemuConfig.pm b/src/PVE/QemuConfig.pm
index d0844c4c..97b2e8a5 100644
--- a/src/PVE/QemuConfig.pm
+++ b/src/PVE/QemuConfig.pm
@@ -312,8 +312,8 @@ sub __snapshot_freeze {
         eval { mon_cmd($vmid, "guest-fsfreeze-thaw"); };
         warn "guest-fsfreeze-thaw problems - $@" if $@;
     } else {
-        eval { mon_cmd($vmid, "guest-fsfreeze-freeze"); };
-        warn "guest-fsfreeze-freeze problems - $@" if $@;
+        eval { PVE::QemuServer::Agent::guest_fsfreeze($vmid); };
+        warn $@ if $@;
     }
 }
 
diff --git a/src/PVE/QemuServer/Agent.pm b/src/PVE/QemuServer/Agent.pm
index ee48e83e..9ec9c1de 100644
--- a/src/PVE/QemuServer/Agent.pm
+++ b/src/PVE/QemuServer/Agent.pm
@@ -131,4 +131,72 @@ sub qemu_exec_status {
     return $res;
 }
 
+=head3 guest_fsfreeze
+
+    guest_fsfreeze($vmid);
+
+Freeze the file systems of the guest C<$vmid>. Check that the guest agent is enabled and running
+before calling this function. Dies if the file systems cannot be frozen.
+
+With C<mon_cmd()>, it can happen that a guest agent command is read, but then the guest agent never
+sends an answer, because the service in the guest is stopped/killed. For example, if a guest reboot
+happens before the command can be successfully executed. This is usually not problematic, but the
+fsfreeze-freeze command should use a timeout of 1 hour, so the guest agent socket would be blocked
+for that amount of time, waiting on a command that is not being executed anymore.
+
+This function uses a lower timeout for the initial fsfreeze-freeze command, and issues an
+fsfreeze-status command afterwards, which will return immediately if the fsfreeze-freeze command
+already finished, and which will be queued if not. This is used as a proxy to determine whether the
+fsfreeze-freeze command is still running and to check whether it was successful. Using a too low
+timeout would mean stuffing/queuing many fsfreeze-status commands while the guest agent might still
+be busy actually doing the freeze. In total, fsfreeze-freeze is still allowed to take 1 hour, but
+the time the socket is blocked after a lost command is at most 10 minutes.
+
+=cut
+
+sub guest_fsfreeze {
+    my ($vmid) = @_;
+
+    my $timeout = 10 * 60;
+
+    my $result = eval {
+        PVE::QemuServer::Monitor::mon_cmd($vmid, 'guest-fsfreeze-freeze', timeout => $timeout);
+    };
+    if ($result && ref($result) eq 'HASH' && $result->{error}) {
+        my $error = $result->{error}->{desc} // 'unknown';
+        die "unable to freeze guest fs - $error\n";
+    } elsif (defined($result)) {
+        return; # command successful
+    }
+
+    my $status;
+    eval {
+        my ($i, $last_iteration) = (0, 5);
+        while ($i < $last_iteration && !defined($status)) {
+            print "still waiting on guest fs freeze - timeout in "
+                . ($timeout * ($last_iteration - $i) / 60)
+                . " minutes\n";
+            $i++;
+
+            $status = PVE::QemuServer::Monitor::mon_cmd(
+                $vmid, 'guest-fsfreeze-status',
+                timeout => $timeout,
+                noerr => 1,
+            );
+
+            if ($status && ref($status) eq 'HASH' && $status->{'error-is-timeout'}) {
+                $status = undef;
+            } else {
+                check_agent_error($status, 'unknown error');
+            }
+        }
+        if (!defined($status)) {
+            die "timeout after " . ($timeout * ($last_iteration + 1) / 60) . " minutes\n";
+        }
+    };
+    die "querying status after freezing guest fs failed - $@" if $@;
+
+    die "unable to freeze guest fs - unexpected status '$status'\n" if $status ne 'frozen';
+}
+
 1;
diff --git a/src/PVE/QemuServer/BlockJob.pm b/src/PVE/QemuServer/BlockJob.pm
index 633c0b34..506010e1 100644
--- a/src/PVE/QemuServer/BlockJob.pm
+++ b/src/PVE/QemuServer/BlockJob.pm
@@ -165,7 +165,7 @@ sub qemu_drive_mirror_monitor {
                     my $agent_running = $qga && qga_check_running($vmid);
                     if ($agent_running) {
                         print "freeze filesystem\n";
-                        eval { mon_cmd($vmid, "guest-fsfreeze-freeze"); };
+                        eval { PVE::QemuServer::Agent::guest_fsfreeze($vmid); };
                         warn $@ if $@;
                     } else {
                         print "suspend vm\n";
diff --git a/src/PVE/VZDump/QemuServer.pm b/src/PVE/VZDump/QemuServer.pm
index 5b94c369..23ac74f7 100644
--- a/src/PVE/VZDump/QemuServer.pm
+++ b/src/PVE/VZDump/QemuServer.pm
@@ -1103,10 +1103,10 @@ sub qga_fs_freeze {
     }
 
     $self->loginfo("issuing guest-agent 'fs-freeze' command");
-    eval { mon_cmd($vmid, "guest-fsfreeze-freeze") };
+    eval { PVE::QemuServer::Agent::guest_fsfreeze($vmid); };
     $self->logerr($@) if $@;
 
-    return 1; # even on mon command error, ensure we always thaw again
+    return 1; # even on error, ensure we always thaw again
 }
 
 # only call if fs_freeze return 1
-- 
2.47.2



_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel


  parent reply	other threads:[~2025-09-09 13:26 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-09-09 13:25 [pve-devel] [PATCH-SERIES qemu-server v2 0/5] guest agent: better handle lost freeze command Fiona Ebner
2025-09-09 13:25 ` [pve-devel] [PATCH qemu-server v2 1/5] api: agent: improve module imports Fiona Ebner
2025-09-09 13:25 ` [pve-devel] [PATCH qemu-server v2 2/5] qmp client: remove erroneous comment Fiona Ebner
2025-09-09 13:26 ` Fiona Ebner [this message]
2025-09-09 13:26 ` [pve-devel] [PATCH qemu-server v2 4/5] agent: prefer usage of get_qga_key() helper Fiona Ebner
2025-09-09 13:26 ` [pve-devel] [PATCH qemu-server v2 5/5] agent: move guest agent format and parsing to agent module Fiona Ebner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20250909132613.96402-4-f.ebner@proxmox.com \
    --to=f.ebner@proxmox.com \
    --cc=pve-devel@lists.proxmox.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox
Service provided by Proxmox Server Solutions GmbH | Privacy | Legal