all lists on lists.proxmox.com
 help / color / mirror / Atom feed
From: Fiona Ebner <f.ebner@proxmox.com>
To: pve-devel@lists.proxmox.com
Subject: [pve-devel] [PATCH qemu-server v2 3/5] agent: implement fsfreeze helper to better handle lost commands
Date: Tue,  9 Sep 2025 15:26:00 +0200	[thread overview]
Message-ID: <20250909132613.96402-4-f.ebner@proxmox.com> (raw)
In-Reply-To: <20250909132613.96402-1-f.ebner@proxmox.com>

As reported in the enterprise support, it can happen that a guest
agent command is read, but then the guest agent never sends an answer,
because the service in the guest is stopped/killed. For example, if a
guest reboot happens before the command can be successfully executed.
This is usually not problematic, but the fsfreeze-freeze command has a
timeout of 1 hour, so the guest agent socket would be blocked for that
amount of time, waiting on a command that is not being executed
anymore.

Use a lower timeout for the initial fsfreeze-freeze command, and issue
an fsfreeze-status command afterwards, which will return immediately
if the fsfreeze-freeze command already finished, and which will be
queued if not. This is used as a proxy to determine whether the
fsfreeze-freeze command is still running and to check whether it was
successful. Using a too low timeout would mean stuffing/queuing many
fsfreeze-status commands while the guest agent might still be busy
actually doing the freeze. In total, fsfreeze-freeze is still allowed
to take 1 hour, but the time the socket is blocked after a
"lost command" is at most 10 minutes.

Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
---

Changes in v2:
* Slightly improve log messages.
* Use POD for documentation.
* Mention why not an even lower timeout is used in commit message.

 src/PVE/QMPClient.pm           |  4 ++
 src/PVE/QemuConfig.pm          |  4 +-
 src/PVE/QemuServer/Agent.pm    | 68 ++++++++++++++++++++++++++++++++++
 src/PVE/QemuServer/BlockJob.pm |  2 +-
 src/PVE/VZDump/QemuServer.pm   |  4 +-
 5 files changed, 77 insertions(+), 5 deletions(-)

diff --git a/src/PVE/QMPClient.pm b/src/PVE/QMPClient.pm
index 68ce0edb..1935a336 100644
--- a/src/PVE/QMPClient.pm
+++ b/src/PVE/QMPClient.pm
@@ -110,6 +110,8 @@ sub cmd {
         } elsif ($cmd->{execute} =~ m/^(eject|change)/) {
             $timeout = 60; # note: cdrom mount command is slow
         } elsif ($cmd->{execute} eq 'guest-fsfreeze-freeze') {
+            # consider using the guest_fsfreeze() helper in Agent.pm
+            #
             # freeze syncs all guest FS, if we kill it it stays in an unfreezable
             # locked state with high probability, so use an generous timeout
             $timeout = 60 * 60; # 1 hour
@@ -158,6 +160,7 @@ sub cmd {
     if (defined($queue_info->{error})) {
         die "VM $vmid qmp command '$cmd->{execute}' failed - $queue_info->{error}" if !$noerr;
         $result = { error => $queue_info->{error} };
+        $result->{'error-is-timeout'} = 1 if $queue_info->{'error-is-timeout'};
     }
 
     return $result;
@@ -484,6 +487,7 @@ sub mux_timeout {
 
     if (my $queue_info = &$lookup_queue_info($self, $fh)) {
         $queue_info->{error} = "got timeout\n";
+        $queue_info->{'error-is-timeout'} = 1;
         $self->{mux}->inbuffer($fh, ''); # clear to avoid warnings
     }
 
diff --git a/src/PVE/QemuConfig.pm b/src/PVE/QemuConfig.pm
index d0844c4c..97b2e8a5 100644
--- a/src/PVE/QemuConfig.pm
+++ b/src/PVE/QemuConfig.pm
@@ -312,8 +312,8 @@ sub __snapshot_freeze {
         eval { mon_cmd($vmid, "guest-fsfreeze-thaw"); };
         warn "guest-fsfreeze-thaw problems - $@" if $@;
     } else {
-        eval { mon_cmd($vmid, "guest-fsfreeze-freeze"); };
-        warn "guest-fsfreeze-freeze problems - $@" if $@;
+        eval { PVE::QemuServer::Agent::guest_fsfreeze($vmid); };
+        warn $@ if $@;
     }
 }
 
diff --git a/src/PVE/QemuServer/Agent.pm b/src/PVE/QemuServer/Agent.pm
index ee48e83e..9ec9c1de 100644
--- a/src/PVE/QemuServer/Agent.pm
+++ b/src/PVE/QemuServer/Agent.pm
@@ -131,4 +131,72 @@ sub qemu_exec_status {
     return $res;
 }
 
+=head3 guest_fsfreeze
+
+    guest_fsfreeze($vmid);
+
+Freeze the file systems of the guest C<$vmid>. Check that the guest agent is enabled and running
+before calling this function. Dies if the file systems cannot be frozen.
+
+With C<mon_cmd()>, it can happen that a guest agent command is read, but then the guest agent never
+sends an answer, because the service in the guest is stopped/killed. For example, if a guest reboot
+happens before the command can be successfully executed. This is usually not problematic, but the
+fsfreeze-freeze command should use a timeout of 1 hour, so the guest agent socket would be blocked
+for that amount of time, waiting on a command that is not being executed anymore.
+
+This function uses a lower timeout for the initial fsfreeze-freeze command, and issues an
+fsfreeze-status command afterwards, which will return immediately if the fsfreeze-freeze command
+already finished, and which will be queued if not. This is used as a proxy to determine whether the
+fsfreeze-freeze command is still running and to check whether it was successful. Using a too low
+timeout would mean stuffing/queuing many fsfreeze-status commands while the guest agent might still
+be busy actually doing the freeze. In total, fsfreeze-freeze is still allowed to take 1 hour, but
+the time the socket is blocked after a lost command is at most 10 minutes.
+
+=cut
+
+sub guest_fsfreeze {
+    my ($vmid) = @_;
+
+    my $timeout = 10 * 60;
+
+    my $result = eval {
+        PVE::QemuServer::Monitor::mon_cmd($vmid, 'guest-fsfreeze-freeze', timeout => $timeout);
+    };
+    if ($result && ref($result) eq 'HASH' && $result->{error}) {
+        my $error = $result->{error}->{desc} // 'unknown';
+        die "unable to freeze guest fs - $error\n";
+    } elsif (defined($result)) {
+        return; # command successful
+    }
+
+    my $status;
+    eval {
+        my ($i, $last_iteration) = (0, 5);
+        while ($i < $last_iteration && !defined($status)) {
+            print "still waiting on guest fs freeze - timeout in "
+                . ($timeout * ($last_iteration - $i) / 60)
+                . " minutes\n";
+            $i++;
+
+            $status = PVE::QemuServer::Monitor::mon_cmd(
+                $vmid, 'guest-fsfreeze-status',
+                timeout => $timeout,
+                noerr => 1,
+            );
+
+            if ($status && ref($status) eq 'HASH' && $status->{'error-is-timeout'}) {
+                $status = undef;
+            } else {
+                check_agent_error($status, 'unknown error');
+            }
+        }
+        if (!defined($status)) {
+            die "timeout after " . ($timeout * ($last_iteration + 1) / 60) . " minutes\n";
+        }
+    };
+    die "querying status after freezing guest fs failed - $@" if $@;
+
+    die "unable to freeze guest fs - unexpected status '$status'\n" if $status ne 'frozen';
+}
+
 1;
diff --git a/src/PVE/QemuServer/BlockJob.pm b/src/PVE/QemuServer/BlockJob.pm
index 633c0b34..506010e1 100644
--- a/src/PVE/QemuServer/BlockJob.pm
+++ b/src/PVE/QemuServer/BlockJob.pm
@@ -165,7 +165,7 @@ sub qemu_drive_mirror_monitor {
                     my $agent_running = $qga && qga_check_running($vmid);
                     if ($agent_running) {
                         print "freeze filesystem\n";
-                        eval { mon_cmd($vmid, "guest-fsfreeze-freeze"); };
+                        eval { PVE::QemuServer::Agent::guest_fsfreeze($vmid); };
                         warn $@ if $@;
                     } else {
                         print "suspend vm\n";
diff --git a/src/PVE/VZDump/QemuServer.pm b/src/PVE/VZDump/QemuServer.pm
index 5b94c369..23ac74f7 100644
--- a/src/PVE/VZDump/QemuServer.pm
+++ b/src/PVE/VZDump/QemuServer.pm
@@ -1103,10 +1103,10 @@ sub qga_fs_freeze {
     }
 
     $self->loginfo("issuing guest-agent 'fs-freeze' command");
-    eval { mon_cmd($vmid, "guest-fsfreeze-freeze") };
+    eval { PVE::QemuServer::Agent::guest_fsfreeze($vmid); };
     $self->logerr($@) if $@;
 
-    return 1; # even on mon command error, ensure we always thaw again
+    return 1; # even on error, ensure we always thaw again
 }
 
 # only call if fs_freeze return 1
-- 
2.47.2



_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel


  parent reply	other threads:[~2025-09-09 13:26 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-09-09 13:25 [pve-devel] [PATCH-SERIES qemu-server v2 0/5] guest agent: better handle lost freeze command Fiona Ebner
2025-09-09 13:25 ` [pve-devel] [PATCH qemu-server v2 1/5] api: agent: improve module imports Fiona Ebner
2025-09-09 13:25 ` [pve-devel] [PATCH qemu-server v2 2/5] qmp client: remove erroneous comment Fiona Ebner
2025-09-09 13:26 ` Fiona Ebner [this message]
2025-09-09 13:26 ` [pve-devel] [PATCH qemu-server v2 4/5] agent: prefer usage of get_qga_key() helper Fiona Ebner
2025-09-09 13:26 ` [pve-devel] [PATCH qemu-server v2 5/5] agent: move guest agent format and parsing to agent module Fiona Ebner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20250909132613.96402-4-f.ebner@proxmox.com \
    --to=f.ebner@proxmox.com \
    --cc=pve-devel@lists.proxmox.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.
Service provided by Proxmox Server Solutions GmbH | Privacy | Legal