public inbox for pve-devel@lists.proxmox.com
 help / color / mirror / Atom feed
* [pve-devel] [PATCH qemu-server 0/2] migration: fix sporadic nbd-server-stop timeout
@ 2023-09-29  8:28 Alexandre Derumier
  2023-09-29  8:28 ` [pve-devel] [PATCH qemu-server 1/2] nbd_stop: increase timeout to 25s Alexandre Derumier
                   ` (3 more replies)
  0 siblings, 4 replies; 5+ messages in thread
From: Alexandre Derumier @ 2023-09-29  8:28 UTC (permalink / raw)
  To: pve-devel

Hi,

We had some sporadic nbd-stop error when trying to migrate vm with rbd storage + writeback between 2 differents cluster:
(This is without my other targetcpu patch)


2023-09-28 16:20:39 ERROR: error - tunnel command '{"cmd":"nbdstop"}' failed - failed to handle 'nbdstop' command - VM 140 qmp command 'nbd-server-stop' failed - got timeout
2023-09-28 16:20:39 ERROR: migration finished with problems (duration 00:01:42)


I'm not sure, maybe it's related to writeback, because it never happend with a fresh started vm, but vms running since some time can trigger this.
(I'm not sure, maybe nbd need to flush pending datas in cache ?)


Currently, the tunnel command have a 30s timeout, but the qmp command is only at 5s.
Also the tunnel v2 command don't have any eval, so the migration abort keeping both source && target vm locked.
unlocking target vm and resume it manually is working, so it really seem to be a too low timeout.


Alexandre Derumier (2):
  nbd_stop: increase timeout to 25s
  migration: add missing eval on nbdstop with tunnel v2.

 PVE/QemuMigrate.pm | 8 +++++++-
 PVE/QemuServer.pm  | 2 +-
 2 files changed, 8 insertions(+), 2 deletions(-)

-- 
2.39.2




^ permalink raw reply	[flat|nested] 5+ messages in thread

* [pve-devel] [PATCH qemu-server 1/2] nbd_stop: increase timeout to 25s
  2023-09-29  8:28 [pve-devel] [PATCH qemu-server 0/2] migration: fix sporadic nbd-server-stop timeout Alexandre Derumier
@ 2023-09-29  8:28 ` Alexandre Derumier
  2023-09-29  8:28 ` [pve-devel] [PATCH qemu-server 2/2] migration: add missing eval on nbdstop with tunnel v2 Alexandre Derumier
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 5+ messages in thread
From: Alexandre Derumier @ 2023-09-29  8:28 UTC (permalink / raw)
  To: pve-devel

---
 PVE/QemuServer.pm | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/PVE/QemuServer.pm b/PVE/QemuServer.pm
index 1b1ccf4..0259c0f 100644
--- a/PVE/QemuServer.pm
+++ b/PVE/QemuServer.pm
@@ -8267,7 +8267,7 @@ sub generate_smbios1_uuid {
 sub nbd_stop {
     my ($vmid) = @_;
 
-    mon_cmd($vmid, 'nbd-server-stop');
+    mon_cmd($vmid, 'nbd-server-stop', timeout => 25);
 }
 
 sub create_reboot_request {
-- 
2.39.2




^ permalink raw reply	[flat|nested] 5+ messages in thread

* [pve-devel] [PATCH qemu-server 2/2] migration: add missing eval on nbdstop with tunnel v2.
  2023-09-29  8:28 [pve-devel] [PATCH qemu-server 0/2] migration: fix sporadic nbd-server-stop timeout Alexandre Derumier
  2023-09-29  8:28 ` [pve-devel] [PATCH qemu-server 1/2] nbd_stop: increase timeout to 25s Alexandre Derumier
@ 2023-09-29  8:28 ` Alexandre Derumier
  2023-09-29 11:57 ` [pve-devel] [PATCH qemu-server 0/2] migration: fix sporadic nbd-server-stop timeout Fiona Ebner
  2023-11-06 18:48 ` [pve-devel] applied: " Thomas Lamprecht
  3 siblings, 0 replies; 5+ messages in thread
From: Alexandre Derumier @ 2023-09-29  8:28 UTC (permalink / raw)
  To: pve-devel

It was already done in tunnel v1.

Avoid to avoid migration (and keep both source/targetvm locked) if nbdstop error occur

2023-09-28 16:20:39 ERROR: error - tunnel command '{"cmd":"nbdstop"}' failed - failed to handle 'nbdstop' command - VM 140 qmp command 'nbd-server-stop' failed - got timeout
2023-09-28 16:20:39 ERROR: migration finished with problems (duration 00:01:42)
---
 PVE/QemuMigrate.pm | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/PVE/QemuMigrate.pm b/PVE/QemuMigrate.pm
index f41c61f..81880e5 100644
--- a/PVE/QemuMigrate.pm
+++ b/PVE/QemuMigrate.pm
@@ -1475,7 +1475,13 @@ sub phase3_cleanup {
 	    $self->log('info', "stopping NBD storage migration server on target.");
 	    # stop nbd server on remote vm - requirement for resume since 2.9
 	    if ($tunnel && $tunnel->{version} && $tunnel->{version} >= 2) {
-		PVE::Tunnel::write_tunnel($tunnel, 30, 'nbdstop');
+		eval {
+		    PVE::Tunnel::write_tunnel($tunnel, 30, 'nbdstop');
+		};
+		if (my $err = $@) {
+		    $self->log('err', $err);
+		    $self->{errors} = 1;
+		}
 	    } else {
 		my $cmd = [@{$self->{rem_ssh}}, 'qm', 'nbdstop', $vmid];
 
-- 
2.39.2




^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [pve-devel] [PATCH qemu-server 0/2] migration: fix sporadic nbd-server-stop timeout
  2023-09-29  8:28 [pve-devel] [PATCH qemu-server 0/2] migration: fix sporadic nbd-server-stop timeout Alexandre Derumier
  2023-09-29  8:28 ` [pve-devel] [PATCH qemu-server 1/2] nbd_stop: increase timeout to 25s Alexandre Derumier
  2023-09-29  8:28 ` [pve-devel] [PATCH qemu-server 2/2] migration: add missing eval on nbdstop with tunnel v2 Alexandre Derumier
@ 2023-09-29 11:57 ` Fiona Ebner
  2023-11-06 18:48 ` [pve-devel] applied: " Thomas Lamprecht
  3 siblings, 0 replies; 5+ messages in thread
From: Fiona Ebner @ 2023-09-29 11:57 UTC (permalink / raw)
  To: Proxmox VE development discussion, Alexandre Derumier

Am 29.09.23 um 10:28 schrieb Alexandre Derumier:
> I'm not sure, maybe it's related to writeback, because it never happend with a fresh started vm, but vms running since some time can trigger this.
> (I'm not sure, maybe nbd need to flush pending datas in cache ?)
> 

It does drain the export's BlockBackend, i.e. wait for all pending IO
before detaching/closing the export. But we did cancel the mirror job,
which should actually wait for any in-flight IO already, so it's a bit
surprising. Maybe there's some cache interaction happening at an
inconvenient time, no idea ¯\_(ツ)_/¯

The other thing it does is closing the connection to the client, so
there is at least that IO interaction and a higher timeout makes sense.

> 
> Alexandre Derumier (2):
>   nbd_stop: increase timeout to 25s
>   migration: add missing eval on nbdstop with tunnel v2.
> 

Patches look good to me, so consider them

Reviewed-by: Fiona Ebner <f.ebner@proxmox.com>

but they are missing your Signed-off-by trailer. Can you please add that?




^ permalink raw reply	[flat|nested] 5+ messages in thread

* [pve-devel] applied: [PATCH qemu-server 0/2] migration: fix sporadic nbd-server-stop timeout
  2023-09-29  8:28 [pve-devel] [PATCH qemu-server 0/2] migration: fix sporadic nbd-server-stop timeout Alexandre Derumier
                   ` (2 preceding siblings ...)
  2023-09-29 11:57 ` [pve-devel] [PATCH qemu-server 0/2] migration: fix sporadic nbd-server-stop timeout Fiona Ebner
@ 2023-11-06 18:48 ` Thomas Lamprecht
  3 siblings, 0 replies; 5+ messages in thread
From: Thomas Lamprecht @ 2023-11-06 18:48 UTC (permalink / raw)
  To: Proxmox VE development discussion, Alexandre Derumier

Am 29/09/2023 um 10:28 schrieb Alexandre Derumier:
> We had some sporadic nbd-stop error when


applied both patches, thanks!




^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2023-11-06 18:48 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-09-29  8:28 [pve-devel] [PATCH qemu-server 0/2] migration: fix sporadic nbd-server-stop timeout Alexandre Derumier
2023-09-29  8:28 ` [pve-devel] [PATCH qemu-server 1/2] nbd_stop: increase timeout to 25s Alexandre Derumier
2023-09-29  8:28 ` [pve-devel] [PATCH qemu-server 2/2] migration: add missing eval on nbdstop with tunnel v2 Alexandre Derumier
2023-09-29 11:57 ` [pve-devel] [PATCH qemu-server 0/2] migration: fix sporadic nbd-server-stop timeout Fiona Ebner
2023-11-06 18:48 ` [pve-devel] applied: " Thomas Lamprecht

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox
Service provided by Proxmox Server Solutions GmbH | Privacy | Legal