* [pve-devel] [RFC qemu-server/common] fix #4501: improve port reservation for QEMU TCP migration
@ 2023-11-14 14:02 Fiona Ebner
2023-11-14 14:02 ` [pve-devel] [RFC qemu-server 1/1] partially fix #4501: migration: start vm: move port reservation and usage closer together Fiona Ebner
` (2 more replies)
0 siblings, 3 replies; 13+ messages in thread
From: Fiona Ebner @ 2023-11-14 14:02 UTC (permalink / raw)
To: pve-devel
Each patch is a different approach for improving the situation and
each subset could be applied. Personally, I like common 2/2, because
it removes the competition for early ports and IMHO the only one
worth considering a full fix, but it is a bit complex.
Another approach (not in the RFC, also could be considered a full fix)
would be to opt-in for a higher expire time for migration ports, add a
mechanism to remove the reservation and have vm_start_nolock() remove
the reservation after it made sure that QEMU got the port.
qemu-server:
Fiona Ebner (1):
partially fix #4501: migration: start vm: move port reservation and
usage closer together
PVE/QemuServer.pm | 20 ++++++++++++++------
1 file changed, 14 insertions(+), 6 deletions(-)
common:
Fiona Ebner (2):
partially fix #4501: next unused port: bump port reservation
expiretime
fix #4501: next unused port: work around issue with too short
expiretime
src/PVE/Tools.pm | 21 +++++++++++++++++++--
1 file changed, 19 insertions(+), 2 deletions(-)
--
2.39.2
^ permalink raw reply [flat|nested] 13+ messages in thread
* [pve-devel] [RFC qemu-server 1/1] partially fix #4501: migration: start vm: move port reservation and usage closer together
2023-11-14 14:02 [pve-devel] [RFC qemu-server/common] fix #4501: improve port reservation for QEMU TCP migration Fiona Ebner
@ 2023-11-14 14:02 ` Fiona Ebner
2023-11-15 8:55 ` Fabian Grünbichler
2023-11-14 14:02 ` [pve-devel] [RFC common 1/2] partially fix #4501: next unused port: bump port reservation expiretime Fiona Ebner
2023-11-14 14:02 ` [pve-devel] [RFC common 2/2] fix #4501: next unused port: work around issue with too short expiretime Fiona Ebner
2 siblings, 1 reply; 13+ messages in thread
From: Fiona Ebner @ 2023-11-14 14:02 UTC (permalink / raw)
To: pve-devel
Currently, volume activation, PCI reservation and resetting systemd
scope happen in between and the 5 second expiretime used for port
reservation might not be enough.
Still not ideal, because entering systemd scope and maybe starting
swtpm still happen after reservation before the QEMU binary can be
invoked and actually use the port, but the reservation needs to happen
outside of the fork, because the result is used there too.
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
---
PVE/QemuServer.pm | 20 ++++++++++++++------
1 file changed, 14 insertions(+), 6 deletions(-)
diff --git a/PVE/QemuServer.pm b/PVE/QemuServer.pm
index c465fb6f..aeaea8eb 100644
--- a/PVE/QemuServer.pm
+++ b/PVE/QemuServer.pm
@@ -5697,6 +5697,9 @@ sub vm_start_nolock {
return $migration_ip;
};
+ # helper to move port reservation and usage closer together to avoid expiry (bug #4501)
+ my $append_tcp_migration_cmdline;
+
if ($statefile) {
if ($statefile eq 'tcp') {
my $migrate = $res->{migrate} = { proto => 'tcp' };
@@ -5717,12 +5720,13 @@ sub vm_start_nolock {
$migrate->{addr} = "[$migrate->{addr}]" if Net::IP::ip_is_ipv6($migrate->{addr});
}
- my $pfamily = PVE::Tools::get_host_address_family($nodename);
- $migrate->{port} = PVE::Tools::next_migrate_port($pfamily);
- $migrate->{uri} = "tcp:$migrate->{addr}:$migrate->{port}";
- push @$cmd, '-incoming', $migrate->{uri};
- push @$cmd, '-S';
-
+ $append_tcp_migration_cmdline = sub {
+ my $pfamily = PVE::Tools::get_host_address_family($nodename);
+ $migrate->{port} = PVE::Tools::next_migrate_port($pfamily);
+ $migrate->{uri} = "tcp:$migrate->{addr}:$migrate->{port}";
+ push @$cmd, '-incoming', $migrate->{uri};
+ push @$cmd, '-S';
+ };
} elsif ($statefile eq 'unix') {
# should be default for secure migrations as a ssh TCP forward
# tunnel is not deterministic reliable ready and fails regurarly
@@ -5840,6 +5844,10 @@ sub vm_start_nolock {
$systemd_properties{timeout} = 10 if $statefile; # setting up the scope shoul be quick
my $run_qemu = sub {
+ # sets the port+uri for $res->{migrate} which is printed below and part of the result, so
+ # needs to happen outside of the fork.
+ $append_tcp_migration_cmdline->() if $append_tcp_migration_cmdline;
+
PVE::Tools::run_fork sub {
PVE::Systemd::enter_systemd_scope($vmid, "Proxmox VE VM $vmid", %systemd_properties);
--
2.39.2
^ permalink raw reply [flat|nested] 13+ messages in thread
* [pve-devel] [RFC common 1/2] partially fix #4501: next unused port: bump port reservation expiretime
2023-11-14 14:02 [pve-devel] [RFC qemu-server/common] fix #4501: improve port reservation for QEMU TCP migration Fiona Ebner
2023-11-14 14:02 ` [pve-devel] [RFC qemu-server 1/1] partially fix #4501: migration: start vm: move port reservation and usage closer together Fiona Ebner
@ 2023-11-14 14:02 ` Fiona Ebner
2023-11-15 8:51 ` Fabian Grünbichler
2023-11-14 14:02 ` [pve-devel] [RFC common 2/2] fix #4501: next unused port: work around issue with too short expiretime Fiona Ebner
2 siblings, 1 reply; 13+ messages in thread
From: Fiona Ebner @ 2023-11-14 14:02 UTC (permalink / raw)
To: pve-devel
For QEMU migration via TCP, there's a bit of time between port
reservation and usage, because currently, the port needs to be
reserved before invoking a fork, where the systemd scope needs to be
set up and swtpm might need to be started before the QEMU binary can
be invoked and actually use the port.
Not bumping too much, because mass migration with many small VMs might
need to re-use the ports rather quickly (there's only 50 ports).
The other two usages of the function are for VNC and SPICE, with
100 and 999 ports respectively. And for those, ports are usually used
for longer than 30 seconds anyways, so the higher expire time should
be fine.
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
---
src/PVE/Tools.pm | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/src/PVE/Tools.pm b/src/PVE/Tools.pm
index b3af2c6..4d018e9 100644
--- a/src/PVE/Tools.pm
+++ b/src/PVE/Tools.pm
@@ -918,7 +918,7 @@ sub next_unused_port {
my $code = sub {
- my $expiretime = 5;
+ my $expiretime = 30;
my $ctime = time();
my $ports = {};
--
2.39.2
^ permalink raw reply [flat|nested] 13+ messages in thread
* [pve-devel] [RFC common 2/2] fix #4501: next unused port: work around issue with too short expiretime
2023-11-14 14:02 [pve-devel] [RFC qemu-server/common] fix #4501: improve port reservation for QEMU TCP migration Fiona Ebner
2023-11-14 14:02 ` [pve-devel] [RFC qemu-server 1/1] partially fix #4501: migration: start vm: move port reservation and usage closer together Fiona Ebner
2023-11-14 14:02 ` [pve-devel] [RFC common 1/2] partially fix #4501: next unused port: bump port reservation expiretime Fiona Ebner
@ 2023-11-14 14:02 ` Fiona Ebner
2023-11-14 14:13 ` Fiona Ebner
2 siblings, 1 reply; 13+ messages in thread
From: Fiona Ebner @ 2023-11-14 14:02 UTC (permalink / raw)
To: pve-devel
For QEMU migration via TCP, there's a bit of time between port
reservation and usage, because currently, the port needs to be
reserved before doing a fork, where the systemd scope needs to be set
up and swtpm might need to be started before the QEMU binary can be
invoked and actually use the port.
To improve the situation, get the latest port recorded in the
reservation file and start trying from the next port, wrapping around
when hitting the end. Drastically reduces the chances to run into a
conflict, because after a given port reservation, all other ports are
tried first before returning to that port.
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
---
src/PVE/Tools.pm | 19 ++++++++++++++++++-
1 file changed, 18 insertions(+), 1 deletion(-)
diff --git a/src/PVE/Tools.pm b/src/PVE/Tools.pm
index 4d018e9..820229d 100644
--- a/src/PVE/Tools.pm
+++ b/src/PVE/Tools.pm
@@ -923,6 +923,11 @@ sub next_unused_port {
my $ports = {};
+ # Avoid that bulk actions compete for the first few ports by detecting the latest
+ # (previously) used port and start checking from there when trying to get a reservation.
+ my $latest_timestamp = 0;
+ my $latest_port = $range_end - 1;
+
if (my $fh = IO::File->new ($filename, "r")) {
while (my $line = <$fh>) {
if ($line =~ m/^(\d+)\s(\d+)$/) {
@@ -930,6 +935,14 @@ sub next_unused_port {
if (($timestamp + $expiretime) > $ctime) {
$ports->{$port} = $timestamp; # not expired
}
+ if (
+ $port >= $range_start
+ && $port < $range_end
+ && $timestamp > $latest_timestamp
+ ) {
+ $latest_timestamp = $timestamp;
+ $latest_port = $port;
+ }
}
}
}
@@ -942,7 +955,11 @@ sub next_unused_port {
GetAddrInfoFlags => 0);
$sockargs{LocalAddr} = $address if defined($address);
- for (my $p = $range_start; $p < $range_end; $p++) {
+ my $range = $range_end - $range_start;
+ for (my $offset = 1; $offset <= $range; $offset++) {
+ my $p = $latest_port + $offset;
+ $p -= $range if $p >= $range_end; # wrap around
+
next if $ports->{$p}; # reserved
$sockargs{LocalPort} = $p;
--
2.39.2
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [pve-devel] [RFC common 2/2] fix #4501: next unused port: work around issue with too short expiretime
2023-11-14 14:02 ` [pve-devel] [RFC common 2/2] fix #4501: next unused port: work around issue with too short expiretime Fiona Ebner
@ 2023-11-14 14:13 ` Fiona Ebner
2023-11-15 8:51 ` Fabian Grünbichler
0 siblings, 1 reply; 13+ messages in thread
From: Fiona Ebner @ 2023-11-14 14:13 UTC (permalink / raw)
To: pve-devel
Am 14.11.23 um 15:02 schrieb Fiona Ebner:
> For QEMU migration via TCP, there's a bit of time between port
> reservation and usage, because currently, the port needs to be
> reserved before doing a fork, where the systemd scope needs to be set
> up and swtpm might need to be started before the QEMU binary can be
> invoked and actually use the port.
>
> To improve the situation, get the latest port recorded in the
> reservation file and start trying from the next port, wrapping around
> when hitting the end. Drastically reduces the chances to run into a
> conflict, because after a given port reservation, all other ports are
> tried first before returning to that port.
Sorry, this is not true. It can be that in the meantime, a port for a
different range is reserved and that will remove the reservation for the
port in the migration range if expired. So we'd need to change the code
to remove only reservations from the current range to not lose track of
the latest previously used migration port.
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [pve-devel] [RFC common 1/2] partially fix #4501: next unused port: bump port reservation expiretime
2023-11-14 14:02 ` [pve-devel] [RFC common 1/2] partially fix #4501: next unused port: bump port reservation expiretime Fiona Ebner
@ 2023-11-15 8:51 ` Fabian Grünbichler
0 siblings, 0 replies; 13+ messages in thread
From: Fabian Grünbichler @ 2023-11-15 8:51 UTC (permalink / raw)
To: Proxmox VE development discussion
On November 14, 2023 3:02 pm, Fiona Ebner wrote:
> For QEMU migration via TCP, there's a bit of time between port
> reservation and usage, because currently, the port needs to be
> reserved before invoking a fork, where the systemd scope needs to be
> set up and swtpm might need to be started before the QEMU binary can
> be invoked and actually use the port.
>
> Not bumping too much, because mass migration with many small VMs might
> need to re-use the ports rather quickly (there's only 50 ports).
>
> The other two usages of the function are for VNC and SPICE, with
> 100 and 999 ports respectively. And for those, ports are usually used
> for longer than 30 seconds anyways, so the higher expire time should
> be fine.
>
> Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
since it's an RFC:
Acked-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
> ---
> src/PVE/Tools.pm | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/src/PVE/Tools.pm b/src/PVE/Tools.pm
> index b3af2c6..4d018e9 100644
> --- a/src/PVE/Tools.pm
> +++ b/src/PVE/Tools.pm
> @@ -918,7 +918,7 @@ sub next_unused_port {
>
> my $code = sub {
>
> - my $expiretime = 5;
> + my $expiretime = 30;
> my $ctime = time();
>
> my $ports = {};
> --
> 2.39.2
>
>
>
> _______________________________________________
> pve-devel mailing list
> pve-devel@lists.proxmox.com
> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
>
>
>
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [pve-devel] [RFC common 2/2] fix #4501: next unused port: work around issue with too short expiretime
2023-11-14 14:13 ` Fiona Ebner
@ 2023-11-15 8:51 ` Fabian Grünbichler
2023-11-15 10:16 ` Fiona Ebner
0 siblings, 1 reply; 13+ messages in thread
From: Fabian Grünbichler @ 2023-11-15 8:51 UTC (permalink / raw)
To: Proxmox VE development discussion
On November 14, 2023 3:13 pm, Fiona Ebner wrote:
> Am 14.11.23 um 15:02 schrieb Fiona Ebner:
>> For QEMU migration via TCP, there's a bit of time between port
>> reservation and usage, because currently, the port needs to be
>> reserved before doing a fork, where the systemd scope needs to be set
>> up and swtpm might need to be started before the QEMU binary can be
>> invoked and actually use the port.
>>
>> To improve the situation, get the latest port recorded in the
>> reservation file and start trying from the next port, wrapping around
>> when hitting the end. Drastically reduces the chances to run into a
>> conflict, because after a given port reservation, all other ports are
>> tried first before returning to that port.
>
> Sorry, this is not true. It can be that in the meantime, a port for a
> different range is reserved and that will remove the reservation for the
> port in the migration range if expired. So we'd need to change the code
> to remove only reservations from the current range to not lose track of
> the latest previously used migration port.
the whole thing would also still be racy anyway across processes, right?
not sure it's worth the additional effort compared to the other patches
then.. if those are not enough (i.e., we still get real-world reports)
then the "increase expiry further + explicit release" part could still
be implemented as follow-up..
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [pve-devel] [RFC qemu-server 1/1] partially fix #4501: migration: start vm: move port reservation and usage closer together
2023-11-14 14:02 ` [pve-devel] [RFC qemu-server 1/1] partially fix #4501: migration: start vm: move port reservation and usage closer together Fiona Ebner
@ 2023-11-15 8:55 ` Fabian Grünbichler
2023-11-15 10:12 ` Wolfgang Bumiller
0 siblings, 1 reply; 13+ messages in thread
From: Fabian Grünbichler @ 2023-11-15 8:55 UTC (permalink / raw)
To: Proxmox VE development discussion
On November 14, 2023 3:02 pm, Fiona Ebner wrote:
> Currently, volume activation, PCI reservation and resetting systemd
> scope happen in between and the 5 second expiretime used for port
> reservation might not be enough.
>
> Still not ideal, because entering systemd scope and maybe starting
> swtpm still happen after reservation before the QEMU binary can be
> invoked and actually use the port, but the reservation needs to happen
> outside of the fork, because the result is used there too.
>
> Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
Acked-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
we could move the whole statefile handling further down, but then some
additional side-effects need to be taken care of/refactored, this seems
like a minimal-invasive version for the uncommon (insecure) case.
> ---
> PVE/QemuServer.pm | 20 ++++++++++++++------
> 1 file changed, 14 insertions(+), 6 deletions(-)
>
> diff --git a/PVE/QemuServer.pm b/PVE/QemuServer.pm
> index c465fb6f..aeaea8eb 100644
> --- a/PVE/QemuServer.pm
> +++ b/PVE/QemuServer.pm
> @@ -5697,6 +5697,9 @@ sub vm_start_nolock {
> return $migration_ip;
> };
>
> + # helper to move port reservation and usage closer together to avoid expiry (bug #4501)
> + my $append_tcp_migration_cmdline;
> +
> if ($statefile) {
> if ($statefile eq 'tcp') {
> my $migrate = $res->{migrate} = { proto => 'tcp' };
> @@ -5717,12 +5720,13 @@ sub vm_start_nolock {
> $migrate->{addr} = "[$migrate->{addr}]" if Net::IP::ip_is_ipv6($migrate->{addr});
> }
>
> - my $pfamily = PVE::Tools::get_host_address_family($nodename);
> - $migrate->{port} = PVE::Tools::next_migrate_port($pfamily);
> - $migrate->{uri} = "tcp:$migrate->{addr}:$migrate->{port}";
> - push @$cmd, '-incoming', $migrate->{uri};
> - push @$cmd, '-S';
> -
nit: I'd maybe add another comment here, maybe something like
# delay migration port reservation to prevent expiry before binding
> + $append_tcp_migration_cmdline = sub {
> + my $pfamily = PVE::Tools::get_host_address_family($nodename);
> + $migrate->{port} = PVE::Tools::next_migrate_port($pfamily);
> + $migrate->{uri} = "tcp:$migrate->{addr}:$migrate->{port}";
> + push @$cmd, '-incoming', $migrate->{uri};
> + push @$cmd, '-S';
> + };
> } elsif ($statefile eq 'unix') {
> # should be default for secure migrations as a ssh TCP forward
> # tunnel is not deterministic reliable ready and fails regurarly
> @@ -5840,6 +5844,10 @@ sub vm_start_nolock {
> $systemd_properties{timeout} = 10 if $statefile; # setting up the scope shoul be quick
>
> my $run_qemu = sub {
> + # sets the port+uri for $res->{migrate} which is printed below and part of the result, so
> + # needs to happen outside of the fork.
> + $append_tcp_migration_cmdline->() if $append_tcp_migration_cmdline;
> +
> PVE::Tools::run_fork sub {
> PVE::Systemd::enter_systemd_scope($vmid, "Proxmox VE VM $vmid", %systemd_properties);
>
> --
> 2.39.2
>
>
>
> _______________________________________________
> pve-devel mailing list
> pve-devel@lists.proxmox.com
> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
>
>
>
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [pve-devel] [RFC qemu-server 1/1] partially fix #4501: migration: start vm: move port reservation and usage closer together
2023-11-15 8:55 ` Fabian Grünbichler
@ 2023-11-15 10:12 ` Wolfgang Bumiller
2023-11-15 10:22 ` Fiona Ebner
0 siblings, 1 reply; 13+ messages in thread
From: Wolfgang Bumiller @ 2023-11-15 10:12 UTC (permalink / raw)
To: Fabian Grünbichler; +Cc: Proxmox VE development discussion
On Wed, Nov 15, 2023 at 09:55:22AM +0100, Fabian Grünbichler wrote:
> On November 14, 2023 3:02 pm, Fiona Ebner wrote:
> > Currently, volume activation, PCI reservation and resetting systemd
> > scope happen in between and the 5 second expiretime used for port
> > reservation might not be enough.
> >
> > Still not ideal, because entering systemd scope and maybe starting
> > swtpm still happen after reservation before the QEMU binary can be
> > invoked and actually use the port, but the reservation needs to happen
> > outside of the fork, because the result is used there too.
> >
> > Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
>
> Acked-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
>
> we could move the whole statefile handling further down, but then some
> additional side-effects need to be taken care of/refactored, this seems
> like a minimal-invasive version for the uncommon (insecure) case.
>
> > ---
> > PVE/QemuServer.pm | 20 ++++++++++++++------
> > 1 file changed, 14 insertions(+), 6 deletions(-)
> >
> > diff --git a/PVE/QemuServer.pm b/PVE/QemuServer.pm
> > index c465fb6f..aeaea8eb 100644
> > --- a/PVE/QemuServer.pm
> > +++ b/PVE/QemuServer.pm
> > @@ -5697,6 +5697,9 @@ sub vm_start_nolock {
> > return $migration_ip;
> > };
> >
> > + # helper to move port reservation and usage closer together to avoid expiry (bug #4501)
> > + my $append_tcp_migration_cmdline;
> > +
> > if ($statefile) {
> > if ($statefile eq 'tcp') {
> > my $migrate = $res->{migrate} = { proto => 'tcp' };
> > @@ -5717,12 +5720,13 @@ sub vm_start_nolock {
> > $migrate->{addr} = "[$migrate->{addr}]" if Net::IP::ip_is_ipv6($migrate->{addr});
> > }
> >
> > - my $pfamily = PVE::Tools::get_host_address_family($nodename);
> > - $migrate->{port} = PVE::Tools::next_migrate_port($pfamily);
> > - $migrate->{uri} = "tcp:$migrate->{addr}:$migrate->{port}";
> > - push @$cmd, '-incoming', $migrate->{uri};
> > - push @$cmd, '-S';
> > -
>
> nit: I'd maybe add another comment here, maybe something like
>
> # delay migration port reservation to prevent expiry before binding
What about adding an option to `next_migrate_port()` to actually return
the open socket to keep the reservation?
Also, did we consider passing the file descriptor through to qemu via
`-incoming fd:$number`?
>
> > + $append_tcp_migration_cmdline = sub {
> > + my $pfamily = PVE::Tools::get_host_address_family($nodename);
> > + $migrate->{port} = PVE::Tools::next_migrate_port($pfamily);
> > + $migrate->{uri} = "tcp:$migrate->{addr}:$migrate->{port}";
> > + push @$cmd, '-incoming', $migrate->{uri};
> > + push @$cmd, '-S';
> > + };
> > } elsif ($statefile eq 'unix') {
> > # should be default for secure migrations as a ssh TCP forward
> > # tunnel is not deterministic reliable ready and fails regurarly
> > @@ -5840,6 +5844,10 @@ sub vm_start_nolock {
> > $systemd_properties{timeout} = 10 if $statefile; # setting up the scope shoul be quick
> >
> > my $run_qemu = sub {
> > + # sets the port+uri for $res->{migrate} which is printed below and part of the result, so
> > + # needs to happen outside of the fork.
> > + $append_tcp_migration_cmdline->() if $append_tcp_migration_cmdline;
> > +
> > PVE::Tools::run_fork sub {
> > PVE::Systemd::enter_systemd_scope($vmid, "Proxmox VE VM $vmid", %systemd_properties);
> >
> > --
> > 2.39.2
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [pve-devel] [RFC common 2/2] fix #4501: next unused port: work around issue with too short expiretime
2023-11-15 8:51 ` Fabian Grünbichler
@ 2023-11-15 10:16 ` Fiona Ebner
2023-11-15 10:27 ` Fabian Grünbichler
0 siblings, 1 reply; 13+ messages in thread
From: Fiona Ebner @ 2023-11-15 10:16 UTC (permalink / raw)
To: Proxmox VE development discussion, Fabian Grünbichler
Am 15.11.23 um 09:51 schrieb Fabian Grünbichler:
> On November 14, 2023 3:13 pm, Fiona Ebner wrote:
>> Am 14.11.23 um 15:02 schrieb Fiona Ebner:
>>> For QEMU migration via TCP, there's a bit of time between port
>>> reservation and usage, because currently, the port needs to be
>>> reserved before doing a fork, where the systemd scope needs to be set
>>> up and swtpm might need to be started before the QEMU binary can be
>>> invoked and actually use the port.
>>>
>>> To improve the situation, get the latest port recorded in the
>>> reservation file and start trying from the next port, wrapping around
>>> when hitting the end. Drastically reduces the chances to run into a
>>> conflict, because after a given port reservation, all other ports are
>>> tried first before returning to that port.
>>
>> Sorry, this is not true. It can be that in the meantime, a port for a
>> different range is reserved and that will remove the reservation for the
>> port in the migration range if expired. So we'd need to change the code
>> to remove only reservations from the current range to not lose track of
>> the latest previously used migration port.
>
> the whole thing would also still be racy anyway across processes, right?
> not sure it's worth the additional effort compared to the other patches
> then.. if those are not enough (i.e., we still get real-world reports)
> then the "increase expiry further + explicit release" part could still
> be implemented as follow-up..
>
No, it's not racy. The reserved ports are recorded in a file while
taking a lock, so each process will see what the others have last used.
My question is if the explicit release isn't much more effort than the
round-robin-style approach here, because it puts the burden on the
callers and you need a good way to actually check if the port is now
used successfully (without creating new races!) and a new helper for
removing the reservation. (That said, with round-robin we would need to
remember which range a port was for if we ever want to support
overlapping ranges).
As long as you have competition for early ports, you just need one
instance where the time between reservation and usage is longer than the
expiretime and you're very likely to hit the issue (except another
earlier port is free again). With round-robin, you need such an instance
and have all(!) other ports reserved/used in the meantime.
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [pve-devel] [RFC qemu-server 1/1] partially fix #4501: migration: start vm: move port reservation and usage closer together
2023-11-15 10:12 ` Wolfgang Bumiller
@ 2023-11-15 10:22 ` Fiona Ebner
2023-11-15 11:21 ` Wolfgang Bumiller
0 siblings, 1 reply; 13+ messages in thread
From: Fiona Ebner @ 2023-11-15 10:22 UTC (permalink / raw)
To: Proxmox VE development discussion, Wolfgang Bumiller,
Fabian Grünbichler
Am 15.11.23 um 11:12 schrieb Wolfgang Bumiller:
>
> What about adding an option to `next_migrate_port()` to actually return
> the open socket to keep the reservation?
>
> Also, did we consider passing the file descriptor through to qemu via
> `-incoming fd:$number`?
>
Sounds promising :) We do invoke QEMU after forking. Is there any
pitfall with that and passing the fd? Or is it enough if we simply don't
touch it or close it in the parent?
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [pve-devel] [RFC common 2/2] fix #4501: next unused port: work around issue with too short expiretime
2023-11-15 10:16 ` Fiona Ebner
@ 2023-11-15 10:27 ` Fabian Grünbichler
0 siblings, 0 replies; 13+ messages in thread
From: Fabian Grünbichler @ 2023-11-15 10:27 UTC (permalink / raw)
To: Fiona Ebner, Proxmox VE development discussion
On November 15, 2023 11:16 am, Fiona Ebner wrote:
> Am 15.11.23 um 09:51 schrieb Fabian Grünbichler:
>> On November 14, 2023 3:13 pm, Fiona Ebner wrote:
>>> Am 14.11.23 um 15:02 schrieb Fiona Ebner:
>>>> For QEMU migration via TCP, there's a bit of time between port
>>>> reservation and usage, because currently, the port needs to be
>>>> reserved before doing a fork, where the systemd scope needs to be set
>>>> up and swtpm might need to be started before the QEMU binary can be
>>>> invoked and actually use the port.
>>>>
>>>> To improve the situation, get the latest port recorded in the
>>>> reservation file and start trying from the next port, wrapping around
>>>> when hitting the end. Drastically reduces the chances to run into a
>>>> conflict, because after a given port reservation, all other ports are
>>>> tried first before returning to that port.
>>>
>>> Sorry, this is not true. It can be that in the meantime, a port for a
>>> different range is reserved and that will remove the reservation for the
>>> port in the migration range if expired. So we'd need to change the code
>>> to remove only reservations from the current range to not lose track of
>>> the latest previously used migration port.
>>
>> the whole thing would also still be racy anyway across processes, right?
>> not sure it's worth the additional effort compared to the other patches
>> then.. if those are not enough (i.e., we still get real-world reports)
>> then the "increase expiry further + explicit release" part could still
>> be implemented as follow-up..
>>
>
> No, it's not racy. The reserved ports are recorded in a file while
> taking a lock, so each process will see what the others have last used.
you are of course, right, sorry for the noise - I misread the diff and
thought the new variables were local state, instead of just helper
variables inside the sub..
> My question is if the explicit release isn't much more effort than the
> round-robin-style approach here, because it puts the burden on the
> callers and you need a good way to actually check if the port is now
> used successfully (without creating new races!) and a new helper for
> removing the reservation. (That said, with round-robin we would need to
> remember which range a port was for if we ever want to support
> overlapping ranges).
yes, we'd need to convert callers to become
reserve();
do_thing_that_binds();
clear_reservation();
possibly with the clearing repeated in the error handling code path.
clearing the reservation could also just mean setting the expiry a few
seconds into the future, for example, to cover any "binding might happen
with a slight delay in a forked process" type of situations.
> As long as you have competition for early ports, you just need one
> instance where the time between reservation and usage is longer than the
> expiretime and you're very likely to hit the issue (except another
> earlier port is free again). With round-robin, you need such an instance
> and have all(!) other ports reserved/used in the meantime.
true. the only way to really fix it would be to make the reservation
actually already do the binding, and pass around the open socket like
Wolfgang suggests. if that works for Qemu, we could at least make that
behaviour opt-in and convert this particular (and most likely to be
problematic) usage?
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [pve-devel] [RFC qemu-server 1/1] partially fix #4501: migration: start vm: move port reservation and usage closer together
2023-11-15 10:22 ` Fiona Ebner
@ 2023-11-15 11:21 ` Wolfgang Bumiller
0 siblings, 0 replies; 13+ messages in thread
From: Wolfgang Bumiller @ 2023-11-15 11:21 UTC (permalink / raw)
To: Fiona Ebner; +Cc: Proxmox VE development discussion, Fabian Grünbichler
On Wed, Nov 15, 2023 at 11:22:46AM +0100, Fiona Ebner wrote:
> Am 15.11.23 um 11:12 schrieb Wolfgang Bumiller:
> >
> > What about adding an option to `next_migrate_port()` to actually return
> > the open socket to keep the reservation?
> >
> > Also, did we consider passing the file descriptor through to qemu via
> > `-incoming fd:$number`?
> >
>
> Sounds promising :) We do invoke QEMU after forking. Is there any
> pitfall with that and passing the fd? Or is it enough if we simply don't
> touch it or close it in the parent?
We just have to explicitly remove the CLOEXEC flag from the fd before
the exec() happens.
Since we use `run_command` for the exec, I've been wondering if maybe
`run_command` itself should get an `fds => [ numbers ]` list it should
drop the CLOEXEC on before opening the subprocess and then restoring the
original flags afterwards.
^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2023-11-15 11:21 UTC | newest]
Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-11-14 14:02 [pve-devel] [RFC qemu-server/common] fix #4501: improve port reservation for QEMU TCP migration Fiona Ebner
2023-11-14 14:02 ` [pve-devel] [RFC qemu-server 1/1] partially fix #4501: migration: start vm: move port reservation and usage closer together Fiona Ebner
2023-11-15 8:55 ` Fabian Grünbichler
2023-11-15 10:12 ` Wolfgang Bumiller
2023-11-15 10:22 ` Fiona Ebner
2023-11-15 11:21 ` Wolfgang Bumiller
2023-11-14 14:02 ` [pve-devel] [RFC common 1/2] partially fix #4501: next unused port: bump port reservation expiretime Fiona Ebner
2023-11-15 8:51 ` Fabian Grünbichler
2023-11-14 14:02 ` [pve-devel] [RFC common 2/2] fix #4501: next unused port: work around issue with too short expiretime Fiona Ebner
2023-11-14 14:13 ` Fiona Ebner
2023-11-15 8:51 ` Fabian Grünbichler
2023-11-15 10:16 ` Fiona Ebner
2023-11-15 10:27 ` Fabian Grünbichler
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox