[PATCH v2 http-server 0/1] fix pveproxy OOM in websocket and spice proxy handlers

public inbox for pve-devel@lists.proxmox.com
 help / color / mirror / Atom feed

* [PATCH v2 http-server 0/1] fix pveproxy OOM in websocket and spice proxy handlers
@ 2026-04-13 12:56 Kefu Chai
  2026-04-13 12:56 ` [PATCH v2 http-server] fix #7483: apiserver: add backpressure to " Kefu Chai
  0 siblings, 1 reply; 2+ messages in thread
From: Kefu Chai @ 2026-04-13 12:56 UTC (permalink / raw)
  To: pve-devel

pveproxy can be OOM-killed when it forwards a WebSocket or SPICE connection
and the backend can't keep up with the client [1]. easy to trigger with PDM
cross-cluster migration to LVM-thin storage, since LVM-thin's zero-on-alloc
makes writes to newly-allocated extents much slower than steady-state, but
the same bug applies to any remote migration, to the VM/node console
sessions that go through the same proxy path, and to SPICE sessions that
carry bulk data like USB passthrough.

the wbuf_max we already set on the backend handle doesn't actually help.
AnyEvent only checks it inside the `if (!$self->{_ww})` guard in
_drain_wbuf:

    if (!$self->{_ww} && length $self->{wbuf}) {
        ...
        if (defined $self->{wbuf_max} && $self->{wbuf_max} < ...) {
            $self->_error(Errno::ENOSPC, 1);
        }
    }

so once the first EAGAIN installs the write watcher (_ww), all subsequent
push_write calls return immediately without ever reaching the check, and
wbuf grows without bound. and as Fabian pointed out, even if it did fire
it'd just tear the connection down with ENOSPC rather than apply any
backpressure, so we don't really want to rely on it anyway.

the fix is to extract a small apply_read_backpressure() helper modelled
on the inline pattern that response_stream() already uses, and call it
from all three proxy handlers: response_stream() (refactor only, no
behaviour change), websocket_proxy(), and handle_spice_proxy_request()
(which previously had only the ineffective wbuf_max and a TODO comment
for this exact problem). the two proxy handlers also get an on_eof
flush so that data buffered while the pause was active doesn't get
silently dropped when the backend closes mid-transfer.

side effect: while reads are paused, any in-band control messages
multiplexed on the same channel are also delayed. for WebSocket
that's ping/pong frames; for SPICE it's whatever protocol-level
keepalives the client uses. in normal operation this is
imperceptible, 640 KB drains in single-digit milliseconds even on
first-time LVM-thin allocations, well within any realistic ping
timeout. only if the backend stalls completely does the pause last
long enough for the client to give up, and in that case a single
connection times out gracefully. still strictly better than the
previous behaviour of OOM-killing pveproxy and taking down every
session on the node.

tested with a synthetic AnyEvent script that pushes a fast writer
through a proxy into a slow reader. without backpressure the proxy
write buffer grows to ~1.4 GB in 5 seconds (2254x the limit); with
the fix it stays bounded just over the 640 KB limit, as expected.

Changes since v1 [2] (thanks @Fabian for the review):
- broaden the scope, not LVM-thin specific
- handle_spice_proxy_request() now gets the same fix; it had the bug
  too and SPICE can carry bulk data via USB passthrough
- pull the helper out into apply_read_backpressure() so response_stream
  uses it too instead of carrying its own copy of the same pattern
- on_eof in both proxy handlers now flushes whatever is left in rbuf
  through the reader before tearing down, so the tail of a bulk
  transfer isn't dropped when the backend closes while reads are paused
- drop an old TODO in response_stream() about whether wbuf_max should
  be used as the threshold, the wbuf_max investigation in this patch
  settles it

[1] https://bugzilla.proxmox.com/show_bug.cgi?id=7483
[2] https://lore.proxmox.com/pve-devel/20260412111209.3960421-1-k.chai@proxmox.com/

Kefu Chai (1):
  fix #7483: apiserver: add backpressure to proxy handlers

 src/PVE/APIServer/AnyEvent.pm | 154 ++++++++++++++++++++++++++--------
 1 file changed, 121 insertions(+), 33 deletions(-)

-- 
2.47.3

^ permalink raw reply	[flat|nested] 2+ messages in thread

* [PATCH v2 http-server] fix #7483: apiserver: add backpressure to proxy handlers
  2026-04-13 12:56 [PATCH v2 http-server 0/1] fix pveproxy OOM in websocket and spice proxy handlers Kefu Chai
@ 2026-04-13 12:56 ` Kefu Chai
  0 siblings, 0 replies; 2+ messages in thread
From: Kefu Chai @ 2026-04-13 12:56 UTC (permalink / raw)
  To: pve-devel

When pveproxy forwards a WebSocket or SPICE connection to a backend,
incoming data can arrive faster than the backend consumes it. With no
flow control, pveproxy buffers the excess in memory and can be
OOM-killed. This is easy to trigger with PDM cross-cluster migration
to LVM-thin storage because of LVM-thin's zero-on-alloc behaviour:
writing to newly-allocated extents is much slower than steady-state,
so the receiving side cannot keep up with the network transfer rate.
In principle this affects any remote migration, the VM/node console
sessions served through the same proxy path, and SPICE sessions that
carry bulk data such as USB passthrough.

The existing wbuf_max on the backend handle does not help. AnyEvent
only checks it inside the `if (!$self->{_ww})` guard in _drain_wbuf,
so once the first EAGAIN installs a write watcher, subsequent
push_write calls return immediately without ever reaching the check
and wbuf grows without bound. Even if the check did fire it would
raise ENOSPC and tear the connection down rather than apply back
pressure.

Add a shared apply_read_backpressure() helper that pauses reads on
the source handle when the destination wbuf exceeds a limit, and
resumes via an on_drain callback once it empties. Use it from:

* response_stream(), replacing the existing inline backpressure block
  (no behaviour change)
* websocket_proxy(), which had no backpressure at all
* handle_spice_proxy_request(), which had only an ineffective wbuf_max
  and a TODO comment for this exact problem

In addition, fix on_eof in websocket_proxy() and
handle_spice_proxy_request() to drain any data still in rbuf through
the reader before tearing down. Without this the tail of a bulk
transfer could be silently dropped if the backend closes while reads
are paused.

A side effect of pausing reads is that any in-band control messages
multiplexed on the same channel are also delayed. For WebSocket this
means ping/pong frames in the paused direction sit behind the queued
bulk data; for SPICE it means whatever protocol-level keepalives the
client uses. In normal operation this is imperceptible, since 640 KB
drains in single-digit milliseconds even on first-time LVM-thin
allocations, well within any realistic ping timeout. Only if the
backend stalls completely does the pause last long enough for the
client to give up, and in that case a single connection times out
gracefully, which is strictly better than the previous behaviour of
OOM-killing pveproxy and taking down every session on the node.

Also drop a stale TODO in response_stream() that asked whether
$reqhdl->{wbuf_max} should be used as the backpressure threshold;
the wbuf_max investigation above answers it: no, wbuf_max and the
backpressure threshold serve different purposes and should stay
independent.

Signed-off-by: Kefu Chai <k.chai@proxmox.com>
---
 src/PVE/APIServer/AnyEvent.pm | 148 ++++++++++++++++++++++++++--------
 1 file changed, 113 insertions(+), 35 deletions(-)

diff --git a/src/PVE/APIServer/AnyEvent.pm b/src/PVE/APIServer/AnyEvent.pm
index 915d678..666c3b0 100644
--- a/src/PVE/APIServer/AnyEvent.pm
+++ b/src/PVE/APIServer/AnyEvent.pm
@@ -206,6 +206,26 @@ sub finish_response {
     }
 }
 
+# pause reads on $read_hdl until $write_hdl's write buffer drains, then
+# re-register $on_read_cb via an on_drain callback. The caller decides
+# when to apply backpressure based on its own threshold and must
+# liveness-check at the top of $on_read_cb. Chains with any existing
+# on_drain handler on $write_hdl.
+sub apply_read_backpressure {
+    my ($read_hdl, $write_hdl, $on_read_cb) = @_;
+
+    $read_hdl->on_read();
+    my $prev_on_drain = $write_hdl->{on_drain};
+    $write_hdl->on_drain(sub {
+        my ($wrhdl) = @_;
+        $read_hdl->on_read($on_read_cb);
+        if ($prev_on_drain) {
+            $wrhdl->on_drain($prev_on_drain);
+            $prev_on_drain->($wrhdl);
+        }
+    });
+}
+
 sub response_stream {
     my ($self, $reqstate, $stream_fh) = @_;
 
@@ -222,8 +242,6 @@ sub response_stream {
 
         my $wbuf_len = length($reqhdl->{wbuf});
         my $rbuf_len = length($hdl->{rbuf});
-        # TODO: Take into account $reqhdl->{wbuf_max} ? Right now
-        # that's unbounded, so just assume $buf_size
         my $to_read = $buf_size - $wbuf_len;
         $to_read = $rbuf_len if $rbuf_len < $to_read;
         if ($to_read > 0) {
@@ -241,20 +259,9 @@ sub response_stream {
 
         # apply backpressure so we don't accept any more data into buffer if the client isn't
         # downloading fast enough. Note: read_size can double upon read, and we also need to account
-        # for one more read after# start_read, so multiply by 4
+        # for one more read after start_read, so multiply by 4
         if ($rbuf_len + $hdl->{read_size} * 4 > $buf_size) {
-            # stop reading until write buffer is empty
-            $hdl->on_read();
-            my $prev_on_drain = $reqhdl->{on_drain};
-            $reqhdl->on_drain(sub {
-                my ($wrhdl) = @_;
-                # on_drain called because write buffer is empty, continue reading
-                $hdl->on_read($on_read);
-                if ($prev_on_drain) {
-                    $wrhdl->on_drain($prev_on_drain);
-                    $prev_on_drain->($wrhdl);
-                }
-            });
+            apply_read_backpressure($hdl, $reqhdl, $on_read);
         }
     };
 
@@ -581,14 +588,25 @@ sub websocket_proxy {
 
             $self->dprint("CONNECTed to '$remhost:$remport'");
 
+            my $wbuf_limit = $max_payload_size * 5;
+
+            # forward-declare so the on_eof closures below can reference
+            # the reader callbacks that are defined further down.
+            my $proxyhdlreader;
+            my $hdlreader;
+
             $reqstate->{proxyhdl} = AnyEvent::Handle->new(
                 fh => $fh,
                 rbuf_max => $max_payload_size,
-                wbuf_max => $max_payload_size * 5,
                 timeout => 5,
                 on_eof => sub {
                     my ($hdl) = @_;
                     eval {
+                        # flush any frames buffered by the backpressure
+                        # pause before tearing down; without this, the
+                        # tail of a bulk transfer can be silently dropped
+                        # if the backend closes while reads are paused.
+                        $proxyhdlreader->($hdl, 1) while length($hdl->{rbuf});
                         $self->log_aborted_request($reqstate);
                         $self->client_do_disconnect($reqstate);
                     };
@@ -604,21 +622,26 @@ sub websocket_proxy {
                 },
             );
 
-            my $proxyhdlreader = sub {
-                my ($hdl) = @_;
+            $proxyhdlreader = sub {
+                my ($hdl, $no_backpressure) = @_;
+
+                my $clienthdl = $reqstate->{hdl};
+                return if !$clienthdl;
 
                 my $len = length($hdl->{rbuf});
                 my $data =
                     substr($hdl->{rbuf}, 0, $len > $max_payload_size ? $max_payload_size : $len,
                         '');
 
-                my $string = $encode->(\$data);
+                $clienthdl->push_write($encode->(\$data));
 
-                $reqstate->{hdl}->push_write($string) if $reqstate->{hdl};
+                if (!$no_backpressure && length($clienthdl->{wbuf}) > $wbuf_limit) {
+                    apply_read_backpressure($hdl, $clienthdl, $proxyhdlreader);
+                }
             };
 
-            my $hdlreader = sub {
-                my ($hdl) = @_;
+            $hdlreader = sub {
+                my ($hdl, $no_backpressure) = @_;
 
                 while (my $len = length($hdl->{rbuf})) {
                     return if $len < 2;
@@ -672,7 +695,17 @@ sub websocket_proxy {
                     }
 
                     if ($opcode == 1 || $opcode == 2) {
-                        $reqstate->{proxyhdl}->push_write($payload) if $reqstate->{proxyhdl};
+                        my $proxyhdl = $reqstate->{proxyhdl};
+                        if ($proxyhdl) {
+                            $proxyhdl->push_write($payload);
+                            if (
+                                !$no_backpressure
+                                && length($proxyhdl->{wbuf}) > $wbuf_limit
+                            ) {
+                                apply_read_backpressure($hdl, $proxyhdl, $hdlreader);
+                                return;
+                            }
+                        }
                     } elsif ($opcode == 8) {
                         my $statuscode = unpack("n", $payload);
                         $self->dprint("websocket received close. status code: '$statuscode'");
@@ -700,7 +733,19 @@ sub websocket_proxy {
             $reqstate->{proxyhdl}->on_read($proxyhdlreader);
             $reqstate->{hdl}->on_read($hdlreader);
 
-            # todo: use stop_read/start_read if write buffer grows to much
+            # override the client handle's on_eof for the websocket
+            # session so frames buffered by the backpressure pause are
+            # drained before tearing down. The handle's default on_eof
+            # (set at accept time) does not know about $hdlreader.
+            $reqstate->{hdl}->on_eof(sub {
+                my ($hdl) = @_;
+                eval {
+                    $hdlreader->($hdl, 1) if length($hdl->{rbuf}) >= 2;
+                    $self->log_aborted_request($reqstate);
+                    $self->client_do_disconnect($reqstate);
+                };
+                if (my $err = $@) { syslog('err', $err); }
+            });
 
             # FIXME: remove protocol in PVE/PMG 8.x
             #
@@ -1098,7 +1143,8 @@ sub handle_spice_proxy_request {
         }
 
         $reqstate->{hdl}->timeout(0);
-        $reqstate->{hdl}->wbuf_max(64 * 10 * 1024);
+
+        my $wbuf_limit = 64 * 10 * 1024;
 
         my $remhost = $remip ? $remip : "localhost";
         my $remport = $remip ? 3128 : $spiceport;
@@ -1108,14 +1154,22 @@ sub handle_spice_proxy_request {
                 or die "connect to '$remhost:$remport' failed: $!";
 
             $self->dprint("CONNECTed to '$remhost:$remport'");
+
+            # forward-declare so the on_eof closure below can reference
+            # the reader callback that is defined further down.
+            my $proxyhdlreader;
+
             $reqstate->{proxyhdl} = AnyEvent::Handle->new(
                 fh => $fh,
                 rbuf_max => 64 * 1024,
-                wbuf_max => 64 * 10 * 1024,
                 timeout => 5,
                 on_eof => sub {
                     my ($hdl) = @_;
                     eval {
+                        # flush anything buffered by the backpressure
+                        # pause before tearing down, so bulk data (e.g.,
+                        # SPICE USB passthrough) is not silently dropped.
+                        $proxyhdlreader->($hdl, 1) while length($hdl->{rbuf});
                         $self->log_aborted_request($reqstate);
                         $self->client_do_disconnect($reqstate);
                     };
@@ -1131,24 +1185,37 @@ sub handle_spice_proxy_request {
                 },
             );
 
-            my $proxyhdlreader = sub {
-                my ($hdl) = @_;
+            $proxyhdlreader = sub {
+                my ($hdl, $no_backpressure) = @_;
+
+                my $clienthdl = $reqstate->{hdl};
+                return if !$clienthdl;
 
                 my $len = length($hdl->{rbuf});
                 my $data = substr($hdl->{rbuf}, 0, $len, '');
 
-                #print "READ1 $len\n";
-                $reqstate->{hdl}->push_write($data) if $reqstate->{hdl};
+                $clienthdl->push_write($data);
+
+                if (!$no_backpressure && length($clienthdl->{wbuf}) > $wbuf_limit) {
+                    apply_read_backpressure($hdl, $clienthdl, $proxyhdlreader);
+                }
             };
 
-            my $hdlreader = sub {
-                my ($hdl) = @_;
+            my $hdlreader;
+            $hdlreader = sub {
+                my ($hdl, $no_backpressure) = @_;
+
+                my $proxyhdl = $reqstate->{proxyhdl};
+                return if !$proxyhdl;
 
                 my $len = length($hdl->{rbuf});
                 my $data = substr($hdl->{rbuf}, 0, $len, '');
 
-                #print "READ0 $len\n";
-                $reqstate->{proxyhdl}->push_write($data) if $reqstate->{proxyhdl};
+                $proxyhdl->push_write($data);
+
+                if (!$no_backpressure && length($proxyhdl->{wbuf}) > $wbuf_limit) {
+                    apply_read_backpressure($hdl, $proxyhdl, $hdlreader);
+                }
             };
 
             my $proto = $reqstate->{proto} ? $reqstate->{proto}->{str} : 'HTTP/1.0';
@@ -1158,7 +1225,18 @@ sub handle_spice_proxy_request {
                 $reqstate->{proxyhdl}->on_read($proxyhdlreader);
                 $reqstate->{hdl}->on_read($hdlreader);
 
-                # todo: use stop_read/start_read if write buffer grows to much
+                # override the client handle's on_eof for the spice
+                # session so bytes buffered by the backpressure pause
+                # are drained before tearing down.
+                $reqstate->{hdl}->on_eof(sub {
+                    my ($hdl) = @_;
+                    eval {
+                        $hdlreader->($hdl, 1) while length($hdl->{rbuf});
+                        $self->log_aborted_request($reqstate);
+                        $self->client_do_disconnect($reqstate);
+                    };
+                    if (my $err = $@) { syslog('err', $err); }
+                });
 
                 # a response must be followed by an empty line
                 my $res = "$proto 200 OK\015\012\015\012";
-- 
2.47.3





^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2026-04-13 12:56 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2026-04-13 12:56 [PATCH v2 http-server 0/1] fix pveproxy OOM in websocket and spice proxy handlers Kefu Chai
2026-04-13 12:56 ` [PATCH v2 http-server] fix #7483: apiserver: add backpressure to " Kefu Chai

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox

Service provided by Proxmox Server Solutions GmbH | Privacy | Legal