[pve-devel] is somebody working on nftables ? (I had scalability problem with big host)

public inbox for pve-devel@lists.proxmox.com
 help / color / mirror / Atom feed

* [pve-devel] is somebody working on nftables ? (I had scalability problem with big host)
@ 2022-03-11  8:42 DERUMIER, Alexandre
  2022-03-11 10:06 ` Wolfgang Bumiller
  0 siblings, 1 reply; 3+ messages in thread
From: DERUMIER, Alexandre @ 2022-03-11  8:42 UTC (permalink / raw)
  To: pve-devel; +Cc: w.bumiller

Hi,
I would like to known if somebody is already working on nftables ?

Recently, I had scalibity problem with big hosts with a lof of vms
interfaces.

This was an host with 500vms with 3 interfaces by vms.  (so 1500 tap
interfaces + 1500 fwbr + 1500 )

The problems:

- ebtables-restore-legacy is not able to import big ruleset. (seem to
works with ebtables-restore-nft).
https://bugzilla.proxmox.com/show_bug.cgi?id=3909

- pve-firewall rule generation take 100% cpu for 5s (on a new epyc
server 3ghz), iptables-save/restore is slow too (but working). With the
10s interval of pve-firewall running, I have almost all the the time
the pve-firewall process running at 100%.

- with the current 1 fwbr for each interfaces, when a broadcast (like
arp) is going to the main bridge, the packet is duplicated/forward on
each fwbr. The current arp forwarding only use 1 ksoftirqd with a slow
cpu path (I have check with "perf record). with a lot of fwbr, I had a 
100% ksoftirqd with packet loss. (200 original arp request/S * 500 fwbr
= >100000 arp request/s to handle)



I have looked at nftables, I think that everything is ready in kernel
now.(last missing part with bridge conntrack from kernel 5.3)


Here a basic example, with conntrack at bridge level and vmap feature
to match to interface.



#!/usr/sbin/nft -f

flush ruleset

table inet filter {
        chain input {
                type filter hook input priority 0;
                policy accept;
                log flags all prefix "host in"
        }
        chain forward {
                type filter hook forward priority 0;
                policy accept;
                log flags all prefix "host forward (routing)"
        }
        
        chain output {
               type filter hook output priority 0;
               policy accept;
               log flags all prefix "host output"
        }
}

table bridge filter {
        chain forward {
                type filter hook forward priority 0; policy accept;
                ct state established,related accept
                log flags all prefix "bridge forward"
                iifname vmap { tap100i0 : jump tap100i0-out , tap105i0
: jump tap105i0-out }
                oifname vmap { tap100i0 : jump tap100i0-in , tap105i0 :
jump tap105i0-in }
        }

        chain tap100i0-in {
                log flags all prefix "tap100i0-in"
                ether type arp accept
                drop
        }

        chain tap100i0-out {
                log flags all prefix "tap100i0-out"
                ether type arp accept
                return
        }

        chain tap105i0-in {
                log flags all prefix "tap1005i0-in"
                ether type arp accept
        }

        chain tap105i0-out {
                log flags all prefix "tap105i0-out"
                ether type arp accept
                return
        }
}



Also, I think we could avoid the use the fwbr for some cases.

AFAIK, the fwbr is only need  for host->vm, because without fwbr, we
only have the packet in host output chain (or forward for routing
setup), without the destination tap interface (only the destination
bridge and destination ip)

ex: routed setup :10.3.94.11----->10.3.94.1(vmbr0)---
(vmbr1)192.168.0.1-----vm(192.168.0.10)

 kernel: [28341.361776] forward hostIN=eth0 OUT=vmbr1
MACSRC=f2:42:cf:23:12:88 MACDST=24:8a:07:9a:2a:f2 MACPROTO=0800
SRC=10.3.94.11 DST=192.168.0.10 LEN=84 TOS=0x00 PREC=0x00 TTL=63
ID=39355 DF PROTO=ICMP TYPE=8 CODE=0 ID=48423 SEQ=1 


with the fwbr, we can match the packet twice, in the host
output/forward, and in the bridge forward.

I'm not able to reproduce this with nftables :(


I see 1 possible clean workaround:

 - Don't setup ip address on the bridge directly, but instead, on a
veth pair. Like this, we see veth source && tap destination is bridge
forward.  

(some users had problem at hetzner with fwbr bridge sending packets
with their own mac, this should avoid this bug)

But for users that mean manual network config change or maybe some
ifupdown2 tricks or config auto rewrite.
  

ex: routed setup

bridge forward IN=veth_host OUT=tap100i0 MACSRC=9a:cd:90:f8:f5:3e
MACDST=04:05:df:12:85:55 MACPROTO=0800 SRC=10.3.94.11 DST=192.168.0.10
LEN=84 TOS=0x00 PREC=0x00 TTL=63 ID=10333 DF PROTO=ICMP TYPE=8 CODE=0
ID=46306 SEQ=1 



I don't known if it's possible to get the fwbr tricks working, but it
this case:
 - keep a fwbr bridge, but only 1 by vmbr where an ip is setup (or for
openswitch too). more transparent to implement at vm start/stop.
(but we still need to match the packet twice)

For other cases (pure bridging), I think we don't need fwbr at all.
This should avoid extra cpu cycle, and make network throughput faster
too.


Any opinion about this ? does somebody already have done test with
nftables ?





^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [pve-devel] is somebody working on nftables ? (I had scalability problem with big host)
  2022-03-11  8:42 [pve-devel] is somebody working on nftables ? (I had scalability problem with big host) DERUMIER, Alexandre
@ 2022-03-11 10:06 ` Wolfgang Bumiller
  2022-03-13  6:44   ` DERUMIER, Alexandre
  0 siblings, 1 reply; 3+ messages in thread
From: Wolfgang Bumiller @ 2022-03-11 10:06 UTC (permalink / raw)
  To: DERUMIER, Alexandre; +Cc: pve-devel

On Fri, Mar 11, 2022 at 08:42:29AM +0000, DERUMIER, Alexandre wrote:
> Hi,
> I would like to known if somebody is already working on nftables ?

Not actively in the pve code. I only have a few different variants of
possible nft rulesets around but there's always been something missing,
even with bridge conntrack.

Particularly, I always felt like there's supposed to be an easy way to
get rid of the fw-bridges (as you noted below, those are a real pain on
big setups... though I haven't even had such large setups on my mind
tbh...)

> 
> Recently, I had scalibity problem with big hosts with a lof of vms
> interfaces.
> 
> This was an host with 500vms with 3 interfaces by vms.  (so 1500 tap
> interfaces + 1500 fwbr + 1500 )
> The problems:
> 
> - ebtables-restore-legacy is not able to import big ruleset. (seem to
> works with ebtables-restore-nft).
> https://bugzilla.proxmox.com/show_bug.cgi?id=3909
> 
> - pve-firewall rule generation take 100% cpu for 5s (on a new epyc
> server 3ghz), iptables-save/restore is slow too (but working). With the
> 10s interval of pve-firewall running, I have almost all the the time
> the pve-firewall process running at 100%.

That seems awful :-)
I'm sure there are ways to improve this even now (have better checks for
whether it's even necessary to update the ruleset in the first place...)

Other than that, if we design the nftables rules carefully enough, an
nftables version would in theory be able to apply only partial changes.
But we'd also need an efficient way to figure out what to change.

More ideas are monitoring the /etc/pve/firewall contents and perhaps
teach pmxcfs to support change notifications via `poll` on the directory
and/or its files... (even high level fuse has support for this AFAICT,
would need to play around with it)

> 
> - with the current 1 fwbr for each interfaces, when a broadcast (like
> arp) is going to the main bridge, the packet is duplicated/forward on
> each fwbr. The current arp forwarding only use 1 ksoftirqd with a slow
> cpu path (I have check with "perf record). with a lot of fwbr, I had a 
> 100% ksoftirqd with packet loss. (200 original arp request/S * 500 fwbr
> = >100000 arp request/s to handle)

AFAICT we need them mostly so that for all packet paths we can tell
which VM the packet is going to/coming from.
Particularly with nftables a bunch of in/out interface information is
simply missing otherwise. Or used to, I guess it's time to check again.

> I have looked at nftables, I think that everything is ready in kernel
> now.(last missing part with bridge conntrack from kernel 5.3)
> 
> 
> Here a basic example, with conntrack at bridge level and vmap feature
> to match to interface.
> 
> 
> 
> #!/usr/sbin/nft -f
> 
> flush ruleset
> 
> table inet filter {
>         chain input {
>                 type filter hook input priority 0;
>                 policy accept;
>                 log flags all prefix "host in"
>         }
>         chain forward {
>                 type filter hook forward priority 0;
>                 policy accept;
>                 log flags all prefix "host forward (routing)"
>         }
>         
>         chain output {
>                type filter hook output priority 0;
>                policy accept;
>                log flags all prefix "host output"
>         }
> }
> 
> table bridge filter {
>         chain forward {
>                 type filter hook forward priority 0; policy accept;
>                 ct state established,related accept
>                 log flags all prefix "bridge forward"
>                 iifname vmap { tap100i0 : jump tap100i0-out , tap105i0
> : jump tap105i0-out }
>                 oifname vmap { tap100i0 : jump tap100i0-in , tap105i0 :
> jump tap105i0-in }

One issue I see though is that this simply won't be executed for
*routed* VMs, so we'd either need to maintain the same ruleset in the
filter forward + the bridge forward chains, or require routed vs bridged
setups to be marked somehow...
Unfortunately we can't share chains across tables :|

Although perhaps some rules could be moved to the 'netdev' chains...
(at least we probably want to do MAC filtering in there, so it happens
early and independent from the rest of the flow).

> Also, I think we could avoid the use the fwbr for some cases.
> 
> AFAIK, the fwbr is only need  for host->vm, because without fwbr, we
> only have the packet in host output chain (or forward for routing
> setup), without the destination tap interface (only the destination
> bridge and destination ip)
> 
> ex: routed setup :10.3.94.11----->10.3.94.1(vmbr0)---
> (vmbr1)192.168.0.1-----vm(192.168.0.10)
> 
>  kernel: [28341.361776] forward hostIN=eth0 OUT=vmbr1
> MACSRC=f2:42:cf:23:12:88 MACDST=24:8a:07:9a:2a:f2 MACPROTO=0800
> SRC=10.3.94.11 DST=192.168.0.10 LEN=84 TOS=0x00 PREC=0x00 TTL=63
> ID=39355 DF PROTO=ICMP TYPE=8 CODE=0 ID=48423 SEQ=1 
> 
> 
> with the fwbr, we can match the packet twice, in the host
> output/forward, and in the bridge forward.
> 
> I'm not able to reproduce this with nftables :(
> 
> 
> I see 1 possible clean workaround:
> 
>  - Don't setup ip address on the bridge directly, but instead, on a
> veth pair. Like this, we see veth source && tap destination is bridge
> forward.  
> 
> (some users had problem at hetzner with fwbr bridge sending packets
> with their own mac, this should avoid this bug)
> 
> But for users that mean manual network config change or maybe some
> ifupdown2 tricks or config auto rewrite.
>   

Yeah that's a problem.

> 
> ex: routed setup
> 
> bridge forward IN=veth_host OUT=tap100i0 MACSRC=9a:cd:90:f8:f5:3e
> MACDST=04:05:df:12:85:55 MACPROTO=0800 SRC=10.3.94.11 DST=192.168.0.10
> LEN=84 TOS=0x00 PREC=0x00 TTL=63 ID=10333 DF PROTO=ICMP TYPE=8 CODE=0
> ID=46306 SEQ=1 
> 
> 
> 
> I don't known if it's possible to get the fwbr tricks working, but it
> this case:
>  - keep a fwbr bridge, but only 1 by vmbr where an ip is setup (or for
> openswitch too). more transparent to implement at vm start/stop.
> (but we still need to match the packet twice)

That may actually work for both bridged & routed setups even, and
traffic would go through regardless of whether the vm communicates with
the host or with the outside. And the host input chain would still get
triggered normally as well.
However, I've hit issues with masquerading there for some reason (where
simply the presence of a 'bridge filter forward' chain caused
masquerading to just stop working despite the masquerading rule being
hit, I'll need to investigate this further.

> 
> For other cases (pure bridging), I think we don't need fwbr at all.
> This should avoid extra cpu cycle, and make network throughput faster
> too.
> 
> 
> Any opinion about this ? does somebody already have done test with
> nftables ?




^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [pve-devel] is somebody working on nftables ? (I had scalability problem with big host)
  2022-03-11 10:06 ` Wolfgang Bumiller
@ 2022-03-13  6:44   ` DERUMIER, Alexandre
  0 siblings, 0 replies; 3+ messages in thread
From: DERUMIER, Alexandre @ 2022-03-13  6:44 UTC (permalink / raw)
  To: w.bumiller; +Cc: pve-devel

Thanks Wolfgang !

I'll do more tests my side too with differents setup,
routed,bridged,maquerade,...

I'l also try to see if we you can a conntrack zone for each vm,
it could be easier to migrate specific conntrack entries when live
migrating the vm.
(it's also seem possible to check direction with ct properties)

I'm doing a 4 days proxmox training next week, so I'll work on nftables
again in 2 weeks.

I'll keep you in touch.

Thanks again !



Le vendredi 11 mars 2022 à 11:06 +0100, Wolfgang Bumiller a écrit :
> On Fri, Mar 11, 2022 at 08:42:29AM +0000, DERUMIER, Alexandre wrote:
> > Hi,
> > I would like to known if somebody is already working on nftables ?
> 
> Not actively in the pve code. I only have a few different variants of
> possible nft rulesets around but there's always been something
> missing,
> even with bridge conntrack.
> 
> Particularly, I always felt like there's supposed to be an easy way
> to
> get rid of the fw-bridges (as you noted below, those are a real pain
> on
> big setups... though I haven't even had such large setups on my mind
> tbh...)
> 
> > 
> > Recently, I had scalibity problem with big hosts with a lof of vms
> > interfaces.
> > 
> > This was an host with 500vms with 3 interfaces by vms.  (so 1500
> > tap
> > interfaces + 1500 fwbr + 1500 )
> > The problems:
> > 
> > - ebtables-restore-legacy is not able to import big ruleset. (seem
> > to
> > works with ebtables-restore-nft).
> > https://antiphishing.cetsi.fr/proxy/v3?i=SHV0Y1JZQjNyckJFa3dUQiblhF
> > 5YcUqtiWCaK_ri0kk&r=T0hnMlUyVEgwNmlmdHc1NQqeTQ1pLQVNn4UPLJn2W6e9Hh5
> > 0epHxcxJAGCrIHvKB1souhZXB265bSkydEfNuQg&f=V3p0eFlQOUZ4czh2enpJS6y9A
> > tQoHV0HyGa2KObv9F6oyhdOPSINYsRBYNzkp7uY&u=https%3A//bugzilla.proxmo
> > x.com/show_bug.cgi%3Fid%3D3909&k=ZVd0
> > 
> > - pve-firewall rule generation take 100% cpu for 5s (on a new epyc
> > server 3ghz), iptables-save/restore is slow too (but working). With
> > the
> > 10s interval of pve-firewall running, I have almost all the the
> > time
> > the pve-firewall process running at 100%.
> 
> That seems awful :-)
> I'm sure there are ways to improve this even now (have better checks
> for
> whether it's even necessary to update the ruleset in the first
> place...)
> 
> Other than that, if we design the nftables rules carefully enough, an
> nftables version would in theory be able to apply only partial
> changes.
> But we'd also need an efficient way to figure out what to change.
> 
> More ideas are monitoring the /etc/pve/firewall contents and perhaps
> teach pmxcfs to support change notifications via `poll` on the
> directory
> and/or its files... (even high level fuse has support for this
> AFAICT,
> would need to play around with it)
> 
> > 
> > - with the current 1 fwbr for each interfaces, when a broadcast
> > (like
> > arp) is going to the main bridge, the packet is duplicated/forward
> > on
> > each fwbr. The current arp forwarding only use 1 ksoftirqd with a
> > slow
> > cpu path (I have check with "perf record). with a lot of fwbr, I
> > had a 
> > 100% ksoftirqd with packet loss. (200 original arp request/S * 500
> > fwbr
> > = >100000 arp request/s to handle)
> 
> AFAICT we need them mostly so that for all packet paths we can tell
> which VM the packet is going to/coming from.
> Particularly with nftables a bunch of in/out interface information is
> simply missing otherwise. Or used to, I guess it's time to check
> again.
> 
> > I have looked at nftables, I think that everything is ready in
> > kernel
> > now.(last missing part with bridge conntrack from kernel 5.3)
> > 
> > 
> > Here a basic example, with conntrack at bridge level and vmap
> > feature
> > to match to interface.
> > 
> > 
> > 
> > #!/usr/sbin/nft -f
> > 
> > flush ruleset
> > 
> > table inet filter {
> >         chain input {
> >                 type filter hook input priority 0;
> >                 policy accept;
> >                 log flags all prefix "host in"
> >         }
> >         chain forward {
> >                 type filter hook forward priority 0;
> >                 policy accept;
> >                 log flags all prefix "host forward (routing)"
> >         }
> >         
> >         chain output {
> >                type filter hook output priority 0;
> >                policy accept;
> >                log flags all prefix "host output"
> >         }
> > }
> > 
> > table bridge filter {
> >         chain forward {
> >                 type filter hook forward priority 0; policy accept;
> >                 ct state established,related accept
> >                 log flags all prefix "bridge forward"
> >                 iifname vmap { tap100i0 : jump tap100i0-out ,
> > tap105i0
> > : jump tap105i0-out }
> >                 oifname vmap { tap100i0 : jump tap100i0-in ,
> > tap105i0 :
> > jump tap105i0-in }
> 
> One issue I see though is that this simply won't be executed for
> *routed* VMs, so we'd either need to maintain the same ruleset in the
> filter forward + the bridge forward chains, or require routed vs
> bridged
> setups to be marked somehow...
> Unfortunately we can't share chains across tables :|
> 
> Although perhaps some rules could be moved to the 'netdev' chains...
> (at least we probably want to do MAC filtering in there, so it
> happens
> early and independent from the rest of the flow).
> 
> > Also, I think we could avoid the use the fwbr for some cases.
> > 
> > AFAIK, the fwbr is only need  for host->vm, because without fwbr,
> > we
> > only have the packet in host output chain (or forward for routing
> > setup), without the destination tap interface (only the destination
> > bridge and destination ip)
> > 
> > ex: routed setup :10.3.94.11----->10.3.94.1(vmbr0)---
> > (vmbr1)192.168.0.1-----vm(192.168.0.10)
> > 
> >  kernel: [28341.361776] forward hostIN=eth0 OUT=vmbr1
> > MACSRC=f2:42:cf:23:12:88 MACDST=24:8a:07:9a:2a:f2 MACPROTO=0800
> > SRC=10.3.94.11 DST=192.168.0.10 LEN=84 TOS=0x00 PREC=0x00 TTL=63
> > ID=39355 DF PROTO=ICMP TYPE=8 CODE=0 ID=48423 SEQ=1 
> > 
> > 
> > with the fwbr, we can match the packet twice, in the host
> > output/forward, and in the bridge forward.
> > 
> > I'm not able to reproduce this with nftables :(
> > 
> > 
> > I see 1 possible clean workaround:
> > 
> >  - Don't setup ip address on the bridge directly, but instead, on a
> > veth pair. Like this, we see veth source && tap destination is
> > bridge
> > forward.  
> > 
> > (some users had problem at hetzner with fwbr bridge sending packets
> > with their own mac, this should avoid this bug)
> > 
> > But for users that mean manual network config change or maybe some
> > ifupdown2 tricks or config auto rewrite.
> >   
> 
> Yeah that's a problem.
> 
> > 
> > ex: routed setup
> > 
> > bridge forward IN=veth_host OUT=tap100i0 MACSRC=9a:cd:90:f8:f5:3e
> > MACDST=04:05:df:12:85:55 MACPROTO=0800 SRC=10.3.94.11
> > DST=192.168.0.10
> > LEN=84 TOS=0x00 PREC=0x00 TTL=63 ID=10333 DF PROTO=ICMP TYPE=8
> > CODE=0
> > ID=46306 SEQ=1 
> > 
> > 
> > 
> > I don't known if it's possible to get the fwbr tricks working, but
> > it
> > this case:
> >  - keep a fwbr bridge, but only 1 by vmbr where an ip is setup (or
> > for
> > openswitch too). more transparent to implement at vm start/stop.
> > (but we still need to match the packet twice)
> 
> That may actually work for both bridged & routed setups even, and
> traffic would go through regardless of whether the vm communicates
> with
> the host or with the outside. And the host input chain would still
> get
> triggered normally as well.
> However, I've hit issues with masquerading there for some reason
> (where
> simply the presence of a 'bridge filter forward' chain caused
> masquerading to just stop working despite the masquerading rule being
> hit, I'll need to investigate this further.
> 
> > 
> > For other cases (pure bridging), I think we don't need fwbr at all.
> > This should avoid extra cpu cycle, and make network throughput
> > faster
> > too.
> > 
> > 
> > Any opinion about this ? does somebody already have done test with
> > nftables ?
> 


^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2022-03-13  6:44 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-03-11  8:42 [pve-devel] is somebody working on nftables ? (I had scalability problem with big host) DERUMIER, Alexandre
2022-03-11 10:06 ` Wolfgang Bumiller
2022-03-13  6:44   ` DERUMIER, Alexandre

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox

Service provided by Proxmox Server Solutions GmbH | Privacy | Legal