From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits)) (No client certificate requested) by lists.proxmox.com (Postfix) with ESMTPS id 03940757DB for ; Thu, 14 Oct 2021 07:40:51 +0200 (CEST) Received: from firstgate.proxmox.com (localhost [127.0.0.1]) by firstgate.proxmox.com (Proxmox) with ESMTP id EE39018FE4 for ; Thu, 14 Oct 2021 07:40:50 +0200 (CEST) Received: from office.oderland.com (office.oderland.com [91.201.60.5]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by firstgate.proxmox.com (Proxmox) with ESMTPS id 7654618FD9 for ; Thu, 14 Oct 2021 07:40:49 +0200 (CEST) Received: from [193.180.18.161] (port=47724 helo=[10.137.0.14]) by office.oderland.com with esmtpsa (TLS1.2) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.94.2) (envelope-from ) id 1matU0-006mEi-Pz; Thu, 14 Oct 2021 07:40:49 +0200 Message-ID: <772dd636-0658-603e-af6a-b34fea151006@oderland.se> Date: Thu, 14 Oct 2021 07:40:47 +0200 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:93.0) Gecko/20100101 Thunderbird/93.0 Content-Language: en-US From: Josef Johansson To: =?UTF-8?Q?VELARTIS_Philipp_D=c3=bcrhammer?= , "'pve-devel@lists.proxmox.com'" References: <2b417bee43cb4484bcba66afc6076113@velartis.at> <093EC041-0E5D-41F2-99C9-CF8A5E767313@marinov.us> <4F0DFA30-F1ED-4322-857A-4F4C24B463FE@marinov.us> <1FAB115F-FD40-41E1-AC81-A781DA29B378@marinov.us> <190901a568da4ce3a4553e6d929e6828@velartis.at> <04e7ef9a-2054-d929-fd1d-cf5f63047816@oderland.se> <554d5b7c632b47a795de25bc56a41ac6@velartis.at> <664767b8-d40b-ef60-f4f0-52b4ddbb62ff@oderland.se> In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-AntiAbuse: This header was added to track abuse, please include it with any abuse report X-AntiAbuse: Primary Hostname - office.oderland.com X-AntiAbuse: Original Domain - lists.proxmox.com X-AntiAbuse: Originator/Caller UID/GID - [47 12] / [47 12] X-AntiAbuse: Sender Address Domain - oderland.se X-Get-Message-Sender-Via: office.oderland.com: authenticated_id: josjoh@oderland.se X-Authenticated-Sender: office.oderland.com: josjoh@oderland.se X-SPAM-LEVEL: Spam detection results: 0 AWL -0.568 Adjusted score from AWL reputation of From: address BAYES_00 -1.9 Bayes spam probability is 0 to 1% KAM_ASCII_DIVIDERS 0.8 Spam that uses ascii formatting tricks KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment NICE_REPLY_A -0.001 Looks like a legit reply (A) SPF_HELO_NONE 0.001 SPF: HELO does not publish an SPF Record SPF_PASS -0.001 SPF: sender matches SPF record Subject: Re: [pve-devel] BUG in vlan aware bridge X-BeenThere: pve-devel@lists.proxmox.com X-Mailman-Version: 2.1.29 Precedence: list List-Id: Proxmox VE development discussion List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 14 Oct 2021 05:40:51 -0000 This is one of the commits in includes/net/ip.h... I'd say someone should look over this and fix it :) commit 93fdd47e52f3f869a437319db9da1ea409acc07e Author: Herbert Xu Date:   Sun Oct 5 12:00:22 2014 +0800     bridge: Save frag_max_size between PRE_ROUTING and POST_ROUTING          As we may defragment the packet in IPv4 PRE_ROUTING and refragment     it after POST_ROUTING we should save the value of frag_max_size.          This is still very wrong as the bridge is supposed to leave the     packets intact, meaning that the right thing to do is to use the     original frag_list for fragmentation.          Unfortunately we don't currently guarantee that the frag_list is     left untouched throughout netfilter so until this changes this is     the best we can do.          There is also a spot in FORWARD where it appears that we can     forward a packet without going through fragmentation, mark it     so that we can fix it later.          Signed-off-by: Herbert Xu     Signed-off-by: David S. Miller diff --git a/net/bridge/br_netfilter.c b/net/bridge/br_netfilter.c index a615264cf01a..4063898cf8aa 100644 --- a/net/bridge/br_netfilter.c +++ b/net/bridge/br_netfilter.c @@ -404,6 +404,7 @@ static int br_nf_pre_routing_finish_bridge(struct sk_buff *skb)                               ETH_HLEN-ETH_ALEN);              /* tell br_dev_xmit to continue with forwarding */              nf_bridge->mask |= BRNF_BRIDGED_DNAT; +            /* FIXME Need to refragment */              ret = neigh->output(neigh, skb);          }          neigh_release(neigh); @@ -459,6 +460,10 @@ static int br_nf_pre_routing_finish(struct sk_buff *skb)      struct nf_bridge_info *nf_bridge = skb->nf_bridge;      struct rtable *rt;      int err; +    int frag_max_size; + +    frag_max_size = IPCB(skb)->frag_max_size; +    BR_INPUT_SKB_CB(skb)->frag_max_size = frag_max_size;        if (nf_bridge->mask & BRNF_PKT_TYPE) {          skb->pkt_type = PACKET_OTHERHOST; @@ -863,13 +868,19 @@ static unsigned int br_nf_forward_arp(const struct nf_hook_ops *ops,  static int br_nf_dev_queue_xmit(struct sk_buff *skb)  {      int ret; +    int frag_max_size;   +    /* This is wrong! We should preserve the original fragment +     * boundaries by preserving frag_list rather than refragmenting. +     */      if (skb->protocol == htons(ETH_P_IP) &&          skb->len + nf_bridge_mtu_reduction(skb) > skb->dev->mtu &&          !skb_is_gso(skb)) { +        frag_max_size = BR_INPUT_SKB_CB(skb)->frag_max_size;          if (br_parse_ip_options(skb))              /* Drop invalid packet */              return NF_DROP; +        IPCB(skb)->frag_max_size = frag_max_size;          ret = ip_fragment(skb, br_dev_queue_push_xmit);      } else          ret = br_dev_queue_push_xmit(skb); diff --git a/net/bridge/br_private.h b/net/bridge/br_private.h index b6c04cbcfdc5..2398369c6dda 100644 --- a/net/bridge/br_private.h +++ b/net/bridge/br_private.h @@ -305,10 +305,14 @@ struct net_bridge    struct br_input_skb_cb {      struct net_device *brdev; +  #ifdef CONFIG_BRIDGE_IGMP_SNOOPING      int igmp;      int mrouters_only;  #endif + +    u16 frag_max_size; +  #ifdef CONFIG_BRIDGE_VLAN_FILTERING      bool vlan_filtered;  #endif Med vänliga hälsningar Josef Johansson On 10/14/21 07:14, Josef Johansson wrote: > Hi, > > I did some more digging searching for 'bridge-nf-call-iptables > fragmentation' > > Found these forum posts: > > https://forum.proxmox.com/threads/net-bridge-bridge-nf-call-iptables-and-friends.64766/ > > https://forum.proxmox.com/threads/linux-bridge-reassemble-fragmented-packets.96432/ > > And this patch, which seems like they at least TRIED to get it fixed ;) > > https://lists.linuxfoundation.org/pipermail/bridge/2019-August/012185.html > > Med vänliga hälsningar > Josef Johansson > > On 10/13/21 16:32, VELARTIS Philipp Dürhammer wrote: >> If you Stop pve firewall service and echo 0 > /proc/sys/net/bridge/bridge-nf-call-iptables (you stop the netfilter hook) >> Then it works for me also with taged tap devices and vlan aware bridge. I think it is a kernel bug. >> What I don’t understand why not more people are reporting it... >> >> >> -----Ursprüngliche Nachricht----- >> Von: Josef Johansson >> Gesendet: Mittwoch, 13. Oktober 2021 16:19 >> An: VELARTIS Philipp Dürhammer ; 'pve-devel@lists.proxmox.com' >> Betreff: Re: AW: [pve-devel] BUG in vlan aware bridge >> >> Hi, >> >> I can confirm that s > 12000 does not work on either >> >> size, tap(untagged, mtu 1500)->vlan-aware bridge(mtu 9000)->bond(mtu 9000), tap(tagged, mtu1500)->vlan-aware bridge(mtu 9000)->bond(mtu 9000) >> >> s > 12000, doesn't work, doesn't work >> >> s > 8000 , works, doesn't work >> >> >> The traffic(one packet defragmented) is just dropped between bridge and tap. I tried my NOTRACK and it didn't have any affect. >> >> >> We have either a bug in my mellanox cards here or the kernel. I don't think this is a normal case. >> >> Med vänliga hälsningar >> Josef Johansson >> >> On 10/13/21 15:53, VELARTIS Philipp Dürhammer wrote: >>> And what happens if you use packet size > 9000? this should still >>> work...(because it gets fragmented) >>> >>> -----Ursprüngliche Nachricht----- >>> Von: pve-devel Im Auftrag von >>> Josef Johansson >>> Gesendet: Mittwoch, 13. Oktober 2021 13:37 >>> An: pve-devel@lists.proxmox.com >>> Betreff: Re: [pve-devel] BUG in vlan aware bridge >>> >>> Hi, >>> >>> AFAIK it's netfilter that is doing defragmenting so that it can firewall. >>> >>> If you specify >>> >>> iptables -t raw -I PREROUTING -s 77.244.240.131 -j NOTRACK >>> >>> iptables -t raw -I PREROUTING -s 37.16.72.52 -j NOTRACK >>> >>> you should be able to make it ignore your packets. >>> >>> >>> As a datapoint I could ping fine from a MTU 1500 host, over MTU 9000 vlan-aware bridges with firewalls to another MTU 1500. >>> >>> As you would assume the package is defragmented over MTU 9000 links and fragmented again over MTU 1500 devices. >>> >>> Med vänliga hälsningar >>> Josef Johansson >>> >>> On 10/13/21 11:22, VELARTIS Philipp Dürhammer wrote: >>>> HI, >>>> >>>> >>>> Yes i think it has nothing to do with the bonds but with the vlan aware bridge interface. >>>> >>>> I see this with ping -s 1500 >>>> >>>> On tap interface: >>>> 11:19:35.141414 62:47:e0:fe:f9:31 > 54:e0:32:27:6e:50, ethertype IPv4 (0x0800), length 1514: (tos 0x0, ttl 64, id 39999, offset 0, flags [+], proto ICMP (1), length 1500) >>>> 37.16.72.52 > 77.244.240.131: ICMP echo request, id 2182, seq 4, >>>> length 1480 >>>> 11:19:35.141430 62:47:e0:fe:f9:31 > 54:e0:32:27:6e:50, ethertype IPv4 (0x0800), length 562: (tos 0x0, ttl 64, id 39999, offset 1480, flags [none], proto ICMP (1), length 548) >>>> 37.16.72.52 > 77.244.240.131: ip-proto-1 >>>> >>>> On vmbr0: >>>> 11:19:35.141442 62:47:e0:fe:f9:31 > 54:e0:32:27:6e:50, ethertype 802.1Q (0x8100), length 2046: vlan 350, p 0, ethertype IPv4 (0x0800), (tos 0x0, ttl 64, id 39999, offset 0, flags [none], proto ICMP (1), length 2028) >>>> 37.16.72.52 > 77.244.240.131: ICMP echo request, id 2182, seq 4, >>>> length 2008 >>>> >>>> On bond0 its gone.... >>>> >>>> But who is in charge of fragementing the packets normally? The bridge itself? Netfilter? >>>> >>>> -----Ursprüngliche Nachricht----- >>>> Von: pve-devel Im Auftrag von >>>> Stoyan Marinov >>>> Gesendet: Mittwoch, 13. Oktober 2021 00:46 >>>> An: Proxmox VE development discussion >>>> Betreff: Re: [pve-devel] BUG in vlan aware bridge >>>> >>>> OK, I have just verified it has nothing to do with bonds. I get the same behavior with vlan aware bridge, bridge-nf-call-iptables=1 with regular eth0 being part of the bridge. Packets arrive fragmented on tap, reassembled by netfilter and then re-injected in bridge assembled (full size). >>>> >>>> I did have limited success by setting net.bridge.bridge-nf-filter-vlan-tagged to 1. Now packets seem to get fragmented on the way out and back in, but there are still issues: >>>> >>>> 1. I'm testing with ping -s 2000 (1500 mtu everywhere) to an external box. I do see reply packets arrive on the vm nic, but ping doesn't see them. Haven't analyzed much further. >>>> 2. While watching with tcpdump (inside the vm) i notice "ip reassembly time exceeded" messages being generated from the vm. >>>> >>>> I'll try to investigate a bit further tomorrow. >>>> >>>>> On 12 Oct 2021, at 11:26 PM, Stoyan Marinov wrote: >>>>> >>>>> That's an interesting observation. Now that I think about it, it could be caused by bonding and not the underlying device. When I tested this (about an year ago) I was using bonding on the mlx adapters and not using bonding on intel ones. >>>>> >>>>>> On 12 Oct 2021, at 3:36 PM, VELARTIS Philipp Dürhammer wrote: >>>>>> >>>>>> HI, >>>>>> >>>>>> we use HP Server with Intel Cards or the standard hp nic ( ithink >>>>>> also intel) >>>>>> >>>>>> Also I see the I did a mistake: >>>>>> >>>>>> Setup working: >>>>>> tapX (UNtagged) <- -> vmbr0 <- - > bond0 >>>>>> >>>>>> is correct. (before I had also tagged) >>>>>> >>>>>> it should be : >>>>>> >>>>>> Setup not working: >>>>>> tapX (tagged) <- -> vmbr0 <- - > bond0 >>>>>> >>>>>> Setup working: >>>>>> tapX (untagged) <- -> vmbr0 <- - > bond0 >>>>>> >>>>>> Setup also working: >>>>>> tapX < - - > vmbr0v350 < -- > bond0.350 < -- > bond0 >>>>>> >>>>>> -----Ursprüngliche Nachricht----- >>>>>> Von: pve-devel Im Auftrag von >>>>>> Stoyan Marinov >>>>>> Gesendet: Dienstag, 12. Oktober 2021 13:16 >>>>>> An: Proxmox VE development discussion >>>>>> Betreff: Re: [pve-devel] BUG in vlan aware bridge >>>>>> >>>>>> I'm having the very same issue with Mellanox ethernet adapters. I don't see this behavior with Intel nics. What network cards do you have? >>>>>> >>>>>>> On 12 Oct 2021, at 1:48 PM, VELARTIS Philipp Dürhammer wrote: >>>>>>> >>>>>>> HI, >>>>>>> >>>>>>> i am playing around since days because we have strange packet losses. >>>>>>> Finally I can report following (Linux 5.11.22-4-pve, Proxmox 7, all devices MTU 1500): >>>>>>> >>>>>>> Packet with sizes > 1500 without VLAN working well but at the moment they are Tagged they are dropped by the bond device. >>>>>>> Netfilter (set to 1) always reassembles the packets when they arrive a bridge. But they don't get fragmented again I they are VLAN tagged. So the bond device drops them. If the bridge is NOT Vlan aware they also get fragmented and it works well. >>>>>>> >>>>>>> Setup not working: >>>>>>> >>>>>>> tapX (tagged) <- -> vmbr0 <- - > bond0 >>>>>>> >>>>>>> Setup working: >>>>>>> >>>>>>> tapX (tagged) <- -> vmbr0 <- - > bond0 >>>>>>> >>>>>>> Setup also working: >>>>>>> >>>>>>> tapX < - - > vmbr0v350 < -- > bond0.350 < -- > bond0 >>>>>>> >>>>>>> Have you got any idea where to search? I don't understand who is >>>>>>> in charge of fragmenting packages again if they get reassembled by >>>>>>> netfilter. (and why it is not working with vlan aware bridges) >>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> pve-devel mailing list >>>>>>> pve-devel@lists.proxmox.com >>>>>>> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel >>>>>>> >>>>>> _______________________________________________ >>>>>> pve-devel mailing list >>>>>> pve-devel@lists.proxmox.com >>>>>> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel >>>>>> _______________________________________________ >>>>>> pve-devel mailing list >>>>>> pve-devel@lists.proxmox.com >>>>>> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel >>>>> _______________________________________________ >>>>> pve-devel mailing list >>>>> pve-devel@lists.proxmox.com >>>>> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel >>>> _______________________________________________ >>>> pve-devel mailing list >>>> pve-devel@lists.proxmox.com >>>> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel >>>> _______________________________________________ >>>> pve-devel mailing list >>>> pve-devel@lists.proxmox.com >>>> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel >>> _______________________________________________ >>> pve-devel mailing list >>> pve-devel@lists.proxmox.com >>> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel