all lists on lists.proxmox.com
 help / color / mirror / Atom feed
* [pve-devel] [PATCH kernel 0/5] backport nftables atomicity fix
@ 2025-09-11 10:05 Gabriel Goller
  2025-09-11 10:05 ` [pve-devel] [PATCH pve-kernel 1/5] kernel: backport: netfilter: nft_set_pipapo: don't check genbit from packetpath lookups Gabriel Goller
                   ` (4 more replies)
  0 siblings, 5 replies; 6+ messages in thread
From: Gabriel Goller @ 2025-09-11 10:05 UTC (permalink / raw)
  To: pve-devel

Stefan Hanreich discovered this nftables bug which breaks the atomicity when
updating certain sets. This means that when updating a set, packets sometimes
slip through even though the existing and the incoming rules deny the packet.
A full reproducer is available here: [0].
More information in following commit messages.

The upstream series has not been applied yet, but is available here:
https://lore.kernel.org/netfilter-devel/20250910080227.11174-1-fw@strlen.de/

Nftables changed quite a bit since 6.14 so the backport was a bit tricky -- a
few Tested-by's would be nice :). If anyone needs help to reproduce this or
wants a pre-build kernel with the fix feel free to reach out!

Thanks to Stefan Hanreich for identifying the bug and providing a minimal
reproducer, and to Florian Westphal for the quick fix.

[0]:
Initial network setup:

ip netns add east
ip netns add west

ip link add east type veth peer name west

ip link set east netns east
ip link set west netns west

ip netns exec east ip a a 192.0.2.20/24 dev east

ip netns exec west ip link add br0 type bridge
ip netns exec west ip a a 192.0.2.10/24 dev br0
ip netns exec west ip link set west master br0

ip netns exec east ip link set up east
ip netns exec west ip link set up west
ip netns exec west ip link set up br0


Initial nft ruleset in network namespace 'west':

table bridge west {
  set east-ip-nomatch {
    type ipv4_addr
    flags interval;
    elements = { 0.0.0.0-192.0.2.19, 192.0.2.21-255.255.255.255 }
  }

  chain block-spoofed {
    type filter hook prerouting priority filter; policy accept;
    ip saddr @east-ip-nomatch drop
  }
}


This should block all traffic on the bridge br0, which does not have
192.0.2.20 as source IP address, but when continuously flushing /
re-creating the east-ip-nomatch set via the following commands:

$ while true; do ip netns exec west nft -j -f update_set.json; done;

# update_set.json
{
  "nftables": [
    {
      "add": {
        "set": {
          "family": "bridge",
          "table": "west",
          "name": "east-ip-nomatch",
          "type": "ipv4_addr",
          "flags": [
            "interval"
          ]
        }
      }
    },
    {
      "flush": {
        "set": {
          "family": "bridge",
          "table": "west",
          "name": "east-ip-nomatch"
        }
      }
    },
    {
      "add": {
        "element": {
          "family": "bridge",
          "table": "west",
          "name": "east-ip-nomatch",
          "elem": [
            {
              "range": ["0.0.0.0", "192.0.2.19"]
            },
            {
              "range": ["192.0.2.21", "255.255.255.255"]
            }
          ]
        }
      }
    }
  ]
}


And then continously sending ICMP packets from east to west via e.g. scapy:

$ ip netns exec east python3 -c 'from scapy.all import send, Ether, IP,
ICMP; send(IP(src="192.0.2.30", dst="192.0.2.10")/ICMP(id=2222, seq=42),
count=1000000, inter=0.001)'



Some of them pass through, as is visible via tcpdump (sometimes its
required to terminate the process for the packets to be visible, since
the buffers do not get flushed immediately):

$ ip netns exec west tcpdump -envi br0 icmp

tcpdump: listening on br0, link-type EN10MB (Ethernet), snapshot length
262144 bytes
17:11:10.008758 06:a4:e8:d4:db:20 > 8a:88:57:79:f6:97, ethertype IPv4
(0x0800), length 42: (tos 0x0, ttl 64, id 1, offset 0, flags [none],
proto ICMP (1), l
ength 28)
    192.0.2.30 > 192.0.2.10: ICMP echo request, id 2222, seq 42, length 8

pve-kernel:

Gabriel Goller (5):
  kernel: backport: netfilter: nft_set_pipapo: don't check genbit from
    packetpath lookups
  kernel: backport: netfilter: nft_set_rbtree: continue traversal if
    element is inactive
  kernel: backport: netfilter: nf_tables: place base_seq in struct net
  kernel: backport: netfilter: nf_tables: make nft_set_do_lookup
    available unconditionally
  kernel: backport: netfilter: nf_tables: restart set lookup on base_seq
    change

 ...t_pipapo-don-t-check-genbit-from-pac.patch | 160 +++++++++
 ...t_rbtree-continue-traversal-if-eleme.patch |  88 +++++
 ..._tables-place-base_seq-in-struct-net.patch | 310 ++++++++++++++++++
 ...les-make-nft_set_do_lookup-available.patch |  86 +++++
 ...les-restart-set-lookup-on-base_seq-c.patch | 148 +++++++++
 5 files changed, 792 insertions(+)
 create mode 100644 patches/kernel/0014-netfilter-nft_set_pipapo-don-t-check-genbit-from-pac.patch
 create mode 100644 patches/kernel/0015-netfilter-nft_set_rbtree-continue-traversal-if-eleme.patch
 create mode 100644 patches/kernel/0016-netfilter-nf_tables-place-base_seq-in-struct-net.patch
 create mode 100644 patches/kernel/0017-netfilter-nf_tables-make-nft_set_do_lookup-available.patch
 create mode 100644 patches/kernel/0018-netfilter-nf_tables-restart-set-lookup-on-base_seq-c.patch


Summary over all repositories:
  5 files changed, 792 insertions(+), 0 deletions(-)

-- 
Generated by git-murpp 0.8.0


_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [pve-devel] [PATCH pve-kernel 1/5] kernel: backport: netfilter: nft_set_pipapo: don't check genbit from packetpath lookups
  2025-09-11 10:05 [pve-devel] [PATCH kernel 0/5] backport nftables atomicity fix Gabriel Goller
@ 2025-09-11 10:05 ` Gabriel Goller
  2025-09-11 10:05 ` [pve-devel] [PATCH pve-kernel 2/5] kernel: backport: netfilter: nft_set_rbtree: continue traversal if element is inactive Gabriel Goller
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 6+ messages in thread
From: Gabriel Goller @ 2025-09-11 10:05 UTC (permalink / raw)
  To: pve-devel

The pipapo set is a specific nftables set which holds ip-ranges. When
updating the set, the new rules are not visible (to the live ruleset)
until the next generation, so we don't need to check the genbit on
lookup.

Signed-off-by: Gabriel Goller <g.goller@proxmox.com>
---
 ...t_pipapo-don-t-check-genbit-from-pac.patch | 160 ++++++++++++++++++
 1 file changed, 160 insertions(+)
 create mode 100644 patches/kernel/0014-netfilter-nft_set_pipapo-don-t-check-genbit-from-pac.patch

diff --git a/patches/kernel/0014-netfilter-nft_set_pipapo-don-t-check-genbit-from-pac.patch b/patches/kernel/0014-netfilter-nft_set_pipapo-don-t-check-genbit-from-pac.patch
new file mode 100644
index 000000000000..be34c0efa7e9
--- /dev/null
+++ b/patches/kernel/0014-netfilter-nft_set_pipapo-don-t-check-genbit-from-pac.patch
@@ -0,0 +1,160 @@
+From da2f3da272763d4382e4c82d2a6d1358aed10123 Mon Sep 17 00:00:00 2001
+From: Gabriel Goller <g.goller@proxmox.com>
+Date: Wed, 10 Sep 2025 12:08:09 +0200
+Subject: [PATCH 1/5] netfilter: nft_set_pipapo: don't check genbit from
+ packetpath lookups
+
+The pipapo set type is special in that it has two copies of its
+datastructure: one live copy containing only valid elements and one
+on-demand clone used during transaction where adds/deletes happen.
+
+This clone is not visible to the datapath.
+
+This is unlike all other set types in nftables, those all link new
+elements into their live hlist/tree.
+
+For those sets, the lookup functions must skip the new elements while the
+transaction is ongoing to ensure consistency.
+
+As the clone is shallow, removal does have an effect on the packet path:
+once the transaction enters the commit phase the 'gencursor' bit that
+determines which elements are active and which elements should be ignored
+(because they are no longer valid) is flipped.
+
+This causes the datapath lookup to ignore these elements if they are found
+during lookup.
+
+This opens up a small race window where pipapo has an inconsistent view of
+the dataset from when the transaction-cpu flipped the genbit until the
+transaction-cpu calls nft_pipapo_commit() to swap live/clone pointers:
+
+cpu0					cpu1
+  has added new elements to clone
+  has marked elements as being
+  inactive in new generation
+					perform lookup in the set
+  enters commit phase:
+
+I) increments the genbit
+					A) observes new genbit
+  removes elements from the clone so
+  they won't be found anymore
+					B) lookup in datastructure
+					   can't see new elements yet,
+					   but old elements are ignored
+					   -> Only matches elements that
+					   were not changed in the
+					   transaction
+II) calls nft_pipapo_commit(), clone
+    and live pointers are swapped.
+					C New nft_lookup happening now
+				       	  will find matching elements.
+
+Consider a packet matching range r1-r2:
+
+cpu0 processes following transaction:
+1. remove r1-r2
+2. add r1-r3
+
+P is contained in both ranges. Therefore, cpu1 should always find a match
+for P.  Due to above race, this is not the case:
+
+cpu1 does find r1-r2, but then ignores it due to the genbit indicating
+the range has been removed.
+
+At the same time, r1-r3 is not visible yet, because it can only be found
+in the clone.
+
+The situation persists for all lookups until after cpu0 hits II).
+
+The fix is easy: Don't check the genbit from pipapo lookup functions.
+This is possible because unlike the other set types, the new elements are
+not reachable from the live copy of the dataset.
+
+The clone/live pointer swap is enough to avoid matching on old elements
+while at the same time all new elements are exposed in one go.
+
+After this change, step B above returns a match in r1-r2.
+This is fine: r1-r2 only becomes truly invalid the moment they get freed.
+This happens after a synchronize_rcu() call and rcu read lock is held
+via netfilter hook traversal (nf_hook_slow()).
+
+Fixes: 3c4287f62044 ("nf_tables: Add set type for arbitrary concatenation of ranges")
+Signed-off-by: Florian Westphal <fw@strlen.de>
+Signed-off-by: Gabriel Goller <g.goller@proxmox.com>
+---
+ net/netfilter/nft_set_pipapo.c      | 21 +++++++++++++++++++--
+ net/netfilter/nft_set_pipapo_avx2.c |  4 +---
+ 2 files changed, 20 insertions(+), 5 deletions(-)
+
+diff --git a/net/netfilter/nft_set_pipapo.c b/net/netfilter/nft_set_pipapo.c
+index 0529e4ef7520..a386c3761cac 100644
+--- a/net/netfilter/nft_set_pipapo.c
++++ b/net/netfilter/nft_set_pipapo.c
+@@ -405,6 +405,24 @@ int pipapo_refill(unsigned long *map, unsigned int len, unsigned int rules,
+  *
+  * For more details, see DOC: Theory of Operation.
+  *
++ * Unlike other set types, this uses NFT_GENMASK_ANY instead of
++ * nft_genmask_cur().
++ *
++ * This is because new (future) elements are not reachable from
++ * priv->match, they get added to priv->clone instead.
++ * When the commit phase flips the generation bitmask, the
++ * 'now old' entries are skipped but without the 'now current'
++ * elements becoming visible. Using nft_genmask_cur() thus creates
++ * inconsistent state: matching old entries get skipped but thew
++ * newly matching entries are unreachable.
++ *
++ * GENMASK will still find the 'now old' entries which ensures consistent
++ * priv->match view.
++ *
++ * nft_pipapo_commit swaps ->clone and ->match shortly after the
++ * genbit flip.  As ->clone doesn't contain the old entries in the first
++ * place, lookup will only find the now-current ones.
++ *
+  * Return: true on match, false otherwise.
+  */
+ bool nft_pipapo_lookup(const struct net *net, const struct nft_set *set,
+@@ -413,7 +431,6 @@ bool nft_pipapo_lookup(const struct net *net, const struct nft_set *set,
+ 	struct nft_pipapo *priv = nft_set_priv(set);
+ 	struct nft_pipapo_scratch *scratch;
+ 	unsigned long *res_map, *fill_map;
+-	u8 genmask = nft_genmask_cur(net);
+ 	const struct nft_pipapo_match *m;
+ 	const struct nft_pipapo_field *f;
+ 	const u8 *rp = (const u8 *)key;
+@@ -471,7 +488,7 @@ bool nft_pipapo_lookup(const struct net *net, const struct nft_set *set,
+ 		if (last) {
+ 			*ext = &f->mt[b].e->ext;
+ 			if (unlikely(nft_set_elem_expired(*ext) ||
+-				     !nft_set_elem_active(*ext, genmask)))
++				     !nft_set_elem_active(*ext, NFT_GENMASK_ANY)))
+ 				goto next_match;
+ 
+ 			/* Last field: we're just returning the key without
+diff --git a/net/netfilter/nft_set_pipapo_avx2.c b/net/netfilter/nft_set_pipapo_avx2.c
+index be7c16c79f71..675c0a9634b6 100644
+--- a/net/netfilter/nft_set_pipapo_avx2.c
++++ b/net/netfilter/nft_set_pipapo_avx2.c
+@@ -1151,7 +1151,6 @@ bool nft_pipapo_avx2_lookup(const struct net *net, const struct nft_set *set,
+ {
+ 	struct nft_pipapo *priv = nft_set_priv(set);
+ 	struct nft_pipapo_scratch *scratch;
+-	u8 genmask = nft_genmask_cur(net);
+ 	const struct nft_pipapo_match *m;
+ 	const struct nft_pipapo_field *f;
+ 	const u8 *rp = (const u8 *)key;
+@@ -1245,8 +1244,7 @@ bool nft_pipapo_avx2_lookup(const struct net *net, const struct nft_set *set,
+ 
+ 		if (last) {
+ 			*ext = &f->mt[ret].e->ext;
+-			if (unlikely(nft_set_elem_expired(*ext) ||
+-				     !nft_set_elem_active(*ext, genmask))) {
++			if (unlikely(nft_set_elem_expired(*ext))) {
+ 				ret = 0;
+ 				goto next_match;
+ 			}
+-- 
+2.47.3
+
-- 
2.47.3



_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [pve-devel] [PATCH pve-kernel 2/5] kernel: backport: netfilter: nft_set_rbtree: continue traversal if element is inactive
  2025-09-11 10:05 [pve-devel] [PATCH kernel 0/5] backport nftables atomicity fix Gabriel Goller
  2025-09-11 10:05 ` [pve-devel] [PATCH pve-kernel 1/5] kernel: backport: netfilter: nft_set_pipapo: don't check genbit from packetpath lookups Gabriel Goller
@ 2025-09-11 10:05 ` Gabriel Goller
  2025-09-11 10:05 ` [pve-devel] [PATCH pve-kernel 3/5] kernel: backport: netfilter: nf_tables: place base_seq in struct net Gabriel Goller
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 6+ messages in thread
From: Gabriel Goller @ 2025-09-11 10:05 UTC (permalink / raw)
  To: pve-devel

If a match is found in a rbtree, set the interval at the very end to
avoid the element being inactive when finishing the traversal.

Signed-off-by: Gabriel Goller <g.goller@proxmox.com>
---
 ...t_rbtree-continue-traversal-if-eleme.patch | 88 +++++++++++++++++++
 1 file changed, 88 insertions(+)
 create mode 100644 patches/kernel/0015-netfilter-nft_set_rbtree-continue-traversal-if-eleme.patch

diff --git a/patches/kernel/0015-netfilter-nft_set_rbtree-continue-traversal-if-eleme.patch b/patches/kernel/0015-netfilter-nft_set_rbtree-continue-traversal-if-eleme.patch
new file mode 100644
index 000000000000..9e4d4d687003
--- /dev/null
+++ b/patches/kernel/0015-netfilter-nft_set_rbtree-continue-traversal-if-eleme.patch
@@ -0,0 +1,88 @@
+From 2af0ed300431a3c5675cd6a7219424430fa9651b Mon Sep 17 00:00:00 2001
+From: Gabriel Goller <g.goller@proxmox.com>
+Date: Wed, 10 Sep 2025 12:08:56 +0200
+Subject: [PATCH 2/5] netfilter: nft_set_rbtree: continue traversal if element
+ is inactive
+
+When the rbtree lookup function finds a match in the rbtree, it sets the
+range start interval to a potentially inactive element.
+
+Then, after tree lookup, if the matching element is inactive, it returns
+NULL and suppresses a matching result.
+
+This is wrong and leads to false negative matches when a transaction has
+already entered the commit phase.
+
+cpu0					cpu1
+  has added new elements to clone
+  has marked elements as being
+  inactive in new generation
+					perform lookup in the set
+  enters commit phase:
+I) increments the genbit
+					A) observes new genbit
+					B) finds matching range
+					C) returns no match: found
+					range invalid in new generation
+II) removes old elements from the tree
+					C New nft_lookup happening now
+				       	  will find matching element,
+					  because it is no longer
+					  obscured by old, inactive one.
+
+Consider a packet matching range r1-r2:
+
+cpu0 processes following transaction:
+1. remove r1-r2
+2. add r1-r3
+
+P is contained in both ranges. Therefore, cpu1 should always find a match
+for P.  Due to above race, this is not the case:
+
+cpu1 does find r1-r2, but then ignores it due to the genbit indicating
+the range has been removed.  It does NOT test for further matches.
+
+The situation persists for all lookups until after cpu0 hits II) after
+which r1-r3 range start node is tested for the first time.
+
+Move the "interval start is valid" check ahead so that tree traversal
+continues if the starting interval is not valid in this generation.
+
+Thanks to Stefan Hanreich for providing an initial reproducer for this
+bug.
+
+Reported-by: Stefan Hanreich <s.hanreich@proxmox.com>
+Fixes: c1eda3c6394f ("netfilter: nft_rbtree: ignore inactive matching element with no descendants")
+Signed-off-by: Florian Westphal <fw@strlen.de>
+Signed-off-by: Gabriel Goller <g.goller@proxmox.com>
+---
+ net/netfilter/nft_set_rbtree.c | 6 +++---
+ 1 file changed, 3 insertions(+), 3 deletions(-)
+
+diff --git a/net/netfilter/nft_set_rbtree.c b/net/netfilter/nft_set_rbtree.c
+index 2e8ef16ff191..c4eb94258e24 100644
+--- a/net/netfilter/nft_set_rbtree.c
++++ b/net/netfilter/nft_set_rbtree.c
+@@ -77,7 +77,9 @@ static bool __nft_rbtree_lookup(const struct net *net, const struct nft_set *set
+ 			    nft_rbtree_interval_end(rbe) &&
+ 			    nft_rbtree_interval_start(interval))
+ 				continue;
+-			interval = rbe;
++			if (nft_set_elem_active(&rbe->ext, genmask) &&
++			    !nft_rbtree_elem_expired(rbe))
++				interval = rbe;
+ 		} else if (d > 0)
+ 			parent = rcu_dereference_raw(parent->rb_right);
+ 		else {
+@@ -103,8 +105,6 @@ static bool __nft_rbtree_lookup(const struct net *net, const struct nft_set *set
+ 	}
+ 
+ 	if (set->flags & NFT_SET_INTERVAL && interval != NULL &&
+-	    nft_set_elem_active(&interval->ext, genmask) &&
+-	    !nft_rbtree_elem_expired(interval) &&
+ 	    nft_rbtree_interval_start(interval)) {
+ 		*ext = &interval->ext;
+ 		return true;
+-- 
+2.47.3
+
-- 
2.47.3



_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [pve-devel] [PATCH pve-kernel 3/5] kernel: backport: netfilter: nf_tables: place base_seq in struct net
  2025-09-11 10:05 [pve-devel] [PATCH kernel 0/5] backport nftables atomicity fix Gabriel Goller
  2025-09-11 10:05 ` [pve-devel] [PATCH pve-kernel 1/5] kernel: backport: netfilter: nft_set_pipapo: don't check genbit from packetpath lookups Gabriel Goller
  2025-09-11 10:05 ` [pve-devel] [PATCH pve-kernel 2/5] kernel: backport: netfilter: nft_set_rbtree: continue traversal if element is inactive Gabriel Goller
@ 2025-09-11 10:05 ` Gabriel Goller
  2025-09-11 10:05 ` [pve-devel] [PATCH pve-kernel 4/5] kernel: backport: netfilter: nf_tables: make nft_set_do_lookup available unconditionally Gabriel Goller
  2025-09-11 10:05 ` [pve-devel] [PATCH pve-kernel 5/5] kernel: backport: netfilter: nf_tables: restart set lookup on base_seq change Gabriel Goller
  4 siblings, 0 replies; 6+ messages in thread
From: Gabriel Goller @ 2025-09-11 10:05 UTC (permalink / raw)
  To: pve-devel

Move base_seq into the net structure. Both get incremented a lot nearly
at the same time in the commit.

Signed-off-by: Gabriel Goller <g.goller@proxmox.com>
---
 ..._tables-place-base_seq-in-struct-net.patch | 310 ++++++++++++++++++
 1 file changed, 310 insertions(+)
 create mode 100644 patches/kernel/0016-netfilter-nf_tables-place-base_seq-in-struct-net.patch

diff --git a/patches/kernel/0016-netfilter-nf_tables-place-base_seq-in-struct-net.patch b/patches/kernel/0016-netfilter-nf_tables-place-base_seq-in-struct-net.patch
new file mode 100644
index 000000000000..63a18e44e169
--- /dev/null
+++ b/patches/kernel/0016-netfilter-nf_tables-place-base_seq-in-struct-net.patch
@@ -0,0 +1,310 @@
+From 7d566006c0aa2461aa263a94e0edc73637750bab Mon Sep 17 00:00:00 2001
+From: Gabriel Goller <g.goller@proxmox.com>
+Date: Wed, 10 Sep 2025 12:09:43 +0200
+Subject: [PATCH 3/5] netfilter: nf_tables: place base_seq in struct net
+
+This will soon be read from packet path around same time as the gencursor.
+
+Both gencursor and base_seq get incremented almost at the same time, so
+it makes sense to place them in the same structure.
+
+This doesn't increase struct net size on 64bit due to padding.
+
+Signed-off-by: Florian Westphal <fw@strlen.de>
+Signed-off-by: Gabriel Goller <g.goller@proxmox.com>
+---
+ include/net/netfilter/nf_tables.h |  1 -
+ include/net/netns/nftables.h      |  1 +
+ net/netfilter/nf_tables_api.c     | 64 ++++++++++++++++---------------
+ 3 files changed, 34 insertions(+), 32 deletions(-)
+
+diff --git a/include/net/netfilter/nf_tables.h b/include/net/netfilter/nf_tables.h
+index 803d5f1601f9..56c6698ed5bf 100644
+--- a/include/net/netfilter/nf_tables.h
++++ b/include/net/netfilter/nf_tables.h
+@@ -1913,7 +1913,6 @@ struct nftables_pernet {
+ 	struct mutex		commit_mutex;
+ 	u64			table_handle;
+ 	u64			tstamp;
+-	unsigned int		base_seq;
+ 	unsigned int		gc_seq;
+ 	u8			validate_state;
+ 	struct work_struct	destroy_work;
+diff --git a/include/net/netns/nftables.h b/include/net/netns/nftables.h
+index cc8060c017d5..99dd166c5d07 100644
+--- a/include/net/netns/nftables.h
++++ b/include/net/netns/nftables.h
+@@ -3,6 +3,7 @@
+ #define _NETNS_NFTABLES_H_
+ 
+ struct netns_nftables {
++	unsigned int		base_seq;
+ 	u8			gencursor;
+ };
+ 
+diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c
+index a133e1c175ce..f9e7f056ea5b 100644
+--- a/net/netfilter/nf_tables_api.c
++++ b/net/netfilter/nf_tables_api.c
+@@ -1096,11 +1096,14 @@ nf_tables_chain_type_lookup(struct net *net, const struct nlattr *nla,
+ 	return ERR_PTR(-ENOENT);
+ }
+ 
+-static __be16 nft_base_seq(const struct net *net)
++static unsigned int nft_base_seq(const struct net *net)
+ {
+-	struct nftables_pernet *nft_net = nft_pernet(net);
++	return READ_ONCE(net->nft.base_seq);
++}
+ 
+-	return htons(nft_net->base_seq & 0xffff);
++static __be16 nft_base_seq_be16(const struct net *net)
++{
++	return htons(nft_base_seq(net) & 0xffff);
+ }
+ 
+ static const struct nla_policy nft_table_policy[NFTA_TABLE_MAX + 1] = {
+@@ -1120,7 +1123,7 @@ static int nf_tables_fill_table_info(struct sk_buff *skb, struct net *net,
+ 
+ 	event = nfnl_msg_type(NFNL_SUBSYS_NFTABLES, event);
+ 	nlh = nfnl_msg_put(skb, portid, seq, event, flags, family,
+-			   NFNETLINK_V0, nft_base_seq(net));
++			   NFNETLINK_V0, nft_base_seq_be16(net));
+ 	if (!nlh)
+ 		goto nla_put_failure;
+ 
+@@ -1212,7 +1215,7 @@ static int nf_tables_dump_tables(struct sk_buff *skb,
+ 
+ 	rcu_read_lock();
+ 	nft_net = nft_pernet(net);
+-	cb->seq = READ_ONCE(nft_net->base_seq);
++	cb->seq = nft_base_seq(net);
+ 
+ 	list_for_each_entry_rcu(table, &nft_net->tables, list) {
+ 		if (family != NFPROTO_UNSPEC && family != table->family)
+@@ -1983,7 +1986,7 @@ static int nf_tables_fill_chain_info(struct sk_buff *skb, struct net *net,
+ 
+ 	event = nfnl_msg_type(NFNL_SUBSYS_NFTABLES, event);
+ 	nlh = nfnl_msg_put(skb, portid, seq, event, flags, family,
+-			   NFNETLINK_V0, nft_base_seq(net));
++			   NFNETLINK_V0, nft_base_seq_be16(net));
+ 	if (!nlh)
+ 		goto nla_put_failure;
+ 
+@@ -2084,7 +2087,7 @@ static int nf_tables_dump_chains(struct sk_buff *skb,
+ 
+ 	rcu_read_lock();
+ 	nft_net = nft_pernet(net);
+-	cb->seq = READ_ONCE(nft_net->base_seq);
++	cb->seq = nft_base_seq(net);
+ 
+ 	list_for_each_entry_rcu(table, &nft_net->tables, list) {
+ 		if (family != NFPROTO_UNSPEC && family != table->family)
+@@ -3584,7 +3587,7 @@ static int nf_tables_fill_rule_info(struct sk_buff *skb, struct net *net,
+ 	u16 type = nfnl_msg_type(NFNL_SUBSYS_NFTABLES, event);
+ 
+ 	nlh = nfnl_msg_put(skb, portid, seq, type, flags, family, NFNETLINK_V0,
+-			   nft_base_seq(net));
++			   nft_base_seq_be16(net));
+ 	if (!nlh)
+ 		goto nla_put_failure;
+ 
+@@ -3752,7 +3755,7 @@ static int nf_tables_dump_rules(struct sk_buff *skb,
+ 
+ 	rcu_read_lock();
+ 	nft_net = nft_pernet(net);
+-	cb->seq = READ_ONCE(nft_net->base_seq);
++	cb->seq = nft_base_seq(net);
+ 
+ 	list_for_each_entry_rcu(table, &nft_net->tables, list) {
+ 		if (family != NFPROTO_UNSPEC && family != table->family)
+@@ -3963,7 +3966,7 @@ static int nf_tables_getrule_reset(struct sk_buff *skb,
+ 	buf = kasprintf(GFP_ATOMIC, "%.*s:%u",
+ 			nla_len(nla[NFTA_RULE_TABLE]),
+ 			(char *)nla_data(nla[NFTA_RULE_TABLE]),
+-			nft_net->base_seq);
++			nft_base_seq(net));
+ 	audit_log_nfcfg(buf, info->nfmsg->nfgen_family, 1,
+ 			AUDIT_NFT_OP_RULE_RESET, GFP_ATOMIC);
+ 	kfree(buf);
+@@ -4776,7 +4779,7 @@ static int nf_tables_fill_set(struct sk_buff *skb, const struct nft_ctx *ctx,
+ 
+ 	event = nfnl_msg_type(NFNL_SUBSYS_NFTABLES, event);
+ 	nlh = nfnl_msg_put(skb, portid, seq, event, flags, ctx->family,
+-			   NFNETLINK_V0, nft_base_seq(ctx->net));
++			   NFNETLINK_V0, nft_base_seq_be16(ctx->net));
+ 	if (!nlh)
+ 		goto nla_put_failure;
+ 
+@@ -4917,7 +4920,7 @@ static int nf_tables_dump_sets(struct sk_buff *skb, struct netlink_callback *cb)
+ 
+ 	rcu_read_lock();
+ 	nft_net = nft_pernet(net);
+-	cb->seq = READ_ONCE(nft_net->base_seq);
++	cb->seq = nft_base_seq(net);
+ 
+ 	list_for_each_entry_rcu(table, &nft_net->tables, list) {
+ 		if (ctx->family != NFPROTO_UNSPEC &&
+@@ -6094,7 +6097,7 @@ static int nf_tables_dump_set(struct sk_buff *skb, struct netlink_callback *cb)
+ 
+ 	rcu_read_lock();
+ 	nft_net = nft_pernet(net);
+-	cb->seq = READ_ONCE(nft_net->base_seq);
++	cb->seq = nft_base_seq(net);
+ 
+ 	list_for_each_entry_rcu(table, &nft_net->tables, list) {
+ 		if (dump_ctx->ctx.family != NFPROTO_UNSPEC &&
+@@ -6123,7 +6126,7 @@ static int nf_tables_dump_set(struct sk_buff *skb, struct netlink_callback *cb)
+ 	seq    = cb->nlh->nlmsg_seq;
+ 
+ 	nlh = nfnl_msg_put(skb, portid, seq, event, NLM_F_MULTI,
+-			   table->family, NFNETLINK_V0, nft_base_seq(net));
++			   table->family, NFNETLINK_V0, nft_base_seq_be16(net));
+ 	if (!nlh)
+ 		goto nla_put_failure;
+ 
+@@ -6216,7 +6219,7 @@ static int nf_tables_fill_setelem_info(struct sk_buff *skb,
+ 
+ 	event = nfnl_msg_type(NFNL_SUBSYS_NFTABLES, event);
+ 	nlh = nfnl_msg_put(skb, portid, seq, event, flags, ctx->family,
+-			   NFNETLINK_V0, nft_base_seq(ctx->net));
++			   NFNETLINK_V0, nft_base_seq_be16(ctx->net));
+ 	if (!nlh)
+ 		goto nla_put_failure;
+ 
+@@ -6515,7 +6518,7 @@ static int nf_tables_getsetelem_reset(struct sk_buff *skb,
+ 		}
+ 		nelems++;
+ 	}
+-	audit_log_nft_set_reset(dump_ctx.ctx.table, nft_net->base_seq, nelems);
++	audit_log_nft_set_reset(dump_ctx.ctx.table, nft_base_seq(info->net), nelems);
+ 
+ out_unlock:
+ 	rcu_read_unlock();
+@@ -8266,7 +8269,7 @@ static int nf_tables_fill_obj_info(struct sk_buff *skb, struct net *net,
+ 
+ 	event = nfnl_msg_type(NFNL_SUBSYS_NFTABLES, event);
+ 	nlh = nfnl_msg_put(skb, portid, seq, event, flags, family,
+-			   NFNETLINK_V0, nft_base_seq(net));
++			   NFNETLINK_V0, nft_base_seq_be16(net));
+ 	if (!nlh)
+ 		goto nla_put_failure;
+ 
+@@ -8330,7 +8333,7 @@ static int nf_tables_dump_obj(struct sk_buff *skb, struct netlink_callback *cb)
+ 
+ 	rcu_read_lock();
+ 	nft_net = nft_pernet(net);
+-	cb->seq = READ_ONCE(nft_net->base_seq);
++	cb->seq = nft_base_seq(net);
+ 
+ 	list_for_each_entry_rcu(table, &nft_net->tables, list) {
+ 		if (family != NFPROTO_UNSPEC && family != table->family)
+@@ -8364,7 +8367,7 @@ static int nf_tables_dump_obj(struct sk_buff *skb, struct netlink_callback *cb)
+ 			idx++;
+ 		}
+ 		if (ctx->reset && entries)
+-			audit_log_obj_reset(table, nft_net->base_seq, entries);
++			audit_log_obj_reset(table, nft_base_seq(net), entries);
+ 		if (rc < 0)
+ 			break;
+ 	}
+@@ -8533,7 +8536,7 @@ static int nf_tables_getobj_reset(struct sk_buff *skb,
+ 	buf = kasprintf(GFP_ATOMIC, "%.*s:%u",
+ 			nla_len(nla[NFTA_OBJ_TABLE]),
+ 			(char *)nla_data(nla[NFTA_OBJ_TABLE]),
+-			nft_net->base_seq);
++			nft_base_seq(net));
+ 	audit_log_nfcfg(buf, info->nfmsg->nfgen_family, 1,
+ 			AUDIT_NFT_OP_OBJ_RESET, GFP_ATOMIC);
+ 	kfree(buf);
+@@ -8640,7 +8643,7 @@ void nft_obj_notify(struct net *net, const struct nft_table *table,
+ {
+ 	struct nftables_pernet *nft_net = nft_pernet(net);
+ 	char *buf = kasprintf(gfp, "%s:%u",
+-			      table->name, nft_net->base_seq);
++			      table->name, nft_base_seq(net));
+ 
+ 	audit_log_nfcfg(buf,
+ 			family,
+@@ -9288,7 +9291,7 @@ static int nf_tables_fill_flowtable_info(struct sk_buff *skb, struct net *net,
+ 
+ 	event = nfnl_msg_type(NFNL_SUBSYS_NFTABLES, event);
+ 	nlh = nfnl_msg_put(skb, portid, seq, event, flags, family,
+-			   NFNETLINK_V0, nft_base_seq(net));
++			   NFNETLINK_V0, nft_base_seq_be16(net));
+ 	if (!nlh)
+ 		goto nla_put_failure;
+ 
+@@ -9356,7 +9359,7 @@ static int nf_tables_dump_flowtable(struct sk_buff *skb,
+ 
+ 	rcu_read_lock();
+ 	nft_net = nft_pernet(net);
+-	cb->seq = READ_ONCE(nft_net->base_seq);
++	cb->seq = nft_base_seq(net);
+ 
+ 	list_for_each_entry_rcu(table, &nft_net->tables, list) {
+ 		if (family != NFPROTO_UNSPEC && family != table->family)
+@@ -9541,17 +9544,16 @@ static void nf_tables_flowtable_destroy(struct nft_flowtable *flowtable)
+ static int nf_tables_fill_gen_info(struct sk_buff *skb, struct net *net,
+ 				   u32 portid, u32 seq)
+ {
+-	struct nftables_pernet *nft_net = nft_pernet(net);
+ 	struct nlmsghdr *nlh;
+ 	char buf[TASK_COMM_LEN];
+ 	int event = nfnl_msg_type(NFNL_SUBSYS_NFTABLES, NFT_MSG_NEWGEN);
+ 
+ 	nlh = nfnl_msg_put(skb, portid, seq, event, 0, AF_UNSPEC,
+-			   NFNETLINK_V0, nft_base_seq(net));
++			   NFNETLINK_V0, nft_base_seq_be16(net));
+ 	if (!nlh)
+ 		goto nla_put_failure;
+ 
+-	if (nla_put_be32(skb, NFTA_GEN_ID, htonl(nft_net->base_seq)) ||
++	if (nla_put_be32(skb, NFTA_GEN_ID, htonl(nft_base_seq(net))) ||
+ 	    nla_put_be32(skb, NFTA_GEN_PROC_PID, htonl(task_pid_nr(current))) ||
+ 	    nla_put_string(skb, NFTA_GEN_PROC_NAME, get_task_comm(buf, current)))
+ 		goto nla_put_failure;
+@@ -10727,11 +10729,11 @@ static int nf_tables_commit(struct net *net, struct sk_buff *skb)
+ 	 * Bump generation counter, invalidate any dump in progress.
+ 	 * Cannot fail after this point.
+ 	 */
+-	base_seq = READ_ONCE(nft_net->base_seq);
++	base_seq = nft_base_seq(net);
+ 	while (++base_seq == 0)
+ 		;
+ 
+-	WRITE_ONCE(nft_net->base_seq, base_seq);
++	WRITE_ONCE(net->nft.base_seq, base_seq);
+ 
+ 	gc_seq = nft_gc_seq_begin(nft_net);
+ 
+@@ -10940,7 +10942,7 @@ static int nf_tables_commit(struct net *net, struct sk_buff *skb)
+ 
+ 	nft_commit_notify(net, NETLINK_CB(skb).portid);
+ 	nf_tables_gen_notify(net, skb, NFT_MSG_NEWGEN);
+-	nf_tables_commit_audit_log(&adl, nft_net->base_seq);
++	nf_tables_commit_audit_log(&adl, nft_base_seq(net));
+ 
+ 	nft_gc_seq_end(nft_net, gc_seq);
+ 	nft_net->validate_state = NFT_VALIDATE_SKIP;
+@@ -11265,7 +11267,7 @@ static bool nf_tables_valid_genid(struct net *net, u32 genid)
+ 	mutex_lock(&nft_net->commit_mutex);
+ 	nft_net->tstamp = get_jiffies_64();
+ 
+-	genid_ok = genid == 0 || nft_net->base_seq == genid;
++	genid_ok = genid == 0 || nft_base_seq(net) == genid;
+ 	if (!genid_ok)
+ 		mutex_unlock(&nft_net->commit_mutex);
+ 
+@@ -11902,7 +11904,7 @@ static int __net_init nf_tables_init_net(struct net *net)
+ 	INIT_LIST_HEAD(&nft_net->module_list);
+ 	INIT_LIST_HEAD(&nft_net->notify_list);
+ 	mutex_init(&nft_net->commit_mutex);
+-	nft_net->base_seq = 1;
++	net->nft.base_seq = 1;
+ 	nft_net->gc_seq = 0;
+ 	nft_net->validate_state = NFT_VALIDATE_SKIP;
+ 	INIT_WORK(&nft_net->destroy_work, nf_tables_trans_destroy_work);
+-- 
+2.47.3
+
-- 
2.47.3



_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [pve-devel] [PATCH pve-kernel 4/5] kernel: backport: netfilter: nf_tables: make nft_set_do_lookup available unconditionally
  2025-09-11 10:05 [pve-devel] [PATCH kernel 0/5] backport nftables atomicity fix Gabriel Goller
                   ` (2 preceding siblings ...)
  2025-09-11 10:05 ` [pve-devel] [PATCH pve-kernel 3/5] kernel: backport: netfilter: nf_tables: place base_seq in struct net Gabriel Goller
@ 2025-09-11 10:05 ` Gabriel Goller
  2025-09-11 10:05 ` [pve-devel] [PATCH pve-kernel 5/5] kernel: backport: netfilter: nf_tables: restart set lookup on base_seq change Gabriel Goller
  4 siblings, 0 replies; 6+ messages in thread
From: Gabriel Goller @ 2025-09-11 10:05 UTC (permalink / raw)
  To: pve-devel

Helper for the actual bugfix.

Signed-off-by: Gabriel Goller <g.goller@proxmox.com>
---
 ...les-make-nft_set_do_lookup-available.patch | 86 +++++++++++++++++++
 1 file changed, 86 insertions(+)
 create mode 100644 patches/kernel/0017-netfilter-nf_tables-make-nft_set_do_lookup-available.patch

diff --git a/patches/kernel/0017-netfilter-nf_tables-make-nft_set_do_lookup-available.patch b/patches/kernel/0017-netfilter-nf_tables-make-nft_set_do_lookup-available.patch
new file mode 100644
index 000000000000..0194b7e7776f
--- /dev/null
+++ b/patches/kernel/0017-netfilter-nf_tables-make-nft_set_do_lookup-available.patch
@@ -0,0 +1,86 @@
+From 35120b5cb4467a234f4ffecc52c7ff6630a31907 Mon Sep 17 00:00:00 2001
+From: Gabriel Goller <g.goller@proxmox.com>
+Date: Wed, 10 Sep 2025 12:10:11 +0200
+Subject: [PATCH 4/5] netfilter: nf_tables: make nft_set_do_lookup available
+ unconditionally
+
+This function was added for retpoline mitigation and is replaced by a
+static inline helper if mitigations are not enabled.
+
+Enable this helper function unconditionally so next patch can add a lookup
+restart mechanism to fix possible false negatives while transactions are
+in progress.
+
+Adding lookup restarts in nft_lookup_eval doesn't work as nft_objref would
+then need the same copypaste loop.
+
+This patch is separate to ease review of the actual bug fix.
+
+Suggested-by: Pablo Neira Ayuso <pablo@netfilter.org>
+Signed-off-by: Florian Westphal <fw@strlen.de>
+Signed-off-by: Gabriel Goller <g.goller@proxmox.com>
+---
+ include/net/netfilter/nf_tables_core.h | 10 ++--------
+ net/netfilter/nft_lookup.c             | 11 ++++++++---
+ 2 files changed, 10 insertions(+), 11 deletions(-)
+
+diff --git a/include/net/netfilter/nf_tables_core.h b/include/net/netfilter/nf_tables_core.h
+index 03b6165756fc..04fc4a411a86 100644
+--- a/include/net/netfilter/nf_tables_core.h
++++ b/include/net/netfilter/nf_tables_core.h
+@@ -105,16 +105,10 @@ bool nft_hash_lookup_fast(const struct net *net,
+ 			  const u32 *key, const struct nft_set_ext **ext);
+ bool nft_hash_lookup(const struct net *net, const struct nft_set *set,
+ 		     const u32 *key, const struct nft_set_ext **ext);
++#endif
++
+ bool nft_set_do_lookup(const struct net *net, const struct nft_set *set,
+ 		       const u32 *key, const struct nft_set_ext **ext);
+-#else
+-static inline bool
+-nft_set_do_lookup(const struct net *net, const struct nft_set *set,
+-		  const u32 *key, const struct nft_set_ext **ext)
+-{
+-	return set->ops->lookup(net, set, key, ext);
+-}
+-#endif
+ 
+ /* called from nft_pipapo_avx2.c */
+ bool nft_pipapo_lookup(const struct net *net, const struct nft_set *set,
+diff --git a/net/netfilter/nft_lookup.c b/net/netfilter/nft_lookup.c
+index 63ef832b8aa7..7d0add1041bb 100644
+--- a/net/netfilter/nft_lookup.c
++++ b/net/netfilter/nft_lookup.c
+@@ -24,10 +24,10 @@ struct nft_lookup {
+ 	struct nft_set_binding		binding;
+ };
+ 
+-#ifdef CONFIG_MITIGATION_RETPOLINE
+-bool nft_set_do_lookup(const struct net *net, const struct nft_set *set,
++static bool __nft_set_do_lookup(const struct net *net, const struct nft_set *set,
+ 		       const u32 *key, const struct nft_set_ext **ext)
+ {
++#ifdef CONFIG_MITIGATION_RETPOLINE
+ 	if (set->ops == &nft_set_hash_fast_type.ops)
+ 		return nft_hash_lookup_fast(net, set, key, ext);
+ 	if (set->ops == &nft_set_hash_type.ops)
+@@ -50,10 +50,15 @@ bool nft_set_do_lookup(const struct net *net, const struct nft_set *set,
+ 		return nft_rbtree_lookup(net, set, key, ext);
+ 
+ 	WARN_ON_ONCE(1);
++#endif
+ 	return set->ops->lookup(net, set, key, ext);
+ }
++bool nft_set_do_lookup(const struct net *net, const struct nft_set *set,
++		       const u32 *key, const struct nft_set_ext **ext)
++{
++	return __nft_set_do_lookup(net, set, key, ext);
++}
+ EXPORT_SYMBOL_GPL(nft_set_do_lookup);
+-#endif
+ 
+ void nft_lookup_eval(const struct nft_expr *expr,
+ 		     struct nft_regs *regs,
+-- 
+2.47.3
+
-- 
2.47.3



_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [pve-devel] [PATCH pve-kernel 5/5] kernel: backport: netfilter: nf_tables: restart set lookup on base_seq change
  2025-09-11 10:05 [pve-devel] [PATCH kernel 0/5] backport nftables atomicity fix Gabriel Goller
                   ` (3 preceding siblings ...)
  2025-09-11 10:05 ` [pve-devel] [PATCH pve-kernel 4/5] kernel: backport: netfilter: nf_tables: make nft_set_do_lookup available unconditionally Gabriel Goller
@ 2025-09-11 10:05 ` Gabriel Goller
  4 siblings, 0 replies; 6+ messages in thread
From: Gabriel Goller @ 2025-09-11 10:05 UTC (permalink / raw)
  To: pve-devel

If a set lookup fails, retry again if the base_seq has changed. This
means that if a lookup does not return a value, check if there was a
generation change in the meantime. If there was, redo the lookup. Exact
explanation is inside the patch.

Signed-off-by: Gabriel Goller <g.goller@proxmox.com>
---
 ...les-restart-set-lookup-on-base_seq-c.patch | 148 ++++++++++++++++++
 1 file changed, 148 insertions(+)
 create mode 100644 patches/kernel/0018-netfilter-nf_tables-restart-set-lookup-on-base_seq-c.patch

diff --git a/patches/kernel/0018-netfilter-nf_tables-restart-set-lookup-on-base_seq-c.patch b/patches/kernel/0018-netfilter-nf_tables-restart-set-lookup-on-base_seq-c.patch
new file mode 100644
index 000000000000..26b28ffc0aa1
--- /dev/null
+++ b/patches/kernel/0018-netfilter-nf_tables-restart-set-lookup-on-base_seq-c.patch
@@ -0,0 +1,148 @@
+From 35dd20f6e27b4a6b0b6c2cbf6ebf8eab33121378 Mon Sep 17 00:00:00 2001
+From: Gabriel Goller <g.goller@proxmox.com>
+Date: Wed, 10 Sep 2025 12:18:18 +0200
+Subject: [PATCH 5/5] netfilter: nf_tables: restart set lookup on base_seq
+ change
+
+The hash, hash_fast, rhash and bitwise sets may indicate no result even
+though a matching element exists during a short time window while other
+cpu is finalizing the transaction.
+
+This happens when the hash lookup/bitwise lookup function has picked up
+the old genbit, right before it was toggled by nf_tables_commit(), but
+then the same cpu managed to unlink the matching old element from the
+hash table:
+
+cpu0					cpu1
+  has added new elements to clone
+  has marked elements as being
+  inactive in new generation
+					perform lookup in the set
+  enters commit phase:
+					A) observes old genbit
+   increments base_seq
+I) increments the genbit
+II) removes old element from the set
+					B) finds matching element
+					C) returns no match: found
+					element is not valid in old
+					generation
+
+					Next lookup observes new genbit and
+					finds matching e2.
+
+Consider a packet matching element e1, e2.
+
+cpu0 processes following transaction:
+1. remove e1
+2. adds e2, which has same key as e1.
+
+P matches both e1 and e2.  Therefore, cpu1 should always find a match
+for P. Due to above race, this is not the case:
+
+cpu1 observed the old genbit.  e2 will not be considered once it is found.
+The element e1 is not found anymore if cpu0 managed to unlink it from the
+hlist before cpu1 found it during list traversal.
+
+The situation only occurs for a brief time period, lookups happening
+after I) observe new genbit and return e2.
+
+This problem exists in all set types except nft_set_pipapo, so fix it once
+in nft_lookup rather than each set ops individually.
+
+Sample the base sequence counter, which gets incremented right before the
+genbit is changed.
+
+Then, if no match is found, retry the lookup if the base sequence was
+altered in between.
+
+If the base sequence hasn't changed:
+ - No update took place: no-match result is expected.
+   This is the common case.  or:
+ - nf_tables_commit() hasn't progressed to genbit update yet.
+   Old elements were still visible and nomatch result is expected, or:
+ - nf_tables_commit updated the genbit:
+   We picked up the new base_seq, so the lookup function also picked
+   up the new genbit, no-match result is expected.
+
+If the old genbit was observed, then nft_lookup also picked up the old
+base_seq: nft_lookup_should_retry() returns true and relookup is performed
+in the new generation.
+
+Thanks to Pablo Neira Ayuso for reviewing an earlier version of this
+patchset, for suggesting re-use of existing base_seq and placement of
+the restart loop in nft_set_do_lookup().
+
+Signed-off-by: Florian Westphal <fw@strlen.de>
+[GG: this patch needed a bigger change because __nft_set_do_lookup
+doesn't ext pointer yet, but still returns a bool, so we just check if
+the bool is false, so if no element has been found and retry.]
+Signed-off-by: Gabriel Goller <g.goller@proxmox.com>
+---
+ net/netfilter/nf_tables_api.c |  3 ++-
+ net/netfilter/nft_lookup.c    | 32 +++++++++++++++++++++++++++++++-
+ 2 files changed, 33 insertions(+), 2 deletions(-)
+
+diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c
+index f9e7f056ea5b..40a32ef86470 100644
+--- a/net/netfilter/nf_tables_api.c
++++ b/net/netfilter/nf_tables_api.c
+@@ -10733,7 +10733,8 @@ static int nf_tables_commit(struct net *net, struct sk_buff *skb)
+ 	while (++base_seq == 0)
+ 		;
+ 
+-	WRITE_ONCE(net->nft.base_seq, base_seq);
++	/* pairs with smp_load_acquire in nft_lookup_eval */
++	smp_store_release(&net->nft.base_seq, base_seq);
+ 
+ 	gc_seq = nft_gc_seq_begin(nft_net);
+ 
+diff --git a/net/netfilter/nft_lookup.c b/net/netfilter/nft_lookup.c
+index 7d0add1041bb..fc0eb91d94c0 100644
+--- a/net/netfilter/nft_lookup.c
++++ b/net/netfilter/nft_lookup.c
+@@ -53,10 +53,40 @@ static bool __nft_set_do_lookup(const struct net *net, const struct nft_set *set
+ #endif
+ 	return set->ops->lookup(net, set, key, ext);
+ }
++
++static unsigned int nft_base_seq(const struct net *net)
++{
++	/* pairs with smp_store_release() in nf_tables_commit() */
++	return smp_load_acquire(&net->nft.base_seq);
++}
++
++static bool nft_lookup_should_retry(const struct net *net, unsigned int seq)
++{
++	return unlikely(seq != nft_base_seq(net));
++}
++
+ bool nft_set_do_lookup(const struct net *net, const struct nft_set *set,
+ 		       const u32 *key, const struct nft_set_ext **ext)
+ {
+-	return __nft_set_do_lookup(net, set, key, ext);
++	bool result;
++	unsigned int base_seq;
++
++	do {
++		base_seq = nft_base_seq(net);
++
++		result = __nft_set_do_lookup(net, set, key, ext);
++		if (result)
++			break;
++		/* No match?  There is a small chance that lookup was
++		 * performed in the old generation, but nf_tables_commit()
++		 * already unlinked a (matching) element.
++		 *
++		 * We need to repeat the lookup to make sure that we didn't
++		 * miss a matching element in the new generation.
++		 */
++	} while (nft_lookup_should_retry(net, base_seq));
++
++	return result;
+ }
+ EXPORT_SYMBOL_GPL(nft_set_do_lookup);
+ 
+-- 
+2.47.3
+
-- 
2.47.3



_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2025-09-11 10:06 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-09-11 10:05 [pve-devel] [PATCH kernel 0/5] backport nftables atomicity fix Gabriel Goller
2025-09-11 10:05 ` [pve-devel] [PATCH pve-kernel 1/5] kernel: backport: netfilter: nft_set_pipapo: don't check genbit from packetpath lookups Gabriel Goller
2025-09-11 10:05 ` [pve-devel] [PATCH pve-kernel 2/5] kernel: backport: netfilter: nft_set_rbtree: continue traversal if element is inactive Gabriel Goller
2025-09-11 10:05 ` [pve-devel] [PATCH pve-kernel 3/5] kernel: backport: netfilter: nf_tables: place base_seq in struct net Gabriel Goller
2025-09-11 10:05 ` [pve-devel] [PATCH pve-kernel 4/5] kernel: backport: netfilter: nf_tables: make nft_set_do_lookup available unconditionally Gabriel Goller
2025-09-11 10:05 ` [pve-devel] [PATCH pve-kernel 5/5] kernel: backport: netfilter: nf_tables: restart set lookup on base_seq change Gabriel Goller

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.
Service provided by Proxmox Server Solutions GmbH | Privacy | Legal