public inbox for pve-devel@lists.proxmox.com
 help / color / mirror / Atom feed
From: Gabriel Goller <g.goller@proxmox.com>
To: pve-devel@lists.proxmox.com
Subject: [pve-devel] [PATCH pve-kernel 1/5] kernel: backport: netfilter: nft_set_pipapo: don't check genbit from packetpath lookups
Date: Thu, 11 Sep 2025 12:05:42 +0200	[thread overview]
Message-ID: <20250911100555.63174-2-g.goller@proxmox.com> (raw)
In-Reply-To: <20250911100555.63174-1-g.goller@proxmox.com>

The pipapo set is a specific nftables set which holds ip-ranges. When
updating the set, the new rules are not visible (to the live ruleset)
until the next generation, so we don't need to check the genbit on
lookup.

Signed-off-by: Gabriel Goller <g.goller@proxmox.com>
---
 ...t_pipapo-don-t-check-genbit-from-pac.patch | 160 ++++++++++++++++++
 1 file changed, 160 insertions(+)
 create mode 100644 patches/kernel/0014-netfilter-nft_set_pipapo-don-t-check-genbit-from-pac.patch

diff --git a/patches/kernel/0014-netfilter-nft_set_pipapo-don-t-check-genbit-from-pac.patch b/patches/kernel/0014-netfilter-nft_set_pipapo-don-t-check-genbit-from-pac.patch
new file mode 100644
index 000000000000..be34c0efa7e9
--- /dev/null
+++ b/patches/kernel/0014-netfilter-nft_set_pipapo-don-t-check-genbit-from-pac.patch
@@ -0,0 +1,160 @@
+From da2f3da272763d4382e4c82d2a6d1358aed10123 Mon Sep 17 00:00:00 2001
+From: Gabriel Goller <g.goller@proxmox.com>
+Date: Wed, 10 Sep 2025 12:08:09 +0200
+Subject: [PATCH 1/5] netfilter: nft_set_pipapo: don't check genbit from
+ packetpath lookups
+
+The pipapo set type is special in that it has two copies of its
+datastructure: one live copy containing only valid elements and one
+on-demand clone used during transaction where adds/deletes happen.
+
+This clone is not visible to the datapath.
+
+This is unlike all other set types in nftables, those all link new
+elements into their live hlist/tree.
+
+For those sets, the lookup functions must skip the new elements while the
+transaction is ongoing to ensure consistency.
+
+As the clone is shallow, removal does have an effect on the packet path:
+once the transaction enters the commit phase the 'gencursor' bit that
+determines which elements are active and which elements should be ignored
+(because they are no longer valid) is flipped.
+
+This causes the datapath lookup to ignore these elements if they are found
+during lookup.
+
+This opens up a small race window where pipapo has an inconsistent view of
+the dataset from when the transaction-cpu flipped the genbit until the
+transaction-cpu calls nft_pipapo_commit() to swap live/clone pointers:
+
+cpu0					cpu1
+  has added new elements to clone
+  has marked elements as being
+  inactive in new generation
+					perform lookup in the set
+  enters commit phase:
+
+I) increments the genbit
+					A) observes new genbit
+  removes elements from the clone so
+  they won't be found anymore
+					B) lookup in datastructure
+					   can't see new elements yet,
+					   but old elements are ignored
+					   -> Only matches elements that
+					   were not changed in the
+					   transaction
+II) calls nft_pipapo_commit(), clone
+    and live pointers are swapped.
+					C New nft_lookup happening now
+				       	  will find matching elements.
+
+Consider a packet matching range r1-r2:
+
+cpu0 processes following transaction:
+1. remove r1-r2
+2. add r1-r3
+
+P is contained in both ranges. Therefore, cpu1 should always find a match
+for P.  Due to above race, this is not the case:
+
+cpu1 does find r1-r2, but then ignores it due to the genbit indicating
+the range has been removed.
+
+At the same time, r1-r3 is not visible yet, because it can only be found
+in the clone.
+
+The situation persists for all lookups until after cpu0 hits II).
+
+The fix is easy: Don't check the genbit from pipapo lookup functions.
+This is possible because unlike the other set types, the new elements are
+not reachable from the live copy of the dataset.
+
+The clone/live pointer swap is enough to avoid matching on old elements
+while at the same time all new elements are exposed in one go.
+
+After this change, step B above returns a match in r1-r2.
+This is fine: r1-r2 only becomes truly invalid the moment they get freed.
+This happens after a synchronize_rcu() call and rcu read lock is held
+via netfilter hook traversal (nf_hook_slow()).
+
+Fixes: 3c4287f62044 ("nf_tables: Add set type for arbitrary concatenation of ranges")
+Signed-off-by: Florian Westphal <fw@strlen.de>
+Signed-off-by: Gabriel Goller <g.goller@proxmox.com>
+---
+ net/netfilter/nft_set_pipapo.c      | 21 +++++++++++++++++++--
+ net/netfilter/nft_set_pipapo_avx2.c |  4 +---
+ 2 files changed, 20 insertions(+), 5 deletions(-)
+
+diff --git a/net/netfilter/nft_set_pipapo.c b/net/netfilter/nft_set_pipapo.c
+index 0529e4ef7520..a386c3761cac 100644
+--- a/net/netfilter/nft_set_pipapo.c
++++ b/net/netfilter/nft_set_pipapo.c
+@@ -405,6 +405,24 @@ int pipapo_refill(unsigned long *map, unsigned int len, unsigned int rules,
+  *
+  * For more details, see DOC: Theory of Operation.
+  *
++ * Unlike other set types, this uses NFT_GENMASK_ANY instead of
++ * nft_genmask_cur().
++ *
++ * This is because new (future) elements are not reachable from
++ * priv->match, they get added to priv->clone instead.
++ * When the commit phase flips the generation bitmask, the
++ * 'now old' entries are skipped but without the 'now current'
++ * elements becoming visible. Using nft_genmask_cur() thus creates
++ * inconsistent state: matching old entries get skipped but thew
++ * newly matching entries are unreachable.
++ *
++ * GENMASK will still find the 'now old' entries which ensures consistent
++ * priv->match view.
++ *
++ * nft_pipapo_commit swaps ->clone and ->match shortly after the
++ * genbit flip.  As ->clone doesn't contain the old entries in the first
++ * place, lookup will only find the now-current ones.
++ *
+  * Return: true on match, false otherwise.
+  */
+ bool nft_pipapo_lookup(const struct net *net, const struct nft_set *set,
+@@ -413,7 +431,6 @@ bool nft_pipapo_lookup(const struct net *net, const struct nft_set *set,
+ 	struct nft_pipapo *priv = nft_set_priv(set);
+ 	struct nft_pipapo_scratch *scratch;
+ 	unsigned long *res_map, *fill_map;
+-	u8 genmask = nft_genmask_cur(net);
+ 	const struct nft_pipapo_match *m;
+ 	const struct nft_pipapo_field *f;
+ 	const u8 *rp = (const u8 *)key;
+@@ -471,7 +488,7 @@ bool nft_pipapo_lookup(const struct net *net, const struct nft_set *set,
+ 		if (last) {
+ 			*ext = &f->mt[b].e->ext;
+ 			if (unlikely(nft_set_elem_expired(*ext) ||
+-				     !nft_set_elem_active(*ext, genmask)))
++				     !nft_set_elem_active(*ext, NFT_GENMASK_ANY)))
+ 				goto next_match;
+ 
+ 			/* Last field: we're just returning the key without
+diff --git a/net/netfilter/nft_set_pipapo_avx2.c b/net/netfilter/nft_set_pipapo_avx2.c
+index be7c16c79f71..675c0a9634b6 100644
+--- a/net/netfilter/nft_set_pipapo_avx2.c
++++ b/net/netfilter/nft_set_pipapo_avx2.c
+@@ -1151,7 +1151,6 @@ bool nft_pipapo_avx2_lookup(const struct net *net, const struct nft_set *set,
+ {
+ 	struct nft_pipapo *priv = nft_set_priv(set);
+ 	struct nft_pipapo_scratch *scratch;
+-	u8 genmask = nft_genmask_cur(net);
+ 	const struct nft_pipapo_match *m;
+ 	const struct nft_pipapo_field *f;
+ 	const u8 *rp = (const u8 *)key;
+@@ -1245,8 +1244,7 @@ bool nft_pipapo_avx2_lookup(const struct net *net, const struct nft_set *set,
+ 
+ 		if (last) {
+ 			*ext = &f->mt[ret].e->ext;
+-			if (unlikely(nft_set_elem_expired(*ext) ||
+-				     !nft_set_elem_active(*ext, genmask))) {
++			if (unlikely(nft_set_elem_expired(*ext))) {
+ 				ret = 0;
+ 				goto next_match;
+ 			}
+-- 
+2.47.3
+
-- 
2.47.3



_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel


  reply	other threads:[~2025-09-11 10:06 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-09-11 10:05 [pve-devel] [PATCH kernel 0/5] backport nftables atomicity fix Gabriel Goller
2025-09-11 10:05 ` Gabriel Goller [this message]
2025-09-11 10:05 ` [pve-devel] [PATCH pve-kernel 2/5] kernel: backport: netfilter: nft_set_rbtree: continue traversal if element is inactive Gabriel Goller
2025-09-11 10:05 ` [pve-devel] [PATCH pve-kernel 3/5] kernel: backport: netfilter: nf_tables: place base_seq in struct net Gabriel Goller
2025-09-11 10:05 ` [pve-devel] [PATCH pve-kernel 4/5] kernel: backport: netfilter: nf_tables: make nft_set_do_lookup available unconditionally Gabriel Goller
2025-09-11 10:05 ` [pve-devel] [PATCH pve-kernel 5/5] kernel: backport: netfilter: nf_tables: restart set lookup on base_seq change Gabriel Goller

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20250911100555.63174-2-g.goller@proxmox.com \
    --to=g.goller@proxmox.com \
    --cc=pve-devel@lists.proxmox.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox
Service provided by Proxmox Server Solutions GmbH | Privacy | Legal