From: Gabriel Goller <g.goller@proxmox.com>
To: pve-devel@lists.proxmox.com
Subject: [pve-devel] [PATCH pve-kernel 1/5] kernel: backport: netfilter: nft_set_pipapo: don't check genbit from packetpath lookups
Date: Thu, 11 Sep 2025 12:05:42 +0200 [thread overview]
Message-ID: <20250911100555.63174-2-g.goller@proxmox.com> (raw)
In-Reply-To: <20250911100555.63174-1-g.goller@proxmox.com>
The pipapo set is a specific nftables set which holds ip-ranges. When
updating the set, the new rules are not visible (to the live ruleset)
until the next generation, so we don't need to check the genbit on
lookup.
Signed-off-by: Gabriel Goller <g.goller@proxmox.com>
---
...t_pipapo-don-t-check-genbit-from-pac.patch | 160 ++++++++++++++++++
1 file changed, 160 insertions(+)
create mode 100644 patches/kernel/0014-netfilter-nft_set_pipapo-don-t-check-genbit-from-pac.patch
diff --git a/patches/kernel/0014-netfilter-nft_set_pipapo-don-t-check-genbit-from-pac.patch b/patches/kernel/0014-netfilter-nft_set_pipapo-don-t-check-genbit-from-pac.patch
new file mode 100644
index 000000000000..be34c0efa7e9
--- /dev/null
+++ b/patches/kernel/0014-netfilter-nft_set_pipapo-don-t-check-genbit-from-pac.patch
@@ -0,0 +1,160 @@
+From da2f3da272763d4382e4c82d2a6d1358aed10123 Mon Sep 17 00:00:00 2001
+From: Gabriel Goller <g.goller@proxmox.com>
+Date: Wed, 10 Sep 2025 12:08:09 +0200
+Subject: [PATCH 1/5] netfilter: nft_set_pipapo: don't check genbit from
+ packetpath lookups
+
+The pipapo set type is special in that it has two copies of its
+datastructure: one live copy containing only valid elements and one
+on-demand clone used during transaction where adds/deletes happen.
+
+This clone is not visible to the datapath.
+
+This is unlike all other set types in nftables, those all link new
+elements into their live hlist/tree.
+
+For those sets, the lookup functions must skip the new elements while the
+transaction is ongoing to ensure consistency.
+
+As the clone is shallow, removal does have an effect on the packet path:
+once the transaction enters the commit phase the 'gencursor' bit that
+determines which elements are active and which elements should be ignored
+(because they are no longer valid) is flipped.
+
+This causes the datapath lookup to ignore these elements if they are found
+during lookup.
+
+This opens up a small race window where pipapo has an inconsistent view of
+the dataset from when the transaction-cpu flipped the genbit until the
+transaction-cpu calls nft_pipapo_commit() to swap live/clone pointers:
+
+cpu0 cpu1
+ has added new elements to clone
+ has marked elements as being
+ inactive in new generation
+ perform lookup in the set
+ enters commit phase:
+
+I) increments the genbit
+ A) observes new genbit
+ removes elements from the clone so
+ they won't be found anymore
+ B) lookup in datastructure
+ can't see new elements yet,
+ but old elements are ignored
+ -> Only matches elements that
+ were not changed in the
+ transaction
+II) calls nft_pipapo_commit(), clone
+ and live pointers are swapped.
+ C New nft_lookup happening now
+ will find matching elements.
+
+Consider a packet matching range r1-r2:
+
+cpu0 processes following transaction:
+1. remove r1-r2
+2. add r1-r3
+
+P is contained in both ranges. Therefore, cpu1 should always find a match
+for P. Due to above race, this is not the case:
+
+cpu1 does find r1-r2, but then ignores it due to the genbit indicating
+the range has been removed.
+
+At the same time, r1-r3 is not visible yet, because it can only be found
+in the clone.
+
+The situation persists for all lookups until after cpu0 hits II).
+
+The fix is easy: Don't check the genbit from pipapo lookup functions.
+This is possible because unlike the other set types, the new elements are
+not reachable from the live copy of the dataset.
+
+The clone/live pointer swap is enough to avoid matching on old elements
+while at the same time all new elements are exposed in one go.
+
+After this change, step B above returns a match in r1-r2.
+This is fine: r1-r2 only becomes truly invalid the moment they get freed.
+This happens after a synchronize_rcu() call and rcu read lock is held
+via netfilter hook traversal (nf_hook_slow()).
+
+Fixes: 3c4287f62044 ("nf_tables: Add set type for arbitrary concatenation of ranges")
+Signed-off-by: Florian Westphal <fw@strlen.de>
+Signed-off-by: Gabriel Goller <g.goller@proxmox.com>
+---
+ net/netfilter/nft_set_pipapo.c | 21 +++++++++++++++++++--
+ net/netfilter/nft_set_pipapo_avx2.c | 4 +---
+ 2 files changed, 20 insertions(+), 5 deletions(-)
+
+diff --git a/net/netfilter/nft_set_pipapo.c b/net/netfilter/nft_set_pipapo.c
+index 0529e4ef7520..a386c3761cac 100644
+--- a/net/netfilter/nft_set_pipapo.c
++++ b/net/netfilter/nft_set_pipapo.c
+@@ -405,6 +405,24 @@ int pipapo_refill(unsigned long *map, unsigned int len, unsigned int rules,
+ *
+ * For more details, see DOC: Theory of Operation.
+ *
++ * Unlike other set types, this uses NFT_GENMASK_ANY instead of
++ * nft_genmask_cur().
++ *
++ * This is because new (future) elements are not reachable from
++ * priv->match, they get added to priv->clone instead.
++ * When the commit phase flips the generation bitmask, the
++ * 'now old' entries are skipped but without the 'now current'
++ * elements becoming visible. Using nft_genmask_cur() thus creates
++ * inconsistent state: matching old entries get skipped but thew
++ * newly matching entries are unreachable.
++ *
++ * GENMASK will still find the 'now old' entries which ensures consistent
++ * priv->match view.
++ *
++ * nft_pipapo_commit swaps ->clone and ->match shortly after the
++ * genbit flip. As ->clone doesn't contain the old entries in the first
++ * place, lookup will only find the now-current ones.
++ *
+ * Return: true on match, false otherwise.
+ */
+ bool nft_pipapo_lookup(const struct net *net, const struct nft_set *set,
+@@ -413,7 +431,6 @@ bool nft_pipapo_lookup(const struct net *net, const struct nft_set *set,
+ struct nft_pipapo *priv = nft_set_priv(set);
+ struct nft_pipapo_scratch *scratch;
+ unsigned long *res_map, *fill_map;
+- u8 genmask = nft_genmask_cur(net);
+ const struct nft_pipapo_match *m;
+ const struct nft_pipapo_field *f;
+ const u8 *rp = (const u8 *)key;
+@@ -471,7 +488,7 @@ bool nft_pipapo_lookup(const struct net *net, const struct nft_set *set,
+ if (last) {
+ *ext = &f->mt[b].e->ext;
+ if (unlikely(nft_set_elem_expired(*ext) ||
+- !nft_set_elem_active(*ext, genmask)))
++ !nft_set_elem_active(*ext, NFT_GENMASK_ANY)))
+ goto next_match;
+
+ /* Last field: we're just returning the key without
+diff --git a/net/netfilter/nft_set_pipapo_avx2.c b/net/netfilter/nft_set_pipapo_avx2.c
+index be7c16c79f71..675c0a9634b6 100644
+--- a/net/netfilter/nft_set_pipapo_avx2.c
++++ b/net/netfilter/nft_set_pipapo_avx2.c
+@@ -1151,7 +1151,6 @@ bool nft_pipapo_avx2_lookup(const struct net *net, const struct nft_set *set,
+ {
+ struct nft_pipapo *priv = nft_set_priv(set);
+ struct nft_pipapo_scratch *scratch;
+- u8 genmask = nft_genmask_cur(net);
+ const struct nft_pipapo_match *m;
+ const struct nft_pipapo_field *f;
+ const u8 *rp = (const u8 *)key;
+@@ -1245,8 +1244,7 @@ bool nft_pipapo_avx2_lookup(const struct net *net, const struct nft_set *set,
+
+ if (last) {
+ *ext = &f->mt[ret].e->ext;
+- if (unlikely(nft_set_elem_expired(*ext) ||
+- !nft_set_elem_active(*ext, genmask))) {
++ if (unlikely(nft_set_elem_expired(*ext))) {
+ ret = 0;
+ goto next_match;
+ }
+--
+2.47.3
+
--
2.47.3
_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
next prev parent reply other threads:[~2025-09-11 10:06 UTC|newest]
Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-09-11 10:05 [pve-devel] [PATCH kernel 0/5] backport nftables atomicity fix Gabriel Goller
2025-09-11 10:05 ` Gabriel Goller [this message]
2025-09-11 10:05 ` [pve-devel] [PATCH pve-kernel 2/5] kernel: backport: netfilter: nft_set_rbtree: continue traversal if element is inactive Gabriel Goller
2025-09-11 10:05 ` [pve-devel] [PATCH pve-kernel 3/5] kernel: backport: netfilter: nf_tables: place base_seq in struct net Gabriel Goller
2025-09-11 10:05 ` [pve-devel] [PATCH pve-kernel 4/5] kernel: backport: netfilter: nf_tables: make nft_set_do_lookup available unconditionally Gabriel Goller
2025-09-11 10:05 ` [pve-devel] [PATCH pve-kernel 5/5] kernel: backport: netfilter: nf_tables: restart set lookup on base_seq change Gabriel Goller
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20250911100555.63174-2-g.goller@proxmox.com \
--to=g.goller@proxmox.com \
--cc=pve-devel@lists.proxmox.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox