From: Gabriel Goller <g.goller@proxmox.com>
To: pve-devel@lists.proxmox.com
Subject: [pve-devel] [PATCH pve-kernel 1/5] kernel: backport: netfilter: nft_set_pipapo: don't check genbit from packetpath lookups
Date: Thu, 11 Sep 2025 12:05:42 +0200 [thread overview]
Message-ID: <20250911100555.63174-2-g.goller@proxmox.com> (raw)
In-Reply-To: <20250911100555.63174-1-g.goller@proxmox.com>
The pipapo set is a specific nftables set which holds ip-ranges. When
updating the set, the new rules are not visible (to the live ruleset)
until the next generation, so we don't need to check the genbit on
lookup.
Signed-off-by: Gabriel Goller <g.goller@proxmox.com>
---
...t_pipapo-don-t-check-genbit-from-pac.patch | 160 ++++++++++++++++++
1 file changed, 160 insertions(+)
create mode 100644 patches/kernel/0014-netfilter-nft_set_pipapo-don-t-check-genbit-from-pac.patch
diff --git a/patches/kernel/0014-netfilter-nft_set_pipapo-don-t-check-genbit-from-pac.patch b/patches/kernel/0014-netfilter-nft_set_pipapo-don-t-check-genbit-from-pac.patch
new file mode 100644
index 000000000000..be34c0efa7e9
--- /dev/null
+++ b/patches/kernel/0014-netfilter-nft_set_pipapo-don-t-check-genbit-from-pac.patch
@@ -0,0 +1,160 @@
+From da2f3da272763d4382e4c82d2a6d1358aed10123 Mon Sep 17 00:00:00 2001
+From: Gabriel Goller <g.goller@proxmox.com>
+Date: Wed, 10 Sep 2025 12:08:09 +0200
+Subject: [PATCH 1/5] netfilter: nft_set_pipapo: don't check genbit from
+ packetpath lookups
+
+The pipapo set type is special in that it has two copies of its
+datastructure: one live copy containing only valid elements and one
+on-demand clone used during transaction where adds/deletes happen.
+
+This clone is not visible to the datapath.
+
+This is unlike all other set types in nftables, those all link new
+elements into their live hlist/tree.
+
+For those sets, the lookup functions must skip the new elements while the
+transaction is ongoing to ensure consistency.
+
+As the clone is shallow, removal does have an effect on the packet path:
+once the transaction enters the commit phase the 'gencursor' bit that
+determines which elements are active and which elements should be ignored
+(because they are no longer valid) is flipped.
+
+This causes the datapath lookup to ignore these elements if they are found
+during lookup.
+
+This opens up a small race window where pipapo has an inconsistent view of
+the dataset from when the transaction-cpu flipped the genbit until the
+transaction-cpu calls nft_pipapo_commit() to swap live/clone pointers:
+
+cpu0 cpu1
+ has added new elements to clone
+ has marked elements as being
+ inactive in new generation
+ perform lookup in the set
+ enters commit phase:
+
+I) increments the genbit
+ A) observes new genbit
+ removes elements from the clone so
+ they won't be found anymore
+ B) lookup in datastructure
+ can't see new elements yet,
+ but old elements are ignored
+ -> Only matches elements that
+ were not changed in the
+ transaction
+II) calls nft_pipapo_commit(), clone
+ and live pointers are swapped.
+ C New nft_lookup happening now
+ will find matching elements.
+
+Consider a packet matching range r1-r2:
+
+cpu0 processes following transaction:
+1. remove r1-r2
+2. add r1-r3
+
+P is contained in both ranges. Therefore, cpu1 should always find a match
+for P. Due to above race, this is not the case:
+
+cpu1 does find r1-r2, but then ignores it due to the genbit indicating
+the range has been removed.
+
+At the same time, r1-r3 is not visible yet, because it can only be found
+in the clone.
+
+The situation persists for all lookups until after cpu0 hits II).
+
+The fix is easy: Don't check the genbit from pipapo lookup functions.
+This is possible because unlike the other set types, the new elements are
+not reachable from the live copy of the dataset.
+
+The clone/live pointer swap is enough to avoid matching on old elements
+while at the same time all new elements are exposed in one go.
+
+After this change, step B above returns a match in r1-r2.
+This is fine: r1-r2 only becomes truly invalid the moment they get freed.
+This happens after a synchronize_rcu() call and rcu read lock is held
+via netfilter hook traversal (nf_hook_slow()).
+
+Fixes: 3c4287f62044 ("nf_tables: Add set type for arbitrary concatenation of ranges")
+Signed-off-by: Florian Westphal <fw@strlen.de>
+Signed-off-by: Gabriel Goller <g.goller@proxmox.com>
+---
+ net/netfilter/nft_set_pipapo.c | 21 +++++++++++++++++++--
+ net/netfilter/nft_set_pipapo_avx2.c | 4 +---
+ 2 files changed, 20 insertions(+), 5 deletions(-)
+
+diff --git a/net/netfilter/nft_set_pipapo.c b/net/netfilter/nft_set_pipapo.c
+index 0529e4ef7520..a386c3761cac 100644
+--- a/net/netfilter/nft_set_pipapo.c
++++ b/net/netfilter/nft_set_pipapo.c
+@@ -405,6 +405,24 @@ int pipapo_refill(unsigned long *map, unsigned int len, unsigned int rules,
+ *
+ * For more details, see DOC: Theory of Operation.
+ *
++ * Unlike other set types, this uses NFT_GENMASK_ANY instead of
++ * nft_genmask_cur().
++ *
++ * This is because new (future) elements are not reachable from
++ * priv->match, they get added to priv->clone instead.
++ * When the commit phase flips the generation bitmask, the
++ * 'now old' entries are skipped but without the 'now current'
++ * elements becoming visible. Using nft_genmask_cur() thus creates
++ * inconsistent state: matching old entries get skipped but thew
++ * newly matching entries are unreachable.
++ *
++ * GENMASK will still find the 'now old' entries which ensures consistent
++ * priv->match view.
++ *
++ * nft_pipapo_commit swaps ->clone and ->match shortly after the
++ * genbit flip. As ->clone doesn't contain the old entries in the first
++ * place, lookup will only find the now-current ones.
++ *
+ * Return: true on match, false otherwise.
+ */
+ bool nft_pipapo_lookup(const struct net *net, const struct nft_set *set,
+@@ -413,7 +431,6 @@ bool nft_pipapo_lookup(const struct net *net, const struct nft_set *set,
+ struct nft_pipapo *priv = nft_set_priv(set);
+ struct nft_pipapo_scratch *scratch;
+ unsigned long *res_map, *fill_map;
+- u8 genmask = nft_genmask_cur(net);
+ const struct nft_pipapo_match *m;
+ const struct nft_pipapo_field *f;
+ const u8 *rp = (const u8 *)key;
+@@ -471,7 +488,7 @@ bool nft_pipapo_lookup(const struct net *net, const struct nft_set *set,
+ if (last) {
+ *ext = &f->mt[b].e->ext;
+ if (unlikely(nft_set_elem_expired(*ext) ||
+- !nft_set_elem_active(*ext, genmask)))
++ !nft_set_elem_active(*ext, NFT_GENMASK_ANY)))
+ goto next_match;
+
+ /* Last field: we're just returning the key without
+diff --git a/net/netfilter/nft_set_pipapo_avx2.c b/net/netfilter/nft_set_pipapo_avx2.c
+index be7c16c79f71..675c0a9634b6 100644
+--- a/net/netfilter/nft_set_pipapo_avx2.c
++++ b/net/netfilter/nft_set_pipapo_avx2.c
+@@ -1151,7 +1151,6 @@ bool nft_pipapo_avx2_lookup(const struct net *net, const struct nft_set *set,
+ {
+ struct nft_pipapo *priv = nft_set_priv(set);
+ struct nft_pipapo_scratch *scratch;
+- u8 genmask = nft_genmask_cur(net);
+ const struct nft_pipapo_match *m;
+ const struct nft_pipapo_field *f;
+ const u8 *rp = (const u8 *)key;
+@@ -1245,8 +1244,7 @@ bool nft_pipapo_avx2_lookup(const struct net *net, const struct nft_set *set,
+
+ if (last) {
+ *ext = &f->mt[ret].e->ext;
+- if (unlikely(nft_set_elem_expired(*ext) ||
+- !nft_set_elem_active(*ext, genmask))) {
++ if (unlikely(nft_set_elem_expired(*ext))) {
+ ret = 0;
+ goto next_match;
+ }
+--
+2.47.3
+
--
2.47.3
_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
next prev parent reply other threads:[~2025-09-11 10:06 UTC|newest]
Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-09-11 10:05 [pve-devel] [PATCH kernel 0/5] backport nftables atomicity fix Gabriel Goller
2025-09-11 10:05 ` Gabriel Goller [this message]
2025-09-11 10:05 ` [pve-devel] [PATCH pve-kernel 2/5] kernel: backport: netfilter: nft_set_rbtree: continue traversal if element is inactive Gabriel Goller
2025-09-11 10:05 ` [pve-devel] [PATCH pve-kernel 3/5] kernel: backport: netfilter: nf_tables: place base_seq in struct net Gabriel Goller
2025-09-11 10:05 ` [pve-devel] [PATCH pve-kernel 4/5] kernel: backport: netfilter: nf_tables: make nft_set_do_lookup available unconditionally Gabriel Goller
2025-09-11 10:05 ` [pve-devel] [PATCH pve-kernel 5/5] kernel: backport: netfilter: nf_tables: restart set lookup on base_seq change Gabriel Goller
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20250911100555.63174-2-g.goller@proxmox.com \
--to=g.goller@proxmox.com \
--cc=pve-devel@lists.proxmox.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.