From: Gabriel Goller <g.goller@proxmox.com>
To: pve-devel@lists.proxmox.com
Subject: [pve-devel] [PATCH pve-kernel 5/5] kernel: backport: netfilter: nf_tables: restart set lookup on base_seq change
Date: Thu, 11 Sep 2025 12:05:46 +0200 [thread overview]
Message-ID: <20250911100555.63174-6-g.goller@proxmox.com> (raw)
In-Reply-To: <20250911100555.63174-1-g.goller@proxmox.com>
If a set lookup fails, retry again if the base_seq has changed. This
means that if a lookup does not return a value, check if there was a
generation change in the meantime. If there was, redo the lookup. Exact
explanation is inside the patch.
Signed-off-by: Gabriel Goller <g.goller@proxmox.com>
---
...les-restart-set-lookup-on-base_seq-c.patch | 148 ++++++++++++++++++
1 file changed, 148 insertions(+)
create mode 100644 patches/kernel/0018-netfilter-nf_tables-restart-set-lookup-on-base_seq-c.patch
diff --git a/patches/kernel/0018-netfilter-nf_tables-restart-set-lookup-on-base_seq-c.patch b/patches/kernel/0018-netfilter-nf_tables-restart-set-lookup-on-base_seq-c.patch
new file mode 100644
index 000000000000..26b28ffc0aa1
--- /dev/null
+++ b/patches/kernel/0018-netfilter-nf_tables-restart-set-lookup-on-base_seq-c.patch
@@ -0,0 +1,148 @@
+From 35dd20f6e27b4a6b0b6c2cbf6ebf8eab33121378 Mon Sep 17 00:00:00 2001
+From: Gabriel Goller <g.goller@proxmox.com>
+Date: Wed, 10 Sep 2025 12:18:18 +0200
+Subject: [PATCH 5/5] netfilter: nf_tables: restart set lookup on base_seq
+ change
+
+The hash, hash_fast, rhash and bitwise sets may indicate no result even
+though a matching element exists during a short time window while other
+cpu is finalizing the transaction.
+
+This happens when the hash lookup/bitwise lookup function has picked up
+the old genbit, right before it was toggled by nf_tables_commit(), but
+then the same cpu managed to unlink the matching old element from the
+hash table:
+
+cpu0 cpu1
+ has added new elements to clone
+ has marked elements as being
+ inactive in new generation
+ perform lookup in the set
+ enters commit phase:
+ A) observes old genbit
+ increments base_seq
+I) increments the genbit
+II) removes old element from the set
+ B) finds matching element
+ C) returns no match: found
+ element is not valid in old
+ generation
+
+ Next lookup observes new genbit and
+ finds matching e2.
+
+Consider a packet matching element e1, e2.
+
+cpu0 processes following transaction:
+1. remove e1
+2. adds e2, which has same key as e1.
+
+P matches both e1 and e2. Therefore, cpu1 should always find a match
+for P. Due to above race, this is not the case:
+
+cpu1 observed the old genbit. e2 will not be considered once it is found.
+The element e1 is not found anymore if cpu0 managed to unlink it from the
+hlist before cpu1 found it during list traversal.
+
+The situation only occurs for a brief time period, lookups happening
+after I) observe new genbit and return e2.
+
+This problem exists in all set types except nft_set_pipapo, so fix it once
+in nft_lookup rather than each set ops individually.
+
+Sample the base sequence counter, which gets incremented right before the
+genbit is changed.
+
+Then, if no match is found, retry the lookup if the base sequence was
+altered in between.
+
+If the base sequence hasn't changed:
+ - No update took place: no-match result is expected.
+ This is the common case. or:
+ - nf_tables_commit() hasn't progressed to genbit update yet.
+ Old elements were still visible and nomatch result is expected, or:
+ - nf_tables_commit updated the genbit:
+ We picked up the new base_seq, so the lookup function also picked
+ up the new genbit, no-match result is expected.
+
+If the old genbit was observed, then nft_lookup also picked up the old
+base_seq: nft_lookup_should_retry() returns true and relookup is performed
+in the new generation.
+
+Thanks to Pablo Neira Ayuso for reviewing an earlier version of this
+patchset, for suggesting re-use of existing base_seq and placement of
+the restart loop in nft_set_do_lookup().
+
+Signed-off-by: Florian Westphal <fw@strlen.de>
+[GG: this patch needed a bigger change because __nft_set_do_lookup
+doesn't ext pointer yet, but still returns a bool, so we just check if
+the bool is false, so if no element has been found and retry.]
+Signed-off-by: Gabriel Goller <g.goller@proxmox.com>
+---
+ net/netfilter/nf_tables_api.c | 3 ++-
+ net/netfilter/nft_lookup.c | 32 +++++++++++++++++++++++++++++++-
+ 2 files changed, 33 insertions(+), 2 deletions(-)
+
+diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c
+index f9e7f056ea5b..40a32ef86470 100644
+--- a/net/netfilter/nf_tables_api.c
++++ b/net/netfilter/nf_tables_api.c
+@@ -10733,7 +10733,8 @@ static int nf_tables_commit(struct net *net, struct sk_buff *skb)
+ while (++base_seq == 0)
+ ;
+
+- WRITE_ONCE(net->nft.base_seq, base_seq);
++ /* pairs with smp_load_acquire in nft_lookup_eval */
++ smp_store_release(&net->nft.base_seq, base_seq);
+
+ gc_seq = nft_gc_seq_begin(nft_net);
+
+diff --git a/net/netfilter/nft_lookup.c b/net/netfilter/nft_lookup.c
+index 7d0add1041bb..fc0eb91d94c0 100644
+--- a/net/netfilter/nft_lookup.c
++++ b/net/netfilter/nft_lookup.c
+@@ -53,10 +53,40 @@ static bool __nft_set_do_lookup(const struct net *net, const struct nft_set *set
+ #endif
+ return set->ops->lookup(net, set, key, ext);
+ }
++
++static unsigned int nft_base_seq(const struct net *net)
++{
++ /* pairs with smp_store_release() in nf_tables_commit() */
++ return smp_load_acquire(&net->nft.base_seq);
++}
++
++static bool nft_lookup_should_retry(const struct net *net, unsigned int seq)
++{
++ return unlikely(seq != nft_base_seq(net));
++}
++
+ bool nft_set_do_lookup(const struct net *net, const struct nft_set *set,
+ const u32 *key, const struct nft_set_ext **ext)
+ {
+- return __nft_set_do_lookup(net, set, key, ext);
++ bool result;
++ unsigned int base_seq;
++
++ do {
++ base_seq = nft_base_seq(net);
++
++ result = __nft_set_do_lookup(net, set, key, ext);
++ if (result)
++ break;
++ /* No match? There is a small chance that lookup was
++ * performed in the old generation, but nf_tables_commit()
++ * already unlinked a (matching) element.
++ *
++ * We need to repeat the lookup to make sure that we didn't
++ * miss a matching element in the new generation.
++ */
++ } while (nft_lookup_should_retry(net, base_seq));
++
++ return result;
+ }
+ EXPORT_SYMBOL_GPL(nft_set_do_lookup);
+
+--
+2.47.3
+
--
2.47.3
_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
prev parent reply other threads:[~2025-09-11 10:06 UTC|newest]
Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-09-11 10:05 [pve-devel] [PATCH kernel 0/5] backport nftables atomicity fix Gabriel Goller
2025-09-11 10:05 ` [pve-devel] [PATCH pve-kernel 1/5] kernel: backport: netfilter: nft_set_pipapo: don't check genbit from packetpath lookups Gabriel Goller
2025-09-11 10:05 ` [pve-devel] [PATCH pve-kernel 2/5] kernel: backport: netfilter: nft_set_rbtree: continue traversal if element is inactive Gabriel Goller
2025-09-11 10:05 ` [pve-devel] [PATCH pve-kernel 3/5] kernel: backport: netfilter: nf_tables: place base_seq in struct net Gabriel Goller
2025-09-11 10:05 ` [pve-devel] [PATCH pve-kernel 4/5] kernel: backport: netfilter: nf_tables: make nft_set_do_lookup available unconditionally Gabriel Goller
2025-09-11 10:05 ` Gabriel Goller [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20250911100555.63174-6-g.goller@proxmox.com \
--to=g.goller@proxmox.com \
--cc=pve-devel@lists.proxmox.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox