* [pve-devel] [PATCH pve-kernel 1/5] kernel: backport: netfilter: nft_set_pipapo: don't check genbit from packetpath lookups
2025-09-11 10:05 [pve-devel] [PATCH kernel 0/5] backport nftables atomicity fix Gabriel Goller
@ 2025-09-11 10:05 ` Gabriel Goller
2025-09-11 10:05 ` [pve-devel] [PATCH pve-kernel 2/5] kernel: backport: netfilter: nft_set_rbtree: continue traversal if element is inactive Gabriel Goller
` (3 subsequent siblings)
4 siblings, 0 replies; 6+ messages in thread
From: Gabriel Goller @ 2025-09-11 10:05 UTC (permalink / raw)
To: pve-devel
The pipapo set is a specific nftables set which holds ip-ranges. When
updating the set, the new rules are not visible (to the live ruleset)
until the next generation, so we don't need to check the genbit on
lookup.
Signed-off-by: Gabriel Goller <g.goller@proxmox.com>
---
...t_pipapo-don-t-check-genbit-from-pac.patch | 160 ++++++++++++++++++
1 file changed, 160 insertions(+)
create mode 100644 patches/kernel/0014-netfilter-nft_set_pipapo-don-t-check-genbit-from-pac.patch
diff --git a/patches/kernel/0014-netfilter-nft_set_pipapo-don-t-check-genbit-from-pac.patch b/patches/kernel/0014-netfilter-nft_set_pipapo-don-t-check-genbit-from-pac.patch
new file mode 100644
index 000000000000..be34c0efa7e9
--- /dev/null
+++ b/patches/kernel/0014-netfilter-nft_set_pipapo-don-t-check-genbit-from-pac.patch
@@ -0,0 +1,160 @@
+From da2f3da272763d4382e4c82d2a6d1358aed10123 Mon Sep 17 00:00:00 2001
+From: Gabriel Goller <g.goller@proxmox.com>
+Date: Wed, 10 Sep 2025 12:08:09 +0200
+Subject: [PATCH 1/5] netfilter: nft_set_pipapo: don't check genbit from
+ packetpath lookups
+
+The pipapo set type is special in that it has two copies of its
+datastructure: one live copy containing only valid elements and one
+on-demand clone used during transaction where adds/deletes happen.
+
+This clone is not visible to the datapath.
+
+This is unlike all other set types in nftables, those all link new
+elements into their live hlist/tree.
+
+For those sets, the lookup functions must skip the new elements while the
+transaction is ongoing to ensure consistency.
+
+As the clone is shallow, removal does have an effect on the packet path:
+once the transaction enters the commit phase the 'gencursor' bit that
+determines which elements are active and which elements should be ignored
+(because they are no longer valid) is flipped.
+
+This causes the datapath lookup to ignore these elements if they are found
+during lookup.
+
+This opens up a small race window where pipapo has an inconsistent view of
+the dataset from when the transaction-cpu flipped the genbit until the
+transaction-cpu calls nft_pipapo_commit() to swap live/clone pointers:
+
+cpu0 cpu1
+ has added new elements to clone
+ has marked elements as being
+ inactive in new generation
+ perform lookup in the set
+ enters commit phase:
+
+I) increments the genbit
+ A) observes new genbit
+ removes elements from the clone so
+ they won't be found anymore
+ B) lookup in datastructure
+ can't see new elements yet,
+ but old elements are ignored
+ -> Only matches elements that
+ were not changed in the
+ transaction
+II) calls nft_pipapo_commit(), clone
+ and live pointers are swapped.
+ C New nft_lookup happening now
+ will find matching elements.
+
+Consider a packet matching range r1-r2:
+
+cpu0 processes following transaction:
+1. remove r1-r2
+2. add r1-r3
+
+P is contained in both ranges. Therefore, cpu1 should always find a match
+for P. Due to above race, this is not the case:
+
+cpu1 does find r1-r2, but then ignores it due to the genbit indicating
+the range has been removed.
+
+At the same time, r1-r3 is not visible yet, because it can only be found
+in the clone.
+
+The situation persists for all lookups until after cpu0 hits II).
+
+The fix is easy: Don't check the genbit from pipapo lookup functions.
+This is possible because unlike the other set types, the new elements are
+not reachable from the live copy of the dataset.
+
+The clone/live pointer swap is enough to avoid matching on old elements
+while at the same time all new elements are exposed in one go.
+
+After this change, step B above returns a match in r1-r2.
+This is fine: r1-r2 only becomes truly invalid the moment they get freed.
+This happens after a synchronize_rcu() call and rcu read lock is held
+via netfilter hook traversal (nf_hook_slow()).
+
+Fixes: 3c4287f62044 ("nf_tables: Add set type for arbitrary concatenation of ranges")
+Signed-off-by: Florian Westphal <fw@strlen.de>
+Signed-off-by: Gabriel Goller <g.goller@proxmox.com>
+---
+ net/netfilter/nft_set_pipapo.c | 21 +++++++++++++++++++--
+ net/netfilter/nft_set_pipapo_avx2.c | 4 +---
+ 2 files changed, 20 insertions(+), 5 deletions(-)
+
+diff --git a/net/netfilter/nft_set_pipapo.c b/net/netfilter/nft_set_pipapo.c
+index 0529e4ef7520..a386c3761cac 100644
+--- a/net/netfilter/nft_set_pipapo.c
++++ b/net/netfilter/nft_set_pipapo.c
+@@ -405,6 +405,24 @@ int pipapo_refill(unsigned long *map, unsigned int len, unsigned int rules,
+ *
+ * For more details, see DOC: Theory of Operation.
+ *
++ * Unlike other set types, this uses NFT_GENMASK_ANY instead of
++ * nft_genmask_cur().
++ *
++ * This is because new (future) elements are not reachable from
++ * priv->match, they get added to priv->clone instead.
++ * When the commit phase flips the generation bitmask, the
++ * 'now old' entries are skipped but without the 'now current'
++ * elements becoming visible. Using nft_genmask_cur() thus creates
++ * inconsistent state: matching old entries get skipped but thew
++ * newly matching entries are unreachable.
++ *
++ * GENMASK will still find the 'now old' entries which ensures consistent
++ * priv->match view.
++ *
++ * nft_pipapo_commit swaps ->clone and ->match shortly after the
++ * genbit flip. As ->clone doesn't contain the old entries in the first
++ * place, lookup will only find the now-current ones.
++ *
+ * Return: true on match, false otherwise.
+ */
+ bool nft_pipapo_lookup(const struct net *net, const struct nft_set *set,
+@@ -413,7 +431,6 @@ bool nft_pipapo_lookup(const struct net *net, const struct nft_set *set,
+ struct nft_pipapo *priv = nft_set_priv(set);
+ struct nft_pipapo_scratch *scratch;
+ unsigned long *res_map, *fill_map;
+- u8 genmask = nft_genmask_cur(net);
+ const struct nft_pipapo_match *m;
+ const struct nft_pipapo_field *f;
+ const u8 *rp = (const u8 *)key;
+@@ -471,7 +488,7 @@ bool nft_pipapo_lookup(const struct net *net, const struct nft_set *set,
+ if (last) {
+ *ext = &f->mt[b].e->ext;
+ if (unlikely(nft_set_elem_expired(*ext) ||
+- !nft_set_elem_active(*ext, genmask)))
++ !nft_set_elem_active(*ext, NFT_GENMASK_ANY)))
+ goto next_match;
+
+ /* Last field: we're just returning the key without
+diff --git a/net/netfilter/nft_set_pipapo_avx2.c b/net/netfilter/nft_set_pipapo_avx2.c
+index be7c16c79f71..675c0a9634b6 100644
+--- a/net/netfilter/nft_set_pipapo_avx2.c
++++ b/net/netfilter/nft_set_pipapo_avx2.c
+@@ -1151,7 +1151,6 @@ bool nft_pipapo_avx2_lookup(const struct net *net, const struct nft_set *set,
+ {
+ struct nft_pipapo *priv = nft_set_priv(set);
+ struct nft_pipapo_scratch *scratch;
+- u8 genmask = nft_genmask_cur(net);
+ const struct nft_pipapo_match *m;
+ const struct nft_pipapo_field *f;
+ const u8 *rp = (const u8 *)key;
+@@ -1245,8 +1244,7 @@ bool nft_pipapo_avx2_lookup(const struct net *net, const struct nft_set *set,
+
+ if (last) {
+ *ext = &f->mt[ret].e->ext;
+- if (unlikely(nft_set_elem_expired(*ext) ||
+- !nft_set_elem_active(*ext, genmask))) {
++ if (unlikely(nft_set_elem_expired(*ext))) {
+ ret = 0;
+ goto next_match;
+ }
+--
+2.47.3
+
--
2.47.3
_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
^ permalink raw reply [flat|nested] 6+ messages in thread
* [pve-devel] [PATCH pve-kernel 3/5] kernel: backport: netfilter: nf_tables: place base_seq in struct net
2025-09-11 10:05 [pve-devel] [PATCH kernel 0/5] backport nftables atomicity fix Gabriel Goller
2025-09-11 10:05 ` [pve-devel] [PATCH pve-kernel 1/5] kernel: backport: netfilter: nft_set_pipapo: don't check genbit from packetpath lookups Gabriel Goller
2025-09-11 10:05 ` [pve-devel] [PATCH pve-kernel 2/5] kernel: backport: netfilter: nft_set_rbtree: continue traversal if element is inactive Gabriel Goller
@ 2025-09-11 10:05 ` Gabriel Goller
2025-09-11 10:05 ` [pve-devel] [PATCH pve-kernel 4/5] kernel: backport: netfilter: nf_tables: make nft_set_do_lookup available unconditionally Gabriel Goller
2025-09-11 10:05 ` [pve-devel] [PATCH pve-kernel 5/5] kernel: backport: netfilter: nf_tables: restart set lookup on base_seq change Gabriel Goller
4 siblings, 0 replies; 6+ messages in thread
From: Gabriel Goller @ 2025-09-11 10:05 UTC (permalink / raw)
To: pve-devel
Move base_seq into the net structure. Both get incremented a lot nearly
at the same time in the commit.
Signed-off-by: Gabriel Goller <g.goller@proxmox.com>
---
..._tables-place-base_seq-in-struct-net.patch | 310 ++++++++++++++++++
1 file changed, 310 insertions(+)
create mode 100644 patches/kernel/0016-netfilter-nf_tables-place-base_seq-in-struct-net.patch
diff --git a/patches/kernel/0016-netfilter-nf_tables-place-base_seq-in-struct-net.patch b/patches/kernel/0016-netfilter-nf_tables-place-base_seq-in-struct-net.patch
new file mode 100644
index 000000000000..63a18e44e169
--- /dev/null
+++ b/patches/kernel/0016-netfilter-nf_tables-place-base_seq-in-struct-net.patch
@@ -0,0 +1,310 @@
+From 7d566006c0aa2461aa263a94e0edc73637750bab Mon Sep 17 00:00:00 2001
+From: Gabriel Goller <g.goller@proxmox.com>
+Date: Wed, 10 Sep 2025 12:09:43 +0200
+Subject: [PATCH 3/5] netfilter: nf_tables: place base_seq in struct net
+
+This will soon be read from packet path around same time as the gencursor.
+
+Both gencursor and base_seq get incremented almost at the same time, so
+it makes sense to place them in the same structure.
+
+This doesn't increase struct net size on 64bit due to padding.
+
+Signed-off-by: Florian Westphal <fw@strlen.de>
+Signed-off-by: Gabriel Goller <g.goller@proxmox.com>
+---
+ include/net/netfilter/nf_tables.h | 1 -
+ include/net/netns/nftables.h | 1 +
+ net/netfilter/nf_tables_api.c | 64 ++++++++++++++++---------------
+ 3 files changed, 34 insertions(+), 32 deletions(-)
+
+diff --git a/include/net/netfilter/nf_tables.h b/include/net/netfilter/nf_tables.h
+index 803d5f1601f9..56c6698ed5bf 100644
+--- a/include/net/netfilter/nf_tables.h
++++ b/include/net/netfilter/nf_tables.h
+@@ -1913,7 +1913,6 @@ struct nftables_pernet {
+ struct mutex commit_mutex;
+ u64 table_handle;
+ u64 tstamp;
+- unsigned int base_seq;
+ unsigned int gc_seq;
+ u8 validate_state;
+ struct work_struct destroy_work;
+diff --git a/include/net/netns/nftables.h b/include/net/netns/nftables.h
+index cc8060c017d5..99dd166c5d07 100644
+--- a/include/net/netns/nftables.h
++++ b/include/net/netns/nftables.h
+@@ -3,6 +3,7 @@
+ #define _NETNS_NFTABLES_H_
+
+ struct netns_nftables {
++ unsigned int base_seq;
+ u8 gencursor;
+ };
+
+diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c
+index a133e1c175ce..f9e7f056ea5b 100644
+--- a/net/netfilter/nf_tables_api.c
++++ b/net/netfilter/nf_tables_api.c
+@@ -1096,11 +1096,14 @@ nf_tables_chain_type_lookup(struct net *net, const struct nlattr *nla,
+ return ERR_PTR(-ENOENT);
+ }
+
+-static __be16 nft_base_seq(const struct net *net)
++static unsigned int nft_base_seq(const struct net *net)
+ {
+- struct nftables_pernet *nft_net = nft_pernet(net);
++ return READ_ONCE(net->nft.base_seq);
++}
+
+- return htons(nft_net->base_seq & 0xffff);
++static __be16 nft_base_seq_be16(const struct net *net)
++{
++ return htons(nft_base_seq(net) & 0xffff);
+ }
+
+ static const struct nla_policy nft_table_policy[NFTA_TABLE_MAX + 1] = {
+@@ -1120,7 +1123,7 @@ static int nf_tables_fill_table_info(struct sk_buff *skb, struct net *net,
+
+ event = nfnl_msg_type(NFNL_SUBSYS_NFTABLES, event);
+ nlh = nfnl_msg_put(skb, portid, seq, event, flags, family,
+- NFNETLINK_V0, nft_base_seq(net));
++ NFNETLINK_V0, nft_base_seq_be16(net));
+ if (!nlh)
+ goto nla_put_failure;
+
+@@ -1212,7 +1215,7 @@ static int nf_tables_dump_tables(struct sk_buff *skb,
+
+ rcu_read_lock();
+ nft_net = nft_pernet(net);
+- cb->seq = READ_ONCE(nft_net->base_seq);
++ cb->seq = nft_base_seq(net);
+
+ list_for_each_entry_rcu(table, &nft_net->tables, list) {
+ if (family != NFPROTO_UNSPEC && family != table->family)
+@@ -1983,7 +1986,7 @@ static int nf_tables_fill_chain_info(struct sk_buff *skb, struct net *net,
+
+ event = nfnl_msg_type(NFNL_SUBSYS_NFTABLES, event);
+ nlh = nfnl_msg_put(skb, portid, seq, event, flags, family,
+- NFNETLINK_V0, nft_base_seq(net));
++ NFNETLINK_V0, nft_base_seq_be16(net));
+ if (!nlh)
+ goto nla_put_failure;
+
+@@ -2084,7 +2087,7 @@ static int nf_tables_dump_chains(struct sk_buff *skb,
+
+ rcu_read_lock();
+ nft_net = nft_pernet(net);
+- cb->seq = READ_ONCE(nft_net->base_seq);
++ cb->seq = nft_base_seq(net);
+
+ list_for_each_entry_rcu(table, &nft_net->tables, list) {
+ if (family != NFPROTO_UNSPEC && family != table->family)
+@@ -3584,7 +3587,7 @@ static int nf_tables_fill_rule_info(struct sk_buff *skb, struct net *net,
+ u16 type = nfnl_msg_type(NFNL_SUBSYS_NFTABLES, event);
+
+ nlh = nfnl_msg_put(skb, portid, seq, type, flags, family, NFNETLINK_V0,
+- nft_base_seq(net));
++ nft_base_seq_be16(net));
+ if (!nlh)
+ goto nla_put_failure;
+
+@@ -3752,7 +3755,7 @@ static int nf_tables_dump_rules(struct sk_buff *skb,
+
+ rcu_read_lock();
+ nft_net = nft_pernet(net);
+- cb->seq = READ_ONCE(nft_net->base_seq);
++ cb->seq = nft_base_seq(net);
+
+ list_for_each_entry_rcu(table, &nft_net->tables, list) {
+ if (family != NFPROTO_UNSPEC && family != table->family)
+@@ -3963,7 +3966,7 @@ static int nf_tables_getrule_reset(struct sk_buff *skb,
+ buf = kasprintf(GFP_ATOMIC, "%.*s:%u",
+ nla_len(nla[NFTA_RULE_TABLE]),
+ (char *)nla_data(nla[NFTA_RULE_TABLE]),
+- nft_net->base_seq);
++ nft_base_seq(net));
+ audit_log_nfcfg(buf, info->nfmsg->nfgen_family, 1,
+ AUDIT_NFT_OP_RULE_RESET, GFP_ATOMIC);
+ kfree(buf);
+@@ -4776,7 +4779,7 @@ static int nf_tables_fill_set(struct sk_buff *skb, const struct nft_ctx *ctx,
+
+ event = nfnl_msg_type(NFNL_SUBSYS_NFTABLES, event);
+ nlh = nfnl_msg_put(skb, portid, seq, event, flags, ctx->family,
+- NFNETLINK_V0, nft_base_seq(ctx->net));
++ NFNETLINK_V0, nft_base_seq_be16(ctx->net));
+ if (!nlh)
+ goto nla_put_failure;
+
+@@ -4917,7 +4920,7 @@ static int nf_tables_dump_sets(struct sk_buff *skb, struct netlink_callback *cb)
+
+ rcu_read_lock();
+ nft_net = nft_pernet(net);
+- cb->seq = READ_ONCE(nft_net->base_seq);
++ cb->seq = nft_base_seq(net);
+
+ list_for_each_entry_rcu(table, &nft_net->tables, list) {
+ if (ctx->family != NFPROTO_UNSPEC &&
+@@ -6094,7 +6097,7 @@ static int nf_tables_dump_set(struct sk_buff *skb, struct netlink_callback *cb)
+
+ rcu_read_lock();
+ nft_net = nft_pernet(net);
+- cb->seq = READ_ONCE(nft_net->base_seq);
++ cb->seq = nft_base_seq(net);
+
+ list_for_each_entry_rcu(table, &nft_net->tables, list) {
+ if (dump_ctx->ctx.family != NFPROTO_UNSPEC &&
+@@ -6123,7 +6126,7 @@ static int nf_tables_dump_set(struct sk_buff *skb, struct netlink_callback *cb)
+ seq = cb->nlh->nlmsg_seq;
+
+ nlh = nfnl_msg_put(skb, portid, seq, event, NLM_F_MULTI,
+- table->family, NFNETLINK_V0, nft_base_seq(net));
++ table->family, NFNETLINK_V0, nft_base_seq_be16(net));
+ if (!nlh)
+ goto nla_put_failure;
+
+@@ -6216,7 +6219,7 @@ static int nf_tables_fill_setelem_info(struct sk_buff *skb,
+
+ event = nfnl_msg_type(NFNL_SUBSYS_NFTABLES, event);
+ nlh = nfnl_msg_put(skb, portid, seq, event, flags, ctx->family,
+- NFNETLINK_V0, nft_base_seq(ctx->net));
++ NFNETLINK_V0, nft_base_seq_be16(ctx->net));
+ if (!nlh)
+ goto nla_put_failure;
+
+@@ -6515,7 +6518,7 @@ static int nf_tables_getsetelem_reset(struct sk_buff *skb,
+ }
+ nelems++;
+ }
+- audit_log_nft_set_reset(dump_ctx.ctx.table, nft_net->base_seq, nelems);
++ audit_log_nft_set_reset(dump_ctx.ctx.table, nft_base_seq(info->net), nelems);
+
+ out_unlock:
+ rcu_read_unlock();
+@@ -8266,7 +8269,7 @@ static int nf_tables_fill_obj_info(struct sk_buff *skb, struct net *net,
+
+ event = nfnl_msg_type(NFNL_SUBSYS_NFTABLES, event);
+ nlh = nfnl_msg_put(skb, portid, seq, event, flags, family,
+- NFNETLINK_V0, nft_base_seq(net));
++ NFNETLINK_V0, nft_base_seq_be16(net));
+ if (!nlh)
+ goto nla_put_failure;
+
+@@ -8330,7 +8333,7 @@ static int nf_tables_dump_obj(struct sk_buff *skb, struct netlink_callback *cb)
+
+ rcu_read_lock();
+ nft_net = nft_pernet(net);
+- cb->seq = READ_ONCE(nft_net->base_seq);
++ cb->seq = nft_base_seq(net);
+
+ list_for_each_entry_rcu(table, &nft_net->tables, list) {
+ if (family != NFPROTO_UNSPEC && family != table->family)
+@@ -8364,7 +8367,7 @@ static int nf_tables_dump_obj(struct sk_buff *skb, struct netlink_callback *cb)
+ idx++;
+ }
+ if (ctx->reset && entries)
+- audit_log_obj_reset(table, nft_net->base_seq, entries);
++ audit_log_obj_reset(table, nft_base_seq(net), entries);
+ if (rc < 0)
+ break;
+ }
+@@ -8533,7 +8536,7 @@ static int nf_tables_getobj_reset(struct sk_buff *skb,
+ buf = kasprintf(GFP_ATOMIC, "%.*s:%u",
+ nla_len(nla[NFTA_OBJ_TABLE]),
+ (char *)nla_data(nla[NFTA_OBJ_TABLE]),
+- nft_net->base_seq);
++ nft_base_seq(net));
+ audit_log_nfcfg(buf, info->nfmsg->nfgen_family, 1,
+ AUDIT_NFT_OP_OBJ_RESET, GFP_ATOMIC);
+ kfree(buf);
+@@ -8640,7 +8643,7 @@ void nft_obj_notify(struct net *net, const struct nft_table *table,
+ {
+ struct nftables_pernet *nft_net = nft_pernet(net);
+ char *buf = kasprintf(gfp, "%s:%u",
+- table->name, nft_net->base_seq);
++ table->name, nft_base_seq(net));
+
+ audit_log_nfcfg(buf,
+ family,
+@@ -9288,7 +9291,7 @@ static int nf_tables_fill_flowtable_info(struct sk_buff *skb, struct net *net,
+
+ event = nfnl_msg_type(NFNL_SUBSYS_NFTABLES, event);
+ nlh = nfnl_msg_put(skb, portid, seq, event, flags, family,
+- NFNETLINK_V0, nft_base_seq(net));
++ NFNETLINK_V0, nft_base_seq_be16(net));
+ if (!nlh)
+ goto nla_put_failure;
+
+@@ -9356,7 +9359,7 @@ static int nf_tables_dump_flowtable(struct sk_buff *skb,
+
+ rcu_read_lock();
+ nft_net = nft_pernet(net);
+- cb->seq = READ_ONCE(nft_net->base_seq);
++ cb->seq = nft_base_seq(net);
+
+ list_for_each_entry_rcu(table, &nft_net->tables, list) {
+ if (family != NFPROTO_UNSPEC && family != table->family)
+@@ -9541,17 +9544,16 @@ static void nf_tables_flowtable_destroy(struct nft_flowtable *flowtable)
+ static int nf_tables_fill_gen_info(struct sk_buff *skb, struct net *net,
+ u32 portid, u32 seq)
+ {
+- struct nftables_pernet *nft_net = nft_pernet(net);
+ struct nlmsghdr *nlh;
+ char buf[TASK_COMM_LEN];
+ int event = nfnl_msg_type(NFNL_SUBSYS_NFTABLES, NFT_MSG_NEWGEN);
+
+ nlh = nfnl_msg_put(skb, portid, seq, event, 0, AF_UNSPEC,
+- NFNETLINK_V0, nft_base_seq(net));
++ NFNETLINK_V0, nft_base_seq_be16(net));
+ if (!nlh)
+ goto nla_put_failure;
+
+- if (nla_put_be32(skb, NFTA_GEN_ID, htonl(nft_net->base_seq)) ||
++ if (nla_put_be32(skb, NFTA_GEN_ID, htonl(nft_base_seq(net))) ||
+ nla_put_be32(skb, NFTA_GEN_PROC_PID, htonl(task_pid_nr(current))) ||
+ nla_put_string(skb, NFTA_GEN_PROC_NAME, get_task_comm(buf, current)))
+ goto nla_put_failure;
+@@ -10727,11 +10729,11 @@ static int nf_tables_commit(struct net *net, struct sk_buff *skb)
+ * Bump generation counter, invalidate any dump in progress.
+ * Cannot fail after this point.
+ */
+- base_seq = READ_ONCE(nft_net->base_seq);
++ base_seq = nft_base_seq(net);
+ while (++base_seq == 0)
+ ;
+
+- WRITE_ONCE(nft_net->base_seq, base_seq);
++ WRITE_ONCE(net->nft.base_seq, base_seq);
+
+ gc_seq = nft_gc_seq_begin(nft_net);
+
+@@ -10940,7 +10942,7 @@ static int nf_tables_commit(struct net *net, struct sk_buff *skb)
+
+ nft_commit_notify(net, NETLINK_CB(skb).portid);
+ nf_tables_gen_notify(net, skb, NFT_MSG_NEWGEN);
+- nf_tables_commit_audit_log(&adl, nft_net->base_seq);
++ nf_tables_commit_audit_log(&adl, nft_base_seq(net));
+
+ nft_gc_seq_end(nft_net, gc_seq);
+ nft_net->validate_state = NFT_VALIDATE_SKIP;
+@@ -11265,7 +11267,7 @@ static bool nf_tables_valid_genid(struct net *net, u32 genid)
+ mutex_lock(&nft_net->commit_mutex);
+ nft_net->tstamp = get_jiffies_64();
+
+- genid_ok = genid == 0 || nft_net->base_seq == genid;
++ genid_ok = genid == 0 || nft_base_seq(net) == genid;
+ if (!genid_ok)
+ mutex_unlock(&nft_net->commit_mutex);
+
+@@ -11902,7 +11904,7 @@ static int __net_init nf_tables_init_net(struct net *net)
+ INIT_LIST_HEAD(&nft_net->module_list);
+ INIT_LIST_HEAD(&nft_net->notify_list);
+ mutex_init(&nft_net->commit_mutex);
+- nft_net->base_seq = 1;
++ net->nft.base_seq = 1;
+ nft_net->gc_seq = 0;
+ nft_net->validate_state = NFT_VALIDATE_SKIP;
+ INIT_WORK(&nft_net->destroy_work, nf_tables_trans_destroy_work);
+--
+2.47.3
+
--
2.47.3
_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
^ permalink raw reply [flat|nested] 6+ messages in thread
* [pve-devel] [PATCH pve-kernel 4/5] kernel: backport: netfilter: nf_tables: make nft_set_do_lookup available unconditionally
2025-09-11 10:05 [pve-devel] [PATCH kernel 0/5] backport nftables atomicity fix Gabriel Goller
` (2 preceding siblings ...)
2025-09-11 10:05 ` [pve-devel] [PATCH pve-kernel 3/5] kernel: backport: netfilter: nf_tables: place base_seq in struct net Gabriel Goller
@ 2025-09-11 10:05 ` Gabriel Goller
2025-09-11 10:05 ` [pve-devel] [PATCH pve-kernel 5/5] kernel: backport: netfilter: nf_tables: restart set lookup on base_seq change Gabriel Goller
4 siblings, 0 replies; 6+ messages in thread
From: Gabriel Goller @ 2025-09-11 10:05 UTC (permalink / raw)
To: pve-devel
Helper for the actual bugfix.
Signed-off-by: Gabriel Goller <g.goller@proxmox.com>
---
...les-make-nft_set_do_lookup-available.patch | 86 +++++++++++++++++++
1 file changed, 86 insertions(+)
create mode 100644 patches/kernel/0017-netfilter-nf_tables-make-nft_set_do_lookup-available.patch
diff --git a/patches/kernel/0017-netfilter-nf_tables-make-nft_set_do_lookup-available.patch b/patches/kernel/0017-netfilter-nf_tables-make-nft_set_do_lookup-available.patch
new file mode 100644
index 000000000000..0194b7e7776f
--- /dev/null
+++ b/patches/kernel/0017-netfilter-nf_tables-make-nft_set_do_lookup-available.patch
@@ -0,0 +1,86 @@
+From 35120b5cb4467a234f4ffecc52c7ff6630a31907 Mon Sep 17 00:00:00 2001
+From: Gabriel Goller <g.goller@proxmox.com>
+Date: Wed, 10 Sep 2025 12:10:11 +0200
+Subject: [PATCH 4/5] netfilter: nf_tables: make nft_set_do_lookup available
+ unconditionally
+
+This function was added for retpoline mitigation and is replaced by a
+static inline helper if mitigations are not enabled.
+
+Enable this helper function unconditionally so next patch can add a lookup
+restart mechanism to fix possible false negatives while transactions are
+in progress.
+
+Adding lookup restarts in nft_lookup_eval doesn't work as nft_objref would
+then need the same copypaste loop.
+
+This patch is separate to ease review of the actual bug fix.
+
+Suggested-by: Pablo Neira Ayuso <pablo@netfilter.org>
+Signed-off-by: Florian Westphal <fw@strlen.de>
+Signed-off-by: Gabriel Goller <g.goller@proxmox.com>
+---
+ include/net/netfilter/nf_tables_core.h | 10 ++--------
+ net/netfilter/nft_lookup.c | 11 ++++++++---
+ 2 files changed, 10 insertions(+), 11 deletions(-)
+
+diff --git a/include/net/netfilter/nf_tables_core.h b/include/net/netfilter/nf_tables_core.h
+index 03b6165756fc..04fc4a411a86 100644
+--- a/include/net/netfilter/nf_tables_core.h
++++ b/include/net/netfilter/nf_tables_core.h
+@@ -105,16 +105,10 @@ bool nft_hash_lookup_fast(const struct net *net,
+ const u32 *key, const struct nft_set_ext **ext);
+ bool nft_hash_lookup(const struct net *net, const struct nft_set *set,
+ const u32 *key, const struct nft_set_ext **ext);
++#endif
++
+ bool nft_set_do_lookup(const struct net *net, const struct nft_set *set,
+ const u32 *key, const struct nft_set_ext **ext);
+-#else
+-static inline bool
+-nft_set_do_lookup(const struct net *net, const struct nft_set *set,
+- const u32 *key, const struct nft_set_ext **ext)
+-{
+- return set->ops->lookup(net, set, key, ext);
+-}
+-#endif
+
+ /* called from nft_pipapo_avx2.c */
+ bool nft_pipapo_lookup(const struct net *net, const struct nft_set *set,
+diff --git a/net/netfilter/nft_lookup.c b/net/netfilter/nft_lookup.c
+index 63ef832b8aa7..7d0add1041bb 100644
+--- a/net/netfilter/nft_lookup.c
++++ b/net/netfilter/nft_lookup.c
+@@ -24,10 +24,10 @@ struct nft_lookup {
+ struct nft_set_binding binding;
+ };
+
+-#ifdef CONFIG_MITIGATION_RETPOLINE
+-bool nft_set_do_lookup(const struct net *net, const struct nft_set *set,
++static bool __nft_set_do_lookup(const struct net *net, const struct nft_set *set,
+ const u32 *key, const struct nft_set_ext **ext)
+ {
++#ifdef CONFIG_MITIGATION_RETPOLINE
+ if (set->ops == &nft_set_hash_fast_type.ops)
+ return nft_hash_lookup_fast(net, set, key, ext);
+ if (set->ops == &nft_set_hash_type.ops)
+@@ -50,10 +50,15 @@ bool nft_set_do_lookup(const struct net *net, const struct nft_set *set,
+ return nft_rbtree_lookup(net, set, key, ext);
+
+ WARN_ON_ONCE(1);
++#endif
+ return set->ops->lookup(net, set, key, ext);
+ }
++bool nft_set_do_lookup(const struct net *net, const struct nft_set *set,
++ const u32 *key, const struct nft_set_ext **ext)
++{
++ return __nft_set_do_lookup(net, set, key, ext);
++}
+ EXPORT_SYMBOL_GPL(nft_set_do_lookup);
+-#endif
+
+ void nft_lookup_eval(const struct nft_expr *expr,
+ struct nft_regs *regs,
+--
+2.47.3
+
--
2.47.3
_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
^ permalink raw reply [flat|nested] 6+ messages in thread
* [pve-devel] [PATCH pve-kernel 5/5] kernel: backport: netfilter: nf_tables: restart set lookup on base_seq change
2025-09-11 10:05 [pve-devel] [PATCH kernel 0/5] backport nftables atomicity fix Gabriel Goller
` (3 preceding siblings ...)
2025-09-11 10:05 ` [pve-devel] [PATCH pve-kernel 4/5] kernel: backport: netfilter: nf_tables: make nft_set_do_lookup available unconditionally Gabriel Goller
@ 2025-09-11 10:05 ` Gabriel Goller
4 siblings, 0 replies; 6+ messages in thread
From: Gabriel Goller @ 2025-09-11 10:05 UTC (permalink / raw)
To: pve-devel
If a set lookup fails, retry again if the base_seq has changed. This
means that if a lookup does not return a value, check if there was a
generation change in the meantime. If there was, redo the lookup. Exact
explanation is inside the patch.
Signed-off-by: Gabriel Goller <g.goller@proxmox.com>
---
...les-restart-set-lookup-on-base_seq-c.patch | 148 ++++++++++++++++++
1 file changed, 148 insertions(+)
create mode 100644 patches/kernel/0018-netfilter-nf_tables-restart-set-lookup-on-base_seq-c.patch
diff --git a/patches/kernel/0018-netfilter-nf_tables-restart-set-lookup-on-base_seq-c.patch b/patches/kernel/0018-netfilter-nf_tables-restart-set-lookup-on-base_seq-c.patch
new file mode 100644
index 000000000000..26b28ffc0aa1
--- /dev/null
+++ b/patches/kernel/0018-netfilter-nf_tables-restart-set-lookup-on-base_seq-c.patch
@@ -0,0 +1,148 @@
+From 35dd20f6e27b4a6b0b6c2cbf6ebf8eab33121378 Mon Sep 17 00:00:00 2001
+From: Gabriel Goller <g.goller@proxmox.com>
+Date: Wed, 10 Sep 2025 12:18:18 +0200
+Subject: [PATCH 5/5] netfilter: nf_tables: restart set lookup on base_seq
+ change
+
+The hash, hash_fast, rhash and bitwise sets may indicate no result even
+though a matching element exists during a short time window while other
+cpu is finalizing the transaction.
+
+This happens when the hash lookup/bitwise lookup function has picked up
+the old genbit, right before it was toggled by nf_tables_commit(), but
+then the same cpu managed to unlink the matching old element from the
+hash table:
+
+cpu0 cpu1
+ has added new elements to clone
+ has marked elements as being
+ inactive in new generation
+ perform lookup in the set
+ enters commit phase:
+ A) observes old genbit
+ increments base_seq
+I) increments the genbit
+II) removes old element from the set
+ B) finds matching element
+ C) returns no match: found
+ element is not valid in old
+ generation
+
+ Next lookup observes new genbit and
+ finds matching e2.
+
+Consider a packet matching element e1, e2.
+
+cpu0 processes following transaction:
+1. remove e1
+2. adds e2, which has same key as e1.
+
+P matches both e1 and e2. Therefore, cpu1 should always find a match
+for P. Due to above race, this is not the case:
+
+cpu1 observed the old genbit. e2 will not be considered once it is found.
+The element e1 is not found anymore if cpu0 managed to unlink it from the
+hlist before cpu1 found it during list traversal.
+
+The situation only occurs for a brief time period, lookups happening
+after I) observe new genbit and return e2.
+
+This problem exists in all set types except nft_set_pipapo, so fix it once
+in nft_lookup rather than each set ops individually.
+
+Sample the base sequence counter, which gets incremented right before the
+genbit is changed.
+
+Then, if no match is found, retry the lookup if the base sequence was
+altered in between.
+
+If the base sequence hasn't changed:
+ - No update took place: no-match result is expected.
+ This is the common case. or:
+ - nf_tables_commit() hasn't progressed to genbit update yet.
+ Old elements were still visible and nomatch result is expected, or:
+ - nf_tables_commit updated the genbit:
+ We picked up the new base_seq, so the lookup function also picked
+ up the new genbit, no-match result is expected.
+
+If the old genbit was observed, then nft_lookup also picked up the old
+base_seq: nft_lookup_should_retry() returns true and relookup is performed
+in the new generation.
+
+Thanks to Pablo Neira Ayuso for reviewing an earlier version of this
+patchset, for suggesting re-use of existing base_seq and placement of
+the restart loop in nft_set_do_lookup().
+
+Signed-off-by: Florian Westphal <fw@strlen.de>
+[GG: this patch needed a bigger change because __nft_set_do_lookup
+doesn't ext pointer yet, but still returns a bool, so we just check if
+the bool is false, so if no element has been found and retry.]
+Signed-off-by: Gabriel Goller <g.goller@proxmox.com>
+---
+ net/netfilter/nf_tables_api.c | 3 ++-
+ net/netfilter/nft_lookup.c | 32 +++++++++++++++++++++++++++++++-
+ 2 files changed, 33 insertions(+), 2 deletions(-)
+
+diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c
+index f9e7f056ea5b..40a32ef86470 100644
+--- a/net/netfilter/nf_tables_api.c
++++ b/net/netfilter/nf_tables_api.c
+@@ -10733,7 +10733,8 @@ static int nf_tables_commit(struct net *net, struct sk_buff *skb)
+ while (++base_seq == 0)
+ ;
+
+- WRITE_ONCE(net->nft.base_seq, base_seq);
++ /* pairs with smp_load_acquire in nft_lookup_eval */
++ smp_store_release(&net->nft.base_seq, base_seq);
+
+ gc_seq = nft_gc_seq_begin(nft_net);
+
+diff --git a/net/netfilter/nft_lookup.c b/net/netfilter/nft_lookup.c
+index 7d0add1041bb..fc0eb91d94c0 100644
+--- a/net/netfilter/nft_lookup.c
++++ b/net/netfilter/nft_lookup.c
+@@ -53,10 +53,40 @@ static bool __nft_set_do_lookup(const struct net *net, const struct nft_set *set
+ #endif
+ return set->ops->lookup(net, set, key, ext);
+ }
++
++static unsigned int nft_base_seq(const struct net *net)
++{
++ /* pairs with smp_store_release() in nf_tables_commit() */
++ return smp_load_acquire(&net->nft.base_seq);
++}
++
++static bool nft_lookup_should_retry(const struct net *net, unsigned int seq)
++{
++ return unlikely(seq != nft_base_seq(net));
++}
++
+ bool nft_set_do_lookup(const struct net *net, const struct nft_set *set,
+ const u32 *key, const struct nft_set_ext **ext)
+ {
+- return __nft_set_do_lookup(net, set, key, ext);
++ bool result;
++ unsigned int base_seq;
++
++ do {
++ base_seq = nft_base_seq(net);
++
++ result = __nft_set_do_lookup(net, set, key, ext);
++ if (result)
++ break;
++ /* No match? There is a small chance that lookup was
++ * performed in the old generation, but nf_tables_commit()
++ * already unlinked a (matching) element.
++ *
++ * We need to repeat the lookup to make sure that we didn't
++ * miss a matching element in the new generation.
++ */
++ } while (nft_lookup_should_retry(net, base_seq));
++
++ return result;
+ }
+ EXPORT_SYMBOL_GPL(nft_set_do_lookup);
+
+--
+2.47.3
+
--
2.47.3
_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
^ permalink raw reply [flat|nested] 6+ messages in thread