[pve-devel] [PATCH kernel trixie-6.14] backport fix for memory leak with Intel NICs using the ice driver

* [pve-devel] [PATCH kernel trixie-6.14] backport fix for memory leak with Intel NICs using the ice driver
@ 2025-12-12 16:18 Friedrich Weber
  2025-12-15 13:05 ` [pve-devel] applied: " Thomas Lamprecht
  0 siblings, 1 reply; 2+ messages in thread
From: Friedrich Weber @ 2025-12-12 16:18 UTC (permalink / raw)
  To: pve-devel

Users reported steadily growing memory usage with certain 6.8 and 6.14
kernels on hosts using Intel NICs with the ice driver, especially if
the NICs are used for Ceph traffic and configured with high MTU.

Kernel 6.14.11-3 was reported to alleviate but not entirely fix the
issue, while 6.17 was reported to fix the issue [2]. Hence, backport
the patch below which fixes a memory leak in the ice driver and is
contained in 6.17, but not in 6.14.11-3.

[1] https://forum.proxmox.com/threads/168961/
[2] https://forum.proxmox.com/threads/168961/post-799361

Suggested-by: Laurențiu Leahu-Vlăducu <l.leahu-vladucu@proxmox.com>
Signed-off-by: Friedrich Weber <f.weber@proxmox.com>
---

Notes:
    Intended mostly for backporting to bookworm-6.14 to offer PVE 8 users
    a nicer fix for this issue (upgrading to the opt-in 6.14 kernel),
    instead of pinning an unaffected old 6.8 kernel.
    
    Did a very basic smoke test by booting a patched kernel on a machine
    with ice NICs, dmesg showed no issues.

 ...-Rx-page-leak-on-multi-buffer-frames.patch | 271 ++++++++++++++++++
 1 file changed, 271 insertions(+)
 create mode 100644 patches/kernel/0028-ice-fix-Rx-page-leak-on-multi-buffer-frames.patch

diff --git a/patches/kernel/0028-ice-fix-Rx-page-leak-on-multi-buffer-frames.patch b/patches/kernel/0028-ice-fix-Rx-page-leak-on-multi-buffer-frames.patch
new file mode 100644
index 0000000..3af45b6
--- /dev/null
+++ b/patches/kernel/0028-ice-fix-Rx-page-leak-on-multi-buffer-frames.patch
@@ -0,0 +1,271 @@
+From 0064fddbc81c14ae6868a9cc0964cd7b1048557a Mon Sep 17 00:00:00 2001
+From: Jacob Keller <jacob.e.keller@intel.com>
+Date: Mon, 25 Aug 2025 16:00:14 -0700
+Subject: [PATCH] ice: fix Rx page leak on multi-buffer frames
+
+The ice_put_rx_mbuf() function handles calling ice_put_rx_buf() for each
+buffer in the current frame. This function was introduced as part of
+handling multi-buffer XDP support in the ice driver.
+
+It works by iterating over the buffers from first_desc up to 1 plus the
+total number of fragments in the frame, cached from before the XDP program
+was executed.
+
+If the hardware posts a descriptor with a size of 0, the logic used in
+ice_put_rx_mbuf() breaks. Such descriptors get skipped and don't get added
+as fragments in ice_add_xdp_frag. Since the buffer isn't counted as a
+fragment, we do not iterate over it in ice_put_rx_mbuf(), and thus we don't
+call ice_put_rx_buf().
+
+Because we don't call ice_put_rx_buf(), we don't attempt to re-use the
+page or free it. This leaves a stale page in the ring, as we don't
+increment next_to_alloc.
+
+The ice_reuse_rx_page() assumes that the next_to_alloc has been incremented
+properly, and that it always points to a buffer with a NULL page. Since
+this function doesn't check, it will happily recycle a page over the top
+of the next_to_alloc buffer, losing track of the old page.
+
+Note that this leak only occurs for multi-buffer frames. The
+ice_put_rx_mbuf() function always handles at least one buffer, so a
+single-buffer frame will always get handled correctly. It is not clear
+precisely why the hardware hands us descriptors with a size of 0 sometimes,
+but it happens somewhat regularly with "jumbo frames" used by 9K MTU.
+
+To fix ice_put_rx_mbuf(), we need to make sure to call ice_put_rx_buf() on
+all buffers between first_desc and next_to_clean. Borrow the logic of a
+similar function in i40e used for this same purpose. Use the same logic
+also in ice_get_pgcnts().
+
+Instead of iterating over just the number of fragments, use a loop which
+iterates until the current index reaches to the next_to_clean element just
+past the current frame. Unlike i40e, the ice_put_rx_mbuf() function does
+call ice_put_rx_buf() on the last buffer of the frame indicating the end of
+packet.
+
+For non-linear (multi-buffer) frames, we need to take care when adjusting
+the pagecnt_bias. An XDP program might release fragments from the tail of
+the frame, in which case that fragment page is already released. Only
+update the pagecnt_bias for the first descriptor and fragments still
+remaining post-XDP program. Take care to only access the shared info for
+fragmented buffers, as this avoids a significant cache miss.
+
+The xdp_xmit value only needs to be updated if an XDP program is run, and
+only once per packet. Drop the xdp_xmit pointer argument from
+ice_put_rx_mbuf(). Instead, set xdp_xmit in the ice_clean_rx_irq() function
+directly. This avoids needing to pass the argument and avoids an extra
+bit-wise OR for each buffer in the frame.
+
+Move the increment of the ntc local variable to ensure its updated *before*
+all calls to ice_get_pgcnts() or ice_put_rx_mbuf(), as the loop logic
+requires the index of the element just after the current frame.
+
+Now that we use an index pointer in the ring to identify the packet, we no
+longer need to track or cache the number of fragments in the rx_ring.
+
+Cc: Christoph Petrausch <christoph.petrausch@deepl.com>
+Cc: Jesper Dangaard Brouer <hawk@kernel.org>
+Reported-by: Jaroslav Pulchart <jaroslav.pulchart@gooddata.com>
+Closes: https://lore.kernel.org/netdev/CAK8fFZ4hY6GUJNENz3wY9jaYLZXGfpr7dnZxzGMYoE44caRbgw@mail.gmail.com/
+Fixes: 743bbd93cf29 ("ice: put Rx buffers after being done with current frame")
+Tested-by: Michal Kubiak <michal.kubiak@intel.com>
+Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
+Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
+Tested-by: Priya Singh <priyax.singh@intel.com>
+Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel)
+Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
+(cherry picked from commit 84bf1ac85af84d354c7a2fdbdc0d4efc8aaec34b)
+Signed-off-by: Friedrich Weber <f.weber@proxmox.com>
+---
+ drivers/net/ethernet/intel/ice/ice_txrx.c | 80 ++++++++++-------------
+ drivers/net/ethernet/intel/ice/ice_txrx.h |  1 -
+ 2 files changed, 34 insertions(+), 47 deletions(-)
+
+diff --git a/drivers/net/ethernet/intel/ice/ice_txrx.c b/drivers/net/ethernet/intel/ice/ice_txrx.c
+index 380ba1e8b3b2..dc3be5a76c6c 100644
+--- a/drivers/net/ethernet/intel/ice/ice_txrx.c
++++ b/drivers/net/ethernet/intel/ice/ice_txrx.c
+@@ -865,10 +865,6 @@ ice_add_xdp_frag(struct ice_rx_ring *rx_ring, struct xdp_buff *xdp,
+ 	__skb_fill_page_desc_noacc(sinfo, sinfo->nr_frags++, rx_buf->page,
+ 				   rx_buf->page_offset, size);
+ 	sinfo->xdp_frags_size += size;
+-	/* remember frag count before XDP prog execution; bpf_xdp_adjust_tail()
+-	 * can pop off frags but driver has to handle it on its own
+-	 */
+-	rx_ring->nr_frags = sinfo->nr_frags;
+ 
+ 	if (page_is_pfmemalloc(rx_buf->page))
+ 		xdp_buff_set_frag_pfmemalloc(xdp);
+@@ -939,20 +935,20 @@ ice_get_rx_buf(struct ice_rx_ring *rx_ring, const unsigned int size,
+ /**
+  * ice_get_pgcnts - grab page_count() for gathered fragments
+  * @rx_ring: Rx descriptor ring to store the page counts on
++ * @ntc: the next to clean element (not included in this frame!)
+  *
+  * This function is intended to be called right before running XDP
+  * program so that the page recycling mechanism will be able to take
+  * a correct decision regarding underlying pages; this is done in such
+  * way as XDP program can change the refcount of page
+  */
+-static void ice_get_pgcnts(struct ice_rx_ring *rx_ring)
++static void ice_get_pgcnts(struct ice_rx_ring *rx_ring, unsigned int ntc)
+ {
+-	u32 nr_frags = rx_ring->nr_frags + 1;
+ 	u32 idx = rx_ring->first_desc;
+ 	struct ice_rx_buf *rx_buf;
+ 	u32 cnt = rx_ring->count;
+ 
+-	for (int i = 0; i < nr_frags; i++) {
++	while (idx != ntc) {
+ 		rx_buf = &rx_ring->rx_buf[idx];
+ 		rx_buf->pgcnt = page_count(rx_buf->page);
+ 
+@@ -1125,62 +1121,51 @@ ice_put_rx_buf(struct ice_rx_ring *rx_ring, struct ice_rx_buf *rx_buf)
+ }
+ 
+ /**
+- * ice_put_rx_mbuf - ice_put_rx_buf() caller, for all frame frags
++ * ice_put_rx_mbuf - ice_put_rx_buf() caller, for all buffers in frame
+  * @rx_ring: Rx ring with all the auxiliary data
+  * @xdp: XDP buffer carrying linear + frags part
+- * @xdp_xmit: XDP_TX/XDP_REDIRECT verdict storage
+- * @ntc: a current next_to_clean value to be stored at rx_ring
++ * @ntc: the next to clean element (not included in this frame!)
+  * @verdict: return code from XDP program execution
+  *
+- * Walk through gathered fragments and satisfy internal page
+- * recycle mechanism; we take here an action related to verdict
+- * returned by XDP program;
++ * Called after XDP program is completed, or on error with verdict set to
++ * ICE_XDP_CONSUMED.
++ *
++ * Walk through buffers from first_desc to the end of the frame, releasing
++ * buffers and satisfying internal page recycle mechanism. The action depends
++ * on verdict from XDP program.
+  */
+ static void ice_put_rx_mbuf(struct ice_rx_ring *rx_ring, struct xdp_buff *xdp,
+-			    u32 *xdp_xmit, u32 ntc, u32 verdict)
++			    u32 ntc, u32 verdict)
+ {
+-	u32 nr_frags = rx_ring->nr_frags + 1;
+ 	u32 idx = rx_ring->first_desc;
+ 	u32 cnt = rx_ring->count;
+-	u32 post_xdp_frags = 1;
+ 	struct ice_rx_buf *buf;
+-	int i;
++	u32 xdp_frags = 0;
++	int i = 0;
+ 
+ 	if (unlikely(xdp_buff_has_frags(xdp)))
+-		post_xdp_frags += xdp_get_shared_info_from_buff(xdp)->nr_frags;
++		xdp_frags = xdp_get_shared_info_from_buff(xdp)->nr_frags;
+ 
+-	for (i = 0; i < post_xdp_frags; i++) {
++	while (idx != ntc) {
+ 		buf = &rx_ring->rx_buf[idx];
++		if (++idx == cnt)
++			idx = 0;
+ 
+-		if (verdict & (ICE_XDP_TX | ICE_XDP_REDIR)) {
++		/* An XDP program could release fragments from the end of the
++		 * buffer. For these, we need to keep the pagecnt_bias as-is.
++		 * To do this, only adjust pagecnt_bias for fragments up to
++		 * the total remaining after the XDP program has run.
++		 */
++		if (verdict != ICE_XDP_CONSUMED)
+ 			ice_rx_buf_adjust_pg_offset(buf, xdp->frame_sz);
+-			*xdp_xmit |= verdict;
+-		} else if (verdict & ICE_XDP_CONSUMED) {
++		else if (i++ <= xdp_frags)
+ 			buf->pagecnt_bias++;
+-		} else if (verdict == ICE_XDP_PASS) {
+-			ice_rx_buf_adjust_pg_offset(buf, xdp->frame_sz);
+-		}
+ 
+ 		ice_put_rx_buf(rx_ring, buf);
+-
+-		if (++idx == cnt)
+-			idx = 0;
+-	}
+-	/* handle buffers that represented frags released by XDP prog;
+-	 * for these we keep pagecnt_bias as-is; refcount from struct page
+-	 * has been decremented within XDP prog and we do not have to increase
+-	 * the biased refcnt
+-	 */
+-	for (; i < nr_frags; i++) {
+-		buf = &rx_ring->rx_buf[idx];
+-		ice_put_rx_buf(rx_ring, buf);
+-		if (++idx == cnt)
+-			idx = 0;
+ 	}
+ 
+ 	xdp->data = NULL;
+ 	rx_ring->first_desc = ntc;
+-	rx_ring->nr_frags = 0;
+ }
+ 
+ /**
+@@ -1260,6 +1245,10 @@ int ice_clean_rx_irq(struct ice_rx_ring *rx_ring, int budget)
+ 		/* retrieve a buffer from the ring */
+ 		rx_buf = ice_get_rx_buf(rx_ring, size, ntc);
+ 
++		/* Increment ntc before calls to ice_put_rx_mbuf() */
++		if (++ntc == cnt)
++			ntc = 0;
++
+ 		if (!xdp->data) {
+ 			void *hard_start;
+ 
+@@ -1268,24 +1257,23 @@ int ice_clean_rx_irq(struct ice_rx_ring *rx_ring, int budget)
+ 			xdp_prepare_buff(xdp, hard_start, offset, size, !!offset);
+ 			xdp_buff_clear_frags_flag(xdp);
+ 		} else if (ice_add_xdp_frag(rx_ring, xdp, rx_buf, size)) {
+-			ice_put_rx_mbuf(rx_ring, xdp, NULL, ntc, ICE_XDP_CONSUMED);
++			ice_put_rx_mbuf(rx_ring, xdp, ntc, ICE_XDP_CONSUMED);
+ 			break;
+ 		}
+-		if (++ntc == cnt)
+-			ntc = 0;
+ 
+ 		/* skip if it is NOP desc */
+ 		if (ice_is_non_eop(rx_ring, rx_desc))
+ 			continue;
+ 
+-		ice_get_pgcnts(rx_ring);
++		ice_get_pgcnts(rx_ring, ntc);
+ 		xdp_verdict = ice_run_xdp(rx_ring, xdp, xdp_prog, xdp_ring, rx_desc);
+ 		if (xdp_verdict == ICE_XDP_PASS)
+ 			goto construct_skb;
+ 		total_rx_bytes += xdp_get_buff_len(xdp);
+ 		total_rx_pkts++;
+ 
+-		ice_put_rx_mbuf(rx_ring, xdp, &xdp_xmit, ntc, xdp_verdict);
++		ice_put_rx_mbuf(rx_ring, xdp, ntc, xdp_verdict);
++		xdp_xmit |= xdp_verdict & (ICE_XDP_TX | ICE_XDP_REDIR);
+ 
+ 		continue;
+ construct_skb:
+@@ -1298,7 +1286,7 @@ int ice_clean_rx_irq(struct ice_rx_ring *rx_ring, int budget)
+ 			rx_ring->ring_stats->rx_stats.alloc_page_failed++;
+ 			xdp_verdict = ICE_XDP_CONSUMED;
+ 		}
+-		ice_put_rx_mbuf(rx_ring, xdp, &xdp_xmit, ntc, xdp_verdict);
++		ice_put_rx_mbuf(rx_ring, xdp, ntc, xdp_verdict);
+ 
+ 		if (!skb)
+ 			break;
+diff --git a/drivers/net/ethernet/intel/ice/ice_txrx.h b/drivers/net/ethernet/intel/ice/ice_txrx.h
+index 806bce701df3..72954b7655c8 100644
+--- a/drivers/net/ethernet/intel/ice/ice_txrx.h
++++ b/drivers/net/ethernet/intel/ice/ice_txrx.h
+@@ -357,7 +357,6 @@ struct ice_rx_ring {
+ 	struct ice_tx_ring *xdp_ring;
+ 	struct ice_rx_ring *next;	/* pointer to next ring in q_vector */
+ 	struct xsk_buff_pool *xsk_pool;
+-	u32 nr_frags;
+ 	u16 max_frame;
+ 	u16 rx_buf_len;
+ 	dma_addr_t dma;			/* physical address of ring */
+-- 
+2.47.3
+
-- 
2.47.3



_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

^ permalink raw reply	[flat|nested] 2+ messages in thread