From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <pve-devel-bounces@lists.proxmox.com>
Received: from firstgate.proxmox.com (firstgate.proxmox.com [IPv6:2a01:7e0:0:424::9])
	by lore.proxmox.com (Postfix) with ESMTPS id 94DC81FF13F
	for <inbox@lore.proxmox.com>; Thu, 26 Mar 2026 08:47:51 +0100 (CET)
Received: from firstgate.proxmox.com (localhost [127.0.0.1])
	by firstgate.proxmox.com (Proxmox) with ESMTP id 88D5376E2;
	Thu, 26 Mar 2026 08:48:11 +0100 (CET)
Mime-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset=UTF-8
Date: Thu, 26 Mar 2026 15:48:02 +0800
Message-Id: <DHCJR18OV951.1UFQTRLNWT77Z@proxmox.com>
From: "Kefu Chai" <k.chai@proxmox.com>
To: "Dietmar Maurer" <dietmar@proxmox.com>, "DERUMIER, Alexandre"
 <alexandre.derumier@groupe-cyllene.com>, <pve-devel@lists.proxmox.com>
Subject: Re: [RFC] Re-enable alternative memory allocator (tcmalloc) for
 pve-qemu?
X-Mailer: aerc 0.20.0
References: <DHCDIFA0P8QP.2CTY4G4EEGKQ0@proxmox.com>
 <03fc2a07-34c8-4029-bbed-f1c764d4562e@proxmox.com>
In-Reply-To: <03fc2a07-34c8-4029-bbed-f1c764d4562e@proxmox.com>
X-Bm-Milter-Handled: 55990f41-d878-4baa-be0a-ee34c49e34d2
X-Bm-Transport-Timestamp: 1774511239037
X-SPAM-LEVEL: Spam detection results:  0
	AWL                     0.370 Adjusted score from AWL reputation of From:
 address
	BAYES_00                 -1.9 Bayes spam probability is 0 to 1%
	DMARC_MISSING             0.1 Missing DMARC policy
	KAM_DMARC_STATUS         0.01 Test Rule for DKIM or SPF Failure with Strict
 Alignment
	RCVD_IN_VALIDITY_CERTIFIED_BLOCKED  0.001 ADMINISTRATOR NOTICE: The query to
 Validity was blocked.  See
 https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more
 information.
	RCVD_IN_VALIDITY_RPBL_BLOCKED  0.001 ADMINISTRATOR NOTICE: The query to
 Validity was blocked.  See
 https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more
 information.
	RCVD_IN_VALIDITY_SAFE_BLOCKED  0.001 ADMINISTRATOR NOTICE: The query to
 Validity was blocked.  See
 https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more
 information.
	SPF_HELO_NONE           0.001 SPF: HELO does not publish an SPF Record
	SPF_PASS               -0.001 SPF: sender matches SPF record
Message-ID-Hash: QXYVJ5JVANZOOW5EHSPPNLMSIDDTKANJ
X-Message-ID-Hash: QXYVJ5JVANZOOW5EHSPPNLMSIDDTKANJ
X-MailFrom: k.chai@proxmox.com
X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; loop;
 banned-address; emergency; member-moderation; nonmember-moderation;
 administrivia; implicit-dest; max-recipients; max-size; news-moderation;
 no-subject; digests; suspicious-header
X-Mailman-Version: 3.3.10
Precedence: list
List-Id: Proxmox VE development discussion <pve-devel.lists.proxmox.com>
List-Help: <mailto:pve-devel-request@lists.proxmox.com?subject=help>
List-Owner: <mailto:pve-devel-owner@lists.proxmox.com>
List-Post: <mailto:pve-devel@lists.proxmox.com>
List-Subscribe: <mailto:pve-devel-join@lists.proxmox.com>
List-Unsubscribe: <mailto:pve-devel-leave@lists.proxmox.com>

Thanks Dietmar and Alexandre for the replies.

> Such things are usually easy to fix using by using a custom pool
> allocator, freeing all allocated objects at once. But I don't know
> the Ceph code, so I am not sure if it is possible to apply such
> pattern to the Ceph code ...

Good question =E2=80=94 I went and audited librbd's allocation patterns in =
the
Ceph squid source to check exactly this.

Short answer: librbd does NOT use a pool allocator. All I/O-path
allocations go directly to the system malloc/new.

Ceph does have a `mempool` system (src/include/mempool.h), but it's
tracking-only =E2=80=94 it provides memory accounting and observability, no=
t
actual pooling. The allocate/deallocate paths are just:

    // mempool.h:375
    T* r =3D reinterpret_cast<T*>(new char[total]);
    // mempool.h:392
    delete[] reinterpret_cast<char*>(p);

And there is no "rbd" or "librbd" mempool defined at all =E2=80=94 the exis=
ting
pools are for bluestore, osd internals, and buffer management. The
per-I/O request objects (ObjectReadRequest, ObjectWriteRequest,
ObjectDispatchSpec, etc.) are all allocated via plain `new` with no
custom allocator, no boost::pool, no arena, no slab.

Buffer data (the actual read/write payloads) uses malloc/posix_memalign,
again with mempool tracking counters but no pooling.

So you're right that a pool allocator would be the proper upstream fix.
But it would be a significant Ceph-side change =E2=80=94 the I/O dispatch l=
ayer
touches many files across src/librbd/io/ and the allocation/deallocation
is spread across different threads (submitter thread allocates, OSD
callback thread deallocates), which makes arena-per-request tricky.

This is why the system allocator choice has such an outsized impact:
with no application-level pooling, every I/O operation generates a
burst of small allocations that hit the system allocator directly.
tcmalloc and jemalloc handle this pattern well (per-thread caches,
size-class freelists); glibc's ptmalloc2 does not.

> The problem with that is that it can halt/delay the whole application
> while freeing memory?
> We need to make sure that there is no noticeable delay.

For gperftools tcmalloc, `ReleaseFreeMemory()` calls
`ReleaseToSystem(SIZE_T_MAX)` (malloc_extension.cc:127), which calls
`PageHeap::ReleaseAtLeastNPages()` (page_heap.cc:599). That function
round-robins through the page heap's free span lists (`free_[]` normal
lists and `large_normal_`) and calls `ReleaseSpan()` =E2=86=92 `madvise(MAD=
V_DONTNEED)`
on each free span. It does NOT walk allocated memory or compact the heap
=E2=80=94 it only touches spans that are already free in tcmalloc's page ca=
che.
It does hold the page heap spinlock during the walk, so concurrent
allocations on other threads would briefly contend on that lock.

That said, we'd only call this once at backup completion (same place the
current malloc_trim(4 MiB) runs), not on any hot path. For comparison,
glibc's malloc_trim iterates all arenas (locking each in turn) and
trims free space at the top of each heap segment. Under fragmentation
with many arenas, this can be more expensive than tcmalloc's walk of
its free span lists.

> If I remember, it was because I had lower performance than jemalloc
> because of too low default TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES &&
> also the tcmalloc 2.2 at this time. (it was resolved later with
> tcmalloc 2.4)

Thanks for the context and the link! That explains the 8-day turnaround.

I checked the gperftools repo =E2=80=94 interestingly, the thread cache def=
ault
(kDefaultOverallThreadCacheSize) is the same 32 MB in both 2.2 and 2.4
(8 =C3=97 kMaxThreadCacheSize =3D 8 =C3=97 4 MB; src/common.h). What did ch=
ange
between 2.2 and 2.4 is commit 7da5bd0 which flipped
TCMALLOC_AGGRESSIVE_DECOMMIT from false to true by default. In 2.2,
tcmalloc would hold onto freed pages indefinitely unless explicitly
told to decommit, leading to higher RSS =E2=80=94 that combined with the 2.=
2-era
thread cache behavior likely explains the poor showing vs jemalloc.

PVE 9 ships gperftools 2.16 (via libgoogle-perftools4t64), so all of
this is long resolved.

So to summarize where the two historical concerns stand today:

1. 2015 tcmalloc removal (Alexandre's point): caused by gperftools
   2.2-era issues =E2=80=94 notably TCMALLOC_AGGRESSIVE_DECOMMIT defaulting=
 to
   false (commit 7da5bd0 flipped it to true in 2.4). PVE 9 ships
   gperftools 2.16 =E2=80=94 this is a non-issue.

2. 2020 jemalloc removal (Stefan Reiter's commit 3d785ea): post-PBS-
   backup memory not returned to OS. Addressable by calling the
   allocator-specific release API (ReleaseFreeMemory for tcmalloc) in
   the backup completion path, right where malloc_trim(4 MiB) already
   sits today.

Both blockers are resolved. I think the pragmatic path is:

1. Re-enable tcmalloc via --enable-malloc=3Dtcmalloc in the pve-qemu
   build (this also compiles out malloc_trim, which is correct)
2. Add allocator-aware memory release in the backup completion path
   (ReleaseFreeMemory for tcmalloc, arena purge for jemalloc)
3. Test the PBS backup memory reclamation to confirm VmRSS comes back
   down after backup

I'll prepare a patch series if this approach sounds reasonable.