From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from firstgate.proxmox.com (firstgate.proxmox.com [IPv6:2a01:7e0:0:424::9]) by lore.proxmox.com (Postfix) with ESMTPS id 6A54D1FF140 for ; Fri, 10 Apr 2026 06:33:23 +0200 (CEST) Received: from firstgate.proxmox.com (localhost [127.0.0.1]) by firstgate.proxmox.com (Proxmox) with ESMTP id BF3D91312F; Fri, 10 Apr 2026 06:34:07 +0200 (CEST) Mime-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=UTF-8 Date: Fri, 10 Apr 2026 12:33:27 +0800 Message-Id: Subject: Re: [RFC] Re-enable alternative memory allocator (tcmalloc) for pve-qemu? From: "Kefu Chai" To: "Kefu Chai" , "Dietmar Maurer" , "DERUMIER, Alexandre" , X-Mailer: aerc 0.20.0 References: <03fc2a07-34c8-4029-bbed-f1c764d4562e@proxmox.com> In-Reply-To: X-Bm-Milter-Handled: 55990f41-d878-4baa-be0a-ee34c49e34d2 X-Bm-Transport-Timestamp: 1775795543570 X-SPAM-LEVEL: Spam detection results: 0 AWL 0.388 Adjusted score from AWL reputation of From: address BAYES_00 -1.9 Bayes spam probability is 0 to 1% DMARC_MISSING 0.1 Missing DMARC policy KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment SPF_HELO_NONE 0.001 SPF: HELO does not publish an SPF Record SPF_PASS -0.001 SPF: sender matches SPF record Message-ID-Hash: 3BRZWLTX5U575O2GE4RI25QN6J5Q4VQI X-Message-ID-Hash: 3BRZWLTX5U575O2GE4RI25QN6J5Q4VQI X-MailFrom: k.chai@proxmox.com X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; loop; banned-address; emergency; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header X-Mailman-Version: 3.3.10 Precedence: list List-Id: Proxmox VE development discussion List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: patchset sent at https://lore.proxmox.com/pve-devel/20260410043027.3621673-= 1-k.chai@proxmox.com/T/#t On Thu Mar 26, 2026 at 3:48 PM CST, Kefu Chai wrote: > Thanks Dietmar and Alexandre for the replies. > >> Such things are usually easy to fix using by using a custom pool >> allocator, freeing all allocated objects at once. But I don't know >> the Ceph code, so I am not sure if it is possible to apply such >> pattern to the Ceph code ... > > Good question =E2=80=94 I went and audited librbd's allocation patterns i= n the > Ceph squid source to check exactly this. > > Short answer: librbd does NOT use a pool allocator. All I/O-path > allocations go directly to the system malloc/new. > > Ceph does have a `mempool` system (src/include/mempool.h), but it's > tracking-only =E2=80=94 it provides memory accounting and observability, = not > actual pooling. The allocate/deallocate paths are just: > > // mempool.h:375 > T* r =3D reinterpret_cast(new char[total]); > // mempool.h:392 > delete[] reinterpret_cast(p); > > And there is no "rbd" or "librbd" mempool defined at all =E2=80=94 the ex= isting > pools are for bluestore, osd internals, and buffer management. The > per-I/O request objects (ObjectReadRequest, ObjectWriteRequest, > ObjectDispatchSpec, etc.) are all allocated via plain `new` with no > custom allocator, no boost::pool, no arena, no slab. > > Buffer data (the actual read/write payloads) uses malloc/posix_memalign, > again with mempool tracking counters but no pooling. > > So you're right that a pool allocator would be the proper upstream fix. > But it would be a significant Ceph-side change =E2=80=94 the I/O dispatch= layer > touches many files across src/librbd/io/ and the allocation/deallocation > is spread across different threads (submitter thread allocates, OSD > callback thread deallocates), which makes arena-per-request tricky. > > This is why the system allocator choice has such an outsized impact: > with no application-level pooling, every I/O operation generates a > burst of small allocations that hit the system allocator directly. > tcmalloc and jemalloc handle this pattern well (per-thread caches, > size-class freelists); glibc's ptmalloc2 does not. > >> The problem with that is that it can halt/delay the whole application >> while freeing memory? >> We need to make sure that there is no noticeable delay. > > For gperftools tcmalloc, `ReleaseFreeMemory()` calls > `ReleaseToSystem(SIZE_T_MAX)` (malloc_extension.cc:127), which calls > `PageHeap::ReleaseAtLeastNPages()` (page_heap.cc:599). That function > round-robins through the page heap's free span lists (`free_[]` normal > lists and `large_normal_`) and calls `ReleaseSpan()` =E2=86=92 `madvise(M= ADV_DONTNEED)` > on each free span. It does NOT walk allocated memory or compact the heap > =E2=80=94 it only touches spans that are already free in tcmalloc's page = cache. > It does hold the page heap spinlock during the walk, so concurrent > allocations on other threads would briefly contend on that lock. > > That said, we'd only call this once at backup completion (same place the > current malloc_trim(4 MiB) runs), not on any hot path. For comparison, > glibc's malloc_trim iterates all arenas (locking each in turn) and > trims free space at the top of each heap segment. Under fragmentation > with many arenas, this can be more expensive than tcmalloc's walk of > its free span lists. > >> If I remember, it was because I had lower performance than jemalloc >> because of too low default TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES && >> also the tcmalloc 2.2 at this time. (it was resolved later with >> tcmalloc 2.4) > > Thanks for the context and the link! That explains the 8-day turnaround. > > I checked the gperftools repo =E2=80=94 interestingly, the thread cache d= efault > (kDefaultOverallThreadCacheSize) is the same 32 MB in both 2.2 and 2.4 > (8 =C3=97 kMaxThreadCacheSize =3D 8 =C3=97 4 MB; src/common.h). What did = change > between 2.2 and 2.4 is commit 7da5bd0 which flipped > TCMALLOC_AGGRESSIVE_DECOMMIT from false to true by default. In 2.2, > tcmalloc would hold onto freed pages indefinitely unless explicitly > told to decommit, leading to higher RSS =E2=80=94 that combined with the = 2.2-era > thread cache behavior likely explains the poor showing vs jemalloc. > > PVE 9 ships gperftools 2.16 (via libgoogle-perftools4t64), so all of > this is long resolved. > > So to summarize where the two historical concerns stand today: > > 1. 2015 tcmalloc removal (Alexandre's point): caused by gperftools > 2.2-era issues =E2=80=94 notably TCMALLOC_AGGRESSIVE_DECOMMIT defaulti= ng to > false (commit 7da5bd0 flipped it to true in 2.4). PVE 9 ships > gperftools 2.16 =E2=80=94 this is a non-issue. > > 2. 2020 jemalloc removal (Stefan Reiter's commit 3d785ea): post-PBS- > backup memory not returned to OS. Addressable by calling the > allocator-specific release API (ReleaseFreeMemory for tcmalloc) in > the backup completion path, right where malloc_trim(4 MiB) already > sits today. > > Both blockers are resolved. I think the pragmatic path is: > > 1. Re-enable tcmalloc via --enable-malloc=3Dtcmalloc in the pve-qemu > build (this also compiles out malloc_trim, which is correct) > 2. Add allocator-aware memory release in the backup completion path > (ReleaseFreeMemory for tcmalloc, arena purge for jemalloc) > 3. Test the PBS backup memory reclamation to confirm VmRSS comes back > down after backup > > I'll prepare a patch series if this approach sounds reasonable.