From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <pve-devel-bounces@lists.proxmox.com>
Received: from firstgate.proxmox.com (firstgate.proxmox.com [IPv6:2a01:7e0:0:424::9])
	by lore.proxmox.com (Postfix) with ESMTPS id 6A54D1FF140
	for <inbox@lore.proxmox.com>; Fri, 10 Apr 2026 06:33:23 +0200 (CEST)
Received: from firstgate.proxmox.com (localhost [127.0.0.1])
	by firstgate.proxmox.com (Proxmox) with ESMTP id BF3D91312F;
	Fri, 10 Apr 2026 06:34:07 +0200 (CEST)
Mime-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset=UTF-8
Date: Fri, 10 Apr 2026 12:33:27 +0800
Message-Id: <DHP707TCJLG8.XKRTTK452S55@proxmox.com>
Subject: Re: [RFC] Re-enable alternative memory allocator (tcmalloc) for
 pve-qemu?
From: "Kefu Chai" <k.chai@proxmox.com>
To: "Kefu Chai" <k.chai@proxmox.com>, "Dietmar Maurer"
 <dietmar@proxmox.com>, "DERUMIER, Alexandre"
 <alexandre.derumier@groupe-cyllene.com>, <pve-devel@lists.proxmox.com>
X-Mailer: aerc 0.20.0
References: <DHCDIFA0P8QP.2CTY4G4EEGKQ0@proxmox.com>
 <03fc2a07-34c8-4029-bbed-f1c764d4562e@proxmox.com>
 <DHCJR18OV951.1UFQTRLNWT77Z@proxmox.com>
In-Reply-To: <DHCJR18OV951.1UFQTRLNWT77Z@proxmox.com>
X-Bm-Milter-Handled: 55990f41-d878-4baa-be0a-ee34c49e34d2
X-Bm-Transport-Timestamp: 1775795543570
X-SPAM-LEVEL: Spam detection results:  0
	AWL                     0.388 Adjusted score from AWL reputation of From:
 address
	BAYES_00                 -1.9 Bayes spam probability is 0 to 1%
	DMARC_MISSING             0.1 Missing DMARC policy
	KAM_DMARC_STATUS         0.01 Test Rule for DKIM or SPF Failure with Strict
 Alignment
	SPF_HELO_NONE           0.001 SPF: HELO does not publish an SPF Record
	SPF_PASS               -0.001 SPF: sender matches SPF record
Message-ID-Hash: 3BRZWLTX5U575O2GE4RI25QN6J5Q4VQI
X-Message-ID-Hash: 3BRZWLTX5U575O2GE4RI25QN6J5Q4VQI
X-MailFrom: k.chai@proxmox.com
X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; loop;
 banned-address; emergency; member-moderation; nonmember-moderation;
 administrivia; implicit-dest; max-recipients; max-size; news-moderation;
 no-subject; digests; suspicious-header
X-Mailman-Version: 3.3.10
Precedence: list
List-Id: Proxmox VE development discussion <pve-devel.lists.proxmox.com>
List-Help: <mailto:pve-devel-request@lists.proxmox.com?subject=help>
List-Owner: <mailto:pve-devel-owner@lists.proxmox.com>
List-Post: <mailto:pve-devel@lists.proxmox.com>
List-Subscribe: <mailto:pve-devel-join@lists.proxmox.com>
List-Unsubscribe: <mailto:pve-devel-leave@lists.proxmox.com>

patchset sent at https://lore.proxmox.com/pve-devel/20260410043027.3621673-=
1-k.chai@proxmox.com/T/#t

On Thu Mar 26, 2026 at 3:48 PM CST, Kefu Chai wrote:
> Thanks Dietmar and Alexandre for the replies.
>
>> Such things are usually easy to fix using by using a custom pool
>> allocator, freeing all allocated objects at once. But I don't know
>> the Ceph code, so I am not sure if it is possible to apply such
>> pattern to the Ceph code ...
>
> Good question =E2=80=94 I went and audited librbd's allocation patterns i=
n the
> Ceph squid source to check exactly this.
>
> Short answer: librbd does NOT use a pool allocator. All I/O-path
> allocations go directly to the system malloc/new.
>
> Ceph does have a `mempool` system (src/include/mempool.h), but it's
> tracking-only =E2=80=94 it provides memory accounting and observability, =
not
> actual pooling. The allocate/deallocate paths are just:
>
>     // mempool.h:375
>     T* r =3D reinterpret_cast<T*>(new char[total]);
>     // mempool.h:392
>     delete[] reinterpret_cast<char*>(p);
>
> And there is no "rbd" or "librbd" mempool defined at all =E2=80=94 the ex=
isting
> pools are for bluestore, osd internals, and buffer management. The
> per-I/O request objects (ObjectReadRequest, ObjectWriteRequest,
> ObjectDispatchSpec, etc.) are all allocated via plain `new` with no
> custom allocator, no boost::pool, no arena, no slab.
>
> Buffer data (the actual read/write payloads) uses malloc/posix_memalign,
> again with mempool tracking counters but no pooling.
>
> So you're right that a pool allocator would be the proper upstream fix.
> But it would be a significant Ceph-side change =E2=80=94 the I/O dispatch=
 layer
> touches many files across src/librbd/io/ and the allocation/deallocation
> is spread across different threads (submitter thread allocates, OSD
> callback thread deallocates), which makes arena-per-request tricky.
>
> This is why the system allocator choice has such an outsized impact:
> with no application-level pooling, every I/O operation generates a
> burst of small allocations that hit the system allocator directly.
> tcmalloc and jemalloc handle this pattern well (per-thread caches,
> size-class freelists); glibc's ptmalloc2 does not.
>
>> The problem with that is that it can halt/delay the whole application
>> while freeing memory?
>> We need to make sure that there is no noticeable delay.
>
> For gperftools tcmalloc, `ReleaseFreeMemory()` calls
> `ReleaseToSystem(SIZE_T_MAX)` (malloc_extension.cc:127), which calls
> `PageHeap::ReleaseAtLeastNPages()` (page_heap.cc:599). That function
> round-robins through the page heap's free span lists (`free_[]` normal
> lists and `large_normal_`) and calls `ReleaseSpan()` =E2=86=92 `madvise(M=
ADV_DONTNEED)`
> on each free span. It does NOT walk allocated memory or compact the heap
> =E2=80=94 it only touches spans that are already free in tcmalloc's page =
cache.
> It does hold the page heap spinlock during the walk, so concurrent
> allocations on other threads would briefly contend on that lock.
>
> That said, we'd only call this once at backup completion (same place the
> current malloc_trim(4 MiB) runs), not on any hot path. For comparison,
> glibc's malloc_trim iterates all arenas (locking each in turn) and
> trims free space at the top of each heap segment. Under fragmentation
> with many arenas, this can be more expensive than tcmalloc's walk of
> its free span lists.
>
>> If I remember, it was because I had lower performance than jemalloc
>> because of too low default TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES &&
>> also the tcmalloc 2.2 at this time. (it was resolved later with
>> tcmalloc 2.4)
>
> Thanks for the context and the link! That explains the 8-day turnaround.
>
> I checked the gperftools repo =E2=80=94 interestingly, the thread cache d=
efault
> (kDefaultOverallThreadCacheSize) is the same 32 MB in both 2.2 and 2.4
> (8 =C3=97 kMaxThreadCacheSize =3D 8 =C3=97 4 MB; src/common.h). What did =
change
> between 2.2 and 2.4 is commit 7da5bd0 which flipped
> TCMALLOC_AGGRESSIVE_DECOMMIT from false to true by default. In 2.2,
> tcmalloc would hold onto freed pages indefinitely unless explicitly
> told to decommit, leading to higher RSS =E2=80=94 that combined with the =
2.2-era
> thread cache behavior likely explains the poor showing vs jemalloc.
>
> PVE 9 ships gperftools 2.16 (via libgoogle-perftools4t64), so all of
> this is long resolved.
>
> So to summarize where the two historical concerns stand today:
>
> 1. 2015 tcmalloc removal (Alexandre's point): caused by gperftools
>    2.2-era issues =E2=80=94 notably TCMALLOC_AGGRESSIVE_DECOMMIT defaulti=
ng to
>    false (commit 7da5bd0 flipped it to true in 2.4). PVE 9 ships
>    gperftools 2.16 =E2=80=94 this is a non-issue.
>
> 2. 2020 jemalloc removal (Stefan Reiter's commit 3d785ea): post-PBS-
>    backup memory not returned to OS. Addressable by calling the
>    allocator-specific release API (ReleaseFreeMemory for tcmalloc) in
>    the backup completion path, right where malloc_trim(4 MiB) already
>    sits today.
>
> Both blockers are resolved. I think the pragmatic path is:
>
> 1. Re-enable tcmalloc via --enable-malloc=3Dtcmalloc in the pve-qemu
>    build (this also compiles out malloc_trim, which is correct)
> 2. Add allocator-aware memory release in the backup completion path
>    (ReleaseFreeMemory for tcmalloc, arena purge for jemalloc)
> 3. Test the PBS backup memory reclamation to confirm VmRSS comes back
>    down after backup
>
> I'll prepare a patch series if this approach sounds reasonable.