public inbox for pve-devel@lists.proxmox.com
 help / color / mirror / Atom feed
* [RFC] Re-enable alternative memory allocator (tcmalloc) for pve-qemu?
@ 2026-03-26  2:54 Kefu Chai
  2026-03-26  6:09 ` DERUMIER, Alexandre
  2026-03-26  6:16 ` Dietmar Maurer
  0 siblings, 2 replies; 4+ messages in thread
From: Kefu Chai @ 2026-03-26  2:54 UTC (permalink / raw)
  To: pve-devel; +Cc: Kefu Chai

Hi all,

I'd like to revisit the question of using an alternative memory allocator
in pve-qemu. The topic came up in a recent forum thread [1] about Ceph
read performance on multi-AZ clusters, where both read-replica
localization and tcmalloc were discussed as complementary optimizations.
After digging into the history and source code, I think there may be a
path to re-enabling tcmalloc (or jemalloc) that addresses the issues
that led to their removal.

## Background: benchmark numbers

The Ceph blog post "QEMU/KVM Tuning" [2] reports a ~50% improvement on
16KB random reads (53.5k -> 80k IOPS) when using
LD_PRELOAD="/usr/lib64/libtcmalloc.so" with QEMU+librbd. spirit's
pve-devel patch [3] reported similar numbers on 4K randread (60k -> 90k
IOPS, also +50%).

The mechanism is well-understood: librbd creates many small temporary
objects per I/O operation, and glibc's allocator is slow under that
fragmentation pattern.

## Background: malloc_trim interaction

On top of the allocation overhead, QEMU's RCU call thread
(util/rcu.c:call_rcu_thread) calls malloc_trim(4 MiB) each time the
RCU callback queue drains. Under I/O load this can happen several
times per second. malloc_trim walks the glibc heap looking for pages
to return to the OS via madvise(MADV_DONTNEED), which adds CPU
overhead.

QEMU upstream handles this correctly: when built with
--enable-malloc=tcmalloc (or jemalloc), meson.build sets
has_malloc_trim = false, which compiles out all malloc_trim calls.
Bonzini's commit aa087962 even makes --enable-malloc-trim +
--enable-malloc=tcmalloc a fatal build error — they are explicitly
mutually exclusive.

## History in pve-qemu

pve-qemu has tried both allocators before:

1. tcmalloc was enabled in 2.3-2 (Jun 10, 2015) and removed 8 days
   later in 2.3-4 (Jun 18, 2015). The changelog gives no reason.

2. jemalloc replaced it in 2.3-5 (Jun 19, 2015) and lasted until
   5.1.0-8 (Dec 2020), when Stefan Reiter removed it in commit
   3d785ea with the explanation:

     "jemalloc does not play nice with our Rust library
     (proxmox-backup-qemu), specifically it never releases memory
     allocated from Rust to the OS. This leads to a problem with
     larger caches (e.g. for the PBS block driver)."

   Stefan referenced jemalloc#1398 [4]. The upstream fix was
   background_thread:true in MALLOC_CONF, which Stefan described as
   "weirdly hacky."

3. In Aug 2023, Fiona Ebner added an explicit malloc_trim(4 MiB) call
   to the backup completion path in commit 5f9cb29, because without
   it, QEMU's RSS ballooned from ~370 MB to ~2.1 GB after a PBS
   backup and never came back. The forum thread that motivated this
   was [5].

## The core problem and a possible solution

The reason both allocators were removed boils down to: after a PBS
backup, the Rust-based proxmox-backup-qemu library frees a large
amount of memory, but neither jemalloc nor tcmalloc return it to the
OS promptly:

- jemalloc's time-based decay requires ongoing allocation activity to
  make progress; when the app goes idle after backup, nothing triggers
  the decay.

- gperftools tcmalloc's IncrementalScavenge (page_heap.cc) is also
  driven by page-level deallocations, not by time. With the default
  TCMALLOC_RELEASE_RATE=1.0, it releases 1 page per 1000 pages
  deallocated. If the app goes idle, no scavenging happens.

But both allocators provide explicit "release everything now" APIs:

- gperftools: MallocExtension::instance()->ReleaseFreeMemory()
  (calls ReleaseToSystem(LONG_MAX) internally)
- jemalloc: mallctl("arena.<i>.purge", ...)

The current pve-backup.c code already has the right structure:

    #if defined(CONFIG_MALLOC_TRIM)
        malloc_trim(4 * 1024 * 1024);
    #endif

This could be extended to:

    #if defined(CONFIG_MALLOC_TRIM)
        malloc_trim(4 * 1024 * 1024);
    #elif defined(CONFIG_TCMALLOC)
        MallocExtension_ReleaseFreeMemory();
    #elif defined(CONFIG_JEMALLOC)
        /* call mallctl to purge arenas */
    #endif

QEMU's meson build doesn't currently define CONFIG_TCMALLOC or
CONFIG_JEMALLOC, but the meson option value is available to add them.
The RCU thread malloc_trim is already handled — --enable-malloc=tcmalloc
compiles it out, and the alternative allocator handles steady-state
memory management on its own.

## Questions for the team

1. Does anyone remember why tcmalloc was removed after only 8 days
   in 2015 (2.3-4)? The changelog just says "remove tcmalloc" with
   no explanation.

2. Fiona: given the backup memory reclamation work in 5f9cb29, would
   you see a problem with replacing malloc_trim with the
   allocator-specific release API in the backup completion path?

3. Has the proxmox-backup-qemu library's allocation pattern changed
   since 2020 in ways that might affect this? (e.g., different buffer
   management, Rust allocator changes)

4. Would anyone be willing to test a pve-qemu build with
   --enable-malloc=tcmalloc + the allocator-aware release patch
   against the PBS backup workload? The key metric is VmRSS in
   /proc/<qemu-pid>/status before and after a backup cycle.

I'm happy to prepare a patch series if there's interest, but wanted
to get input first given the history.

[1] https://forum.proxmox.com/threads/181751/
[2] https://ceph.io/en/news/blog/2022/qemu-kvm-tuning/
[3] https://lists.proxmox.com/pipermail/pve-devel/2023-May/056815.html
[4] https://github.com/jemalloc/jemalloc/issues/1398
[5] https://forum.proxmox.com/threads/131339/

Regards,
Kefu





^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [RFC] Re-enable alternative memory allocator (tcmalloc) for pve-qemu?
  2026-03-26  2:54 [RFC] Re-enable alternative memory allocator (tcmalloc) for pve-qemu? Kefu Chai
@ 2026-03-26  6:09 ` DERUMIER, Alexandre
  2026-03-26  6:16 ` Dietmar Maurer
  1 sibling, 0 replies; 4+ messages in thread
From: DERUMIER, Alexandre @ 2026-03-26  6:09 UTC (permalink / raw)
  To: pve-devel, k.chai; +Cc: tchaikov

Hi Kefu !
>>1. Does anyone remember why tcmalloc was removed after only 8 days
>>   in 2015 (2.3-4)? The changelog just says "remove tcmalloc" with
>>   no explanation.

If I remember, it was because I had lower peformance than jemalloc
because of too low default TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES &&
also the tcmalloc 2.2 at this time.  (it was resolved later with
tcmalloc 2.4)

See:
https://lists.proxmox.com/pipermail/pve-devel/2015-June/015551.html


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [RFC] Re-enable alternative memory allocator (tcmalloc) for pve-qemu?
  2026-03-26  2:54 [RFC] Re-enable alternative memory allocator (tcmalloc) for pve-qemu? Kefu Chai
  2026-03-26  6:09 ` DERUMIER, Alexandre
@ 2026-03-26  6:16 ` Dietmar Maurer
  2026-03-26  7:48   ` Kefu Chai
  1 sibling, 1 reply; 4+ messages in thread
From: Dietmar Maurer @ 2026-03-26  6:16 UTC (permalink / raw)
  To: Kefu Chai, pve-devel


> The mechanism is well-understood: librbd creates many small temporary
> objects per I/O operation, and glibc's allocator is slow under that
> fragmentation pattern.

Such things are usually easy to fix using by using a custom pool 
allocator, freeing
all allocated objects at once. But I don't know the Ceph code, so I am 
not sure if it is
possible to apply such pattern to the Ceph code ...

> But both allocators provide explicit "release everything now" APIs:
>
> - gperftools: MallocExtension::instance()->ReleaseFreeMemory()
>    (calls ReleaseToSystem(LONG_MAX) internally)
> - jemalloc: mallctl("arena.<i>.purge", ...)

The problem with that is that it can halt/delay the whole application 
while freeing memory?
We need to make sure that there is no noticeable delay.

> ## Questions for the team
>
> 1. Does anyone remember why tcmalloc was removed after only 8 days
>     in 2015 (2.3-4)? The changelog just says "remove tcmalloc" with
>     no explanation.
Not really, sorry.
> 2. Fiona: given the backup memory reclamation work in 5f9cb29, would
>     you see a problem with replacing malloc_trim with the
>     allocator-specific release API in the backup completion path?
>
> 3. Has the proxmox-backup-qemu library's allocation pattern changed
>     since 2020 in ways that might affect this? (e.g., different buffer
>     management, Rust allocator changes)

I am not aware of such changes.






^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [RFC] Re-enable alternative memory allocator (tcmalloc) for pve-qemu?
  2026-03-26  6:16 ` Dietmar Maurer
@ 2026-03-26  7:48   ` Kefu Chai
  0 siblings, 0 replies; 4+ messages in thread
From: Kefu Chai @ 2026-03-26  7:48 UTC (permalink / raw)
  To: Dietmar Maurer, DERUMIER, Alexandre, pve-devel

Thanks Dietmar and Alexandre for the replies.

> Such things are usually easy to fix using by using a custom pool
> allocator, freeing all allocated objects at once. But I don't know
> the Ceph code, so I am not sure if it is possible to apply such
> pattern to the Ceph code ...

Good question — I went and audited librbd's allocation patterns in the
Ceph squid source to check exactly this.

Short answer: librbd does NOT use a pool allocator. All I/O-path
allocations go directly to the system malloc/new.

Ceph does have a `mempool` system (src/include/mempool.h), but it's
tracking-only — it provides memory accounting and observability, not
actual pooling. The allocate/deallocate paths are just:

    // mempool.h:375
    T* r = reinterpret_cast<T*>(new char[total]);
    // mempool.h:392
    delete[] reinterpret_cast<char*>(p);

And there is no "rbd" or "librbd" mempool defined at all — the existing
pools are for bluestore, osd internals, and buffer management. The
per-I/O request objects (ObjectReadRequest, ObjectWriteRequest,
ObjectDispatchSpec, etc.) are all allocated via plain `new` with no
custom allocator, no boost::pool, no arena, no slab.

Buffer data (the actual read/write payloads) uses malloc/posix_memalign,
again with mempool tracking counters but no pooling.

So you're right that a pool allocator would be the proper upstream fix.
But it would be a significant Ceph-side change — the I/O dispatch layer
touches many files across src/librbd/io/ and the allocation/deallocation
is spread across different threads (submitter thread allocates, OSD
callback thread deallocates), which makes arena-per-request tricky.

This is why the system allocator choice has such an outsized impact:
with no application-level pooling, every I/O operation generates a
burst of small allocations that hit the system allocator directly.
tcmalloc and jemalloc handle this pattern well (per-thread caches,
size-class freelists); glibc's ptmalloc2 does not.

> The problem with that is that it can halt/delay the whole application
> while freeing memory?
> We need to make sure that there is no noticeable delay.

For gperftools tcmalloc, `ReleaseFreeMemory()` calls
`ReleaseToSystem(SIZE_T_MAX)` (malloc_extension.cc:127), which calls
`PageHeap::ReleaseAtLeastNPages()` (page_heap.cc:599). That function
round-robins through the page heap's free span lists (`free_[]` normal
lists and `large_normal_`) and calls `ReleaseSpan()` → `madvise(MADV_DONTNEED)`
on each free span. It does NOT walk allocated memory or compact the heap
— it only touches spans that are already free in tcmalloc's page cache.
It does hold the page heap spinlock during the walk, so concurrent
allocations on other threads would briefly contend on that lock.

That said, we'd only call this once at backup completion (same place the
current malloc_trim(4 MiB) runs), not on any hot path. For comparison,
glibc's malloc_trim iterates all arenas (locking each in turn) and
trims free space at the top of each heap segment. Under fragmentation
with many arenas, this can be more expensive than tcmalloc's walk of
its free span lists.

> If I remember, it was because I had lower performance than jemalloc
> because of too low default TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES &&
> also the tcmalloc 2.2 at this time. (it was resolved later with
> tcmalloc 2.4)

Thanks for the context and the link! That explains the 8-day turnaround.

I checked the gperftools repo — interestingly, the thread cache default
(kDefaultOverallThreadCacheSize) is the same 32 MB in both 2.2 and 2.4
(8 × kMaxThreadCacheSize = 8 × 4 MB; src/common.h). What did change
between 2.2 and 2.4 is commit 7da5bd0 which flipped
TCMALLOC_AGGRESSIVE_DECOMMIT from false to true by default. In 2.2,
tcmalloc would hold onto freed pages indefinitely unless explicitly
told to decommit, leading to higher RSS — that combined with the 2.2-era
thread cache behavior likely explains the poor showing vs jemalloc.

PVE 9 ships gperftools 2.16 (via libgoogle-perftools4t64), so all of
this is long resolved.

So to summarize where the two historical concerns stand today:

1. 2015 tcmalloc removal (Alexandre's point): caused by gperftools
   2.2-era issues — notably TCMALLOC_AGGRESSIVE_DECOMMIT defaulting to
   false (commit 7da5bd0 flipped it to true in 2.4). PVE 9 ships
   gperftools 2.16 — this is a non-issue.

2. 2020 jemalloc removal (Stefan Reiter's commit 3d785ea): post-PBS-
   backup memory not returned to OS. Addressable by calling the
   allocator-specific release API (ReleaseFreeMemory for tcmalloc) in
   the backup completion path, right where malloc_trim(4 MiB) already
   sits today.

Both blockers are resolved. I think the pragmatic path is:

1. Re-enable tcmalloc via --enable-malloc=tcmalloc in the pve-qemu
   build (this also compiles out malloc_trim, which is correct)
2. Add allocator-aware memory release in the backup completion path
   (ReleaseFreeMemory for tcmalloc, arena purge for jemalloc)
3. Test the PBS backup memory reclamation to confirm VmRSS comes back
   down after backup

I'll prepare a patch series if this approach sounds reasonable.





^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2026-03-26  7:47 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2026-03-26  2:54 [RFC] Re-enable alternative memory allocator (tcmalloc) for pve-qemu? Kefu Chai
2026-03-26  6:09 ` DERUMIER, Alexandre
2026-03-26  6:16 ` Dietmar Maurer
2026-03-26  7:48   ` Kefu Chai

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox
Service provided by Proxmox Server Solutions GmbH | Privacy | Legal