public inbox for pve-devel@lists.proxmox.com
 help / color / mirror / Atom feed
* [RFC] Re-enable alternative memory allocator (tcmalloc) for pve-qemu?
@ 2026-03-26  2:54 Kefu Chai
  2026-03-26  6:09 ` DERUMIER, Alexandre
  2026-03-26  6:16 ` Dietmar Maurer
  0 siblings, 2 replies; 4+ messages in thread
From: Kefu Chai @ 2026-03-26  2:54 UTC (permalink / raw)
  To: pve-devel; +Cc: Kefu Chai

Hi all,

I'd like to revisit the question of using an alternative memory allocator
in pve-qemu. The topic came up in a recent forum thread [1] about Ceph
read performance on multi-AZ clusters, where both read-replica
localization and tcmalloc were discussed as complementary optimizations.
After digging into the history and source code, I think there may be a
path to re-enabling tcmalloc (or jemalloc) that addresses the issues
that led to their removal.

## Background: benchmark numbers

The Ceph blog post "QEMU/KVM Tuning" [2] reports a ~50% improvement on
16KB random reads (53.5k -> 80k IOPS) when using
LD_PRELOAD="/usr/lib64/libtcmalloc.so" with QEMU+librbd. spirit's
pve-devel patch [3] reported similar numbers on 4K randread (60k -> 90k
IOPS, also +50%).

The mechanism is well-understood: librbd creates many small temporary
objects per I/O operation, and glibc's allocator is slow under that
fragmentation pattern.

## Background: malloc_trim interaction

On top of the allocation overhead, QEMU's RCU call thread
(util/rcu.c:call_rcu_thread) calls malloc_trim(4 MiB) each time the
RCU callback queue drains. Under I/O load this can happen several
times per second. malloc_trim walks the glibc heap looking for pages
to return to the OS via madvise(MADV_DONTNEED), which adds CPU
overhead.

QEMU upstream handles this correctly: when built with
--enable-malloc=tcmalloc (or jemalloc), meson.build sets
has_malloc_trim = false, which compiles out all malloc_trim calls.
Bonzini's commit aa087962 even makes --enable-malloc-trim +
--enable-malloc=tcmalloc a fatal build error — they are explicitly
mutually exclusive.

## History in pve-qemu

pve-qemu has tried both allocators before:

1. tcmalloc was enabled in 2.3-2 (Jun 10, 2015) and removed 8 days
   later in 2.3-4 (Jun 18, 2015). The changelog gives no reason.

2. jemalloc replaced it in 2.3-5 (Jun 19, 2015) and lasted until
   5.1.0-8 (Dec 2020), when Stefan Reiter removed it in commit
   3d785ea with the explanation:

     "jemalloc does not play nice with our Rust library
     (proxmox-backup-qemu), specifically it never releases memory
     allocated from Rust to the OS. This leads to a problem with
     larger caches (e.g. for the PBS block driver)."

   Stefan referenced jemalloc#1398 [4]. The upstream fix was
   background_thread:true in MALLOC_CONF, which Stefan described as
   "weirdly hacky."

3. In Aug 2023, Fiona Ebner added an explicit malloc_trim(4 MiB) call
   to the backup completion path in commit 5f9cb29, because without
   it, QEMU's RSS ballooned from ~370 MB to ~2.1 GB after a PBS
   backup and never came back. The forum thread that motivated this
   was [5].

## The core problem and a possible solution

The reason both allocators were removed boils down to: after a PBS
backup, the Rust-based proxmox-backup-qemu library frees a large
amount of memory, but neither jemalloc nor tcmalloc return it to the
OS promptly:

- jemalloc's time-based decay requires ongoing allocation activity to
  make progress; when the app goes idle after backup, nothing triggers
  the decay.

- gperftools tcmalloc's IncrementalScavenge (page_heap.cc) is also
  driven by page-level deallocations, not by time. With the default
  TCMALLOC_RELEASE_RATE=1.0, it releases 1 page per 1000 pages
  deallocated. If the app goes idle, no scavenging happens.

But both allocators provide explicit "release everything now" APIs:

- gperftools: MallocExtension::instance()->ReleaseFreeMemory()
  (calls ReleaseToSystem(LONG_MAX) internally)
- jemalloc: mallctl("arena.<i>.purge", ...)

The current pve-backup.c code already has the right structure:

    #if defined(CONFIG_MALLOC_TRIM)
        malloc_trim(4 * 1024 * 1024);
    #endif

This could be extended to:

    #if defined(CONFIG_MALLOC_TRIM)
        malloc_trim(4 * 1024 * 1024);
    #elif defined(CONFIG_TCMALLOC)
        MallocExtension_ReleaseFreeMemory();
    #elif defined(CONFIG_JEMALLOC)
        /* call mallctl to purge arenas */
    #endif

QEMU's meson build doesn't currently define CONFIG_TCMALLOC or
CONFIG_JEMALLOC, but the meson option value is available to add them.
The RCU thread malloc_trim is already handled — --enable-malloc=tcmalloc
compiles it out, and the alternative allocator handles steady-state
memory management on its own.

## Questions for the team

1. Does anyone remember why tcmalloc was removed after only 8 days
   in 2015 (2.3-4)? The changelog just says "remove tcmalloc" with
   no explanation.

2. Fiona: given the backup memory reclamation work in 5f9cb29, would
   you see a problem with replacing malloc_trim with the
   allocator-specific release API in the backup completion path?

3. Has the proxmox-backup-qemu library's allocation pattern changed
   since 2020 in ways that might affect this? (e.g., different buffer
   management, Rust allocator changes)

4. Would anyone be willing to test a pve-qemu build with
   --enable-malloc=tcmalloc + the allocator-aware release patch
   against the PBS backup workload? The key metric is VmRSS in
   /proc/<qemu-pid>/status before and after a backup cycle.

I'm happy to prepare a patch series if there's interest, but wanted
to get input first given the history.

[1] https://forum.proxmox.com/threads/181751/
[2] https://ceph.io/en/news/blog/2022/qemu-kvm-tuning/
[3] https://lists.proxmox.com/pipermail/pve-devel/2023-May/056815.html
[4] https://github.com/jemalloc/jemalloc/issues/1398
[5] https://forum.proxmox.com/threads/131339/

Regards,
Kefu





^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2026-03-26  7:47 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2026-03-26  2:54 [RFC] Re-enable alternative memory allocator (tcmalloc) for pve-qemu? Kefu Chai
2026-03-26  6:09 ` DERUMIER, Alexandre
2026-03-26  6:16 ` Dietmar Maurer
2026-03-26  7:48   ` Kefu Chai

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox
Service provided by Proxmox Server Solutions GmbH | Privacy | Legal