all lists on lists.proxmox.com
 help / color / mirror / Atom feed
* [RFC] Re-enable alternative memory allocator (tcmalloc) for pve-qemu?
@ 2026-03-26  2:54 Kefu Chai
  2026-03-26  6:09 ` DERUMIER, Alexandre
  2026-03-26  6:16 ` Dietmar Maurer
  0 siblings, 2 replies; 4+ messages in thread
From: Kefu Chai @ 2026-03-26  2:54 UTC (permalink / raw)
  To: pve-devel; +Cc: Kefu Chai

Hi all,

I'd like to revisit the question of using an alternative memory allocator
in pve-qemu. The topic came up in a recent forum thread [1] about Ceph
read performance on multi-AZ clusters, where both read-replica
localization and tcmalloc were discussed as complementary optimizations.
After digging into the history and source code, I think there may be a
path to re-enabling tcmalloc (or jemalloc) that addresses the issues
that led to their removal.

## Background: benchmark numbers

The Ceph blog post "QEMU/KVM Tuning" [2] reports a ~50% improvement on
16KB random reads (53.5k -> 80k IOPS) when using
LD_PRELOAD="/usr/lib64/libtcmalloc.so" with QEMU+librbd. spirit's
pve-devel patch [3] reported similar numbers on 4K randread (60k -> 90k
IOPS, also +50%).

The mechanism is well-understood: librbd creates many small temporary
objects per I/O operation, and glibc's allocator is slow under that
fragmentation pattern.

## Background: malloc_trim interaction

On top of the allocation overhead, QEMU's RCU call thread
(util/rcu.c:call_rcu_thread) calls malloc_trim(4 MiB) each time the
RCU callback queue drains. Under I/O load this can happen several
times per second. malloc_trim walks the glibc heap looking for pages
to return to the OS via madvise(MADV_DONTNEED), which adds CPU
overhead.

QEMU upstream handles this correctly: when built with
--enable-malloc=tcmalloc (or jemalloc), meson.build sets
has_malloc_trim = false, which compiles out all malloc_trim calls.
Bonzini's commit aa087962 even makes --enable-malloc-trim +
--enable-malloc=tcmalloc a fatal build error — they are explicitly
mutually exclusive.

## History in pve-qemu

pve-qemu has tried both allocators before:

1. tcmalloc was enabled in 2.3-2 (Jun 10, 2015) and removed 8 days
   later in 2.3-4 (Jun 18, 2015). The changelog gives no reason.

2. jemalloc replaced it in 2.3-5 (Jun 19, 2015) and lasted until
   5.1.0-8 (Dec 2020), when Stefan Reiter removed it in commit
   3d785ea with the explanation:

     "jemalloc does not play nice with our Rust library
     (proxmox-backup-qemu), specifically it never releases memory
     allocated from Rust to the OS. This leads to a problem with
     larger caches (e.g. for the PBS block driver)."

   Stefan referenced jemalloc#1398 [4]. The upstream fix was
   background_thread:true in MALLOC_CONF, which Stefan described as
   "weirdly hacky."

3. In Aug 2023, Fiona Ebner added an explicit malloc_trim(4 MiB) call
   to the backup completion path in commit 5f9cb29, because without
   it, QEMU's RSS ballooned from ~370 MB to ~2.1 GB after a PBS
   backup and never came back. The forum thread that motivated this
   was [5].

## The core problem and a possible solution

The reason both allocators were removed boils down to: after a PBS
backup, the Rust-based proxmox-backup-qemu library frees a large
amount of memory, but neither jemalloc nor tcmalloc return it to the
OS promptly:

- jemalloc's time-based decay requires ongoing allocation activity to
  make progress; when the app goes idle after backup, nothing triggers
  the decay.

- gperftools tcmalloc's IncrementalScavenge (page_heap.cc) is also
  driven by page-level deallocations, not by time. With the default
  TCMALLOC_RELEASE_RATE=1.0, it releases 1 page per 1000 pages
  deallocated. If the app goes idle, no scavenging happens.

But both allocators provide explicit "release everything now" APIs:

- gperftools: MallocExtension::instance()->ReleaseFreeMemory()
  (calls ReleaseToSystem(LONG_MAX) internally)
- jemalloc: mallctl("arena.<i>.purge", ...)

The current pve-backup.c code already has the right structure:

    #if defined(CONFIG_MALLOC_TRIM)
        malloc_trim(4 * 1024 * 1024);
    #endif

This could be extended to:

    #if defined(CONFIG_MALLOC_TRIM)
        malloc_trim(4 * 1024 * 1024);
    #elif defined(CONFIG_TCMALLOC)
        MallocExtension_ReleaseFreeMemory();
    #elif defined(CONFIG_JEMALLOC)
        /* call mallctl to purge arenas */
    #endif

QEMU's meson build doesn't currently define CONFIG_TCMALLOC or
CONFIG_JEMALLOC, but the meson option value is available to add them.
The RCU thread malloc_trim is already handled — --enable-malloc=tcmalloc
compiles it out, and the alternative allocator handles steady-state
memory management on its own.

## Questions for the team

1. Does anyone remember why tcmalloc was removed after only 8 days
   in 2015 (2.3-4)? The changelog just says "remove tcmalloc" with
   no explanation.

2. Fiona: given the backup memory reclamation work in 5f9cb29, would
   you see a problem with replacing malloc_trim with the
   allocator-specific release API in the backup completion path?

3. Has the proxmox-backup-qemu library's allocation pattern changed
   since 2020 in ways that might affect this? (e.g., different buffer
   management, Rust allocator changes)

4. Would anyone be willing to test a pve-qemu build with
   --enable-malloc=tcmalloc + the allocator-aware release patch
   against the PBS backup workload? The key metric is VmRSS in
   /proc/<qemu-pid>/status before and after a backup cycle.

I'm happy to prepare a patch series if there's interest, but wanted
to get input first given the history.

[1] https://forum.proxmox.com/threads/181751/
[2] https://ceph.io/en/news/blog/2022/qemu-kvm-tuning/
[3] https://lists.proxmox.com/pipermail/pve-devel/2023-May/056815.html
[4] https://github.com/jemalloc/jemalloc/issues/1398
[5] https://forum.proxmox.com/threads/131339/

Regards,
Kefu





^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2026-03-26  7:47 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2026-03-26  2:54 [RFC] Re-enable alternative memory allocator (tcmalloc) for pve-qemu? Kefu Chai
2026-03-26  6:09 ` DERUMIER, Alexandre
2026-03-26  6:16 ` Dietmar Maurer
2026-03-26  7:48   ` Kefu Chai

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.
Service provided by Proxmox Server Solutions GmbH | Privacy | Legal