public inbox for pve-devel@lists.proxmox.com
 help / color / mirror / Atom feed
From: "Kefu Chai" <k.chai@proxmox.com>
To: <pve-devel@lists.proxmox.com>
Cc: Kefu Chai <tchaikov@gmail.com>
Subject: [RFC] Re-enable alternative memory allocator (tcmalloc) for pve-qemu?
Date: Thu, 26 Mar 2026 10:54:41 +0800	[thread overview]
Message-ID: <DHCDIFA0P8QP.2CTY4G4EEGKQ0@proxmox.com> (raw)

Hi all,

I'd like to revisit the question of using an alternative memory allocator
in pve-qemu. The topic came up in a recent forum thread [1] about Ceph
read performance on multi-AZ clusters, where both read-replica
localization and tcmalloc were discussed as complementary optimizations.
After digging into the history and source code, I think there may be a
path to re-enabling tcmalloc (or jemalloc) that addresses the issues
that led to their removal.

## Background: benchmark numbers

The Ceph blog post "QEMU/KVM Tuning" [2] reports a ~50% improvement on
16KB random reads (53.5k -> 80k IOPS) when using
LD_PRELOAD="/usr/lib64/libtcmalloc.so" with QEMU+librbd. spirit's
pve-devel patch [3] reported similar numbers on 4K randread (60k -> 90k
IOPS, also +50%).

The mechanism is well-understood: librbd creates many small temporary
objects per I/O operation, and glibc's allocator is slow under that
fragmentation pattern.

## Background: malloc_trim interaction

On top of the allocation overhead, QEMU's RCU call thread
(util/rcu.c:call_rcu_thread) calls malloc_trim(4 MiB) each time the
RCU callback queue drains. Under I/O load this can happen several
times per second. malloc_trim walks the glibc heap looking for pages
to return to the OS via madvise(MADV_DONTNEED), which adds CPU
overhead.

QEMU upstream handles this correctly: when built with
--enable-malloc=tcmalloc (or jemalloc), meson.build sets
has_malloc_trim = false, which compiles out all malloc_trim calls.
Bonzini's commit aa087962 even makes --enable-malloc-trim +
--enable-malloc=tcmalloc a fatal build error — they are explicitly
mutually exclusive.

## History in pve-qemu

pve-qemu has tried both allocators before:

1. tcmalloc was enabled in 2.3-2 (Jun 10, 2015) and removed 8 days
   later in 2.3-4 (Jun 18, 2015). The changelog gives no reason.

2. jemalloc replaced it in 2.3-5 (Jun 19, 2015) and lasted until
   5.1.0-8 (Dec 2020), when Stefan Reiter removed it in commit
   3d785ea with the explanation:

     "jemalloc does not play nice with our Rust library
     (proxmox-backup-qemu), specifically it never releases memory
     allocated from Rust to the OS. This leads to a problem with
     larger caches (e.g. for the PBS block driver)."

   Stefan referenced jemalloc#1398 [4]. The upstream fix was
   background_thread:true in MALLOC_CONF, which Stefan described as
   "weirdly hacky."

3. In Aug 2023, Fiona Ebner added an explicit malloc_trim(4 MiB) call
   to the backup completion path in commit 5f9cb29, because without
   it, QEMU's RSS ballooned from ~370 MB to ~2.1 GB after a PBS
   backup and never came back. The forum thread that motivated this
   was [5].

## The core problem and a possible solution

The reason both allocators were removed boils down to: after a PBS
backup, the Rust-based proxmox-backup-qemu library frees a large
amount of memory, but neither jemalloc nor tcmalloc return it to the
OS promptly:

- jemalloc's time-based decay requires ongoing allocation activity to
  make progress; when the app goes idle after backup, nothing triggers
  the decay.

- gperftools tcmalloc's IncrementalScavenge (page_heap.cc) is also
  driven by page-level deallocations, not by time. With the default
  TCMALLOC_RELEASE_RATE=1.0, it releases 1 page per 1000 pages
  deallocated. If the app goes idle, no scavenging happens.

But both allocators provide explicit "release everything now" APIs:

- gperftools: MallocExtension::instance()->ReleaseFreeMemory()
  (calls ReleaseToSystem(LONG_MAX) internally)
- jemalloc: mallctl("arena.<i>.purge", ...)

The current pve-backup.c code already has the right structure:

    #if defined(CONFIG_MALLOC_TRIM)
        malloc_trim(4 * 1024 * 1024);
    #endif

This could be extended to:

    #if defined(CONFIG_MALLOC_TRIM)
        malloc_trim(4 * 1024 * 1024);
    #elif defined(CONFIG_TCMALLOC)
        MallocExtension_ReleaseFreeMemory();
    #elif defined(CONFIG_JEMALLOC)
        /* call mallctl to purge arenas */
    #endif

QEMU's meson build doesn't currently define CONFIG_TCMALLOC or
CONFIG_JEMALLOC, but the meson option value is available to add them.
The RCU thread malloc_trim is already handled — --enable-malloc=tcmalloc
compiles it out, and the alternative allocator handles steady-state
memory management on its own.

## Questions for the team

1. Does anyone remember why tcmalloc was removed after only 8 days
   in 2015 (2.3-4)? The changelog just says "remove tcmalloc" with
   no explanation.

2. Fiona: given the backup memory reclamation work in 5f9cb29, would
   you see a problem with replacing malloc_trim with the
   allocator-specific release API in the backup completion path?

3. Has the proxmox-backup-qemu library's allocation pattern changed
   since 2020 in ways that might affect this? (e.g., different buffer
   management, Rust allocator changes)

4. Would anyone be willing to test a pve-qemu build with
   --enable-malloc=tcmalloc + the allocator-aware release patch
   against the PBS backup workload? The key metric is VmRSS in
   /proc/<qemu-pid>/status before and after a backup cycle.

I'm happy to prepare a patch series if there's interest, but wanted
to get input first given the history.

[1] https://forum.proxmox.com/threads/181751/
[2] https://ceph.io/en/news/blog/2022/qemu-kvm-tuning/
[3] https://lists.proxmox.com/pipermail/pve-devel/2023-May/056815.html
[4] https://github.com/jemalloc/jemalloc/issues/1398
[5] https://forum.proxmox.com/threads/131339/

Regards,
Kefu





             reply	other threads:[~2026-03-26  2:55 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-26  2:54 Kefu Chai [this message]
2026-03-26  6:09 ` DERUMIER, Alexandre
2026-03-26  6:16 ` Dietmar Maurer
2026-03-26  7:48   ` Kefu Chai

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=DHCDIFA0P8QP.2CTY4G4EEGKQ0@proxmox.com \
    --to=k.chai@proxmox.com \
    --cc=pve-devel@lists.proxmox.com \
    --cc=tchaikov@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox
Service provided by Proxmox Server Solutions GmbH | Privacy | Legal