From: "Kefu Chai" <k.chai@proxmox.com>
To: <pve-devel@lists.proxmox.com>
Cc: Kefu Chai <tchaikov@gmail.com>
Subject: [RFC] Re-enable alternative memory allocator (tcmalloc) for pve-qemu?
Date: Thu, 26 Mar 2026 10:54:41 +0800 [thread overview]
Message-ID: <DHCDIFA0P8QP.2CTY4G4EEGKQ0@proxmox.com> (raw)
Hi all,
I'd like to revisit the question of using an alternative memory allocator
in pve-qemu. The topic came up in a recent forum thread [1] about Ceph
read performance on multi-AZ clusters, where both read-replica
localization and tcmalloc were discussed as complementary optimizations.
After digging into the history and source code, I think there may be a
path to re-enabling tcmalloc (or jemalloc) that addresses the issues
that led to their removal.
## Background: benchmark numbers
The Ceph blog post "QEMU/KVM Tuning" [2] reports a ~50% improvement on
16KB random reads (53.5k -> 80k IOPS) when using
LD_PRELOAD="/usr/lib64/libtcmalloc.so" with QEMU+librbd. spirit's
pve-devel patch [3] reported similar numbers on 4K randread (60k -> 90k
IOPS, also +50%).
The mechanism is well-understood: librbd creates many small temporary
objects per I/O operation, and glibc's allocator is slow under that
fragmentation pattern.
## Background: malloc_trim interaction
On top of the allocation overhead, QEMU's RCU call thread
(util/rcu.c:call_rcu_thread) calls malloc_trim(4 MiB) each time the
RCU callback queue drains. Under I/O load this can happen several
times per second. malloc_trim walks the glibc heap looking for pages
to return to the OS via madvise(MADV_DONTNEED), which adds CPU
overhead.
QEMU upstream handles this correctly: when built with
--enable-malloc=tcmalloc (or jemalloc), meson.build sets
has_malloc_trim = false, which compiles out all malloc_trim calls.
Bonzini's commit aa087962 even makes --enable-malloc-trim +
--enable-malloc=tcmalloc a fatal build error — they are explicitly
mutually exclusive.
## History in pve-qemu
pve-qemu has tried both allocators before:
1. tcmalloc was enabled in 2.3-2 (Jun 10, 2015) and removed 8 days
later in 2.3-4 (Jun 18, 2015). The changelog gives no reason.
2. jemalloc replaced it in 2.3-5 (Jun 19, 2015) and lasted until
5.1.0-8 (Dec 2020), when Stefan Reiter removed it in commit
3d785ea with the explanation:
"jemalloc does not play nice with our Rust library
(proxmox-backup-qemu), specifically it never releases memory
allocated from Rust to the OS. This leads to a problem with
larger caches (e.g. for the PBS block driver)."
Stefan referenced jemalloc#1398 [4]. The upstream fix was
background_thread:true in MALLOC_CONF, which Stefan described as
"weirdly hacky."
3. In Aug 2023, Fiona Ebner added an explicit malloc_trim(4 MiB) call
to the backup completion path in commit 5f9cb29, because without
it, QEMU's RSS ballooned from ~370 MB to ~2.1 GB after a PBS
backup and never came back. The forum thread that motivated this
was [5].
## The core problem and a possible solution
The reason both allocators were removed boils down to: after a PBS
backup, the Rust-based proxmox-backup-qemu library frees a large
amount of memory, but neither jemalloc nor tcmalloc return it to the
OS promptly:
- jemalloc's time-based decay requires ongoing allocation activity to
make progress; when the app goes idle after backup, nothing triggers
the decay.
- gperftools tcmalloc's IncrementalScavenge (page_heap.cc) is also
driven by page-level deallocations, not by time. With the default
TCMALLOC_RELEASE_RATE=1.0, it releases 1 page per 1000 pages
deallocated. If the app goes idle, no scavenging happens.
But both allocators provide explicit "release everything now" APIs:
- gperftools: MallocExtension::instance()->ReleaseFreeMemory()
(calls ReleaseToSystem(LONG_MAX) internally)
- jemalloc: mallctl("arena.<i>.purge", ...)
The current pve-backup.c code already has the right structure:
#if defined(CONFIG_MALLOC_TRIM)
malloc_trim(4 * 1024 * 1024);
#endif
This could be extended to:
#if defined(CONFIG_MALLOC_TRIM)
malloc_trim(4 * 1024 * 1024);
#elif defined(CONFIG_TCMALLOC)
MallocExtension_ReleaseFreeMemory();
#elif defined(CONFIG_JEMALLOC)
/* call mallctl to purge arenas */
#endif
QEMU's meson build doesn't currently define CONFIG_TCMALLOC or
CONFIG_JEMALLOC, but the meson option value is available to add them.
The RCU thread malloc_trim is already handled — --enable-malloc=tcmalloc
compiles it out, and the alternative allocator handles steady-state
memory management on its own.
## Questions for the team
1. Does anyone remember why tcmalloc was removed after only 8 days
in 2015 (2.3-4)? The changelog just says "remove tcmalloc" with
no explanation.
2. Fiona: given the backup memory reclamation work in 5f9cb29, would
you see a problem with replacing malloc_trim with the
allocator-specific release API in the backup completion path?
3. Has the proxmox-backup-qemu library's allocation pattern changed
since 2020 in ways that might affect this? (e.g., different buffer
management, Rust allocator changes)
4. Would anyone be willing to test a pve-qemu build with
--enable-malloc=tcmalloc + the allocator-aware release patch
against the PBS backup workload? The key metric is VmRSS in
/proc/<qemu-pid>/status before and after a backup cycle.
I'm happy to prepare a patch series if there's interest, but wanted
to get input first given the history.
[1] https://forum.proxmox.com/threads/181751/
[2] https://ceph.io/en/news/blog/2022/qemu-kvm-tuning/
[3] https://lists.proxmox.com/pipermail/pve-devel/2023-May/056815.html
[4] https://github.com/jemalloc/jemalloc/issues/1398
[5] https://forum.proxmox.com/threads/131339/
Regards,
Kefu
next reply other threads:[~2026-03-26 2:55 UTC|newest]
Thread overview: 4+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-03-26 2:54 Kefu Chai [this message]
2026-03-26 6:09 ` DERUMIER, Alexandre
2026-03-26 6:16 ` Dietmar Maurer
2026-03-26 7:48 ` Kefu Chai
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=DHCDIFA0P8QP.2CTY4G4EEGKQ0@proxmox.com \
--to=k.chai@proxmox.com \
--cc=pve-devel@lists.proxmox.com \
--cc=tchaikov@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.