From: "Kefu Chai" <k.chai@proxmox.com>
To: <pve-devel@lists.proxmox.com>
Cc: Kefu Chai <tchaikov@gmail.com>
Subject: [RFC] Re-enable alternative memory allocator (tcmalloc) for pve-qemu?
Date: Thu, 26 Mar 2026 10:54:41 +0800 [thread overview]
Message-ID: <DHCDIFA0P8QP.2CTY4G4EEGKQ0@proxmox.com> (raw)
Hi all,
I'd like to revisit the question of using an alternative memory allocator
in pve-qemu. The topic came up in a recent forum thread [1] about Ceph
read performance on multi-AZ clusters, where both read-replica
localization and tcmalloc were discussed as complementary optimizations.
After digging into the history and source code, I think there may be a
path to re-enabling tcmalloc (or jemalloc) that addresses the issues
that led to their removal.
## Background: benchmark numbers
The Ceph blog post "QEMU/KVM Tuning" [2] reports a ~50% improvement on
16KB random reads (53.5k -> 80k IOPS) when using
LD_PRELOAD="/usr/lib64/libtcmalloc.so" with QEMU+librbd. spirit's
pve-devel patch [3] reported similar numbers on 4K randread (60k -> 90k
IOPS, also +50%).
The mechanism is well-understood: librbd creates many small temporary
objects per I/O operation, and glibc's allocator is slow under that
fragmentation pattern.
## Background: malloc_trim interaction
On top of the allocation overhead, QEMU's RCU call thread
(util/rcu.c:call_rcu_thread) calls malloc_trim(4 MiB) each time the
RCU callback queue drains. Under I/O load this can happen several
times per second. malloc_trim walks the glibc heap looking for pages
to return to the OS via madvise(MADV_DONTNEED), which adds CPU
overhead.
QEMU upstream handles this correctly: when built with
--enable-malloc=tcmalloc (or jemalloc), meson.build sets
has_malloc_trim = false, which compiles out all malloc_trim calls.
Bonzini's commit aa087962 even makes --enable-malloc-trim +
--enable-malloc=tcmalloc a fatal build error — they are explicitly
mutually exclusive.
## History in pve-qemu
pve-qemu has tried both allocators before:
1. tcmalloc was enabled in 2.3-2 (Jun 10, 2015) and removed 8 days
later in 2.3-4 (Jun 18, 2015). The changelog gives no reason.
2. jemalloc replaced it in 2.3-5 (Jun 19, 2015) and lasted until
5.1.0-8 (Dec 2020), when Stefan Reiter removed it in commit
3d785ea with the explanation:
"jemalloc does not play nice with our Rust library
(proxmox-backup-qemu), specifically it never releases memory
allocated from Rust to the OS. This leads to a problem with
larger caches (e.g. for the PBS block driver)."
Stefan referenced jemalloc#1398 [4]. The upstream fix was
background_thread:true in MALLOC_CONF, which Stefan described as
"weirdly hacky."
3. In Aug 2023, Fiona Ebner added an explicit malloc_trim(4 MiB) call
to the backup completion path in commit 5f9cb29, because without
it, QEMU's RSS ballooned from ~370 MB to ~2.1 GB after a PBS
backup and never came back. The forum thread that motivated this
was [5].
## The core problem and a possible solution
The reason both allocators were removed boils down to: after a PBS
backup, the Rust-based proxmox-backup-qemu library frees a large
amount of memory, but neither jemalloc nor tcmalloc return it to the
OS promptly:
- jemalloc's time-based decay requires ongoing allocation activity to
make progress; when the app goes idle after backup, nothing triggers
the decay.
- gperftools tcmalloc's IncrementalScavenge (page_heap.cc) is also
driven by page-level deallocations, not by time. With the default
TCMALLOC_RELEASE_RATE=1.0, it releases 1 page per 1000 pages
deallocated. If the app goes idle, no scavenging happens.
But both allocators provide explicit "release everything now" APIs:
- gperftools: MallocExtension::instance()->ReleaseFreeMemory()
(calls ReleaseToSystem(LONG_MAX) internally)
- jemalloc: mallctl("arena.<i>.purge", ...)
The current pve-backup.c code already has the right structure:
#if defined(CONFIG_MALLOC_TRIM)
malloc_trim(4 * 1024 * 1024);
#endif
This could be extended to:
#if defined(CONFIG_MALLOC_TRIM)
malloc_trim(4 * 1024 * 1024);
#elif defined(CONFIG_TCMALLOC)
MallocExtension_ReleaseFreeMemory();
#elif defined(CONFIG_JEMALLOC)
/* call mallctl to purge arenas */
#endif
QEMU's meson build doesn't currently define CONFIG_TCMALLOC or
CONFIG_JEMALLOC, but the meson option value is available to add them.
The RCU thread malloc_trim is already handled — --enable-malloc=tcmalloc
compiles it out, and the alternative allocator handles steady-state
memory management on its own.
## Questions for the team
1. Does anyone remember why tcmalloc was removed after only 8 days
in 2015 (2.3-4)? The changelog just says "remove tcmalloc" with
no explanation.
2. Fiona: given the backup memory reclamation work in 5f9cb29, would
you see a problem with replacing malloc_trim with the
allocator-specific release API in the backup completion path?
3. Has the proxmox-backup-qemu library's allocation pattern changed
since 2020 in ways that might affect this? (e.g., different buffer
management, Rust allocator changes)
4. Would anyone be willing to test a pve-qemu build with
--enable-malloc=tcmalloc + the allocator-aware release patch
against the PBS backup workload? The key metric is VmRSS in
/proc/<qemu-pid>/status before and after a backup cycle.
I'm happy to prepare a patch series if there's interest, but wanted
to get input first given the history.
[1] https://forum.proxmox.com/threads/181751/
[2] https://ceph.io/en/news/blog/2022/qemu-kvm-tuning/
[3] https://lists.proxmox.com/pipermail/pve-devel/2023-May/056815.html
[4] https://github.com/jemalloc/jemalloc/issues/1398
[5] https://forum.proxmox.com/threads/131339/
Regards,
Kefu
next reply other threads:[~2026-03-26 2:55 UTC|newest]
Thread overview: 4+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-03-26 2:54 Kefu Chai [this message]
2026-03-26 6:09 ` DERUMIER, Alexandre
2026-03-26 6:16 ` Dietmar Maurer
2026-03-26 7:48 ` Kefu Chai
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=DHCDIFA0P8QP.2CTY4G4EEGKQ0@proxmox.com \
--to=k.chai@proxmox.com \
--cc=pve-devel@lists.proxmox.com \
--cc=tchaikov@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox