* [PATCH pve-qemu 1/2] PVE: use tcmalloc as the memory allocator
2026-04-10 4:30 [PATCH pve-qemu 0/2] Re-enable tcmalloc as the memory allocator Kefu Chai
@ 2026-04-10 4:30 ` Kefu Chai
2026-04-10 4:30 ` [PATCH pve-qemu 2/2] d/rules: enable " Kefu Chai
2026-04-10 8:12 ` [PATCH pve-qemu 0/2] Re-enable " Fiona Ebner
2 siblings, 0 replies; 5+ messages in thread
From: Kefu Chai @ 2026-04-10 4:30 UTC (permalink / raw)
To: pve-devel
Add allocator-aware memory release in the backup completion path:
since tcmalloc does not provide glibc's malloc_trim(), use the
tcmalloc-specific MallocExtension_ReleaseFreeMemory() API instead.
This function walks tcmalloc's page heap free span lists and calls
madvise(MADV_DONTNEED) -- it does not walk allocated memory or compact
the heap, so latency impact is negligible.
Also adds a CONFIG_TCMALLOC meson define so the conditional compilation
in pve-backup.c can detect the allocator choice.
Signed-off-by: Kefu Chai <k.chai@proxmox.com>
---
...use-tcmalloc-as-the-memory-allocator.patch | 77 +++++++++++++++++++
debian/patches/series | 1 +
2 files changed, 78 insertions(+)
create mode 100644 debian/patches/pve/0048-PVE-use-tcmalloc-as-the-memory-allocator.patch
diff --git a/debian/patches/pve/0048-PVE-use-tcmalloc-as-the-memory-allocator.patch b/debian/patches/pve/0048-PVE-use-tcmalloc-as-the-memory-allocator.patch
new file mode 100644
index 0000000..719d522
--- /dev/null
+++ b/debian/patches/pve/0048-PVE-use-tcmalloc-as-the-memory-allocator.patch
@@ -0,0 +1,77 @@
+From 0000000000000000000000000000000000000000 Mon Sep 17 00:00:00 2001
+From: Kefu Chai <k.chai@proxmox.com>
+Date: Thu, 9 Apr 2026 17:29:10 +0800
+Subject: [PATCH] PVE: use tcmalloc as the memory allocator
+
+Use tcmalloc (from gperftools) as the memory allocator for improved
+performance with workloads that create many small, short-lived
+allocations -- particularly Ceph/librbd I/O paths.
+
+tcmalloc uses per-thread caches and size-class freelists that handle
+this allocation pattern more efficiently than glibc's allocator. Ceph
+benchmarks show ~50% IOPS improvement on 16KB random reads.
+
+Since tcmalloc does not provide glibc's malloc_trim(), use the
+tcmalloc-specific MallocExtension_ReleaseFreeMemory() API to release
+cached memory back to the OS after backup completion. This function
+walks tcmalloc's page heap free span lists and calls
+madvise(MADV_DONTNEED) -- it does not walk allocated memory or compact
+the heap, so latency impact is negligible.
+
+Historical context:
+- tcmalloc was originally enabled in 2015 but removed due to
+ performance issues with gperftools 2.2's default settings (low
+ TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES and aggressive decommit
+ disabled). These issues were resolved in gperftools 2.4+.
+- jemalloc replaced tcmalloc but was removed in 2020 because it didn't
+ release memory allocated from Rust (proxmox-backup-qemu) back to the
+ OS. The allocator-specific release API addresses this.
+- PVE 9 ships gperftools 2.16, so the old tuning issues are moot.
+
+Signed-off-by: Kefu Chai <k.chai@proxmox.com>
+---
+ meson.build | 1 +
+ pve-backup.c | 8 ++++++++
+ 2 files changed, 9 insertions(+)
+
+diff --git a/meson.build b/meson.build
+index 0b28d2ec39..c6de2464d6 100644
+--- a/meson.build
++++ b/meson.build
+@@ -2567,6 +2567,7 @@ config_host_data.set('CONFIG_CRYPTO_SM4', crypto_sm4.found())
+ config_host_data.set('CONFIG_CRYPTO_SM3', crypto_sm3.found())
+ config_host_data.set('CONFIG_HOGWEED', hogweed.found())
+ config_host_data.set('CONFIG_MALLOC_TRIM', has_malloc_trim)
++config_host_data.set('CONFIG_TCMALLOC', get_option('malloc') == 'tcmalloc')
+ config_host_data.set('CONFIG_ZSTD', zstd.found())
+ config_host_data.set('CONFIG_QPL', qpl.found())
+ config_host_data.set('CONFIG_UADK', uadk.found())
+diff --git a/pve-backup.c b/pve-backup.c
+index ad0f8668fd..d5556f152b 100644
+--- a/pve-backup.c
++++ b/pve-backup.c
+@@ -19,6 +19,8 @@
+
+ #if defined(CONFIG_MALLOC_TRIM)
+ #include <malloc.h>
++#elif defined(CONFIG_TCMALLOC)
++#include <gperftools/malloc_extension_c.h>
+ #endif
+
+ #include <proxmox-backup-qemu.h>
+@@ -303,6 +305,12 @@ static void coroutine_fn pvebackup_co_cleanup(void)
+ * Won't happen by default if there is fragmentation.
+ */
+ malloc_trim(4 * 1024 * 1024);
++#elif defined(CONFIG_TCMALLOC)
++ /*
++ * Release free memory from tcmalloc's page cache back to the OS. This is
++ * allocator-aware and efficiently returns cached spans via madvise().
++ */
++ MallocExtension_ReleaseFreeMemory();
+ #endif
+ }
+
+--
+2.47.3
+
diff --git a/debian/patches/series b/debian/patches/series
index 8ed0c52..468df6c 100644
--- a/debian/patches/series
+++ b/debian/patches/series
@@ -81,3 +81,4 @@ pve/0044-PVE-backup-get-device-info-allow-caller-to-specify-f.patch
pve/0045-PVE-backup-implement-backup-access-setup-and-teardow.patch
pve/0046-PVE-backup-prepare-for-the-switch-to-using-blockdev-.patch
pve/0047-savevm-async-reuse-migration-blocker-check-for-snaps.patch
+pve/0048-PVE-use-tcmalloc-as-the-memory-allocator.patch
--
2.47.3
^ permalink raw reply [flat|nested] 5+ messages in thread* [PATCH pve-qemu 2/2] d/rules: enable tcmalloc as the memory allocator
2026-04-10 4:30 [PATCH pve-qemu 0/2] Re-enable tcmalloc as the memory allocator Kefu Chai
2026-04-10 4:30 ` [PATCH pve-qemu 1/2] PVE: use " Kefu Chai
@ 2026-04-10 4:30 ` Kefu Chai
2026-04-10 8:12 ` [PATCH pve-qemu 0/2] Re-enable " Fiona Ebner
2 siblings, 0 replies; 5+ messages in thread
From: Kefu Chai @ 2026-04-10 4:30 UTC (permalink / raw)
To: pve-devel
Use tcmalloc (from gperftools) instead of glibc's allocator for
improved performance with workloads that create many small, short-lived
allocations -- particularly Ceph/librbd I/O paths. Ceph benchmarks
show ~50% IOPS improvement on 16KB random reads.
tcmalloc was originally used in 2015 but removed due to tuning issues
with gperftools 2.2. PVE 9 ships gperftools 2.16, where those issues
are long resolved.
Signed-off-by: Kefu Chai <k.chai@proxmox.com>
---
debian/control | 1 +
debian/rules | 1 +
2 files changed, 2 insertions(+)
diff --git a/debian/control b/debian/control
index 81cc026..a3121e1 100644
--- a/debian/control
+++ b/debian/control
@@ -14,6 +14,7 @@ Build-Depends: debhelper-compat (= 13),
libfuse3-dev,
libgbm-dev,
libgnutls28-dev,
+ libgoogle-perftools-dev,
libiscsi-dev (>= 1.12.0),
libjpeg-dev,
libjson-perl,
diff --git a/debian/rules b/debian/rules
index c90db29..a63e3a5 100755
--- a/debian/rules
+++ b/debian/rules
@@ -70,6 +70,7 @@ endif
--enable-libusb \
--enable-linux-aio \
--enable-linux-io-uring \
+ --enable-malloc=tcmalloc \
--enable-numa \
--enable-opengl \
--enable-rbd \
--
2.47.3
^ permalink raw reply [flat|nested] 5+ messages in thread* Re: [PATCH pve-qemu 0/2] Re-enable tcmalloc as the memory allocator
2026-04-10 4:30 [PATCH pve-qemu 0/2] Re-enable tcmalloc as the memory allocator Kefu Chai
2026-04-10 4:30 ` [PATCH pve-qemu 1/2] PVE: use " Kefu Chai
2026-04-10 4:30 ` [PATCH pve-qemu 2/2] d/rules: enable " Kefu Chai
@ 2026-04-10 8:12 ` Fiona Ebner
2026-04-10 10:45 ` DERUMIER, Alexandre
2 siblings, 1 reply; 5+ messages in thread
From: Fiona Ebner @ 2026-04-10 8:12 UTC (permalink / raw)
To: Kefu Chai, pve-devel
Am 10.04.26 um 6:30 AM schrieb Kefu Chai:
> Following up on the RFC thread [0], here's the formal submission to
> re-enable tcmalloc for pve-qemu.
>
> Quick recap: librbd's I/O path allocates a lot of small, short-lived
> objects with plain new/malloc (ObjectReadRequest, bufferlist, etc.),
> and glibc's ptmalloc2 handles this pattern poorly -- cross-thread
> arena contention and cache-line bouncing show up clearly in perf
> profiles. tcmalloc's per-thread fast path avoids both.
>
> A bit of history for context: tcmalloc was tried in 2015 but dropped
> after 8 days due to gperftools 2.2 tuning issues (fixed in 2.4+).
> jemalloc replaced it but was dropped in 2020 because it didn't
> release Rust-allocated memory (from proxmox-backup-qemu) back to the
> OS. PVE 9 ships gperftools 2.16, and patch 1/2 addresses the
> reclamation gap explicitly.
>
> On Dietmar's two concerns from the RFC:
>
> "Could ReleaseFreeMemory() halt the application?" -- No, and I
> verified this directly. It walks tcmalloc's page heap free span
> lists and calls madvise(MADV_DONTNEED) on each span. It does not
> walk allocated memory or compact the heap. A standalone test
> reclaimed 386 MB of 410 MB cached memory (94%) in effectively zero
> wall time. The call runs once at backup completion, same spot where
> malloc_trim runs today.
>
> "Wouldn't a pool allocator in librbd be the proper fix?" -- In
> principle yes, but I audited librbd in Ceph squid and it does NOT
> use a pool allocator -- all I/O path objects go through plain new.
> Ceph's mempool is tracking-only, not actual pooling. Adding real
> pooling would be a significant Ceph-side change (submission and
> completion happen on different threads), and it's orthogonal to the
> allocator choice here.
>
> Also thanks to Alexandre for confirming the 2015 gperftools issues
> are long resolved.
>
> Test results
> ------------
>
> Benchmarked on a local vstart Ceph cluster (3 OSDs on local NVMe).
> This is the worst case for showing allocator impact, since there's
> no network latency for CPU savings to amortize against:
>
> rbd bench --io-type read --io-size 4096 --io-threads 16 \
> --io-pattern rand
>
> Metric | glibc ptmalloc2 | tcmalloc | Delta
> ---------------+-----------------+-----------+--------
> IOPS | 131,201 | 136,389 | +4.0%
> CPU time | 1,556 ms | 1,439 ms | -7.5%
> Cycles | 6.74B | 6.06B | -10.1%
> Cache misses | 137.1M | 123.9M | -9.6%
>
> perf report on the glibc run shows ~8% of CPU in allocator internals
> (_int_malloc, cfree, malloc_consolidate, _int_free_*); the same
> symbols are barely visible with tcmalloc because the fast path is
> just a pointer bump. The Ceph blog [1] reports ~50% IOPS gain on
> production clusters where network RTT dominates per-I/O latency --
> the 10% CPU savings compound there since the host can push more I/O
> into the pipeline during the same wall time.
How does the performance change when doing IO within a QEMU guest?
How does this affect the performance for other storage types, like ZFS,
qcow2 on top of directory-based storages, qcow2 on top of LVM, LVM-thin,
etc. and other workloads like saving VM state during snapshot, transfer
during migration, maybe memory hotplug/ballooning, network performance
for vNICs?
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH pve-qemu 0/2] Re-enable tcmalloc as the memory allocator
2026-04-10 8:12 ` [PATCH pve-qemu 0/2] Re-enable " Fiona Ebner
@ 2026-04-10 10:45 ` DERUMIER, Alexandre
0 siblings, 0 replies; 5+ messages in thread
From: DERUMIER, Alexandre @ 2026-04-10 10:45 UTC (permalink / raw)
To: pve-devel, f.ebner, k.chai
>>How does the performance change when doing IO within a QEMU guest?
>>
>>How does this affect the performance for other storage types, like
>>ZFS,
>>qcow2 on top of directory-based storages, qcow2 on top of LVM, LVM-
>>thin,
>>etc. and other workloads like saving VM state during snapshot,
>>transfer
>>during migration, maybe memory hotplug/ballooning, network
>>performance
>>for vNICs?
Hi Fiona,
I'm stil running in production (I have keeped tcmalloc after the
removal some year ago from the pve build), and I didn't notice problem.
(but I still don't use pbs).
But I never have done bench with/without it since 5/6 year.
Maybe vm memory should be checked too, I'm thinking about RSS memory
with balloon free_page_reporting, to see if it's correcting freeing
memory.
^ permalink raw reply [flat|nested] 5+ messages in thread