all lists on lists.proxmox.com
 help / color / mirror / Atom feed
* [PATCH pve-qemu 0/2] Re-enable tcmalloc as the memory allocator
@ 2026-04-10  4:30 Kefu Chai
  2026-04-10  4:30 ` [PATCH pve-qemu 1/2] PVE: use " Kefu Chai
                   ` (2 more replies)
  0 siblings, 3 replies; 5+ messages in thread
From: Kefu Chai @ 2026-04-10  4:30 UTC (permalink / raw)
  To: pve-devel

Following up on the RFC thread [0], here's the formal submission to
re-enable tcmalloc for pve-qemu.

Quick recap: librbd's I/O path allocates a lot of small, short-lived
objects with plain new/malloc (ObjectReadRequest, bufferlist, etc.),
and glibc's ptmalloc2 handles this pattern poorly -- cross-thread
arena contention and cache-line bouncing show up clearly in perf
profiles. tcmalloc's per-thread fast path avoids both.

A bit of history for context: tcmalloc was tried in 2015 but dropped
after 8 days due to gperftools 2.2 tuning issues (fixed in 2.4+).
jemalloc replaced it but was dropped in 2020 because it didn't
release Rust-allocated memory (from proxmox-backup-qemu) back to the
OS. PVE 9 ships gperftools 2.16, and patch 1/2 addresses the
reclamation gap explicitly.

On Dietmar's two concerns from the RFC:

"Could ReleaseFreeMemory() halt the application?" -- No, and I
verified this directly. It walks tcmalloc's page heap free span
lists and calls madvise(MADV_DONTNEED) on each span. It does not
walk allocated memory or compact the heap. A standalone test
reclaimed 386 MB of 410 MB cached memory (94%) in effectively zero
wall time. The call runs once at backup completion, same spot where
malloc_trim runs today.

"Wouldn't a pool allocator in librbd be the proper fix?" -- In
principle yes, but I audited librbd in Ceph squid and it does NOT
use a pool allocator -- all I/O path objects go through plain new.
Ceph's mempool is tracking-only, not actual pooling. Adding real
pooling would be a significant Ceph-side change (submission and
completion happen on different threads), and it's orthogonal to the
allocator choice here.

Also thanks to Alexandre for confirming the 2015 gperftools issues
are long resolved.

Test results
------------

Benchmarked on a local vstart Ceph cluster (3 OSDs on local NVMe).
This is the worst case for showing allocator impact, since there's
no network latency for CPU savings to amortize against:

  rbd bench --io-type read --io-size 4096 --io-threads 16 \
            --io-pattern rand

  Metric         | glibc ptmalloc2 | tcmalloc  | Delta
  ---------------+-----------------+-----------+--------
  IOPS           |         131,201 |   136,389 |  +4.0%
  CPU time       |        1,556 ms |  1,439 ms |  -7.5%
  Cycles         |           6.74B |     6.06B | -10.1%
  Cache misses   |          137.1M |    123.9M |  -9.6%

perf report on the glibc run shows ~8% of CPU in allocator internals
(_int_malloc, cfree, malloc_consolidate, _int_free_*); the same
symbols are barely visible with tcmalloc because the fast path is
just a pointer bump. The Ceph blog [1] reports ~50% IOPS gain on
production clusters where network RTT dominates per-I/O latency --
the 10% CPU savings compound there since the host can push more I/O
into the pipeline during the same wall time.

The series is small:

  1/2  adds the QEMU source patch (0048) with the CONFIG_TCMALLOC
       meson define and the ReleaseFreeMemory() call in
       pve-backup.c's cleanup path.
  2/2  adds libgoogle-perftools-dev to Build-Depends and
       --enable-malloc=tcmalloc to configure.

Runtime dep libgoogle-perftools4t64 (>= 2.16) is picked up
automatically by dh_shlibdeps.

[0]: https://lore.proxmox.com/pve-devel/DHCDIFA0P8QP.2CTY4G4EEGKQ0@proxmox.com/
[1]: https://ceph.io/en/news/blog/2023/reef-freelist-bench/

Kefu Chai (2):
  PVE: use tcmalloc as the memory allocator
  d/rules: enable tcmalloc as the memory allocator

 debian/control                                |  1 +
 ...use-tcmalloc-as-the-memory-allocator.patch | 77 +++++++++++++++++++
 debian/patches/series                         |  1 +
 debian/rules                                  |  1 +
 4 files changed, 80 insertions(+)
 create mode 100644 debian/patches/pve/0048-PVE-use-tcmalloc-as-the-memory-allocator.patch

--
2.47.3





^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH pve-qemu 1/2] PVE: use tcmalloc as the memory allocator
  2026-04-10  4:30 [PATCH pve-qemu 0/2] Re-enable tcmalloc as the memory allocator Kefu Chai
@ 2026-04-10  4:30 ` Kefu Chai
  2026-04-10  4:30 ` [PATCH pve-qemu 2/2] d/rules: enable " Kefu Chai
  2026-04-10  8:12 ` [PATCH pve-qemu 0/2] Re-enable " Fiona Ebner
  2 siblings, 0 replies; 5+ messages in thread
From: Kefu Chai @ 2026-04-10  4:30 UTC (permalink / raw)
  To: pve-devel

Add allocator-aware memory release in the backup completion path:
since tcmalloc does not provide glibc's malloc_trim(), use the
tcmalloc-specific MallocExtension_ReleaseFreeMemory() API instead.
This function walks tcmalloc's page heap free span lists and calls
madvise(MADV_DONTNEED) -- it does not walk allocated memory or compact
the heap, so latency impact is negligible.

Also adds a CONFIG_TCMALLOC meson define so the conditional compilation
in pve-backup.c can detect the allocator choice.

Signed-off-by: Kefu Chai <k.chai@proxmox.com>
---
 ...use-tcmalloc-as-the-memory-allocator.patch | 77 +++++++++++++++++++
 debian/patches/series                         |  1 +
 2 files changed, 78 insertions(+)
 create mode 100644 debian/patches/pve/0048-PVE-use-tcmalloc-as-the-memory-allocator.patch

diff --git a/debian/patches/pve/0048-PVE-use-tcmalloc-as-the-memory-allocator.patch b/debian/patches/pve/0048-PVE-use-tcmalloc-as-the-memory-allocator.patch
new file mode 100644
index 0000000..719d522
--- /dev/null
+++ b/debian/patches/pve/0048-PVE-use-tcmalloc-as-the-memory-allocator.patch
@@ -0,0 +1,77 @@
+From 0000000000000000000000000000000000000000 Mon Sep 17 00:00:00 2001
+From: Kefu Chai <k.chai@proxmox.com>
+Date: Thu, 9 Apr 2026 17:29:10 +0800
+Subject: [PATCH] PVE: use tcmalloc as the memory allocator
+
+Use tcmalloc (from gperftools) as the memory allocator for improved
+performance with workloads that create many small, short-lived
+allocations -- particularly Ceph/librbd I/O paths.
+
+tcmalloc uses per-thread caches and size-class freelists that handle
+this allocation pattern more efficiently than glibc's allocator. Ceph
+benchmarks show ~50% IOPS improvement on 16KB random reads.
+
+Since tcmalloc does not provide glibc's malloc_trim(), use the
+tcmalloc-specific MallocExtension_ReleaseFreeMemory() API to release
+cached memory back to the OS after backup completion. This function
+walks tcmalloc's page heap free span lists and calls
+madvise(MADV_DONTNEED) -- it does not walk allocated memory or compact
+the heap, so latency impact is negligible.
+
+Historical context:
+- tcmalloc was originally enabled in 2015 but removed due to
+  performance issues with gperftools 2.2's default settings (low
+  TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES and aggressive decommit
+  disabled). These issues were resolved in gperftools 2.4+.
+- jemalloc replaced tcmalloc but was removed in 2020 because it didn't
+  release memory allocated from Rust (proxmox-backup-qemu) back to the
+  OS. The allocator-specific release API addresses this.
+- PVE 9 ships gperftools 2.16, so the old tuning issues are moot.
+
+Signed-off-by: Kefu Chai <k.chai@proxmox.com>
+---
+ meson.build  | 1 +
+ pve-backup.c | 8 ++++++++
+ 2 files changed, 9 insertions(+)
+
+diff --git a/meson.build b/meson.build
+index 0b28d2ec39..c6de2464d6 100644
+--- a/meson.build
++++ b/meson.build
+@@ -2567,6 +2567,7 @@ config_host_data.set('CONFIG_CRYPTO_SM4', crypto_sm4.found())
+ config_host_data.set('CONFIG_CRYPTO_SM3', crypto_sm3.found())
+ config_host_data.set('CONFIG_HOGWEED', hogweed.found())
+ config_host_data.set('CONFIG_MALLOC_TRIM', has_malloc_trim)
++config_host_data.set('CONFIG_TCMALLOC', get_option('malloc') == 'tcmalloc')
+ config_host_data.set('CONFIG_ZSTD', zstd.found())
+ config_host_data.set('CONFIG_QPL', qpl.found())
+ config_host_data.set('CONFIG_UADK', uadk.found())
+diff --git a/pve-backup.c b/pve-backup.c
+index ad0f8668fd..d5556f152b 100644
+--- a/pve-backup.c
++++ b/pve-backup.c
+@@ -19,6 +19,8 @@
+ 
+ #if defined(CONFIG_MALLOC_TRIM)
+ #include <malloc.h>
++#elif defined(CONFIG_TCMALLOC)
++#include <gperftools/malloc_extension_c.h>
+ #endif
+ 
+ #include <proxmox-backup-qemu.h>
+@@ -303,6 +305,12 @@ static void coroutine_fn pvebackup_co_cleanup(void)
+      * Won't happen by default if there is fragmentation.
+      */
+     malloc_trim(4 * 1024 * 1024);
++#elif defined(CONFIG_TCMALLOC)
++    /*
++     * Release free memory from tcmalloc's page cache back to the OS. This is
++     * allocator-aware and efficiently returns cached spans via madvise().
++     */
++    MallocExtension_ReleaseFreeMemory();
+ #endif
+ }
+ 
+-- 
+2.47.3
+
diff --git a/debian/patches/series b/debian/patches/series
index 8ed0c52..468df6c 100644
--- a/debian/patches/series
+++ b/debian/patches/series
@@ -81,3 +81,4 @@ pve/0044-PVE-backup-get-device-info-allow-caller-to-specify-f.patch
 pve/0045-PVE-backup-implement-backup-access-setup-and-teardow.patch
 pve/0046-PVE-backup-prepare-for-the-switch-to-using-blockdev-.patch
 pve/0047-savevm-async-reuse-migration-blocker-check-for-snaps.patch
+pve/0048-PVE-use-tcmalloc-as-the-memory-allocator.patch
-- 
2.47.3





^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH pve-qemu 2/2] d/rules: enable tcmalloc as the memory allocator
  2026-04-10  4:30 [PATCH pve-qemu 0/2] Re-enable tcmalloc as the memory allocator Kefu Chai
  2026-04-10  4:30 ` [PATCH pve-qemu 1/2] PVE: use " Kefu Chai
@ 2026-04-10  4:30 ` Kefu Chai
  2026-04-10  8:12 ` [PATCH pve-qemu 0/2] Re-enable " Fiona Ebner
  2 siblings, 0 replies; 5+ messages in thread
From: Kefu Chai @ 2026-04-10  4:30 UTC (permalink / raw)
  To: pve-devel

Use tcmalloc (from gperftools) instead of glibc's allocator for
improved performance with workloads that create many small, short-lived
allocations -- particularly Ceph/librbd I/O paths. Ceph benchmarks
show ~50% IOPS improvement on 16KB random reads.

tcmalloc was originally used in 2015 but removed due to tuning issues
with gperftools 2.2. PVE 9 ships gperftools 2.16, where those issues
are long resolved.

Signed-off-by: Kefu Chai <k.chai@proxmox.com>
---
 debian/control | 1 +
 debian/rules   | 1 +
 2 files changed, 2 insertions(+)

diff --git a/debian/control b/debian/control
index 81cc026..a3121e1 100644
--- a/debian/control
+++ b/debian/control
@@ -14,6 +14,7 @@ Build-Depends: debhelper-compat (= 13),
                libfuse3-dev,
                libgbm-dev,
                libgnutls28-dev,
+               libgoogle-perftools-dev,
                libiscsi-dev (>= 1.12.0),
                libjpeg-dev,
                libjson-perl,
diff --git a/debian/rules b/debian/rules
index c90db29..a63e3a5 100755
--- a/debian/rules
+++ b/debian/rules
@@ -70,6 +70,7 @@ endif
 	    --enable-libusb \
 	    --enable-linux-aio \
 	    --enable-linux-io-uring \
+	    --enable-malloc=tcmalloc \
 	    --enable-numa \
 	    --enable-opengl \
 	    --enable-rbd \
-- 
2.47.3





^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH pve-qemu 0/2] Re-enable tcmalloc as the memory allocator
  2026-04-10  4:30 [PATCH pve-qemu 0/2] Re-enable tcmalloc as the memory allocator Kefu Chai
  2026-04-10  4:30 ` [PATCH pve-qemu 1/2] PVE: use " Kefu Chai
  2026-04-10  4:30 ` [PATCH pve-qemu 2/2] d/rules: enable " Kefu Chai
@ 2026-04-10  8:12 ` Fiona Ebner
  2026-04-10 10:45   ` DERUMIER, Alexandre
  2 siblings, 1 reply; 5+ messages in thread
From: Fiona Ebner @ 2026-04-10  8:12 UTC (permalink / raw)
  To: Kefu Chai, pve-devel

Am 10.04.26 um 6:30 AM schrieb Kefu Chai:
> Following up on the RFC thread [0], here's the formal submission to
> re-enable tcmalloc for pve-qemu.
> 
> Quick recap: librbd's I/O path allocates a lot of small, short-lived
> objects with plain new/malloc (ObjectReadRequest, bufferlist, etc.),
> and glibc's ptmalloc2 handles this pattern poorly -- cross-thread
> arena contention and cache-line bouncing show up clearly in perf
> profiles. tcmalloc's per-thread fast path avoids both.
>
> A bit of history for context: tcmalloc was tried in 2015 but dropped
> after 8 days due to gperftools 2.2 tuning issues (fixed in 2.4+).
> jemalloc replaced it but was dropped in 2020 because it didn't
> release Rust-allocated memory (from proxmox-backup-qemu) back to the
> OS. PVE 9 ships gperftools 2.16, and patch 1/2 addresses the
> reclamation gap explicitly.
> 
> On Dietmar's two concerns from the RFC:
> 
> "Could ReleaseFreeMemory() halt the application?" -- No, and I
> verified this directly. It walks tcmalloc's page heap free span
> lists and calls madvise(MADV_DONTNEED) on each span. It does not
> walk allocated memory or compact the heap. A standalone test
> reclaimed 386 MB of 410 MB cached memory (94%) in effectively zero
> wall time. The call runs once at backup completion, same spot where
> malloc_trim runs today.
> 
> "Wouldn't a pool allocator in librbd be the proper fix?" -- In
> principle yes, but I audited librbd in Ceph squid and it does NOT
> use a pool allocator -- all I/O path objects go through plain new.
> Ceph's mempool is tracking-only, not actual pooling. Adding real
> pooling would be a significant Ceph-side change (submission and
> completion happen on different threads), and it's orthogonal to the
> allocator choice here.
> 
> Also thanks to Alexandre for confirming the 2015 gperftools issues
> are long resolved.
> 
> Test results
> ------------
> 
> Benchmarked on a local vstart Ceph cluster (3 OSDs on local NVMe).
> This is the worst case for showing allocator impact, since there's
> no network latency for CPU savings to amortize against:
> 
>   rbd bench --io-type read --io-size 4096 --io-threads 16 \
>             --io-pattern rand
> 
>   Metric         | glibc ptmalloc2 | tcmalloc  | Delta
>   ---------------+-----------------+-----------+--------
>   IOPS           |         131,201 |   136,389 |  +4.0%
>   CPU time       |        1,556 ms |  1,439 ms |  -7.5%
>   Cycles         |           6.74B |     6.06B | -10.1%
>   Cache misses   |          137.1M |    123.9M |  -9.6%
> 
> perf report on the glibc run shows ~8% of CPU in allocator internals
> (_int_malloc, cfree, malloc_consolidate, _int_free_*); the same
> symbols are barely visible with tcmalloc because the fast path is
> just a pointer bump. The Ceph blog [1] reports ~50% IOPS gain on
> production clusters where network RTT dominates per-I/O latency --
> the 10% CPU savings compound there since the host can push more I/O
> into the pipeline during the same wall time.

How does the performance change when doing IO within a QEMU guest?

How does this affect the performance for other storage types, like ZFS,
qcow2 on top of directory-based storages, qcow2 on top of LVM, LVM-thin,
etc. and other workloads like saving VM state during snapshot, transfer
during migration, maybe memory hotplug/ballooning, network performance
for vNICs?




^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH pve-qemu 0/2] Re-enable tcmalloc as the memory allocator
  2026-04-10  8:12 ` [PATCH pve-qemu 0/2] Re-enable " Fiona Ebner
@ 2026-04-10 10:45   ` DERUMIER, Alexandre
  0 siblings, 0 replies; 5+ messages in thread
From: DERUMIER, Alexandre @ 2026-04-10 10:45 UTC (permalink / raw)
  To: pve-devel, f.ebner, k.chai


>>How does the performance change when doing IO within a QEMU guest?
>>
>>How does this affect the performance for other storage types, like
>>ZFS,
>>qcow2 on top of directory-based storages, qcow2 on top of LVM, LVM-
>>thin,
>>etc. and other workloads like saving VM state during snapshot,
>>transfer
>>during migration, maybe memory hotplug/ballooning, network
>>performance
>>for vNICs?

Hi Fiona,

I'm stil running in production (I have keeped tcmalloc after the
removal some year ago from the pve build), and I didn't notice problem.
(but I still don't use pbs).

But I never have done bench with/without it since 5/6 year.

Maybe vm memory should be checked too, I'm thinking about RSS memory
with balloon free_page_reporting, to see if it's correcting freeing
memory.






^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2026-04-10 10:45 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2026-04-10  4:30 [PATCH pve-qemu 0/2] Re-enable tcmalloc as the memory allocator Kefu Chai
2026-04-10  4:30 ` [PATCH pve-qemu 1/2] PVE: use " Kefu Chai
2026-04-10  4:30 ` [PATCH pve-qemu 2/2] d/rules: enable " Kefu Chai
2026-04-10  8:12 ` [PATCH pve-qemu 0/2] Re-enable " Fiona Ebner
2026-04-10 10:45   ` DERUMIER, Alexandre

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.
Service provided by Proxmox Server Solutions GmbH | Privacy | Legal