[PATCH pve-qemu 0/2] Re-enable tcmalloc as the memory allocator

public inbox for pve-devel@lists.proxmox.com
 help / color / mirror / Atom feed

* [PATCH pve-qemu 0/2] Re-enable tcmalloc as the memory allocator
@ 2026-04-10  4:30 Kefu Chai
  2026-04-10  4:30 ` [PATCH pve-qemu 1/2] PVE: use " Kefu Chai
                   ` (3 more replies)
  0 siblings, 4 replies; 16+ messages in thread
From: Kefu Chai @ 2026-04-10  4:30 UTC (permalink / raw)
  To: pve-devel

Following up on the RFC thread [0], here's the formal submission to
re-enable tcmalloc for pve-qemu.

Quick recap: librbd's I/O path allocates a lot of small, short-lived
objects with plain new/malloc (ObjectReadRequest, bufferlist, etc.),
and glibc's ptmalloc2 handles this pattern poorly -- cross-thread
arena contention and cache-line bouncing show up clearly in perf
profiles. tcmalloc's per-thread fast path avoids both.

A bit of history for context: tcmalloc was tried in 2015 but dropped
after 8 days due to gperftools 2.2 tuning issues (fixed in 2.4+).
jemalloc replaced it but was dropped in 2020 because it didn't
release Rust-allocated memory (from proxmox-backup-qemu) back to the
OS. PVE 9 ships gperftools 2.16, and patch 1/2 addresses the
reclamation gap explicitly.

On Dietmar's two concerns from the RFC:

"Could ReleaseFreeMemory() halt the application?" -- No, and I
verified this directly. It walks tcmalloc's page heap free span
lists and calls madvise(MADV_DONTNEED) on each span. It does not
walk allocated memory or compact the heap. A standalone test
reclaimed 386 MB of 410 MB cached memory (94%) in effectively zero
wall time. The call runs once at backup completion, same spot where
malloc_trim runs today.

"Wouldn't a pool allocator in librbd be the proper fix?" -- In
principle yes, but I audited librbd in Ceph squid and it does NOT
use a pool allocator -- all I/O path objects go through plain new.
Ceph's mempool is tracking-only, not actual pooling. Adding real
pooling would be a significant Ceph-side change (submission and
completion happen on different threads), and it's orthogonal to the
allocator choice here.

Also thanks to Alexandre for confirming the 2015 gperftools issues
are long resolved.

Test results
------------

Benchmarked on a local vstart Ceph cluster (3 OSDs on local NVMe).
This is the worst case for showing allocator impact, since there's
no network latency for CPU savings to amortize against:

  rbd bench --io-type read --io-size 4096 --io-threads 16 \
            --io-pattern rand

  Metric         | glibc ptmalloc2 | tcmalloc  | Delta
  ---------------+-----------------+-----------+--------
  IOPS           |         131,201 |   136,389 |  +4.0%
  CPU time       |        1,556 ms |  1,439 ms |  -7.5%
  Cycles         |           6.74B |     6.06B | -10.1%
  Cache misses   |          137.1M |    123.9M |  -9.6%

perf report on the glibc run shows ~8% of CPU in allocator internals
(_int_malloc, cfree, malloc_consolidate, _int_free_*); the same
symbols are barely visible with tcmalloc because the fast path is
just a pointer bump. The Ceph blog [1] reports ~50% IOPS gain on
production clusters where network RTT dominates per-I/O latency --
the 10% CPU savings compound there since the host can push more I/O
into the pipeline during the same wall time.

The series is small:

  1/2  adds the QEMU source patch (0048) with the CONFIG_TCMALLOC
       meson define and the ReleaseFreeMemory() call in
       pve-backup.c's cleanup path.
  2/2  adds libgoogle-perftools-dev to Build-Depends and
       --enable-malloc=tcmalloc to configure.

Runtime dep libgoogle-perftools4t64 (>= 2.16) is picked up
automatically by dh_shlibdeps.

[0]: https://lore.proxmox.com/pve-devel/DHCDIFA0P8QP.2CTY4G4EEGKQ0@proxmox.com/
[1]: https://ceph.io/en/news/blog/2023/reef-freelist-bench/

Kefu Chai (2):
  PVE: use tcmalloc as the memory allocator
  d/rules: enable tcmalloc as the memory allocator

 debian/control                                |  1 +
 ...use-tcmalloc-as-the-memory-allocator.patch | 77 +++++++++++++++++++
 debian/patches/series                         |  1 +
 debian/rules                                  |  1 +
 4 files changed, 80 insertions(+)
 create mode 100644 debian/patches/pve/0048-PVE-use-tcmalloc-as-the-memory-allocator.patch

--
2.47.3

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH pve-qemu 1/2] PVE: use tcmalloc as the memory allocator
  2026-04-10  4:30 [PATCH pve-qemu 0/2] Re-enable tcmalloc as the memory allocator Kefu Chai
@ 2026-04-10  4:30 ` Kefu Chai
  2026-04-13 13:12   ` Fiona Ebner
  2026-04-10  4:30 ` [PATCH pve-qemu 2/2] d/rules: enable " Kefu Chai
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 16+ messages in thread
From: Kefu Chai @ 2026-04-10  4:30 UTC (permalink / raw)
  To: pve-devel

Add allocator-aware memory release in the backup completion path:
since tcmalloc does not provide glibc's malloc_trim(), use the
tcmalloc-specific MallocExtension_ReleaseFreeMemory() API instead.
This function walks tcmalloc's page heap free span lists and calls
madvise(MADV_DONTNEED) -- it does not walk allocated memory or compact
the heap, so latency impact is negligible.

Also adds a CONFIG_TCMALLOC meson define so the conditional compilation
in pve-backup.c can detect the allocator choice.

Signed-off-by: Kefu Chai <k.chai@proxmox.com>
---
 ...use-tcmalloc-as-the-memory-allocator.patch | 77 +++++++++++++++++++
 debian/patches/series                         |  1 +
 2 files changed, 78 insertions(+)
 create mode 100644 debian/patches/pve/0048-PVE-use-tcmalloc-as-the-memory-allocator.patch

diff --git a/debian/patches/pve/0048-PVE-use-tcmalloc-as-the-memory-allocator.patch b/debian/patches/pve/0048-PVE-use-tcmalloc-as-the-memory-allocator.patch
new file mode 100644
index 0000000..719d522
--- /dev/null
+++ b/debian/patches/pve/0048-PVE-use-tcmalloc-as-the-memory-allocator.patch
@@ -0,0 +1,77 @@
+From 0000000000000000000000000000000000000000 Mon Sep 17 00:00:00 2001
+From: Kefu Chai <k.chai@proxmox.com>
+Date: Thu, 9 Apr 2026 17:29:10 +0800
+Subject: [PATCH] PVE: use tcmalloc as the memory allocator
+
+Use tcmalloc (from gperftools) as the memory allocator for improved
+performance with workloads that create many small, short-lived
+allocations -- particularly Ceph/librbd I/O paths.
+
+tcmalloc uses per-thread caches and size-class freelists that handle
+this allocation pattern more efficiently than glibc's allocator. Ceph
+benchmarks show ~50% IOPS improvement on 16KB random reads.
+
+Since tcmalloc does not provide glibc's malloc_trim(), use the
+tcmalloc-specific MallocExtension_ReleaseFreeMemory() API to release
+cached memory back to the OS after backup completion. This function
+walks tcmalloc's page heap free span lists and calls
+madvise(MADV_DONTNEED) -- it does not walk allocated memory or compact
+the heap, so latency impact is negligible.
+
+Historical context:
+- tcmalloc was originally enabled in 2015 but removed due to
+  performance issues with gperftools 2.2's default settings (low
+  TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES and aggressive decommit
+  disabled). These issues were resolved in gperftools 2.4+.
+- jemalloc replaced tcmalloc but was removed in 2020 because it didn't
+  release memory allocated from Rust (proxmox-backup-qemu) back to the
+  OS. The allocator-specific release API addresses this.
+- PVE 9 ships gperftools 2.16, so the old tuning issues are moot.
+
+Signed-off-by: Kefu Chai <k.chai@proxmox.com>
+---
+ meson.build  | 1 +
+ pve-backup.c | 8 ++++++++
+ 2 files changed, 9 insertions(+)
+
+diff --git a/meson.build b/meson.build
+index 0b28d2ec39..c6de2464d6 100644
+--- a/meson.build
++++ b/meson.build
+@@ -2567,6 +2567,7 @@ config_host_data.set('CONFIG_CRYPTO_SM4', crypto_sm4.found())
+ config_host_data.set('CONFIG_CRYPTO_SM3', crypto_sm3.found())
+ config_host_data.set('CONFIG_HOGWEED', hogweed.found())
+ config_host_data.set('CONFIG_MALLOC_TRIM', has_malloc_trim)
++config_host_data.set('CONFIG_TCMALLOC', get_option('malloc') == 'tcmalloc')
+ config_host_data.set('CONFIG_ZSTD', zstd.found())
+ config_host_data.set('CONFIG_QPL', qpl.found())
+ config_host_data.set('CONFIG_UADK', uadk.found())
+diff --git a/pve-backup.c b/pve-backup.c
+index ad0f8668fd..d5556f152b 100644
+--- a/pve-backup.c
++++ b/pve-backup.c
+@@ -19,6 +19,8 @@
+ 
+ #if defined(CONFIG_MALLOC_TRIM)
+ #include <malloc.h>
++#elif defined(CONFIG_TCMALLOC)
++#include <gperftools/malloc_extension_c.h>
+ #endif
+ 
+ #include <proxmox-backup-qemu.h>
+@@ -303,6 +305,12 @@ static void coroutine_fn pvebackup_co_cleanup(void)
+      * Won't happen by default if there is fragmentation.
+      */
+     malloc_trim(4 * 1024 * 1024);
++#elif defined(CONFIG_TCMALLOC)
++    /*
++     * Release free memory from tcmalloc's page cache back to the OS. This is
++     * allocator-aware and efficiently returns cached spans via madvise().
++     */
++    MallocExtension_ReleaseFreeMemory();
+ #endif
+ }
+ 
+-- 
+2.47.3
+
diff --git a/debian/patches/series b/debian/patches/series
index 8ed0c52..468df6c 100644
--- a/debian/patches/series
+++ b/debian/patches/series
@@ -81,3 +81,4 @@ pve/0044-PVE-backup-get-device-info-allow-caller-to-specify-f.patch
 pve/0045-PVE-backup-implement-backup-access-setup-and-teardow.patch
 pve/0046-PVE-backup-prepare-for-the-switch-to-using-blockdev-.patch
 pve/0047-savevm-async-reuse-migration-blocker-check-for-snaps.patch
+pve/0048-PVE-use-tcmalloc-as-the-memory-allocator.patch
-- 
2.47.3





^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH pve-qemu 1/2] PVE: use tcmalloc as the memory allocator
  2026-04-10  4:30 ` [PATCH pve-qemu 1/2] PVE: use " Kefu Chai
@ 2026-04-13 13:12   ` Fiona Ebner
  0 siblings, 0 replies; 16+ messages in thread
From: Fiona Ebner @ 2026-04-13 13:12 UTC (permalink / raw)
  To: Kefu Chai, pve-devel

It's preparation for using tcmalloc, not actually using it. I'd prefer a
title like "add patch to support using tcmalloc as the memory allocator".

Am 10.04.26 um 6:30 AM schrieb Kefu Chai:
> Add allocator-aware memory release in the backup completion path:
> since tcmalloc does not provide glibc's malloc_trim(), use the
> tcmalloc-specific MallocExtension_ReleaseFreeMemory() API instead.
> This function walks tcmalloc's page heap free span lists and calls
> madvise(MADV_DONTNEED) -- it does not walk allocated memory or compact
> the heap, so latency impact is negligible.
> 
> Also adds a CONFIG_TCMALLOC meson define so the conditional compilation
> in pve-backup.c can detect the allocator choice.
> 
> Signed-off-by: Kefu Chai <k.chai@proxmox.com>
> ---
>  ...use-tcmalloc-as-the-memory-allocator.patch | 77 +++++++++++++++++++
>  debian/patches/series                         |  1 +
>  2 files changed, 78 insertions(+)
>  create mode 100644 debian/patches/pve/0048-PVE-use-tcmalloc-as-the-memory-allocator.patch
> 
> diff --git a/debian/patches/pve/0048-PVE-use-tcmalloc-as-the-memory-allocator.patch b/debian/patches/pve/0048-PVE-use-tcmalloc-as-the-memory-allocator.patch
> new file mode 100644
> index 0000000..719d522
> --- /dev/null
> +++ b/debian/patches/pve/0048-PVE-use-tcmalloc-as-the-memory-allocator.patch
> @@ -0,0 +1,77 @@
> +From 0000000000000000000000000000000000000000 Mon Sep 17 00:00:00 2001
> +From: Kefu Chai <k.chai@proxmox.com>
> +Date: Thu, 9 Apr 2026 17:29:10 +0800
> +Subject: [PATCH] PVE: use tcmalloc as the memory allocator

Similar here. Also, it is specific to backup, I'd go for something like
"PVE-Backup: support using tcmalloc as the memory allocator".

> +
> +Use tcmalloc (from gperftools) as the memory allocator for improved
> +performance with workloads that create many small, short-lived
> +allocations -- particularly Ceph/librbd I/O paths.
> +
> +tcmalloc uses per-thread caches and size-class freelists that handle
> +this allocation pattern more efficiently than glibc's allocator. Ceph
> +benchmarks show ~50% IOPS improvement on 16KB random reads.
> +
> +Since tcmalloc does not provide glibc's malloc_trim(), use the
> +tcmalloc-specific MallocExtension_ReleaseFreeMemory() API to release
> +cached memory back to the OS after backup completion. This function
> +walks tcmalloc's page heap free span lists and calls
> +madvise(MADV_DONTNEED) -- it does not walk allocated memory or compact
> +the heap, so latency impact is negligible.
> +
> +Historical context:
> +- tcmalloc was originally enabled in 2015 but removed due to
> +  performance issues with gperftools 2.2's default settings (low
> +  TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES and aggressive decommit
> +  disabled). These issues were resolved in gperftools 2.4+.
> +- jemalloc replaced tcmalloc but was removed in 2020 because it didn't
> +  release memory allocated from Rust (proxmox-backup-qemu) back to the
> +  OS. The allocator-specific release API addresses this.
> +- PVE 9 ships gperftools 2.16, so the old tuning issues are moot.
> +
> +Signed-off-by: Kefu Chai <k.chai@proxmox.com>
> +---
> + meson.build  | 1 +
> + pve-backup.c | 8 ++++++++
> + 2 files changed, 9 insertions(+)
> +
> +diff --git a/meson.build b/meson.build
> +index 0b28d2ec39..c6de2464d6 100644
> +--- a/meson.build
> ++++ b/meson.build
> +@@ -2567,6 +2567,7 @@ config_host_data.set('CONFIG_CRYPTO_SM4', crypto_sm4.found())
> + config_host_data.set('CONFIG_CRYPTO_SM3', crypto_sm3.found())
> + config_host_data.set('CONFIG_HOGWEED', hogweed.found())
> + config_host_data.set('CONFIG_MALLOC_TRIM', has_malloc_trim)
> ++config_host_data.set('CONFIG_TCMALLOC', get_option('malloc') == 'tcmalloc')
> + config_host_data.set('CONFIG_ZSTD', zstd.found())
> + config_host_data.set('CONFIG_QPL', qpl.found())
> + config_host_data.set('CONFIG_UADK', uadk.found())
> +diff --git a/pve-backup.c b/pve-backup.c
> +index ad0f8668fd..d5556f152b 100644
> +--- a/pve-backup.c
> ++++ b/pve-backup.c
> +@@ -19,6 +19,8 @@
> + 
> + #if defined(CONFIG_MALLOC_TRIM)
> + #include <malloc.h>
> ++#elif defined(CONFIG_TCMALLOC)
> ++#include <gperftools/malloc_extension_c.h>
> + #endif
> + 
> + #include <proxmox-backup-qemu.h>
> +@@ -303,6 +305,12 @@ static void coroutine_fn pvebackup_co_cleanup(void)
> +      * Won't happen by default if there is fragmentation.
> +      */
> +     malloc_trim(4 * 1024 * 1024);
> ++#elif defined(CONFIG_TCMALLOC)
> ++    /*
> ++     * Release free memory from tcmalloc's page cache back to the OS. This is
> ++     * allocator-aware and efficiently returns cached spans via madvise().
> ++     */
> ++    MallocExtension_ReleaseFreeMemory();
> + #endif
> + }
> + 
> +-- 
> +2.47.3
> +
> diff --git a/debian/patches/series b/debian/patches/series
> index 8ed0c52..468df6c 100644
> --- a/debian/patches/series
> +++ b/debian/patches/series
> @@ -81,3 +81,4 @@ pve/0044-PVE-backup-get-device-info-allow-caller-to-specify-f.patch
>  pve/0045-PVE-backup-implement-backup-access-setup-and-teardow.patch
>  pve/0046-PVE-backup-prepare-for-the-switch-to-using-blockdev-.patch
>  pve/0047-savevm-async-reuse-migration-blocker-check-for-snaps.patch
> +pve/0048-PVE-use-tcmalloc-as-the-memory-allocator.patch





^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH pve-qemu 2/2] d/rules: enable tcmalloc as the memory allocator
  2026-04-10  4:30 [PATCH pve-qemu 0/2] Re-enable tcmalloc as the memory allocator Kefu Chai
  2026-04-10  4:30 ` [PATCH pve-qemu 1/2] PVE: use " Kefu Chai
@ 2026-04-10  4:30 ` Kefu Chai
  2026-04-13 13:13   ` Fiona Ebner
  2026-04-10  8:12 ` [PATCH pve-qemu 0/2] Re-enable " Fiona Ebner
  2026-04-14  5:48 ` superseded: " Kefu Chai
  3 siblings, 1 reply; 16+ messages in thread
From: Kefu Chai @ 2026-04-10  4:30 UTC (permalink / raw)
  To: pve-devel

Use tcmalloc (from gperftools) instead of glibc's allocator for
improved performance with workloads that create many small, short-lived
allocations -- particularly Ceph/librbd I/O paths. Ceph benchmarks
show ~50% IOPS improvement on 16KB random reads.

tcmalloc was originally used in 2015 but removed due to tuning issues
with gperftools 2.2. PVE 9 ships gperftools 2.16, where those issues
are long resolved.

Signed-off-by: Kefu Chai <k.chai@proxmox.com>
---
 debian/control | 1 +
 debian/rules   | 1 +
 2 files changed, 2 insertions(+)

diff --git a/debian/control b/debian/control
index 81cc026..a3121e1 100644
--- a/debian/control
+++ b/debian/control
@@ -14,6 +14,7 @@ Build-Depends: debhelper-compat (= 13),
                libfuse3-dev,
                libgbm-dev,
                libgnutls28-dev,
+               libgoogle-perftools-dev,
                libiscsi-dev (>= 1.12.0),
                libjpeg-dev,
                libjson-perl,
diff --git a/debian/rules b/debian/rules
index c90db29..a63e3a5 100755
--- a/debian/rules
+++ b/debian/rules
@@ -70,6 +70,7 @@ endif
 	    --enable-libusb \
 	    --enable-linux-aio \
 	    --enable-linux-io-uring \
+	    --enable-malloc=tcmalloc \
 	    --enable-numa \
 	    --enable-opengl \
 	    --enable-rbd \
-- 
2.47.3





^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH pve-qemu 2/2] d/rules: enable tcmalloc as the memory allocator
  2026-04-10  4:30 ` [PATCH pve-qemu 2/2] d/rules: enable " Kefu Chai
@ 2026-04-13 13:13   ` Fiona Ebner
  2026-04-13 13:23     ` DERUMIER, Alexandre
  0 siblings, 1 reply; 16+ messages in thread
From: Fiona Ebner @ 2026-04-13 13:13 UTC (permalink / raw)
  To: Kefu Chai, pve-devel

Am 10.04.26 um 6:29 AM schrieb Kefu Chai:
> Use tcmalloc (from gperftools) instead of glibc's allocator for
> improved performance with workloads that create many small, short-lived
> allocations -- particularly Ceph/librbd I/O paths. Ceph benchmarks
> show ~50% IOPS improvement on 16KB random reads.
> 
> tcmalloc was originally used in 2015 but removed due to tuning issues
> with gperftools 2.2. PVE 9 ships gperftools 2.16, where those issues
> are long resolved.
> 
> Signed-off-by: Kefu Chai <k.chai@proxmox.com>
> ---
>  debian/control | 1 +
>  debian/rules   | 1 +
>  2 files changed, 2 insertions(+)
> 
> diff --git a/debian/control b/debian/control
> index 81cc026..a3121e1 100644
> --- a/debian/control
> +++ b/debian/control
> @@ -14,6 +14,7 @@ Build-Depends: debhelper-compat (= 13),
>                 libfuse3-dev,
>                 libgbm-dev,
>                 libgnutls28-dev,
> +               libgoogle-perftools-dev,
>                 libiscsi-dev (>= 1.12.0),
>                 libjpeg-dev,
>                 libjson-perl,

Please also add the runtime dependency to Depends:

> diff --git a/debian/rules b/debian/rules
> index c90db29..a63e3a5 100755
> --- a/debian/rules
> +++ b/debian/rules
> @@ -70,6 +70,7 @@ endif
>  	    --enable-libusb \
>  	    --enable-linux-aio \
>  	    --enable-linux-io-uring \
> +	    --enable-malloc=tcmalloc \
>  	    --enable-numa \
>  	    --enable-opengl \
>  	    --enable-rbd \





^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH pve-qemu 2/2] d/rules: enable tcmalloc as the memory allocator
  2026-04-13 13:13   ` Fiona Ebner
@ 2026-04-13 13:23     ` DERUMIER, Alexandre
  2026-04-13 13:40       ` Fabian Grünbichler
  0 siblings, 1 reply; 16+ messages in thread
From: DERUMIER, Alexandre @ 2026-04-13 13:23 UTC (permalink / raw)
  To: pve-devel, f.ebner, k.chai

>>Please also add the runtime dependency to Depends:

I was going to say the same,

it should depend on "libgoogle-perftools4t64"





^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH pve-qemu 2/2] d/rules: enable tcmalloc as the memory allocator
  2026-04-13 13:23     ` DERUMIER, Alexandre
@ 2026-04-13 13:40       ` Fabian Grünbichler
  2026-04-14  8:10         ` Fiona Ebner
  0 siblings, 1 reply; 16+ messages in thread
From: Fabian Grünbichler @ 2026-04-13 13:40 UTC (permalink / raw)
  To: DERUMIER, Alexandre; +Cc: pve-devel

Quoting DERUMIER, Alexandre (2026-04-13 15:23:28)
> >>Please also add the runtime dependency to Depends:
> 
> I was going to say the same,
> 
> it should depend on "libgoogle-perftools4t64"

that is not needed. in general, for properly packaged libraries it is enough to
build-depend on the -dev package, and have ${shlibs:Depends} in the binary
package Depends (with this series applied and built):

 new Debian package, version 2.0.
 size 32266824 bytes: control archive=13756 bytes.
      38 bytes,     2 lines      conffiles
    1732 bytes,    18 lines      control
   40920 bytes,   494 lines      md5sums
 Package: pve-qemu-kvm
 Version: 10.2.1-1
 Architecture: amd64
 Maintainer: Proxmox Support Team <support@proxmox.com>
 Installed-Size: 422145
 Depends: ceph-common (>= 0.48), fuse3, iproute2, libiscsi7 (>= 1.18.0), libjpeg62-turbo (>= 1.3.1), libspice-server1 (>= 0.14.2), libusb-1.0-0 (>= 2:1.0.23~), libusbredirparser1 (>= 0.6-2), libuuid1 (>= 2.16), libaio1t64 (>= 0.3.93), libasound2t64 (>= 1.0.16), libc6 (>= 2.38), libcap-ng0 (>= 0.7.9), libcurl3t64-gnutls (>= 7.16.3), libepoxy0 (>= 1.5.2), libfdt1 (>= 1.7.2), libfuse3-4 (>= 3.17.2), libgbm1 (>= 12.0.0~0), libgcc-s1 (>= 3.0), libglib2.0-0t64 (>= 2.83.0), libgnutls30t64 (>= 3.8.6), libgoogle-perftools4t64 (>= 2.16), libnuma1 (>= 2.0.15), libpixman-1-0 (>= 0.30.0), libproxmox-backup-qemu0 (>= 1.3.0), libpulse0 (>= 0.99.1), librados2 (>= 19.2.3), librbd1 (>= 19.2.3), libseccomp2 (>= 2.1.0), libselinux1 (>= 3.1~), libslirp0 (>= 4.7.0), libsndio7.0 (>= 1.8.1), libsystemd0, libudev1 (>= 183), liburing2 (>= 2.3), libusbredirparser1t64 (>= 0.8.0), libvirglrenderer1 (>= 1.0.0), libxkbcommon0 (>= 0.5.0), libzstd1 (>= 1.5.5), zlib1g (>= 1:1.2.0)
 Recommends: numactl
 Suggests: libgl1
 Conflicts: kvm, pve-kvm, pve-qemu-kvm-2.6.18, qemu, qemu-kvm, qemu-system-arm, qemu-system-common, qemu-system-data, qemu-system-x86, qemu-utils
 Breaks: qemu-server (<= 8.0.6)
 Replaces: pve-kvm, pve-qemu-kvm-2.6.18, qemu-system-arm, qemu-system-x86, qemu-utils
 Provides: qemu-system-arm, qemu-system-x86, qemu-utils
 Section: admin
 Priority: optional
 Description: Full virtualization on x86 hardware
  Using KVM, one can run multiple virtual PCs, each running unmodified Linux or
  Windows images. Each virtual machine has private virtualized hardware: a
  network card, disk, graphics adapter, etc.

if the library is packaged using symbol versioning, it even adds the proper lower version bound ;)

in fact, with the following:

diff --git a/debian/control b/debian/control
index a3121e1..0a3ffbd 100644
--- a/debian/control
+++ b/debian/control
@@ -49,12 +49,6 @@ Architecture: any
 Depends: ceph-common (>= 0.48),
          fuse3,
          iproute2,
-         libiscsi4 (>= 1.12.0) | libiscsi7,
-         libjpeg62-turbo,
-         libspice-server1 (>= 0.14.0~),
-         libusb-1.0-0 (>= 1.0.17-1),
-         libusbredirparser1 (>= 0.6-2),
-         libuuid1,
          ${misc:Depends},
          ${shlibs:Depends},
 Recommends: numactl,

we get even more accurate Depends: and nicer sorting, and less work in case one
of those libraries goes through a transition in Forky:

$ debdiff ...
File lists identical (after any substitutions)

Control files: lines which differ (wdiff format)
------------------------------------------------
Depends: ceph-common (>= 0.48), fuse3, iproute2, [-libiscsi7 (>= 1.18.0), libjpeg62-turbo (>= 1.3.1), libspice-server1 (>= 0.14.2), libusb-1.0-0 (>= 2:1.0.23~), libusbredirparser1 (>= 0.6-2), libuuid1 (>= 2.16),-] libaio1t64 (>= 0.3.93), libasound2t64 (>= 1.0.16), libc6 (>= 2.38), libcap-ng0 (>= 0.7.9), libcurl3t64-gnutls (>= 7.16.3), libepoxy0 (>= 1.5.2), libfdt1 (>= 1.7.2), libfuse3-4 (>= 3.17.2), libgbm1 (>= 12.0.0~0), libgcc-s1 (>= 3.0), libglib2.0-0t64 (>= 2.83.0), libgnutls30t64 (>= 3.8.6), libgoogle-perftools4t64 (>= 2.16), {+libiscsi7 (>= 1.18.0), libjpeg62-turbo (>= 1.3.1),+} libnuma1 (>= 2.0.15), libpixman-1-0 (>= 0.30.0), libproxmox-backup-qemu0 (>= 1.3.0), libpulse0 (>= 0.99.1), librados2 (>= 19.2.3), librbd1 (>= 19.2.3), libseccomp2 (>= 2.1.0), libselinux1 (>= 3.1~), libslirp0 (>= 4.7.0), libsndio7.0 (>= 1.8.1), {+libspice-server1 (>= 0.14.2),+} libsystemd0, libudev1 (>= 183), liburing2 (>= 2.3), {+libusb-1.0-0 (>= 2:1.0.23~),+} libusbredirparser1t64 (>= 0.8.0), {+libuuid1 (>= 2.16),+} libvirglrenderer1 (>= 1.0.0), libxkbcommon0 (>= 0.5.0), libzstd1 (>= 1.5.5), zlib1g (>= 1:1.2.0)




^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH pve-qemu 2/2] d/rules: enable tcmalloc as the memory allocator
  2026-04-13 13:40       ` Fabian Grünbichler
@ 2026-04-14  8:10         ` Fiona Ebner
  0 siblings, 0 replies; 16+ messages in thread
From: Fiona Ebner @ 2026-04-14  8:10 UTC (permalink / raw)
  To: Fabian Grünbichler, DERUMIER, Alexandre; +Cc: pve-devel

Am 13.04.26 um 3:39 PM schrieb Fabian Grünbichler:
> Quoting DERUMIER, Alexandre (2026-04-13 15:23:28)
>>>> Please also add the runtime dependency to Depends:
>>
>> I was going to say the same,
>>
>> it should depend on "libgoogle-perftools4t64"
> 
> that is not needed. in general, for properly packaged libraries it is enough to
> build-depend on the -dev package, and have ${shlibs:Depends} in the binary
> package Depends (with this series applied and built):
> 
>  new Debian package, version 2.0.
>  size 32266824 bytes: control archive=13756 bytes.
>       38 bytes,     2 lines      conffiles
>     1732 bytes,    18 lines      control
>    40920 bytes,   494 lines      md5sums
>  Package: pve-qemu-kvm
>  Version: 10.2.1-1
>  Architecture: amd64
>  Maintainer: Proxmox Support Team <support@proxmox.com>
>  Installed-Size: 422145
>  Depends: ceph-common (>= 0.48), fuse3, iproute2, libiscsi7 (>= 1.18.0), libjpeg62-turbo (>= 1.3.1), libspice-server1 (>= 0.14.2), libusb-1.0-0 (>= 2:1.0.23~), libusbredirparser1 (>= 0.6-2), libuuid1 (>= 2.16), libaio1t64 (>= 0.3.93), libasound2t64 (>= 1.0.16), libc6 (>= 2.38), libcap-ng0 (>= 0.7.9), libcurl3t64-gnutls (>= 7.16.3), libepoxy0 (>= 1.5.2), libfdt1 (>= 1.7.2), libfuse3-4 (>= 3.17.2), libgbm1 (>= 12.0.0~0), libgcc-s1 (>= 3.0), libglib2.0-0t64 (>= 2.83.0), libgnutls30t64 (>= 3.8.6), libgoogle-perftools4t64 (>= 2.16), libnuma1 (>= 2.0.15), libpixman-1-0 (>= 0.30.0), libproxmox-backup-qemu0 (>= 1.3.0), libpulse0 (>= 0.99.1), librados2 (>= 19.2.3), librbd1 (>= 19.2.3), libseccomp2 (>= 2.1.0), libselinux1 (>= 3.1~), libslirp0 (>= 4.7.0), libsndio7.0 (>= 1.8.1), libsystemd0, libudev1 (>= 183), liburing2 (>= 2.3), libusbredirparser1t64 (>= 0.8.0), libvirglrenderer1 (>= 1.0.0), libxkbcommon0 (>= 0.5.0), libzstd1 (>= 1.5.5), zlib1g (>= 1:1.2.0)
>  Recommends: numactl
>  Suggests: libgl1
>  Conflicts: kvm, pve-kvm, pve-qemu-kvm-2.6.18, qemu, qemu-kvm, qemu-system-arm, qemu-system-common, qemu-system-data, qemu-system-x86, qemu-utils
>  Breaks: qemu-server (<= 8.0.6)
>  Replaces: pve-kvm, pve-qemu-kvm-2.6.18, qemu-system-arm, qemu-system-x86, qemu-utils
>  Provides: qemu-system-arm, qemu-system-x86, qemu-utils
>  Section: admin
>  Priority: optional
>  Description: Full virtualization on x86 hardware
>   Using KVM, one can run multiple virtual PCs, each running unmodified Linux or
>   Windows images. Each virtual machine has private virtualized hardware: a
>   network card, disk, graphics adapter, etc.
> 
> if the library is packaged using symbol versioning, it even adds the proper lower version bound ;)
> 
> in fact, with the following:
> 
> diff --git a/debian/control b/debian/control
> index a3121e1..0a3ffbd 100644
> --- a/debian/control
> +++ b/debian/control
> @@ -49,12 +49,6 @@ Architecture: any
>  Depends: ceph-common (>= 0.48),
>           fuse3,
>           iproute2,
> -         libiscsi4 (>= 1.12.0) | libiscsi7,
> -         libjpeg62-turbo,
> -         libspice-server1 (>= 0.14.0~),
> -         libusb-1.0-0 (>= 1.0.17-1),
> -         libusbredirparser1 (>= 0.6-2),
> -         libuuid1,
>           ${misc:Depends},
>           ${shlibs:Depends},
>  Recommends: numactl,
> 
> we get even more accurate Depends: and nicer sorting, and less work in case one
> of those libraries goes through a transition in Forky:
> 
> $ debdiff ...
> File lists identical (after any substitutions)
> 
> Control files: lines which differ (wdiff format)
> ------------------------------------------------
> Depends: ceph-common (>= 0.48), fuse3, iproute2, [-libiscsi7 (>= 1.18.0), libjpeg62-turbo (>= 1.3.1), libspice-server1 (>= 0.14.2), libusb-1.0-0 (>= 2:1.0.23~), libusbredirparser1 (>= 0.6-2), libuuid1 (>= 2.16),-] libaio1t64 (>= 0.3.93), libasound2t64 (>= 1.0.16), libc6 (>= 2.38), libcap-ng0 (>= 0.7.9), libcurl3t64-gnutls (>= 7.16.3), libepoxy0 (>= 1.5.2), libfdt1 (>= 1.7.2), libfuse3-4 (>= 3.17.2), libgbm1 (>= 12.0.0~0), libgcc-s1 (>= 3.0), libglib2.0-0t64 (>= 2.83.0), libgnutls30t64 (>= 3.8.6), libgoogle-perftools4t64 (>= 2.16), {+libiscsi7 (>= 1.18.0), libjpeg62-turbo (>= 1.3.1),+} libnuma1 (>= 2.0.15), libpixman-1-0 (>= 0.30.0), libproxmox-backup-qemu0 (>= 1.3.0), libpulse0 (>= 0.99.1), librados2 (>= 19.2.3), librbd1 (>= 19.2.3), libseccomp2 (>= 2.1.0), libselinux1 (>= 3.1~), libslirp0 (>= 4.7.0), libsndio7.0 (>= 1.8.1), {+libspice-server1 (>= 0.14.2),+} libsystemd0, libudev1 (>= 183), liburing2 (>= 2.3), {+libusb-1.0-0 (>= 2:1.0.23~),+} libusbredirparser1t64 (>= 0.8.0), {+libuuid1 (>= 2.16),+} libvirglrenderer1 (>= 1.0.0), libxkbcommon0 (>= 0.5.0), libzstd1 (>= 1.5.5), zlib1g (>= 1:1.2.0)

Ah, good to know! Let's do that then :)





^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH pve-qemu 0/2] Re-enable tcmalloc as the memory allocator
  2026-04-10  4:30 [PATCH pve-qemu 0/2] Re-enable tcmalloc as the memory allocator Kefu Chai
  2026-04-10  4:30 ` [PATCH pve-qemu 1/2] PVE: use " Kefu Chai
  2026-04-10  4:30 ` [PATCH pve-qemu 2/2] d/rules: enable " Kefu Chai
@ 2026-04-10  8:12 ` Fiona Ebner
  2026-04-10 10:45   ` DERUMIER, Alexandre
  2026-04-14  5:48 ` superseded: " Kefu Chai
  3 siblings, 1 reply; 16+ messages in thread
From: Fiona Ebner @ 2026-04-10  8:12 UTC (permalink / raw)
  To: Kefu Chai, pve-devel

Am 10.04.26 um 6:30 AM schrieb Kefu Chai:
> Following up on the RFC thread [0], here's the formal submission to
> re-enable tcmalloc for pve-qemu.
> 
> Quick recap: librbd's I/O path allocates a lot of small, short-lived
> objects with plain new/malloc (ObjectReadRequest, bufferlist, etc.),
> and glibc's ptmalloc2 handles this pattern poorly -- cross-thread
> arena contention and cache-line bouncing show up clearly in perf
> profiles. tcmalloc's per-thread fast path avoids both.
>
> A bit of history for context: tcmalloc was tried in 2015 but dropped
> after 8 days due to gperftools 2.2 tuning issues (fixed in 2.4+).
> jemalloc replaced it but was dropped in 2020 because it didn't
> release Rust-allocated memory (from proxmox-backup-qemu) back to the
> OS. PVE 9 ships gperftools 2.16, and patch 1/2 addresses the
> reclamation gap explicitly.
> 
> On Dietmar's two concerns from the RFC:
> 
> "Could ReleaseFreeMemory() halt the application?" -- No, and I
> verified this directly. It walks tcmalloc's page heap free span
> lists and calls madvise(MADV_DONTNEED) on each span. It does not
> walk allocated memory or compact the heap. A standalone test
> reclaimed 386 MB of 410 MB cached memory (94%) in effectively zero
> wall time. The call runs once at backup completion, same spot where
> malloc_trim runs today.
> 
> "Wouldn't a pool allocator in librbd be the proper fix?" -- In
> principle yes, but I audited librbd in Ceph squid and it does NOT
> use a pool allocator -- all I/O path objects go through plain new.
> Ceph's mempool is tracking-only, not actual pooling. Adding real
> pooling would be a significant Ceph-side change (submission and
> completion happen on different threads), and it's orthogonal to the
> allocator choice here.
> 
> Also thanks to Alexandre for confirming the 2015 gperftools issues
> are long resolved.
> 
> Test results
> ------------
> 
> Benchmarked on a local vstart Ceph cluster (3 OSDs on local NVMe).
> This is the worst case for showing allocator impact, since there's
> no network latency for CPU savings to amortize against:
> 
>   rbd bench --io-type read --io-size 4096 --io-threads 16 \
>             --io-pattern rand
> 
>   Metric         | glibc ptmalloc2 | tcmalloc  | Delta
>   ---------------+-----------------+-----------+--------
>   IOPS           |         131,201 |   136,389 |  +4.0%
>   CPU time       |        1,556 ms |  1,439 ms |  -7.5%
>   Cycles         |           6.74B |     6.06B | -10.1%
>   Cache misses   |          137.1M |    123.9M |  -9.6%
> 
> perf report on the glibc run shows ~8% of CPU in allocator internals
> (_int_malloc, cfree, malloc_consolidate, _int_free_*); the same
> symbols are barely visible with tcmalloc because the fast path is
> just a pointer bump. The Ceph blog [1] reports ~50% IOPS gain on
> production clusters where network RTT dominates per-I/O latency --
> the 10% CPU savings compound there since the host can push more I/O
> into the pipeline during the same wall time.

How does the performance change when doing IO within a QEMU guest?

How does this affect the performance for other storage types, like ZFS,
qcow2 on top of directory-based storages, qcow2 on top of LVM, LVM-thin,
etc. and other workloads like saving VM state during snapshot, transfer
during migration, maybe memory hotplug/ballooning, network performance
for vNICs?




^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH pve-qemu 0/2] Re-enable tcmalloc as the memory allocator
  2026-04-10  8:12 ` [PATCH pve-qemu 0/2] Re-enable " Fiona Ebner
@ 2026-04-10 10:45   ` DERUMIER, Alexandre
  2026-04-13  8:14     ` Fiona Ebner
  0 siblings, 1 reply; 16+ messages in thread
From: DERUMIER, Alexandre @ 2026-04-10 10:45 UTC (permalink / raw)
  To: pve-devel, f.ebner, k.chai

>>How does the performance change when doing IO within a QEMU guest?
>>
>>How does this affect the performance for other storage types, like
>>ZFS,
>>qcow2 on top of directory-based storages, qcow2 on top of LVM, LVM-
>>thin,
>>etc. and other workloads like saving VM state during snapshot,
>>transfer
>>during migration, maybe memory hotplug/ballooning, network
>>performance
>>for vNICs?

Hi Fiona,

I'm stil running in production (I have keeped tcmalloc after the
removal some year ago from the pve build), and I didn't notice problem.
(but I still don't use pbs).

But I never have done bench with/without it since 5/6 year.

Maybe vm memory should be checked too, I'm thinking about RSS memory
with balloon free_page_reporting, to see if it's correcting freeing
memory.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH pve-qemu 0/2] Re-enable tcmalloc as the memory allocator
  2026-04-10 10:45   ` DERUMIER, Alexandre
@ 2026-04-13  8:14     ` Fiona Ebner
  2026-04-13 11:13       ` Kefu Chai
  0 siblings, 1 reply; 16+ messages in thread
From: Fiona Ebner @ 2026-04-13  8:14 UTC (permalink / raw)
  To: DERUMIER, Alexandre, pve-devel, k.chai

Am 10.04.26 um 12:44 PM schrieb DERUMIER, Alexandre:
> 
>>> How does the performance change when doing IO within a QEMU guest?
>>>
>>> How does this affect the performance for other storage types, like
>>> ZFS,
>>> qcow2 on top of directory-based storages, qcow2 on top of LVM, LVM-
>>> thin,
>>> etc. and other workloads like saving VM state during snapshot,
>>> transfer
>>> during migration, maybe memory hotplug/ballooning, network
>>> performance
>>> for vNICs?
> 
> Hi Fiona,
> 
> I'm stil running in production (I have keeped tcmalloc after the
> removal some year ago from the pve build), and I didn't notice problem.
> (but I still don't use pbs).
> 
> But I never have done bench with/without it since 5/6 year.
> 
> Maybe vm memory should be checked too, I'm thinking about RSS memory
> with balloon free_page_reporting, to see if it's correcting freeing
> memory.

Thanks! If it was running fine for you for this long, that is a very
good data point :) Still, testing different scenarios/configurations
would be nice to rule out that there is a (performance) regression
somewhere else.




^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH pve-qemu 0/2] Re-enable tcmalloc as the memory allocator
  2026-04-13  8:14     ` Fiona Ebner
@ 2026-04-13 11:13       ` Kefu Chai
  2026-04-13 13:14         ` DERUMIER, Alexandre
  2026-04-13 13:18         ` Fiona Ebner
  0 siblings, 2 replies; 16+ messages in thread
From: Kefu Chai @ 2026-04-13 11:13 UTC (permalink / raw)
  To: Fiona Ebner, DERUMIER, Alexandre, pve-devel

On Mon Apr 13, 2026 at 4:14 PM CST, Fiona Ebner wrote:
> Am 10.04.26 um 12:44 PM schrieb DERUMIER, Alexandre:
>> 
>>>> How does the performance change when doing IO within a QEMU guest?
>>>>
>>>> How does this affect the performance for other storage types, like
>>>> ZFS,
>>>> qcow2 on top of directory-based storages, qcow2 on top of LVM, LVM-
>>>> thin,
>>>> etc. and other workloads like saving VM state during snapshot,
>>>> transfer
>>>> during migration, maybe memory hotplug/ballooning, network
>>>> performance
>>>> for vNICs?

Hi Fiona,

Thanks for the questions.

I traced QEMU's source code. It turns out that guest's RAM is allocated
via direct mmap() calls, which completely bypassing QEMU's C library
allocator. The path looks like:

-m 4G on command line:

memory_region_init_ram_flags_nomigrate()   
  qemu_ram_alloc()
    qemu_ram_alloc_internal()
      g_malloc0(sizeof(*new_block))  <-- only the RAMBlock metadata
                                         struct, about 512 bytes
      ram_block_add()
        qemu_anon_ram_alloc()
          qemu_ram_mmap(-1, size, ...)
            mmap_reserve(total)
              mmap(0, size, PROT_NONE, ...)  <-- reserve address space
            mmap_activate(ptr, size, ...)
              mmap(ptr, size, PROT_RW, ...)  <-- actual guest RAM

So the gigabytes of guest memory go straight to the kernel via mmap()
and never touch malloc or tcmalloc. Only the small RAMBlock metadata
structure (~512 bytes per region) goes through g_malloc0().

In other words, tcmalloc's scope is limited to QEMU's own working
memory, among other things, block layer buffers, coroutine stacks,
internal data structs. It does not touch gues RAM at all.

The inflat/deflate path of balloon code does involve glib malloc. It's
virtio_balloon_pbp_alloc(), which calls bitmap_new() via g_new0() to
track partially-ballooned pages. But these bitmaps tracks only metadata,
whose memory footprints are relatively small. And this only happens in 
infrequent operations.

Hopefully, this addresses the balloon concern. If I've missed something
or misread the code, please help point it out.

When it comes to different storage types and workloads, I ran benchmarks
covering some scenarios you listed. All comparison use the same binary
(pve-qemu-kvm 10.1.2-7) with and without LD_PRELOAD=libtcmalloc.so.4, so
the allocator is the only variable.

Storage backends (block layer, via qemu-img bench)

  4K reads, depth=32, io_uring, cache=none, 5M ops per run, best
  of 3. Host NVMe (ext4) or LVM-thin backed by NVMe.

  backend              glibc ops/s   tcmalloc ops/s   delta
  qcow2 on ext4        1,188,495     1,189,343        +0.1%
  qcow2 on ext4 write  1,036,914     1,036,699         0.0%
  raw on ext4          1,263,583     1,277,465        +1.1%
  raw on LVM-thin        433,727       433,576         0.0%

  The reason why raw/LVM-thin is slower, is that, I *think* it actually
  hits the dm-thin layer rather than page cache. The allocator delta is
  noise in all cases.

  RBD tested via a local vstart cluster (3 OSDs, bluestore on NVMe):

  path                        glibc       tcmalloc    delta
  qemu-img bench + librbd     19,111      19,156      +0.2%
  rbd bench (librbd direct)   35,329      36,622      +3.7%

  I don't have ZFS configured on this (host) machine, but QEMU's file I/O
  path to ZFS goes through the same code as ext4, so the difference
  is in the kernel. I'd expect the same null result,
  though I could be wrong and am happy to set up a test pool if
  you'd like to see the numbers.

Guest I/O (Debian 13 guest, virtio-blk)

  Ran dd inside a Debian 13 cloud image guest (4 vCPU, 2 GB RAM,
  qcow2 on ext4, cache=none, aio=native) against a second virtio
  disk (/dev/vdb, 8 GB qcow2).

  workload                              glibc     tcmalloc   delta
  dd if=/dev/vdb bs=4M count=1024       15.3 GB/s 15.5 GB/s  +1.3%
    iflag=direct (sequential)
  8x parallel dd if=/dev/vdb bs=4k      787 MB/s  870 MB/s  +10.5%
    count=100k iflag=direct
    (each starting at a different offset)

Migration and savevm

  Tested with a 4 GB guest where about 2 GB of RAM was dirtied
  (filled /dev/shm with urandom, 8x256 MB) before triggering each
  operation.

  scenario                glibc      tcmalloc   delta
  migrate (exec: URI)     0.622 s    0.622 s    0.0%
  savevm (qcow2 snap)     0.503 s    0.504 s    +0.2%

What I didn't measure

  I did'nt test ZFS/vNIC throughput, as the host path lives in the kernel
  side, while QEMU only handles the control plane. Hence I'd expect very
  little allocator impact there, though I could be wrong.

  I also didn't test memory hotplug on ballooning cases, because, as
  explained above, these are rare one-shot operations. and as the source
  code trace shows malloc is not involved. But again, happy to look into
  it, if it's still a concern.

>> 
>> Hi Fiona,
>> 
>> I'm stil running in production (I have keeped tcmalloc after the
>> removal some year ago from the pve build), and I didn't notice problem.
>> (but I still don't use pbs).
>> 
>> But I never have done bench with/without it since 5/6 year.

And thanks Alexandre for sharing your production experience, that's very
valuable context.

>> 
>> Maybe vm memory should be checked too, I'm thinking about RSS memory
>> with balloon free_page_reporting, to see if it's correcting freeing
>> memory.

For free_page_reporting, the VIRTIO_BALLOON_F_REPORTING path:

virtual_balloon_handle_report()
  ram_block_discard_range()
    madvise(host_startaddr, kebngth, QEMU_MADV_DONTNEED)

This operates directly on the mmap'ed guest RAM region. Still no malloc
involvement anywhere in this path.

In short, the tests above reveal a consistent pattern: tcmalloc helps
where allocation pressure is high, and is neutral everywhere else.
Please let me know if you'd like more data on any specific workload, or
if there is anything I overlooked. I appreciate the careful review and
insights.

>
> Thanks! If it was running fine for you for this long, that is a very
> good data point :) Still, testing different scenarios/configurations
> would be nice to rule out that there is a (performance) regression
> somewhere else.

cheers,
Kefu

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH pve-qemu 0/2] Re-enable tcmalloc as the memory allocator
  2026-04-13 11:13       ` Kefu Chai
@ 2026-04-13 13:14         ` DERUMIER, Alexandre
  2026-04-13 13:18         ` Fiona Ebner
  1 sibling, 0 replies; 16+ messages in thread
From: DERUMIER, Alexandre @ 2026-04-13 13:14 UTC (permalink / raw)
  To: pve-devel, f.ebner, k.chai

Hi,

I have done a small test on a free cluster (this was with old xeon v3
with 18 ssd).

For 4k randwrite, I don't see difference (but I think I'm write limited
by the ssd), I'm around 40k~50k iops

for 4k randread, I see 30% improvement (~70k iops -> ~100kiops)


test: virtio-scsi + iothread + cache=none


without tcmalloc:

# fio --filename=/dev/sdb --direct=1 --rw=randread --bs=4k --iodepth=64
--ioengine=libaio --name test
test: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T)
4096B-4096B, ioengine=libaio, iodepth=64
fio-3.33
Starting 1 process
^Cbs: 1 (f=1): [r(1)][20%][r=273MiB/s][r=69.9k IOPS][eta 05m:12s]


# fio --filename=/dev/sda --direct=1 --rw=randread --bs=4k --iodepth=64
--ioengine=libaio --name test

test: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T)
4096B-4096B, ioengine=libaio, iodepth=64
fio-3.33
Starting 1 process
^Cbs: 1 (f=1): [r(1)][13.6%][r=273MiB/s][r=69.9k IOPS][eta 05m:12s]
fio: terminating on signal 2

test: (groupid=0, jobs=1): err= 0: pid=929282: Mon Apr 13 15:01:33 2026
  read: IOPS=72.7k, BW=284MiB/s (298MB/s)(13.7GiB/49316msec)
    slat (usec): min=3, max=917, avg= 6.00, stdev= 3.66
    clat (usec): min=169, max=117369, avg=873.83, stdev=521.46
     lat (usec): min=180, max=117377, avg=879.83, stdev=521.44
    clat percentiles (usec):
     |  1.00th=[  408],  5.00th=[  510], 10.00th=[  578], 20.00th=[ 
668],
     | 30.00th=[  734], 40.00th=[  799], 50.00th=[  848], 60.00th=[ 
906],
     | 70.00th=[  971], 80.00th=[ 1057], 90.00th=[ 1172], 95.00th=[
1303],
     | 99.00th=[ 1598], 99.50th=[ 1762], 99.90th=[ 2311], 99.95th=[
2638],
     | 99.99th=[ 4490]
   bw (  KiB/s): min=228520, max=315464, per=100.00%, avg=291059.43,
stdev=14202.45, samples=98
   iops        : min=57130, max=78866, avg=72764.90, stdev=3550.61,
samples=98
  lat (usec)   : 250=0.01%, 500=4.32%, 750=27.98%, 1000=41.61%
  lat (msec)   : 2=25.85%, 4=0.21%, 10=0.01%, 100=0.01%, 250=0.01%
  cpu          : usr=15.15%, sys=50.34%, ctx=226303, majf=0, minf=4301
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%,
>=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%,
>=64=0.0%
     issued rwts: total=3584067,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
   READ: bw=284MiB/s (298MB/s), 284MiB/s-284MiB/s (298MB/s-298MB/s),
io=13.7GiB (14.7GB), run=49316-49316msec

Disk stats (read/write):
  sdc: ios=3571207/0, merge=0/0, ticks=2868237/0, in_queue=2868237,
util=99.90%



with tcmalloc
-------------

fio --filename=/dev/sda --direct=1 --rw=randread --bs=4k --iodepth=64 -
-ioengine=libaio --name test
test: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T)
4096B-4096B, ioengine=libaio, iodepth=64
fio-3.33
Starting 1 process
^Cbs: 1 (f=1): [r(1)][19.0%][r=383MiB/s][r=98.0k IOPS][eta 03m:42s]


test: (groupid=0, jobs=1): err= 0: pid=1293: Mon Apr 13 15:05:04 2026
  read: IOPS=95.7k, BW=374MiB/s (392MB/s)(19.1GiB/52256msec)
    slat (usec): min=2, max=1605, avg= 5.94, stdev= 3.93
    clat (usec): min=150, max=7426, avg=661.83, stdev=204.22
     lat (usec): min=157, max=7433, avg=667.77, stdev=204.42
    clat percentiles (usec):
     |  1.00th=[  347],  5.00th=[  412], 10.00th=[  453], 20.00th=[ 
506],
     | 30.00th=[  553], 40.00th=[  586], 50.00th=[  627], 60.00th=[ 
668],
     | 70.00th=[  717], 80.00th=[  791], 90.00th=[  906], 95.00th=[
1020],
     | 99.00th=[ 1336], 99.50th=[ 1500], 99.90th=[ 1909], 99.95th=[
2114],
     | 99.99th=[ 3359]
   bw (  KiB/s): min=324264, max=442120, per=100.00%, avg=383405.95,
stdev=20263.50, samples=104
   iops        : min=81066, max=110530, avg=95851.60, stdev=5065.87,
samples=104
  lat (usec)   : 250=0.02%, 500=18.47%, 750=56.25%, 1000=19.54%
  lat (msec)   : 2=5.64%, 4=0.06%, 10=0.01%
  cpu          : usr=19.61%, sys=65.54%, ctx=262129, majf=0, minf=4296
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%,
>=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%,
>=64=0.0%
     issued rwts: total=5002932,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
   READ: bw=374MiB/s (392MB/s), 374MiB/s-374MiB/s (392MB/s-392MB/s),
io=19.1GiB (20.5GB), run=52256-52256msec

Disk stats (read/write):
  sda: ios=4990801/0, merge=0/0, ticks=2862837/0, in_queue=2862837,
util=99.91%



-------- Message initial --------
De: Kefu Chai <k.chai@proxmox.com>
À: Fiona Ebner <f.ebner@proxmox.com>, "DERUMIER, Alexandre"
<alexandre.derumier@groupe-cyllene.com>, pve-devel@lists.proxmox.com
<pve-devel@lists.proxmox.com>
Objet: Re: [PATCH pve-qemu 0/2] Re-enable tcmalloc as the memory
allocator
Date: 13/04/2026 13:13:12

On Mon Apr 13, 2026 at 4:14 PM CST, Fiona Ebner wrote:
> Am 10.04.26 um 12:44 PM schrieb DERUMIER, Alexandre:
> > 
> > > > How does the performance change when doing IO within a QEMU
> > > > guest?
> > > > 
> > > > How does this affect the performance for other storage types,
> > > > like
> > > > ZFS,
> > > > qcow2 on top of directory-based storages, qcow2 on top of LVM,
> > > > LVM-
> > > > thin,
> > > > etc. and other workloads like saving VM state during snapshot,
> > > > transfer
> > > > during migration, maybe memory hotplug/ballooning, network
> > > > performance
> > > > for vNICs?

Hi Fiona,

Thanks for the questions.

I traced QEMU's source code. It turns out that guest's RAM is allocated
via direct mmap() calls, which completely bypassing QEMU's C library
allocator. The path looks like:

-m 4G on command line:

memory_region_init_ram_flags_nomigrate()   
  qemu_ram_alloc()
    qemu_ram_alloc_internal()
      g_malloc0(sizeof(*new_block))  <-- only the RAMBlock metadata
                                         struct, about 512 bytes
      ram_block_add()
        qemu_anon_ram_alloc()
          qemu_ram_mmap(-1, size, ...)
            mmap_reserve(total)
              mmap(0, size, PROT_NONE, ...)  <-- reserve address space
            mmap_activate(ptr, size, ...)
              mmap(ptr, size, PROT_RW, ...)  <-- actual guest RAM

So the gigabytes of guest memory go straight to the kernel via mmap()
and never touch malloc or tcmalloc. Only the small RAMBlock metadata
structure (~512 bytes per region) goes through g_malloc0().

In other words, tcmalloc's scope is limited to QEMU's own working
memory, among other things, block layer buffers, coroutine stacks,
internal data structs. It does not touch gues RAM at all.

The inflat/deflate path of balloon code does involve glib malloc. It's
virtio_balloon_pbp_alloc(), which calls bitmap_new() via g_new0() to
track partially-ballooned pages. But these bitmaps tracks only
metadata,
whose memory footprints are relatively small. And this only happens in 
infrequent operations.

Hopefully, this addresses the balloon concern. If I've missed something
or misread the code, please help point it out.

When it comes to different storage types and workloads, I ran
benchmarks
covering some scenarios you listed. All comparison use the same binary
(pve-qemu-kvm 10.1.2-7) with and without LD_PRELOAD=libtcmalloc.so.4,
so
the allocator is the only variable.


Storage backends (block layer, via qemu-img bench)

  4K reads, depth=32, io_uring, cache=none, 5M ops per run, best
  of 3. Host NVMe (ext4) or LVM-thin backed by NVMe.

  backend              glibc ops/s   tcmalloc ops/s   delta
  qcow2 on ext4        1,188,495     1,189,343        +0.1%
  qcow2 on ext4 write  1,036,914     1,036,699         0.0%
  raw on ext4          1,263,583     1,277,465        +1.1%
  raw on LVM-thin        433,727       433,576         0.0%

  The reason why raw/LVM-thin is slower, is that, I *think* it actually
  hits the dm-thin layer rather than page cache. The allocator delta is
  noise in all cases.

  RBD tested via a local vstart cluster (3 OSDs, bluestore on NVMe):

  path                        glibc       tcmalloc    delta
  qemu-img bench + librbd     19,111      19,156      +0.2%
  rbd bench (librbd direct)   35,329      36,622      +3.7%

  I don't have ZFS configured on this (host) machine, but QEMU's file
I/O
  path to ZFS goes through the same code as ext4, so the difference
  is in the kernel. I'd expect the same null result,
  though I could be wrong and am happy to set up a test pool if
  you'd like to see the numbers.

Guest I/O (Debian 13 guest, virtio-blk)

  Ran dd inside a Debian 13 cloud image guest (4 vCPU, 2 GB RAM,
  qcow2 on ext4, cache=none, aio=native) against a second virtio
  disk (/dev/vdb, 8 GB qcow2).

  workload                              glibc     tcmalloc   delta
  dd if=/dev/vdb bs=4M count=1024       15.3 GB/s 15.5 GB/s  +1.3%
    iflag=direct (sequential)
  8x parallel dd if=/dev/vdb bs=4k      787 MB/s  870 MB/s  +10.5%
    count=100k iflag=direct
    (each starting at a different offset)

Migration and savevm

  Tested with a 4 GB guest where about 2 GB of RAM was dirtied
  (filled /dev/shm with urandom, 8x256 MB) before triggering each
  operation.

  scenario                glibc      tcmalloc   delta
  migrate (exec: URI)     0.622 s    0.622 s    0.0%
  savevm (qcow2 snap)     0.503 s    0.504 s    +0.2%

What I didn't measure

  I did'nt test ZFS/vNIC throughput, as the host path lives in the
kernel
  side, while QEMU only handles the control plane. Hence I'd expect
very
  little allocator impact there, though I could be wrong.

  I also didn't test memory hotplug on ballooning cases, because, as
  explained above, these are rare one-shot operations. and as the
source
  code trace shows malloc is not involved. But again, happy to look
into
  it, if it's still a concern.

> > 
> > Hi Fiona,
> > 
> > I'm stil running in production (I have keeped tcmalloc after the
> > removal some year ago from the pve build), and I didn't notice
> > problem.
> > (but I still don't use pbs).
> > 
> > But I never have done bench with/without it since 5/6 year.

And thanks Alexandre for sharing your production experience, that's
very
valuable context.

> > 
> > Maybe vm memory should be checked too, I'm thinking about RSS
> > memory
> > with balloon free_page_reporting, to see if it's correcting freeing
> > memory.

For free_page_reporting, the VIRTIO_BALLOON_F_REPORTING path:

virtual_balloon_handle_report()
  ram_block_discard_range()
    madvise(host_startaddr, kebngth, QEMU_MADV_DONTNEED)

This operates directly on the mmap'ed guest RAM region. Still no malloc
involvement anywhere in this path.

In short, the tests above reveal a consistent pattern: tcmalloc helps
where allocation pressure is high, and is neutral everywhere else.
Please let me know if you'd like more data on any specific workload, or
if there is anything I overlooked. I appreciate the careful review and
insights.


> 
> Thanks! If it was running fine for you for this long, that is a very
> good data point :) Still, testing different scenarios/configurations
> would be nice to rule out that there is a (performance) regression
> somewhere else.

cheers,
Kefu


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH pve-qemu 0/2] Re-enable tcmalloc as the memory allocator
  2026-04-13 11:13       ` Kefu Chai
  2026-04-13 13:14         ` DERUMIER, Alexandre
@ 2026-04-13 13:18         ` Fiona Ebner
  2026-04-14  5:41           ` Kefu Chai
  1 sibling, 1 reply; 16+ messages in thread
From: Fiona Ebner @ 2026-04-13 13:18 UTC (permalink / raw)
  To: Kefu Chai, DERUMIER, Alexandre, pve-devel

Am 13.04.26 um 1:12 PM schrieb Kefu Chai:
> On Mon Apr 13, 2026 at 4:14 PM CST, Fiona Ebner wrote:
>> Am 10.04.26 um 12:44 PM schrieb DERUMIER, Alexandre:
>>>
>>>>> How does the performance change when doing IO within a QEMU guest?
>>>>>
>>>>> How does this affect the performance for other storage types, like
>>>>> ZFS,
>>>>> qcow2 on top of directory-based storages, qcow2 on top of LVM, LVM-
>>>>> thin,
>>>>> etc. and other workloads like saving VM state during snapshot,
>>>>> transfer
>>>>> during migration, maybe memory hotplug/ballooning, network
>>>>> performance
>>>>> for vNICs?
> 
> Hi Fiona,
> 
> Thanks for the questions.
> 
> I traced QEMU's source code. It turns out that guest's RAM is allocated
> via direct mmap() calls, which completely bypassing QEMU's C library
> allocator. The path looks like:
> 
> -m 4G on command line:
> 
> memory_region_init_ram_flags_nomigrate()   
>   qemu_ram_alloc()
>     qemu_ram_alloc_internal()
>       g_malloc0(sizeof(*new_block))  <-- only the RAMBlock metadata
>                                          struct, about 512 bytes
>       ram_block_add()
>         qemu_anon_ram_alloc()
>           qemu_ram_mmap(-1, size, ...)
>             mmap_reserve(total)
>               mmap(0, size, PROT_NONE, ...)  <-- reserve address space
>             mmap_activate(ptr, size, ...)
>               mmap(ptr, size, PROT_RW, ...)  <-- actual guest RAM
> 
> So the gigabytes of guest memory go straight to the kernel via mmap()
> and never touch malloc or tcmalloc. Only the small RAMBlock metadata
> structure (~512 bytes per region) goes through g_malloc0().
> 
> In other words, tcmalloc's scope is limited to QEMU's own working
> memory, among other things, block layer buffers, coroutine stacks,
> internal data structs. It does not touch gues RAM at all.
> 
> The inflat/deflate path of balloon code does involve glib malloc. It's
> virtio_balloon_pbp_alloc(), which calls bitmap_new() via g_new0() to
> track partially-ballooned pages. But these bitmaps tracks only metadata,
> whose memory footprints are relatively small. And this only happens in 
> infrequent operations.
> 
> Hopefully, this addresses the balloon concern. If I've missed something
> or misread the code, please help point it out.
> 
> When it comes to different storage types and workloads, I ran benchmarks
> covering some scenarios you listed. All comparison use the same binary
> (pve-qemu-kvm 10.1.2-7) with and without LD_PRELOAD=libtcmalloc.so.4, so
> the allocator is the only variable.
> 
> 
> Storage backends (block layer, via qemu-img bench)
> 
>   4K reads, depth=32, io_uring, cache=none, 5M ops per run, best
>   of 3. Host NVMe (ext4) or LVM-thin backed by NVMe.
> 
>   backend              glibc ops/s   tcmalloc ops/s   delta
>   qcow2 on ext4        1,188,495     1,189,343        +0.1%
>   qcow2 on ext4 write  1,036,914     1,036,699         0.0%
>   raw on ext4          1,263,583     1,277,465        +1.1%
>   raw on LVM-thin        433,727       433,576         0.0%
> 
>   The reason why raw/LVM-thin is slower, is that, I *think* it actually
>   hits the dm-thin layer rather than page cache. The allocator delta is
>   noise in all cases.
> 
>   RBD tested via a local vstart cluster (3 OSDs, bluestore on NVMe):
> 
>   path                        glibc       tcmalloc    delta
>   qemu-img bench + librbd     19,111      19,156      +0.2%
>   rbd bench (librbd direct)   35,329      36,622      +3.7%
> 
>   I don't have ZFS configured on this (host) machine, but QEMU's file I/O
>   path to ZFS goes through the same code as ext4, so the difference
>   is in the kernel. I'd expect the same null result,
>   though I could be wrong and am happy to set up a test pool if
>   you'd like to see the numbers.
> 
> Guest I/O (Debian 13 guest, virtio-blk)
> 
>   Ran dd inside a Debian 13 cloud image guest (4 vCPU, 2 GB RAM,
>   qcow2 on ext4, cache=none, aio=native) against a second virtio
>   disk (/dev/vdb, 8 GB qcow2).
> 
>   workload                              glibc     tcmalloc   delta
>   dd if=/dev/vdb bs=4M count=1024       15.3 GB/s 15.5 GB/s  +1.3%
>     iflag=direct (sequential)
>   8x parallel dd if=/dev/vdb bs=4k      787 MB/s  870 MB/s  +10.5%
>     count=100k iflag=direct
>     (each starting at a different offset)
> 
> Migration and savevm
> 
>   Tested with a 4 GB guest where about 2 GB of RAM was dirtied
>   (filled /dev/shm with urandom, 8x256 MB) before triggering each
>   operation.
> 
>   scenario                glibc      tcmalloc   delta
>   migrate (exec: URI)     0.622 s    0.622 s    0.0%
>   savevm (qcow2 snap)     0.503 s    0.504 s    +0.2%
> 
> What I didn't measure
> 
>   I did'nt test ZFS/vNIC throughput, as the host path lives in the kernel
>   side, while QEMU only handles the control plane. Hence I'd expect very
>   little allocator impact there, though I could be wrong.
> 
>   I also didn't test memory hotplug on ballooning cases, because, as
>   explained above, these are rare one-shot operations. and as the source
>   code trace shows malloc is not involved. But again, happy to look into
>   it, if it's still a concern.
> 

Thank you for testing and the analysis!

I'd still appreciate if somebody could test these, just to be sure. It
is a rather core change after all. I'll also try to find some time to
test around, but more eyes cannot hurt here!

That said, it looks okay in general, so I'm giving a preliminary:

Acked-by: Fiona Ebner <f.ebner@proxmox.com>

>>>
>>> Hi Fiona,
>>>
>>> I'm stil running in production (I have keeped tcmalloc after the
>>> removal some year ago from the pve build), and I didn't notice problem.
>>> (but I still don't use pbs).
>>>
>>> But I never have done bench with/without it since 5/6 year.
> 
> And thanks Alexandre for sharing your production experience, that's very
> valuable context.
> 
>>>
>>> Maybe vm memory should be checked too, I'm thinking about RSS memory
>>> with balloon free_page_reporting, to see if it's correcting freeing
>>> memory.
> 
> For free_page_reporting, the VIRTIO_BALLOON_F_REPORTING path:
> 
> virtual_balloon_handle_report()
>   ram_block_discard_range()
>     madvise(host_startaddr, kebngth, QEMU_MADV_DONTNEED)
> 
> This operates directly on the mmap'ed guest RAM region. Still no malloc
> involvement anywhere in this path.
> 
> In short, the tests above reveal a consistent pattern: tcmalloc helps
> where allocation pressure is high, and is neutral everywhere else.
> Please let me know if you'd like more data on any specific workload, or
> if there is anything I overlooked. I appreciate the careful review and
> insights.
> 
> 
>>
>> Thanks! If it was running fine for you for this long, that is a very
>> good data point :) Still, testing different scenarios/configurations
>> would be nice to rule out that there is a (performance) regression
>> somewhere else.
> 
> cheers,
> Kefu





^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH pve-qemu 0/2] Re-enable tcmalloc as the memory allocator
  2026-04-13 13:18         ` Fiona Ebner
@ 2026-04-14  5:41           ` Kefu Chai
  0 siblings, 0 replies; 16+ messages in thread
From: Kefu Chai @ 2026-04-14  5:41 UTC (permalink / raw)
  To: Fiona Ebner, DERUMIER, Alexandre, pve-devel

On Mon Apr 13, 2026 at 9:18 PM CST, Fiona Ebner wrote:
> Am 13.04.26 um 1:12 PM schrieb Kefu Chai:
>> On Mon Apr 13, 2026 at 4:14 PM CST, Fiona Ebner wrote:
>>> Am 10.04.26 um 12:44 PM schrieb DERUMIER, Alexandre:
>>>>
>>>>>> How does the performance change when doing IO within a QEMU guest?
>>>>>>
>>>>>> How does this affect the performance for other storage types, like
>>>>>> ZFS,
>>>>>> qcow2 on top of directory-based storages, qcow2 on top of LVM, LVM-
>>>>>> thin,
>>>>>> etc. and other workloads like saving VM state during snapshot,
>>>>>> transfer
>>>>>> during migration, maybe memory hotplug/ballooning, network
>>>>>> performance
>>>>>> for vNICs?
>> 
>> Hi Fiona,
>> 
>> Thanks for the questions.
>> 
>> I traced QEMU's source code. It turns out that guest's RAM is allocated
>> via direct mmap() calls, which completely bypassing QEMU's C library
>> allocator. The path looks like:
>> 
>> -m 4G on command line:
>> 
>> memory_region_init_ram_flags_nomigrate()   
>>   qemu_ram_alloc()
>>     qemu_ram_alloc_internal()
>>       g_malloc0(sizeof(*new_block))  <-- only the RAMBlock metadata
>>                                          struct, about 512 bytes
>>       ram_block_add()
>>         qemu_anon_ram_alloc()
>>           qemu_ram_mmap(-1, size, ...)
>>             mmap_reserve(total)
>>               mmap(0, size, PROT_NONE, ...)  <-- reserve address space
>>             mmap_activate(ptr, size, ...)
>>               mmap(ptr, size, PROT_RW, ...)  <-- actual guest RAM
>> 
>> So the gigabytes of guest memory go straight to the kernel via mmap()
>> and never touch malloc or tcmalloc. Only the small RAMBlock metadata
>> structure (~512 bytes per region) goes through g_malloc0().
>> 
>> In other words, tcmalloc's scope is limited to QEMU's own working
>> memory, among other things, block layer buffers, coroutine stacks,
>> internal data structs. It does not touch gues RAM at all.
>> 
>> The inflat/deflate path of balloon code does involve glib malloc. It's
>> virtio_balloon_pbp_alloc(), which calls bitmap_new() via g_new0() to
>> track partially-ballooned pages. But these bitmaps tracks only metadata,
>> whose memory footprints are relatively small. And this only happens in 
>> infrequent operations.
>> 
>> Hopefully, this addresses the balloon concern. If I've missed something
>> or misread the code, please help point it out.
>> 
>> When it comes to different storage types and workloads, I ran benchmarks
>> covering some scenarios you listed. All comparison use the same binary
>> (pve-qemu-kvm 10.1.2-7) with and without LD_PRELOAD=libtcmalloc.so.4, so
>> the allocator is the only variable.
>> 
>> 
>> Storage backends (block layer, via qemu-img bench)
>> 
>>   4K reads, depth=32, io_uring, cache=none, 5M ops per run, best
>>   of 3. Host NVMe (ext4) or LVM-thin backed by NVMe.
>> 
>>   backend              glibc ops/s   tcmalloc ops/s   delta
>>   qcow2 on ext4        1,188,495     1,189,343        +0.1%
>>   qcow2 on ext4 write  1,036,914     1,036,699         0.0%
>>   raw on ext4          1,263,583     1,277,465        +1.1%
>>   raw on LVM-thin        433,727       433,576         0.0%
>> 
>>   The reason why raw/LVM-thin is slower, is that, I *think* it actually
>>   hits the dm-thin layer rather than page cache. The allocator delta is
>>   noise in all cases.
>> 
>>   RBD tested via a local vstart cluster (3 OSDs, bluestore on NVMe):
>> 
>>   path                        glibc       tcmalloc    delta
>>   qemu-img bench + librbd     19,111      19,156      +0.2%
>>   rbd bench (librbd direct)   35,329      36,622      +3.7%
>> 
>>   I don't have ZFS configured on this (host) machine, but QEMU's file I/O
>>   path to ZFS goes through the same code as ext4, so the difference
>>   is in the kernel. I'd expect the same null result,
>>   though I could be wrong and am happy to set up a test pool if
>>   you'd like to see the numbers.
>> 
>> Guest I/O (Debian 13 guest, virtio-blk)
>> 
>>   Ran dd inside a Debian 13 cloud image guest (4 vCPU, 2 GB RAM,
>>   qcow2 on ext4, cache=none, aio=native) against a second virtio
>>   disk (/dev/vdb, 8 GB qcow2).
>> 
>>   workload                              glibc     tcmalloc   delta
>>   dd if=/dev/vdb bs=4M count=1024       15.3 GB/s 15.5 GB/s  +1.3%
>>     iflag=direct (sequential)
>>   8x parallel dd if=/dev/vdb bs=4k      787 MB/s  870 MB/s  +10.5%
>>     count=100k iflag=direct
>>     (each starting at a different offset)
>> 
>> Migration and savevm
>> 
>>   Tested with a 4 GB guest where about 2 GB of RAM was dirtied
>>   (filled /dev/shm with urandom, 8x256 MB) before triggering each
>>   operation.
>> 
>>   scenario                glibc      tcmalloc   delta
>>   migrate (exec: URI)     0.622 s    0.622 s    0.0%
>>   savevm (qcow2 snap)     0.503 s    0.504 s    +0.2%
>> 
>> What I didn't measure
>> 
>>   I did'nt test ZFS/vNIC throughput, as the host path lives in the kernel
>>   side, while QEMU only handles the control plane. Hence I'd expect very
>>   little allocator impact there, though I could be wrong.
>> 
>>   I also didn't test memory hotplug on ballooning cases, because, as
>>   explained above, these are rare one-shot operations. and as the source
>>   code trace shows malloc is not involved. But again, happy to look into
>>   it, if it's still a concern.
>> 
>
> Thank you for testing and the analysis!
>
> I'd still appreciate if somebody could test these, just to be sure. It
> is a rather core change after all. I'll also try to find some time to
> test around, but more eyes cannot hurt here!
>
> That said, it looks okay in general, so I'm giving a preliminary:
>
> Acked-by: Fiona Ebner <f.ebner@proxmox.com>
>

Thanks for the ack, Fiona! I agree more testing can't hurt for the
critical software in our stack, so I went ahead and ran through the
remaining scenarios, ZFS, balloon, vNIC and snapshot, using PVE's qm
tooling this time instead of bare QEMU.

To avoid the QEMU version difference muddying the numbers, I built
two packages from the same 10.2.1 source tree: one with
--enable-malloc=tcmalloc (plus the proposed patch) and one without. Same
compiler, same patches 0001-0047, allocator is the only variable,
like I did yesterday.

Test VM is a Debian 13 cloud image guest (4 vCPU, 4 GB RAM,
virtio-scsi-single + iothread on LVM-thin, Samsung 9100 PRO NVMe).


Guest I/O on LVM-thin (dd inside guest, best of 3)

  workload                     glibc          tcmalloc       delta
  seq 4M read (1024 x dd)     16.6 GB/s      16.6 GB/s       0.0%
  par 4K direct (8x dd)        578 MB/s       576 MB/s       -0.3%


Snapshot (qm snapshot, 2 GB of dirty guest RAM)

  glibc: 3.502 s    tcmalloc: 3.544 s    (+1.2%, noise)


Balloon (QMP balloon 4G -> 2G)

  Both correctly deflated:

  glibc:    RSS 2,462 MB -> 2,328 MB (freed 135 MB)
  tcmalloc: RSS 2,496 MB ->   707 MB (freed 1,789 MB)

  The difference in freed amount comes from different initial page
  residency, not the allocator.


Storage backends (tcmalloc, absolute numbers)

  I moved the VM disk between backends to cover ZFS and
  qcow2-on-directory. Everything worked without issues:

  backend                     seq read      par 4K read
  LVM-thin (raw, NVMe)        17.5 GB/s     670 MB/s
  ZFS (zvol, file-backed)     11.3 GB/s     696 MB/s
  qcow2 on ext4 (directory)   19.6 GB/s     763 MB/s

  ZFS is slower because the pool is file-backed (double
  indirection), not because of the allocator. The block-layer
  A/B from my earlier mail already showed the allocator is neutral
  across backends, so I didn't repeat the full A/B for each one.


vNIC (iperf3 over vmbr0, 20 rounds each)

  This one needed more rounds to get a stable picture -- early
  attempts with 5 rounds looked misleadingly noisy.

                          glibc            tcmalloc         delta
  host->guest mean        132.2 Gbps       132.9 Gbps       +0.6%
  host->guest stddev       17.4              20.6
  guest->host mean         69.6 Gbps        72.0 Gbps       +3.5%
  guest->host stddev       17.1              15.0

  Welch's t-test: t=0.12 (h2g), t=0.48 (g2h), neither significant
  at p=0.05. The stddev (~17 Gbps) dwarfs the mean difference, so
  there's no detectable allocator effect here. The hot path is in
  vhost-net on the kernel side, so this is pretty much what I'd
  expect.


Overall:

  workload              delta        method
  guest seq I/O          0.0%       same-version A/B
  guest par 4K I/O      -0.3%       same-version A/B (noise)
  qm snapshot           +1.2%       same-version A/B (noise)
  balloon                OK         both deflate correctly
  vNIC               +0.6/+3.5%     20 rounds, not significant
  block layer (all)    0 to +1%     LD_PRELOAD, same binary
  RBD (librbd)          +3.7%       LD_PRELOAD, same binary

No regressions on any tested path. Together with Alexandre's
production fio numbers (+31.6% on 4K randread on a real cluster),
I think we have a more solid picture now.

Please let me know if there's anything else I should look into, or if
you'd like me to re-run something with different parameters!

cheers,
Kefu

>>>>
>>>> Hi Fiona,
>>>>
>>>> I'm stil running in production (I have keeped tcmalloc after the
>>>> removal some year ago from the pve build), and I didn't notice problem.
>>>> (but I still don't use pbs).
>>>>
>>>> But I never have done bench with/without it since 5/6 year.
>> 
>> And thanks Alexandre for sharing your production experience, that's very
>> valuable context.
>> 
>>>>
>>>> Maybe vm memory should be checked too, I'm thinking about RSS memory
>>>> with balloon free_page_reporting, to see if it's correcting freeing
>>>> memory.
>> 
>> For free_page_reporting, the VIRTIO_BALLOON_F_REPORTING path:
>> 
>> virtual_balloon_handle_report()
>>   ram_block_discard_range()
>>     madvise(host_startaddr, kebngth, QEMU_MADV_DONTNEED)
>> 
>> This operates directly on the mmap'ed guest RAM region. Still no malloc
>> involvement anywhere in this path.
>> 
>> In short, the tests above reveal a consistent pattern: tcmalloc helps
>> where allocation pressure is high, and is neutral everywhere else.
>> Please let me know if you'd like more data on any specific workload, or
>> if there is anything I overlooked. I appreciate the careful review and
>> insights.
>> 
>> 
>>>
>>> Thanks! If it was running fine for you for this long, that is a very
>>> good data point :) Still, testing different scenarios/configurations
>>> would be nice to rule out that there is a (performance) regression
>>> somewhere else.
>> 
>> cheers,
>> Kefu





^ permalink raw reply	[flat|nested] 16+ messages in thread

* superseded: [PATCH pve-qemu 0/2] Re-enable tcmalloc as the memory allocator
  2026-04-10  4:30 [PATCH pve-qemu 0/2] Re-enable tcmalloc as the memory allocator Kefu Chai
                   ` (2 preceding siblings ...)
  2026-04-10  8:12 ` [PATCH pve-qemu 0/2] Re-enable " Fiona Ebner
@ 2026-04-14  5:48 ` Kefu Chai
  3 siblings, 0 replies; 16+ messages in thread
From: Kefu Chai @ 2026-04-14  5:48 UTC (permalink / raw)
  To: pve-devel

superseded by https://lore.proxmox.com/pve-devel/20260414054645.405151-1-k.chai@proxmox.com/




^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2026-04-14  8:09 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2026-04-10  4:30 [PATCH pve-qemu 0/2] Re-enable tcmalloc as the memory allocator Kefu Chai
2026-04-10  4:30 ` [PATCH pve-qemu 1/2] PVE: use " Kefu Chai
2026-04-13 13:12   ` Fiona Ebner
2026-04-10  4:30 ` [PATCH pve-qemu 2/2] d/rules: enable " Kefu Chai
2026-04-13 13:13   ` Fiona Ebner
2026-04-13 13:23     ` DERUMIER, Alexandre
2026-04-13 13:40       ` Fabian Grünbichler
2026-04-14  8:10         ` Fiona Ebner
2026-04-10  8:12 ` [PATCH pve-qemu 0/2] Re-enable " Fiona Ebner
2026-04-10 10:45   ` DERUMIER, Alexandre
2026-04-13  8:14     ` Fiona Ebner
2026-04-13 11:13       ` Kefu Chai
2026-04-13 13:14         ` DERUMIER, Alexandre
2026-04-13 13:18         ` Fiona Ebner
2026-04-14  5:41           ` Kefu Chai
2026-04-14  5:48 ` superseded: " Kefu Chai

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox

Service provided by Proxmox Server Solutions GmbH | Privacy | Legal