* [PATCH pve-qemu 0/2] Re-enable tcmalloc as the memory allocator
@ 2026-04-10 4:30 Kefu Chai
2026-04-10 4:30 ` [PATCH pve-qemu 1/2] PVE: use " Kefu Chai
` (3 more replies)
0 siblings, 4 replies; 16+ messages in thread
From: Kefu Chai @ 2026-04-10 4:30 UTC (permalink / raw)
To: pve-devel
Following up on the RFC thread [0], here's the formal submission to
re-enable tcmalloc for pve-qemu.
Quick recap: librbd's I/O path allocates a lot of small, short-lived
objects with plain new/malloc (ObjectReadRequest, bufferlist, etc.),
and glibc's ptmalloc2 handles this pattern poorly -- cross-thread
arena contention and cache-line bouncing show up clearly in perf
profiles. tcmalloc's per-thread fast path avoids both.
A bit of history for context: tcmalloc was tried in 2015 but dropped
after 8 days due to gperftools 2.2 tuning issues (fixed in 2.4+).
jemalloc replaced it but was dropped in 2020 because it didn't
release Rust-allocated memory (from proxmox-backup-qemu) back to the
OS. PVE 9 ships gperftools 2.16, and patch 1/2 addresses the
reclamation gap explicitly.
On Dietmar's two concerns from the RFC:
"Could ReleaseFreeMemory() halt the application?" -- No, and I
verified this directly. It walks tcmalloc's page heap free span
lists and calls madvise(MADV_DONTNEED) on each span. It does not
walk allocated memory or compact the heap. A standalone test
reclaimed 386 MB of 410 MB cached memory (94%) in effectively zero
wall time. The call runs once at backup completion, same spot where
malloc_trim runs today.
"Wouldn't a pool allocator in librbd be the proper fix?" -- In
principle yes, but I audited librbd in Ceph squid and it does NOT
use a pool allocator -- all I/O path objects go through plain new.
Ceph's mempool is tracking-only, not actual pooling. Adding real
pooling would be a significant Ceph-side change (submission and
completion happen on different threads), and it's orthogonal to the
allocator choice here.
Also thanks to Alexandre for confirming the 2015 gperftools issues
are long resolved.
Test results
------------
Benchmarked on a local vstart Ceph cluster (3 OSDs on local NVMe).
This is the worst case for showing allocator impact, since there's
no network latency for CPU savings to amortize against:
rbd bench --io-type read --io-size 4096 --io-threads 16 \
--io-pattern rand
Metric | glibc ptmalloc2 | tcmalloc | Delta
---------------+-----------------+-----------+--------
IOPS | 131,201 | 136,389 | +4.0%
CPU time | 1,556 ms | 1,439 ms | -7.5%
Cycles | 6.74B | 6.06B | -10.1%
Cache misses | 137.1M | 123.9M | -9.6%
perf report on the glibc run shows ~8% of CPU in allocator internals
(_int_malloc, cfree, malloc_consolidate, _int_free_*); the same
symbols are barely visible with tcmalloc because the fast path is
just a pointer bump. The Ceph blog [1] reports ~50% IOPS gain on
production clusters where network RTT dominates per-I/O latency --
the 10% CPU savings compound there since the host can push more I/O
into the pipeline during the same wall time.
The series is small:
1/2 adds the QEMU source patch (0048) with the CONFIG_TCMALLOC
meson define and the ReleaseFreeMemory() call in
pve-backup.c's cleanup path.
2/2 adds libgoogle-perftools-dev to Build-Depends and
--enable-malloc=tcmalloc to configure.
Runtime dep libgoogle-perftools4t64 (>= 2.16) is picked up
automatically by dh_shlibdeps.
[0]: https://lore.proxmox.com/pve-devel/DHCDIFA0P8QP.2CTY4G4EEGKQ0@proxmox.com/
[1]: https://ceph.io/en/news/blog/2023/reef-freelist-bench/
Kefu Chai (2):
PVE: use tcmalloc as the memory allocator
d/rules: enable tcmalloc as the memory allocator
debian/control | 1 +
...use-tcmalloc-as-the-memory-allocator.patch | 77 +++++++++++++++++++
debian/patches/series | 1 +
debian/rules | 1 +
4 files changed, 80 insertions(+)
create mode 100644 debian/patches/pve/0048-PVE-use-tcmalloc-as-the-memory-allocator.patch
--
2.47.3
^ permalink raw reply [flat|nested] 16+ messages in thread* [PATCH pve-qemu 1/2] PVE: use tcmalloc as the memory allocator 2026-04-10 4:30 [PATCH pve-qemu 0/2] Re-enable tcmalloc as the memory allocator Kefu Chai @ 2026-04-10 4:30 ` Kefu Chai 2026-04-13 13:12 ` Fiona Ebner 2026-04-10 4:30 ` [PATCH pve-qemu 2/2] d/rules: enable " Kefu Chai ` (2 subsequent siblings) 3 siblings, 1 reply; 16+ messages in thread From: Kefu Chai @ 2026-04-10 4:30 UTC (permalink / raw) To: pve-devel Add allocator-aware memory release in the backup completion path: since tcmalloc does not provide glibc's malloc_trim(), use the tcmalloc-specific MallocExtension_ReleaseFreeMemory() API instead. This function walks tcmalloc's page heap free span lists and calls madvise(MADV_DONTNEED) -- it does not walk allocated memory or compact the heap, so latency impact is negligible. Also adds a CONFIG_TCMALLOC meson define so the conditional compilation in pve-backup.c can detect the allocator choice. Signed-off-by: Kefu Chai <k.chai@proxmox.com> --- ...use-tcmalloc-as-the-memory-allocator.patch | 77 +++++++++++++++++++ debian/patches/series | 1 + 2 files changed, 78 insertions(+) create mode 100644 debian/patches/pve/0048-PVE-use-tcmalloc-as-the-memory-allocator.patch diff --git a/debian/patches/pve/0048-PVE-use-tcmalloc-as-the-memory-allocator.patch b/debian/patches/pve/0048-PVE-use-tcmalloc-as-the-memory-allocator.patch new file mode 100644 index 0000000..719d522 --- /dev/null +++ b/debian/patches/pve/0048-PVE-use-tcmalloc-as-the-memory-allocator.patch @@ -0,0 +1,77 @@ +From 0000000000000000000000000000000000000000 Mon Sep 17 00:00:00 2001 +From: Kefu Chai <k.chai@proxmox.com> +Date: Thu, 9 Apr 2026 17:29:10 +0800 +Subject: [PATCH] PVE: use tcmalloc as the memory allocator + +Use tcmalloc (from gperftools) as the memory allocator for improved +performance with workloads that create many small, short-lived +allocations -- particularly Ceph/librbd I/O paths. + +tcmalloc uses per-thread caches and size-class freelists that handle +this allocation pattern more efficiently than glibc's allocator. Ceph +benchmarks show ~50% IOPS improvement on 16KB random reads. + +Since tcmalloc does not provide glibc's malloc_trim(), use the +tcmalloc-specific MallocExtension_ReleaseFreeMemory() API to release +cached memory back to the OS after backup completion. This function +walks tcmalloc's page heap free span lists and calls +madvise(MADV_DONTNEED) -- it does not walk allocated memory or compact +the heap, so latency impact is negligible. + +Historical context: +- tcmalloc was originally enabled in 2015 but removed due to + performance issues with gperftools 2.2's default settings (low + TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES and aggressive decommit + disabled). These issues were resolved in gperftools 2.4+. +- jemalloc replaced tcmalloc but was removed in 2020 because it didn't + release memory allocated from Rust (proxmox-backup-qemu) back to the + OS. The allocator-specific release API addresses this. +- PVE 9 ships gperftools 2.16, so the old tuning issues are moot. + +Signed-off-by: Kefu Chai <k.chai@proxmox.com> +--- + meson.build | 1 + + pve-backup.c | 8 ++++++++ + 2 files changed, 9 insertions(+) + +diff --git a/meson.build b/meson.build +index 0b28d2ec39..c6de2464d6 100644 +--- a/meson.build ++++ b/meson.build +@@ -2567,6 +2567,7 @@ config_host_data.set('CONFIG_CRYPTO_SM4', crypto_sm4.found()) + config_host_data.set('CONFIG_CRYPTO_SM3', crypto_sm3.found()) + config_host_data.set('CONFIG_HOGWEED', hogweed.found()) + config_host_data.set('CONFIG_MALLOC_TRIM', has_malloc_trim) ++config_host_data.set('CONFIG_TCMALLOC', get_option('malloc') == 'tcmalloc') + config_host_data.set('CONFIG_ZSTD', zstd.found()) + config_host_data.set('CONFIG_QPL', qpl.found()) + config_host_data.set('CONFIG_UADK', uadk.found()) +diff --git a/pve-backup.c b/pve-backup.c +index ad0f8668fd..d5556f152b 100644 +--- a/pve-backup.c ++++ b/pve-backup.c +@@ -19,6 +19,8 @@ + + #if defined(CONFIG_MALLOC_TRIM) + #include <malloc.h> ++#elif defined(CONFIG_TCMALLOC) ++#include <gperftools/malloc_extension_c.h> + #endif + + #include <proxmox-backup-qemu.h> +@@ -303,6 +305,12 @@ static void coroutine_fn pvebackup_co_cleanup(void) + * Won't happen by default if there is fragmentation. + */ + malloc_trim(4 * 1024 * 1024); ++#elif defined(CONFIG_TCMALLOC) ++ /* ++ * Release free memory from tcmalloc's page cache back to the OS. This is ++ * allocator-aware and efficiently returns cached spans via madvise(). ++ */ ++ MallocExtension_ReleaseFreeMemory(); + #endif + } + +-- +2.47.3 + diff --git a/debian/patches/series b/debian/patches/series index 8ed0c52..468df6c 100644 --- a/debian/patches/series +++ b/debian/patches/series @@ -81,3 +81,4 @@ pve/0044-PVE-backup-get-device-info-allow-caller-to-specify-f.patch pve/0045-PVE-backup-implement-backup-access-setup-and-teardow.patch pve/0046-PVE-backup-prepare-for-the-switch-to-using-blockdev-.patch pve/0047-savevm-async-reuse-migration-blocker-check-for-snaps.patch +pve/0048-PVE-use-tcmalloc-as-the-memory-allocator.patch -- 2.47.3 ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH pve-qemu 1/2] PVE: use tcmalloc as the memory allocator 2026-04-10 4:30 ` [PATCH pve-qemu 1/2] PVE: use " Kefu Chai @ 2026-04-13 13:12 ` Fiona Ebner 0 siblings, 0 replies; 16+ messages in thread From: Fiona Ebner @ 2026-04-13 13:12 UTC (permalink / raw) To: Kefu Chai, pve-devel It's preparation for using tcmalloc, not actually using it. I'd prefer a title like "add patch to support using tcmalloc as the memory allocator". Am 10.04.26 um 6:30 AM schrieb Kefu Chai: > Add allocator-aware memory release in the backup completion path: > since tcmalloc does not provide glibc's malloc_trim(), use the > tcmalloc-specific MallocExtension_ReleaseFreeMemory() API instead. > This function walks tcmalloc's page heap free span lists and calls > madvise(MADV_DONTNEED) -- it does not walk allocated memory or compact > the heap, so latency impact is negligible. > > Also adds a CONFIG_TCMALLOC meson define so the conditional compilation > in pve-backup.c can detect the allocator choice. > > Signed-off-by: Kefu Chai <k.chai@proxmox.com> > --- > ...use-tcmalloc-as-the-memory-allocator.patch | 77 +++++++++++++++++++ > debian/patches/series | 1 + > 2 files changed, 78 insertions(+) > create mode 100644 debian/patches/pve/0048-PVE-use-tcmalloc-as-the-memory-allocator.patch > > diff --git a/debian/patches/pve/0048-PVE-use-tcmalloc-as-the-memory-allocator.patch b/debian/patches/pve/0048-PVE-use-tcmalloc-as-the-memory-allocator.patch > new file mode 100644 > index 0000000..719d522 > --- /dev/null > +++ b/debian/patches/pve/0048-PVE-use-tcmalloc-as-the-memory-allocator.patch > @@ -0,0 +1,77 @@ > +From 0000000000000000000000000000000000000000 Mon Sep 17 00:00:00 2001 > +From: Kefu Chai <k.chai@proxmox.com> > +Date: Thu, 9 Apr 2026 17:29:10 +0800 > +Subject: [PATCH] PVE: use tcmalloc as the memory allocator Similar here. Also, it is specific to backup, I'd go for something like "PVE-Backup: support using tcmalloc as the memory allocator". > + > +Use tcmalloc (from gperftools) as the memory allocator for improved > +performance with workloads that create many small, short-lived > +allocations -- particularly Ceph/librbd I/O paths. > + > +tcmalloc uses per-thread caches and size-class freelists that handle > +this allocation pattern more efficiently than glibc's allocator. Ceph > +benchmarks show ~50% IOPS improvement on 16KB random reads. > + > +Since tcmalloc does not provide glibc's malloc_trim(), use the > +tcmalloc-specific MallocExtension_ReleaseFreeMemory() API to release > +cached memory back to the OS after backup completion. This function > +walks tcmalloc's page heap free span lists and calls > +madvise(MADV_DONTNEED) -- it does not walk allocated memory or compact > +the heap, so latency impact is negligible. > + > +Historical context: > +- tcmalloc was originally enabled in 2015 but removed due to > + performance issues with gperftools 2.2's default settings (low > + TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES and aggressive decommit > + disabled). These issues were resolved in gperftools 2.4+. > +- jemalloc replaced tcmalloc but was removed in 2020 because it didn't > + release memory allocated from Rust (proxmox-backup-qemu) back to the > + OS. The allocator-specific release API addresses this. > +- PVE 9 ships gperftools 2.16, so the old tuning issues are moot. > + > +Signed-off-by: Kefu Chai <k.chai@proxmox.com> > +--- > + meson.build | 1 + > + pve-backup.c | 8 ++++++++ > + 2 files changed, 9 insertions(+) > + > +diff --git a/meson.build b/meson.build > +index 0b28d2ec39..c6de2464d6 100644 > +--- a/meson.build > ++++ b/meson.build > +@@ -2567,6 +2567,7 @@ config_host_data.set('CONFIG_CRYPTO_SM4', crypto_sm4.found()) > + config_host_data.set('CONFIG_CRYPTO_SM3', crypto_sm3.found()) > + config_host_data.set('CONFIG_HOGWEED', hogweed.found()) > + config_host_data.set('CONFIG_MALLOC_TRIM', has_malloc_trim) > ++config_host_data.set('CONFIG_TCMALLOC', get_option('malloc') == 'tcmalloc') > + config_host_data.set('CONFIG_ZSTD', zstd.found()) > + config_host_data.set('CONFIG_QPL', qpl.found()) > + config_host_data.set('CONFIG_UADK', uadk.found()) > +diff --git a/pve-backup.c b/pve-backup.c > +index ad0f8668fd..d5556f152b 100644 > +--- a/pve-backup.c > ++++ b/pve-backup.c > +@@ -19,6 +19,8 @@ > + > + #if defined(CONFIG_MALLOC_TRIM) > + #include <malloc.h> > ++#elif defined(CONFIG_TCMALLOC) > ++#include <gperftools/malloc_extension_c.h> > + #endif > + > + #include <proxmox-backup-qemu.h> > +@@ -303,6 +305,12 @@ static void coroutine_fn pvebackup_co_cleanup(void) > + * Won't happen by default if there is fragmentation. > + */ > + malloc_trim(4 * 1024 * 1024); > ++#elif defined(CONFIG_TCMALLOC) > ++ /* > ++ * Release free memory from tcmalloc's page cache back to the OS. This is > ++ * allocator-aware and efficiently returns cached spans via madvise(). > ++ */ > ++ MallocExtension_ReleaseFreeMemory(); > + #endif > + } > + > +-- > +2.47.3 > + > diff --git a/debian/patches/series b/debian/patches/series > index 8ed0c52..468df6c 100644 > --- a/debian/patches/series > +++ b/debian/patches/series > @@ -81,3 +81,4 @@ pve/0044-PVE-backup-get-device-info-allow-caller-to-specify-f.patch > pve/0045-PVE-backup-implement-backup-access-setup-and-teardow.patch > pve/0046-PVE-backup-prepare-for-the-switch-to-using-blockdev-.patch > pve/0047-savevm-async-reuse-migration-blocker-check-for-snaps.patch > +pve/0048-PVE-use-tcmalloc-as-the-memory-allocator.patch ^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH pve-qemu 2/2] d/rules: enable tcmalloc as the memory allocator 2026-04-10 4:30 [PATCH pve-qemu 0/2] Re-enable tcmalloc as the memory allocator Kefu Chai 2026-04-10 4:30 ` [PATCH pve-qemu 1/2] PVE: use " Kefu Chai @ 2026-04-10 4:30 ` Kefu Chai 2026-04-13 13:13 ` Fiona Ebner 2026-04-10 8:12 ` [PATCH pve-qemu 0/2] Re-enable " Fiona Ebner 2026-04-14 5:48 ` superseded: " Kefu Chai 3 siblings, 1 reply; 16+ messages in thread From: Kefu Chai @ 2026-04-10 4:30 UTC (permalink / raw) To: pve-devel Use tcmalloc (from gperftools) instead of glibc's allocator for improved performance with workloads that create many small, short-lived allocations -- particularly Ceph/librbd I/O paths. Ceph benchmarks show ~50% IOPS improvement on 16KB random reads. tcmalloc was originally used in 2015 but removed due to tuning issues with gperftools 2.2. PVE 9 ships gperftools 2.16, where those issues are long resolved. Signed-off-by: Kefu Chai <k.chai@proxmox.com> --- debian/control | 1 + debian/rules | 1 + 2 files changed, 2 insertions(+) diff --git a/debian/control b/debian/control index 81cc026..a3121e1 100644 --- a/debian/control +++ b/debian/control @@ -14,6 +14,7 @@ Build-Depends: debhelper-compat (= 13), libfuse3-dev, libgbm-dev, libgnutls28-dev, + libgoogle-perftools-dev, libiscsi-dev (>= 1.12.0), libjpeg-dev, libjson-perl, diff --git a/debian/rules b/debian/rules index c90db29..a63e3a5 100755 --- a/debian/rules +++ b/debian/rules @@ -70,6 +70,7 @@ endif --enable-libusb \ --enable-linux-aio \ --enable-linux-io-uring \ + --enable-malloc=tcmalloc \ --enable-numa \ --enable-opengl \ --enable-rbd \ -- 2.47.3 ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH pve-qemu 2/2] d/rules: enable tcmalloc as the memory allocator 2026-04-10 4:30 ` [PATCH pve-qemu 2/2] d/rules: enable " Kefu Chai @ 2026-04-13 13:13 ` Fiona Ebner 2026-04-13 13:23 ` DERUMIER, Alexandre 0 siblings, 1 reply; 16+ messages in thread From: Fiona Ebner @ 2026-04-13 13:13 UTC (permalink / raw) To: Kefu Chai, pve-devel Am 10.04.26 um 6:29 AM schrieb Kefu Chai: > Use tcmalloc (from gperftools) instead of glibc's allocator for > improved performance with workloads that create many small, short-lived > allocations -- particularly Ceph/librbd I/O paths. Ceph benchmarks > show ~50% IOPS improvement on 16KB random reads. > > tcmalloc was originally used in 2015 but removed due to tuning issues > with gperftools 2.2. PVE 9 ships gperftools 2.16, where those issues > are long resolved. > > Signed-off-by: Kefu Chai <k.chai@proxmox.com> > --- > debian/control | 1 + > debian/rules | 1 + > 2 files changed, 2 insertions(+) > > diff --git a/debian/control b/debian/control > index 81cc026..a3121e1 100644 > --- a/debian/control > +++ b/debian/control > @@ -14,6 +14,7 @@ Build-Depends: debhelper-compat (= 13), > libfuse3-dev, > libgbm-dev, > libgnutls28-dev, > + libgoogle-perftools-dev, > libiscsi-dev (>= 1.12.0), > libjpeg-dev, > libjson-perl, Please also add the runtime dependency to Depends: > diff --git a/debian/rules b/debian/rules > index c90db29..a63e3a5 100755 > --- a/debian/rules > +++ b/debian/rules > @@ -70,6 +70,7 @@ endif > --enable-libusb \ > --enable-linux-aio \ > --enable-linux-io-uring \ > + --enable-malloc=tcmalloc \ > --enable-numa \ > --enable-opengl \ > --enable-rbd \ ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH pve-qemu 2/2] d/rules: enable tcmalloc as the memory allocator 2026-04-13 13:13 ` Fiona Ebner @ 2026-04-13 13:23 ` DERUMIER, Alexandre 2026-04-13 13:40 ` Fabian Grünbichler 0 siblings, 1 reply; 16+ messages in thread From: DERUMIER, Alexandre @ 2026-04-13 13:23 UTC (permalink / raw) To: pve-devel, f.ebner, k.chai >>Please also add the runtime dependency to Depends: I was going to say the same, it should depend on "libgoogle-perftools4t64" ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH pve-qemu 2/2] d/rules: enable tcmalloc as the memory allocator 2026-04-13 13:23 ` DERUMIER, Alexandre @ 2026-04-13 13:40 ` Fabian Grünbichler 2026-04-14 8:10 ` Fiona Ebner 0 siblings, 1 reply; 16+ messages in thread From: Fabian Grünbichler @ 2026-04-13 13:40 UTC (permalink / raw) To: DERUMIER, Alexandre; +Cc: pve-devel Quoting DERUMIER, Alexandre (2026-04-13 15:23:28) > >>Please also add the runtime dependency to Depends: > > I was going to say the same, > > it should depend on "libgoogle-perftools4t64" that is not needed. in general, for properly packaged libraries it is enough to build-depend on the -dev package, and have ${shlibs:Depends} in the binary package Depends (with this series applied and built): new Debian package, version 2.0. size 32266824 bytes: control archive=13756 bytes. 38 bytes, 2 lines conffiles 1732 bytes, 18 lines control 40920 bytes, 494 lines md5sums Package: pve-qemu-kvm Version: 10.2.1-1 Architecture: amd64 Maintainer: Proxmox Support Team <support@proxmox.com> Installed-Size: 422145 Depends: ceph-common (>= 0.48), fuse3, iproute2, libiscsi7 (>= 1.18.0), libjpeg62-turbo (>= 1.3.1), libspice-server1 (>= 0.14.2), libusb-1.0-0 (>= 2:1.0.23~), libusbredirparser1 (>= 0.6-2), libuuid1 (>= 2.16), libaio1t64 (>= 0.3.93), libasound2t64 (>= 1.0.16), libc6 (>= 2.38), libcap-ng0 (>= 0.7.9), libcurl3t64-gnutls (>= 7.16.3), libepoxy0 (>= 1.5.2), libfdt1 (>= 1.7.2), libfuse3-4 (>= 3.17.2), libgbm1 (>= 12.0.0~0), libgcc-s1 (>= 3.0), libglib2.0-0t64 (>= 2.83.0), libgnutls30t64 (>= 3.8.6), libgoogle-perftools4t64 (>= 2.16), libnuma1 (>= 2.0.15), libpixman-1-0 (>= 0.30.0), libproxmox-backup-qemu0 (>= 1.3.0), libpulse0 (>= 0.99.1), librados2 (>= 19.2.3), librbd1 (>= 19.2.3), libseccomp2 (>= 2.1.0), libselinux1 (>= 3.1~), libslirp0 (>= 4.7.0), libsndio7.0 (>= 1.8.1), libsystemd0, libudev1 (>= 183), liburing2 (>= 2.3), libusbredirparser1t64 (>= 0.8.0), libvirglrenderer1 (>= 1.0.0), libxkbcommon0 (>= 0.5.0), libzstd1 (>= 1.5.5), zlib1g (>= 1:1.2.0) Recommends: numactl Suggests: libgl1 Conflicts: kvm, pve-kvm, pve-qemu-kvm-2.6.18, qemu, qemu-kvm, qemu-system-arm, qemu-system-common, qemu-system-data, qemu-system-x86, qemu-utils Breaks: qemu-server (<= 8.0.6) Replaces: pve-kvm, pve-qemu-kvm-2.6.18, qemu-system-arm, qemu-system-x86, qemu-utils Provides: qemu-system-arm, qemu-system-x86, qemu-utils Section: admin Priority: optional Description: Full virtualization on x86 hardware Using KVM, one can run multiple virtual PCs, each running unmodified Linux or Windows images. Each virtual machine has private virtualized hardware: a network card, disk, graphics adapter, etc. if the library is packaged using symbol versioning, it even adds the proper lower version bound ;) in fact, with the following: diff --git a/debian/control b/debian/control index a3121e1..0a3ffbd 100644 --- a/debian/control +++ b/debian/control @@ -49,12 +49,6 @@ Architecture: any Depends: ceph-common (>= 0.48), fuse3, iproute2, - libiscsi4 (>= 1.12.0) | libiscsi7, - libjpeg62-turbo, - libspice-server1 (>= 0.14.0~), - libusb-1.0-0 (>= 1.0.17-1), - libusbredirparser1 (>= 0.6-2), - libuuid1, ${misc:Depends}, ${shlibs:Depends}, Recommends: numactl, we get even more accurate Depends: and nicer sorting, and less work in case one of those libraries goes through a transition in Forky: $ debdiff ... File lists identical (after any substitutions) Control files: lines which differ (wdiff format) ------------------------------------------------ Depends: ceph-common (>= 0.48), fuse3, iproute2, [-libiscsi7 (>= 1.18.0), libjpeg62-turbo (>= 1.3.1), libspice-server1 (>= 0.14.2), libusb-1.0-0 (>= 2:1.0.23~), libusbredirparser1 (>= 0.6-2), libuuid1 (>= 2.16),-] libaio1t64 (>= 0.3.93), libasound2t64 (>= 1.0.16), libc6 (>= 2.38), libcap-ng0 (>= 0.7.9), libcurl3t64-gnutls (>= 7.16.3), libepoxy0 (>= 1.5.2), libfdt1 (>= 1.7.2), libfuse3-4 (>= 3.17.2), libgbm1 (>= 12.0.0~0), libgcc-s1 (>= 3.0), libglib2.0-0t64 (>= 2.83.0), libgnutls30t64 (>= 3.8.6), libgoogle-perftools4t64 (>= 2.16), {+libiscsi7 (>= 1.18.0), libjpeg62-turbo (>= 1.3.1),+} libnuma1 (>= 2.0.15), libpixman-1-0 (>= 0.30.0), libproxmox-backup-qemu0 (>= 1.3.0), libpulse0 (>= 0.99.1), librados2 (>= 19.2.3), librbd1 (>= 19.2.3), libseccomp2 (>= 2.1.0), libselinux1 (>= 3.1~), libslirp0 (>= 4.7.0), libsndio7.0 (>= 1.8.1), {+libspice-server1 (>= 0.14.2),+} libsystemd0, libudev1 (>= 183), liburing2 (>= 2.3), {+libusb-1.0-0 (>= 2:1.0.23~),+} libusbredirparser1t64 (>= 0.8.0), {+libuuid1 (>= 2.16),+} libvirglrenderer1 (>= 1.0.0), libxkbcommon0 (>= 0.5.0), libzstd1 (>= 1.5.5), zlib1g (>= 1:1.2.0) ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH pve-qemu 2/2] d/rules: enable tcmalloc as the memory allocator 2026-04-13 13:40 ` Fabian Grünbichler @ 2026-04-14 8:10 ` Fiona Ebner 0 siblings, 0 replies; 16+ messages in thread From: Fiona Ebner @ 2026-04-14 8:10 UTC (permalink / raw) To: Fabian Grünbichler, DERUMIER, Alexandre; +Cc: pve-devel Am 13.04.26 um 3:39 PM schrieb Fabian Grünbichler: > Quoting DERUMIER, Alexandre (2026-04-13 15:23:28) >>>> Please also add the runtime dependency to Depends: >> >> I was going to say the same, >> >> it should depend on "libgoogle-perftools4t64" > > that is not needed. in general, for properly packaged libraries it is enough to > build-depend on the -dev package, and have ${shlibs:Depends} in the binary > package Depends (with this series applied and built): > > new Debian package, version 2.0. > size 32266824 bytes: control archive=13756 bytes. > 38 bytes, 2 lines conffiles > 1732 bytes, 18 lines control > 40920 bytes, 494 lines md5sums > Package: pve-qemu-kvm > Version: 10.2.1-1 > Architecture: amd64 > Maintainer: Proxmox Support Team <support@proxmox.com> > Installed-Size: 422145 > Depends: ceph-common (>= 0.48), fuse3, iproute2, libiscsi7 (>= 1.18.0), libjpeg62-turbo (>= 1.3.1), libspice-server1 (>= 0.14.2), libusb-1.0-0 (>= 2:1.0.23~), libusbredirparser1 (>= 0.6-2), libuuid1 (>= 2.16), libaio1t64 (>= 0.3.93), libasound2t64 (>= 1.0.16), libc6 (>= 2.38), libcap-ng0 (>= 0.7.9), libcurl3t64-gnutls (>= 7.16.3), libepoxy0 (>= 1.5.2), libfdt1 (>= 1.7.2), libfuse3-4 (>= 3.17.2), libgbm1 (>= 12.0.0~0), libgcc-s1 (>= 3.0), libglib2.0-0t64 (>= 2.83.0), libgnutls30t64 (>= 3.8.6), libgoogle-perftools4t64 (>= 2.16), libnuma1 (>= 2.0.15), libpixman-1-0 (>= 0.30.0), libproxmox-backup-qemu0 (>= 1.3.0), libpulse0 (>= 0.99.1), librados2 (>= 19.2.3), librbd1 (>= 19.2.3), libseccomp2 (>= 2.1.0), libselinux1 (>= 3.1~), libslirp0 (>= 4.7.0), libsndio7.0 (>= 1.8.1), libsystemd0, libudev1 (>= 183), liburing2 (>= 2.3), libusbredirparser1t64 (>= 0.8.0), libvirglrenderer1 (>= 1.0.0), libxkbcommon0 (>= 0.5.0), libzstd1 (>= 1.5.5), zlib1g (>= 1:1.2.0) > Recommends: numactl > Suggests: libgl1 > Conflicts: kvm, pve-kvm, pve-qemu-kvm-2.6.18, qemu, qemu-kvm, qemu-system-arm, qemu-system-common, qemu-system-data, qemu-system-x86, qemu-utils > Breaks: qemu-server (<= 8.0.6) > Replaces: pve-kvm, pve-qemu-kvm-2.6.18, qemu-system-arm, qemu-system-x86, qemu-utils > Provides: qemu-system-arm, qemu-system-x86, qemu-utils > Section: admin > Priority: optional > Description: Full virtualization on x86 hardware > Using KVM, one can run multiple virtual PCs, each running unmodified Linux or > Windows images. Each virtual machine has private virtualized hardware: a > network card, disk, graphics adapter, etc. > > if the library is packaged using symbol versioning, it even adds the proper lower version bound ;) > > in fact, with the following: > > diff --git a/debian/control b/debian/control > index a3121e1..0a3ffbd 100644 > --- a/debian/control > +++ b/debian/control > @@ -49,12 +49,6 @@ Architecture: any > Depends: ceph-common (>= 0.48), > fuse3, > iproute2, > - libiscsi4 (>= 1.12.0) | libiscsi7, > - libjpeg62-turbo, > - libspice-server1 (>= 0.14.0~), > - libusb-1.0-0 (>= 1.0.17-1), > - libusbredirparser1 (>= 0.6-2), > - libuuid1, > ${misc:Depends}, > ${shlibs:Depends}, > Recommends: numactl, > > we get even more accurate Depends: and nicer sorting, and less work in case one > of those libraries goes through a transition in Forky: > > $ debdiff ... > File lists identical (after any substitutions) > > Control files: lines which differ (wdiff format) > ------------------------------------------------ > Depends: ceph-common (>= 0.48), fuse3, iproute2, [-libiscsi7 (>= 1.18.0), libjpeg62-turbo (>= 1.3.1), libspice-server1 (>= 0.14.2), libusb-1.0-0 (>= 2:1.0.23~), libusbredirparser1 (>= 0.6-2), libuuid1 (>= 2.16),-] libaio1t64 (>= 0.3.93), libasound2t64 (>= 1.0.16), libc6 (>= 2.38), libcap-ng0 (>= 0.7.9), libcurl3t64-gnutls (>= 7.16.3), libepoxy0 (>= 1.5.2), libfdt1 (>= 1.7.2), libfuse3-4 (>= 3.17.2), libgbm1 (>= 12.0.0~0), libgcc-s1 (>= 3.0), libglib2.0-0t64 (>= 2.83.0), libgnutls30t64 (>= 3.8.6), libgoogle-perftools4t64 (>= 2.16), {+libiscsi7 (>= 1.18.0), libjpeg62-turbo (>= 1.3.1),+} libnuma1 (>= 2.0.15), libpixman-1-0 (>= 0.30.0), libproxmox-backup-qemu0 (>= 1.3.0), libpulse0 (>= 0.99.1), librados2 (>= 19.2.3), librbd1 (>= 19.2.3), libseccomp2 (>= 2.1.0), libselinux1 (>= 3.1~), libslirp0 (>= 4.7.0), libsndio7.0 (>= 1.8.1), {+libspice-server1 (>= 0.14.2),+} libsystemd0, libudev1 (>= 183), liburing2 (>= 2.3), {+libusb-1.0-0 (>= 2:1.0.23~),+} libusbredirparser1t64 (>= 0.8.0), {+libuuid1 (>= 2.16),+} libvirglrenderer1 (>= 1.0.0), libxkbcommon0 (>= 0.5.0), libzstd1 (>= 1.5.5), zlib1g (>= 1:1.2.0) Ah, good to know! Let's do that then :) ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH pve-qemu 0/2] Re-enable tcmalloc as the memory allocator 2026-04-10 4:30 [PATCH pve-qemu 0/2] Re-enable tcmalloc as the memory allocator Kefu Chai 2026-04-10 4:30 ` [PATCH pve-qemu 1/2] PVE: use " Kefu Chai 2026-04-10 4:30 ` [PATCH pve-qemu 2/2] d/rules: enable " Kefu Chai @ 2026-04-10 8:12 ` Fiona Ebner 2026-04-10 10:45 ` DERUMIER, Alexandre 2026-04-14 5:48 ` superseded: " Kefu Chai 3 siblings, 1 reply; 16+ messages in thread From: Fiona Ebner @ 2026-04-10 8:12 UTC (permalink / raw) To: Kefu Chai, pve-devel Am 10.04.26 um 6:30 AM schrieb Kefu Chai: > Following up on the RFC thread [0], here's the formal submission to > re-enable tcmalloc for pve-qemu. > > Quick recap: librbd's I/O path allocates a lot of small, short-lived > objects with plain new/malloc (ObjectReadRequest, bufferlist, etc.), > and glibc's ptmalloc2 handles this pattern poorly -- cross-thread > arena contention and cache-line bouncing show up clearly in perf > profiles. tcmalloc's per-thread fast path avoids both. > > A bit of history for context: tcmalloc was tried in 2015 but dropped > after 8 days due to gperftools 2.2 tuning issues (fixed in 2.4+). > jemalloc replaced it but was dropped in 2020 because it didn't > release Rust-allocated memory (from proxmox-backup-qemu) back to the > OS. PVE 9 ships gperftools 2.16, and patch 1/2 addresses the > reclamation gap explicitly. > > On Dietmar's two concerns from the RFC: > > "Could ReleaseFreeMemory() halt the application?" -- No, and I > verified this directly. It walks tcmalloc's page heap free span > lists and calls madvise(MADV_DONTNEED) on each span. It does not > walk allocated memory or compact the heap. A standalone test > reclaimed 386 MB of 410 MB cached memory (94%) in effectively zero > wall time. The call runs once at backup completion, same spot where > malloc_trim runs today. > > "Wouldn't a pool allocator in librbd be the proper fix?" -- In > principle yes, but I audited librbd in Ceph squid and it does NOT > use a pool allocator -- all I/O path objects go through plain new. > Ceph's mempool is tracking-only, not actual pooling. Adding real > pooling would be a significant Ceph-side change (submission and > completion happen on different threads), and it's orthogonal to the > allocator choice here. > > Also thanks to Alexandre for confirming the 2015 gperftools issues > are long resolved. > > Test results > ------------ > > Benchmarked on a local vstart Ceph cluster (3 OSDs on local NVMe). > This is the worst case for showing allocator impact, since there's > no network latency for CPU savings to amortize against: > > rbd bench --io-type read --io-size 4096 --io-threads 16 \ > --io-pattern rand > > Metric | glibc ptmalloc2 | tcmalloc | Delta > ---------------+-----------------+-----------+-------- > IOPS | 131,201 | 136,389 | +4.0% > CPU time | 1,556 ms | 1,439 ms | -7.5% > Cycles | 6.74B | 6.06B | -10.1% > Cache misses | 137.1M | 123.9M | -9.6% > > perf report on the glibc run shows ~8% of CPU in allocator internals > (_int_malloc, cfree, malloc_consolidate, _int_free_*); the same > symbols are barely visible with tcmalloc because the fast path is > just a pointer bump. The Ceph blog [1] reports ~50% IOPS gain on > production clusters where network RTT dominates per-I/O latency -- > the 10% CPU savings compound there since the host can push more I/O > into the pipeline during the same wall time. How does the performance change when doing IO within a QEMU guest? How does this affect the performance for other storage types, like ZFS, qcow2 on top of directory-based storages, qcow2 on top of LVM, LVM-thin, etc. and other workloads like saving VM state during snapshot, transfer during migration, maybe memory hotplug/ballooning, network performance for vNICs? ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH pve-qemu 0/2] Re-enable tcmalloc as the memory allocator 2026-04-10 8:12 ` [PATCH pve-qemu 0/2] Re-enable " Fiona Ebner @ 2026-04-10 10:45 ` DERUMIER, Alexandre 2026-04-13 8:14 ` Fiona Ebner 0 siblings, 1 reply; 16+ messages in thread From: DERUMIER, Alexandre @ 2026-04-10 10:45 UTC (permalink / raw) To: pve-devel, f.ebner, k.chai >>How does the performance change when doing IO within a QEMU guest? >> >>How does this affect the performance for other storage types, like >>ZFS, >>qcow2 on top of directory-based storages, qcow2 on top of LVM, LVM- >>thin, >>etc. and other workloads like saving VM state during snapshot, >>transfer >>during migration, maybe memory hotplug/ballooning, network >>performance >>for vNICs? Hi Fiona, I'm stil running in production (I have keeped tcmalloc after the removal some year ago from the pve build), and I didn't notice problem. (but I still don't use pbs). But I never have done bench with/without it since 5/6 year. Maybe vm memory should be checked too, I'm thinking about RSS memory with balloon free_page_reporting, to see if it's correcting freeing memory. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH pve-qemu 0/2] Re-enable tcmalloc as the memory allocator 2026-04-10 10:45 ` DERUMIER, Alexandre @ 2026-04-13 8:14 ` Fiona Ebner 2026-04-13 11:13 ` Kefu Chai 0 siblings, 1 reply; 16+ messages in thread From: Fiona Ebner @ 2026-04-13 8:14 UTC (permalink / raw) To: DERUMIER, Alexandre, pve-devel, k.chai Am 10.04.26 um 12:44 PM schrieb DERUMIER, Alexandre: > >>> How does the performance change when doing IO within a QEMU guest? >>> >>> How does this affect the performance for other storage types, like >>> ZFS, >>> qcow2 on top of directory-based storages, qcow2 on top of LVM, LVM- >>> thin, >>> etc. and other workloads like saving VM state during snapshot, >>> transfer >>> during migration, maybe memory hotplug/ballooning, network >>> performance >>> for vNICs? > > Hi Fiona, > > I'm stil running in production (I have keeped tcmalloc after the > removal some year ago from the pve build), and I didn't notice problem. > (but I still don't use pbs). > > But I never have done bench with/without it since 5/6 year. > > Maybe vm memory should be checked too, I'm thinking about RSS memory > with balloon free_page_reporting, to see if it's correcting freeing > memory. Thanks! If it was running fine for you for this long, that is a very good data point :) Still, testing different scenarios/configurations would be nice to rule out that there is a (performance) regression somewhere else. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH pve-qemu 0/2] Re-enable tcmalloc as the memory allocator 2026-04-13 8:14 ` Fiona Ebner @ 2026-04-13 11:13 ` Kefu Chai 2026-04-13 13:14 ` DERUMIER, Alexandre 2026-04-13 13:18 ` Fiona Ebner 0 siblings, 2 replies; 16+ messages in thread From: Kefu Chai @ 2026-04-13 11:13 UTC (permalink / raw) To: Fiona Ebner, DERUMIER, Alexandre, pve-devel On Mon Apr 13, 2026 at 4:14 PM CST, Fiona Ebner wrote: > Am 10.04.26 um 12:44 PM schrieb DERUMIER, Alexandre: >> >>>> How does the performance change when doing IO within a QEMU guest? >>>> >>>> How does this affect the performance for other storage types, like >>>> ZFS, >>>> qcow2 on top of directory-based storages, qcow2 on top of LVM, LVM- >>>> thin, >>>> etc. and other workloads like saving VM state during snapshot, >>>> transfer >>>> during migration, maybe memory hotplug/ballooning, network >>>> performance >>>> for vNICs? Hi Fiona, Thanks for the questions. I traced QEMU's source code. It turns out that guest's RAM is allocated via direct mmap() calls, which completely bypassing QEMU's C library allocator. The path looks like: -m 4G on command line: memory_region_init_ram_flags_nomigrate() qemu_ram_alloc() qemu_ram_alloc_internal() g_malloc0(sizeof(*new_block)) <-- only the RAMBlock metadata struct, about 512 bytes ram_block_add() qemu_anon_ram_alloc() qemu_ram_mmap(-1, size, ...) mmap_reserve(total) mmap(0, size, PROT_NONE, ...) <-- reserve address space mmap_activate(ptr, size, ...) mmap(ptr, size, PROT_RW, ...) <-- actual guest RAM So the gigabytes of guest memory go straight to the kernel via mmap() and never touch malloc or tcmalloc. Only the small RAMBlock metadata structure (~512 bytes per region) goes through g_malloc0(). In other words, tcmalloc's scope is limited to QEMU's own working memory, among other things, block layer buffers, coroutine stacks, internal data structs. It does not touch gues RAM at all. The inflat/deflate path of balloon code does involve glib malloc. It's virtio_balloon_pbp_alloc(), which calls bitmap_new() via g_new0() to track partially-ballooned pages. But these bitmaps tracks only metadata, whose memory footprints are relatively small. And this only happens in infrequent operations. Hopefully, this addresses the balloon concern. If I've missed something or misread the code, please help point it out. When it comes to different storage types and workloads, I ran benchmarks covering some scenarios you listed. All comparison use the same binary (pve-qemu-kvm 10.1.2-7) with and without LD_PRELOAD=libtcmalloc.so.4, so the allocator is the only variable. Storage backends (block layer, via qemu-img bench) 4K reads, depth=32, io_uring, cache=none, 5M ops per run, best of 3. Host NVMe (ext4) or LVM-thin backed by NVMe. backend glibc ops/s tcmalloc ops/s delta qcow2 on ext4 1,188,495 1,189,343 +0.1% qcow2 on ext4 write 1,036,914 1,036,699 0.0% raw on ext4 1,263,583 1,277,465 +1.1% raw on LVM-thin 433,727 433,576 0.0% The reason why raw/LVM-thin is slower, is that, I *think* it actually hits the dm-thin layer rather than page cache. The allocator delta is noise in all cases. RBD tested via a local vstart cluster (3 OSDs, bluestore on NVMe): path glibc tcmalloc delta qemu-img bench + librbd 19,111 19,156 +0.2% rbd bench (librbd direct) 35,329 36,622 +3.7% I don't have ZFS configured on this (host) machine, but QEMU's file I/O path to ZFS goes through the same code as ext4, so the difference is in the kernel. I'd expect the same null result, though I could be wrong and am happy to set up a test pool if you'd like to see the numbers. Guest I/O (Debian 13 guest, virtio-blk) Ran dd inside a Debian 13 cloud image guest (4 vCPU, 2 GB RAM, qcow2 on ext4, cache=none, aio=native) against a second virtio disk (/dev/vdb, 8 GB qcow2). workload glibc tcmalloc delta dd if=/dev/vdb bs=4M count=1024 15.3 GB/s 15.5 GB/s +1.3% iflag=direct (sequential) 8x parallel dd if=/dev/vdb bs=4k 787 MB/s 870 MB/s +10.5% count=100k iflag=direct (each starting at a different offset) Migration and savevm Tested with a 4 GB guest where about 2 GB of RAM was dirtied (filled /dev/shm with urandom, 8x256 MB) before triggering each operation. scenario glibc tcmalloc delta migrate (exec: URI) 0.622 s 0.622 s 0.0% savevm (qcow2 snap) 0.503 s 0.504 s +0.2% What I didn't measure I did'nt test ZFS/vNIC throughput, as the host path lives in the kernel side, while QEMU only handles the control plane. Hence I'd expect very little allocator impact there, though I could be wrong. I also didn't test memory hotplug on ballooning cases, because, as explained above, these are rare one-shot operations. and as the source code trace shows malloc is not involved. But again, happy to look into it, if it's still a concern. >> >> Hi Fiona, >> >> I'm stil running in production (I have keeped tcmalloc after the >> removal some year ago from the pve build), and I didn't notice problem. >> (but I still don't use pbs). >> >> But I never have done bench with/without it since 5/6 year. And thanks Alexandre for sharing your production experience, that's very valuable context. >> >> Maybe vm memory should be checked too, I'm thinking about RSS memory >> with balloon free_page_reporting, to see if it's correcting freeing >> memory. For free_page_reporting, the VIRTIO_BALLOON_F_REPORTING path: virtual_balloon_handle_report() ram_block_discard_range() madvise(host_startaddr, kebngth, QEMU_MADV_DONTNEED) This operates directly on the mmap'ed guest RAM region. Still no malloc involvement anywhere in this path. In short, the tests above reveal a consistent pattern: tcmalloc helps where allocation pressure is high, and is neutral everywhere else. Please let me know if you'd like more data on any specific workload, or if there is anything I overlooked. I appreciate the careful review and insights. > > Thanks! If it was running fine for you for this long, that is a very > good data point :) Still, testing different scenarios/configurations > would be nice to rule out that there is a (performance) regression > somewhere else. cheers, Kefu ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH pve-qemu 0/2] Re-enable tcmalloc as the memory allocator 2026-04-13 11:13 ` Kefu Chai @ 2026-04-13 13:14 ` DERUMIER, Alexandre 2026-04-13 13:18 ` Fiona Ebner 1 sibling, 0 replies; 16+ messages in thread From: DERUMIER, Alexandre @ 2026-04-13 13:14 UTC (permalink / raw) To: pve-devel, f.ebner, k.chai Hi, I have done a small test on a free cluster (this was with old xeon v3 with 18 ssd). For 4k randwrite, I don't see difference (but I think I'm write limited by the ssd), I'm around 40k~50k iops for 4k randread, I see 30% improvement (~70k iops -> ~100kiops) test: virtio-scsi + iothread + cache=none without tcmalloc: # fio --filename=/dev/sdb --direct=1 --rw=randread --bs=4k --iodepth=64 --ioengine=libaio --name test test: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64 fio-3.33 Starting 1 process ^Cbs: 1 (f=1): [r(1)][20%][r=273MiB/s][r=69.9k IOPS][eta 05m:12s] # fio --filename=/dev/sda --direct=1 --rw=randread --bs=4k --iodepth=64 --ioengine=libaio --name test test: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64 fio-3.33 Starting 1 process ^Cbs: 1 (f=1): [r(1)][13.6%][r=273MiB/s][r=69.9k IOPS][eta 05m:12s] fio: terminating on signal 2 test: (groupid=0, jobs=1): err= 0: pid=929282: Mon Apr 13 15:01:33 2026 read: IOPS=72.7k, BW=284MiB/s (298MB/s)(13.7GiB/49316msec) slat (usec): min=3, max=917, avg= 6.00, stdev= 3.66 clat (usec): min=169, max=117369, avg=873.83, stdev=521.46 lat (usec): min=180, max=117377, avg=879.83, stdev=521.44 clat percentiles (usec): | 1.00th=[ 408], 5.00th=[ 510], 10.00th=[ 578], 20.00th=[ 668], | 30.00th=[ 734], 40.00th=[ 799], 50.00th=[ 848], 60.00th=[ 906], | 70.00th=[ 971], 80.00th=[ 1057], 90.00th=[ 1172], 95.00th=[ 1303], | 99.00th=[ 1598], 99.50th=[ 1762], 99.90th=[ 2311], 99.95th=[ 2638], | 99.99th=[ 4490] bw ( KiB/s): min=228520, max=315464, per=100.00%, avg=291059.43, stdev=14202.45, samples=98 iops : min=57130, max=78866, avg=72764.90, stdev=3550.61, samples=98 lat (usec) : 250=0.01%, 500=4.32%, 750=27.98%, 1000=41.61% lat (msec) : 2=25.85%, 4=0.21%, 10=0.01%, 100=0.01%, 250=0.01% cpu : usr=15.15%, sys=50.34%, ctx=226303, majf=0, minf=4301 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0% issued rwts: total=3584067,0,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=64 Run status group 0 (all jobs): READ: bw=284MiB/s (298MB/s), 284MiB/s-284MiB/s (298MB/s-298MB/s), io=13.7GiB (14.7GB), run=49316-49316msec Disk stats (read/write): sdc: ios=3571207/0, merge=0/0, ticks=2868237/0, in_queue=2868237, util=99.90% with tcmalloc ------------- fio --filename=/dev/sda --direct=1 --rw=randread --bs=4k --iodepth=64 - -ioengine=libaio --name test test: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64 fio-3.33 Starting 1 process ^Cbs: 1 (f=1): [r(1)][19.0%][r=383MiB/s][r=98.0k IOPS][eta 03m:42s] test: (groupid=0, jobs=1): err= 0: pid=1293: Mon Apr 13 15:05:04 2026 read: IOPS=95.7k, BW=374MiB/s (392MB/s)(19.1GiB/52256msec) slat (usec): min=2, max=1605, avg= 5.94, stdev= 3.93 clat (usec): min=150, max=7426, avg=661.83, stdev=204.22 lat (usec): min=157, max=7433, avg=667.77, stdev=204.42 clat percentiles (usec): | 1.00th=[ 347], 5.00th=[ 412], 10.00th=[ 453], 20.00th=[ 506], | 30.00th=[ 553], 40.00th=[ 586], 50.00th=[ 627], 60.00th=[ 668], | 70.00th=[ 717], 80.00th=[ 791], 90.00th=[ 906], 95.00th=[ 1020], | 99.00th=[ 1336], 99.50th=[ 1500], 99.90th=[ 1909], 99.95th=[ 2114], | 99.99th=[ 3359] bw ( KiB/s): min=324264, max=442120, per=100.00%, avg=383405.95, stdev=20263.50, samples=104 iops : min=81066, max=110530, avg=95851.60, stdev=5065.87, samples=104 lat (usec) : 250=0.02%, 500=18.47%, 750=56.25%, 1000=19.54% lat (msec) : 2=5.64%, 4=0.06%, 10=0.01% cpu : usr=19.61%, sys=65.54%, ctx=262129, majf=0, minf=4296 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0% issued rwts: total=5002932,0,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=64 Run status group 0 (all jobs): READ: bw=374MiB/s (392MB/s), 374MiB/s-374MiB/s (392MB/s-392MB/s), io=19.1GiB (20.5GB), run=52256-52256msec Disk stats (read/write): sda: ios=4990801/0, merge=0/0, ticks=2862837/0, in_queue=2862837, util=99.91% -------- Message initial -------- De: Kefu Chai <k.chai@proxmox.com> À: Fiona Ebner <f.ebner@proxmox.com>, "DERUMIER, Alexandre" <alexandre.derumier@groupe-cyllene.com>, pve-devel@lists.proxmox.com <pve-devel@lists.proxmox.com> Objet: Re: [PATCH pve-qemu 0/2] Re-enable tcmalloc as the memory allocator Date: 13/04/2026 13:13:12 On Mon Apr 13, 2026 at 4:14 PM CST, Fiona Ebner wrote: > Am 10.04.26 um 12:44 PM schrieb DERUMIER, Alexandre: > > > > > > How does the performance change when doing IO within a QEMU > > > > guest? > > > > > > > > How does this affect the performance for other storage types, > > > > like > > > > ZFS, > > > > qcow2 on top of directory-based storages, qcow2 on top of LVM, > > > > LVM- > > > > thin, > > > > etc. and other workloads like saving VM state during snapshot, > > > > transfer > > > > during migration, maybe memory hotplug/ballooning, network > > > > performance > > > > for vNICs? Hi Fiona, Thanks for the questions. I traced QEMU's source code. It turns out that guest's RAM is allocated via direct mmap() calls, which completely bypassing QEMU's C library allocator. The path looks like: -m 4G on command line: memory_region_init_ram_flags_nomigrate() qemu_ram_alloc() qemu_ram_alloc_internal() g_malloc0(sizeof(*new_block)) <-- only the RAMBlock metadata struct, about 512 bytes ram_block_add() qemu_anon_ram_alloc() qemu_ram_mmap(-1, size, ...) mmap_reserve(total) mmap(0, size, PROT_NONE, ...) <-- reserve address space mmap_activate(ptr, size, ...) mmap(ptr, size, PROT_RW, ...) <-- actual guest RAM So the gigabytes of guest memory go straight to the kernel via mmap() and never touch malloc or tcmalloc. Only the small RAMBlock metadata structure (~512 bytes per region) goes through g_malloc0(). In other words, tcmalloc's scope is limited to QEMU's own working memory, among other things, block layer buffers, coroutine stacks, internal data structs. It does not touch gues RAM at all. The inflat/deflate path of balloon code does involve glib malloc. It's virtio_balloon_pbp_alloc(), which calls bitmap_new() via g_new0() to track partially-ballooned pages. But these bitmaps tracks only metadata, whose memory footprints are relatively small. And this only happens in infrequent operations. Hopefully, this addresses the balloon concern. If I've missed something or misread the code, please help point it out. When it comes to different storage types and workloads, I ran benchmarks covering some scenarios you listed. All comparison use the same binary (pve-qemu-kvm 10.1.2-7) with and without LD_PRELOAD=libtcmalloc.so.4, so the allocator is the only variable. Storage backends (block layer, via qemu-img bench) 4K reads, depth=32, io_uring, cache=none, 5M ops per run, best of 3. Host NVMe (ext4) or LVM-thin backed by NVMe. backend glibc ops/s tcmalloc ops/s delta qcow2 on ext4 1,188,495 1,189,343 +0.1% qcow2 on ext4 write 1,036,914 1,036,699 0.0% raw on ext4 1,263,583 1,277,465 +1.1% raw on LVM-thin 433,727 433,576 0.0% The reason why raw/LVM-thin is slower, is that, I *think* it actually hits the dm-thin layer rather than page cache. The allocator delta is noise in all cases. RBD tested via a local vstart cluster (3 OSDs, bluestore on NVMe): path glibc tcmalloc delta qemu-img bench + librbd 19,111 19,156 +0.2% rbd bench (librbd direct) 35,329 36,622 +3.7% I don't have ZFS configured on this (host) machine, but QEMU's file I/O path to ZFS goes through the same code as ext4, so the difference is in the kernel. I'd expect the same null result, though I could be wrong and am happy to set up a test pool if you'd like to see the numbers. Guest I/O (Debian 13 guest, virtio-blk) Ran dd inside a Debian 13 cloud image guest (4 vCPU, 2 GB RAM, qcow2 on ext4, cache=none, aio=native) against a second virtio disk (/dev/vdb, 8 GB qcow2). workload glibc tcmalloc delta dd if=/dev/vdb bs=4M count=1024 15.3 GB/s 15.5 GB/s +1.3% iflag=direct (sequential) 8x parallel dd if=/dev/vdb bs=4k 787 MB/s 870 MB/s +10.5% count=100k iflag=direct (each starting at a different offset) Migration and savevm Tested with a 4 GB guest where about 2 GB of RAM was dirtied (filled /dev/shm with urandom, 8x256 MB) before triggering each operation. scenario glibc tcmalloc delta migrate (exec: URI) 0.622 s 0.622 s 0.0% savevm (qcow2 snap) 0.503 s 0.504 s +0.2% What I didn't measure I did'nt test ZFS/vNIC throughput, as the host path lives in the kernel side, while QEMU only handles the control plane. Hence I'd expect very little allocator impact there, though I could be wrong. I also didn't test memory hotplug on ballooning cases, because, as explained above, these are rare one-shot operations. and as the source code trace shows malloc is not involved. But again, happy to look into it, if it's still a concern. > > > > Hi Fiona, > > > > I'm stil running in production (I have keeped tcmalloc after the > > removal some year ago from the pve build), and I didn't notice > > problem. > > (but I still don't use pbs). > > > > But I never have done bench with/without it since 5/6 year. And thanks Alexandre for sharing your production experience, that's very valuable context. > > > > Maybe vm memory should be checked too, I'm thinking about RSS > > memory > > with balloon free_page_reporting, to see if it's correcting freeing > > memory. For free_page_reporting, the VIRTIO_BALLOON_F_REPORTING path: virtual_balloon_handle_report() ram_block_discard_range() madvise(host_startaddr, kebngth, QEMU_MADV_DONTNEED) This operates directly on the mmap'ed guest RAM region. Still no malloc involvement anywhere in this path. In short, the tests above reveal a consistent pattern: tcmalloc helps where allocation pressure is high, and is neutral everywhere else. Please let me know if you'd like more data on any specific workload, or if there is anything I overlooked. I appreciate the careful review and insights. > > Thanks! If it was running fine for you for this long, that is a very > good data point :) Still, testing different scenarios/configurations > would be nice to rule out that there is a (performance) regression > somewhere else. cheers, Kefu ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH pve-qemu 0/2] Re-enable tcmalloc as the memory allocator 2026-04-13 11:13 ` Kefu Chai 2026-04-13 13:14 ` DERUMIER, Alexandre @ 2026-04-13 13:18 ` Fiona Ebner 2026-04-14 5:41 ` Kefu Chai 1 sibling, 1 reply; 16+ messages in thread From: Fiona Ebner @ 2026-04-13 13:18 UTC (permalink / raw) To: Kefu Chai, DERUMIER, Alexandre, pve-devel Am 13.04.26 um 1:12 PM schrieb Kefu Chai: > On Mon Apr 13, 2026 at 4:14 PM CST, Fiona Ebner wrote: >> Am 10.04.26 um 12:44 PM schrieb DERUMIER, Alexandre: >>> >>>>> How does the performance change when doing IO within a QEMU guest? >>>>> >>>>> How does this affect the performance for other storage types, like >>>>> ZFS, >>>>> qcow2 on top of directory-based storages, qcow2 on top of LVM, LVM- >>>>> thin, >>>>> etc. and other workloads like saving VM state during snapshot, >>>>> transfer >>>>> during migration, maybe memory hotplug/ballooning, network >>>>> performance >>>>> for vNICs? > > Hi Fiona, > > Thanks for the questions. > > I traced QEMU's source code. It turns out that guest's RAM is allocated > via direct mmap() calls, which completely bypassing QEMU's C library > allocator. The path looks like: > > -m 4G on command line: > > memory_region_init_ram_flags_nomigrate() > qemu_ram_alloc() > qemu_ram_alloc_internal() > g_malloc0(sizeof(*new_block)) <-- only the RAMBlock metadata > struct, about 512 bytes > ram_block_add() > qemu_anon_ram_alloc() > qemu_ram_mmap(-1, size, ...) > mmap_reserve(total) > mmap(0, size, PROT_NONE, ...) <-- reserve address space > mmap_activate(ptr, size, ...) > mmap(ptr, size, PROT_RW, ...) <-- actual guest RAM > > So the gigabytes of guest memory go straight to the kernel via mmap() > and never touch malloc or tcmalloc. Only the small RAMBlock metadata > structure (~512 bytes per region) goes through g_malloc0(). > > In other words, tcmalloc's scope is limited to QEMU's own working > memory, among other things, block layer buffers, coroutine stacks, > internal data structs. It does not touch gues RAM at all. > > The inflat/deflate path of balloon code does involve glib malloc. It's > virtio_balloon_pbp_alloc(), which calls bitmap_new() via g_new0() to > track partially-ballooned pages. But these bitmaps tracks only metadata, > whose memory footprints are relatively small. And this only happens in > infrequent operations. > > Hopefully, this addresses the balloon concern. If I've missed something > or misread the code, please help point it out. > > When it comes to different storage types and workloads, I ran benchmarks > covering some scenarios you listed. All comparison use the same binary > (pve-qemu-kvm 10.1.2-7) with and without LD_PRELOAD=libtcmalloc.so.4, so > the allocator is the only variable. > > > Storage backends (block layer, via qemu-img bench) > > 4K reads, depth=32, io_uring, cache=none, 5M ops per run, best > of 3. Host NVMe (ext4) or LVM-thin backed by NVMe. > > backend glibc ops/s tcmalloc ops/s delta > qcow2 on ext4 1,188,495 1,189,343 +0.1% > qcow2 on ext4 write 1,036,914 1,036,699 0.0% > raw on ext4 1,263,583 1,277,465 +1.1% > raw on LVM-thin 433,727 433,576 0.0% > > The reason why raw/LVM-thin is slower, is that, I *think* it actually > hits the dm-thin layer rather than page cache. The allocator delta is > noise in all cases. > > RBD tested via a local vstart cluster (3 OSDs, bluestore on NVMe): > > path glibc tcmalloc delta > qemu-img bench + librbd 19,111 19,156 +0.2% > rbd bench (librbd direct) 35,329 36,622 +3.7% > > I don't have ZFS configured on this (host) machine, but QEMU's file I/O > path to ZFS goes through the same code as ext4, so the difference > is in the kernel. I'd expect the same null result, > though I could be wrong and am happy to set up a test pool if > you'd like to see the numbers. > > Guest I/O (Debian 13 guest, virtio-blk) > > Ran dd inside a Debian 13 cloud image guest (4 vCPU, 2 GB RAM, > qcow2 on ext4, cache=none, aio=native) against a second virtio > disk (/dev/vdb, 8 GB qcow2). > > workload glibc tcmalloc delta > dd if=/dev/vdb bs=4M count=1024 15.3 GB/s 15.5 GB/s +1.3% > iflag=direct (sequential) > 8x parallel dd if=/dev/vdb bs=4k 787 MB/s 870 MB/s +10.5% > count=100k iflag=direct > (each starting at a different offset) > > Migration and savevm > > Tested with a 4 GB guest where about 2 GB of RAM was dirtied > (filled /dev/shm with urandom, 8x256 MB) before triggering each > operation. > > scenario glibc tcmalloc delta > migrate (exec: URI) 0.622 s 0.622 s 0.0% > savevm (qcow2 snap) 0.503 s 0.504 s +0.2% > > What I didn't measure > > I did'nt test ZFS/vNIC throughput, as the host path lives in the kernel > side, while QEMU only handles the control plane. Hence I'd expect very > little allocator impact there, though I could be wrong. > > I also didn't test memory hotplug on ballooning cases, because, as > explained above, these are rare one-shot operations. and as the source > code trace shows malloc is not involved. But again, happy to look into > it, if it's still a concern. > Thank you for testing and the analysis! I'd still appreciate if somebody could test these, just to be sure. It is a rather core change after all. I'll also try to find some time to test around, but more eyes cannot hurt here! That said, it looks okay in general, so I'm giving a preliminary: Acked-by: Fiona Ebner <f.ebner@proxmox.com> >>> >>> Hi Fiona, >>> >>> I'm stil running in production (I have keeped tcmalloc after the >>> removal some year ago from the pve build), and I didn't notice problem. >>> (but I still don't use pbs). >>> >>> But I never have done bench with/without it since 5/6 year. > > And thanks Alexandre for sharing your production experience, that's very > valuable context. > >>> >>> Maybe vm memory should be checked too, I'm thinking about RSS memory >>> with balloon free_page_reporting, to see if it's correcting freeing >>> memory. > > For free_page_reporting, the VIRTIO_BALLOON_F_REPORTING path: > > virtual_balloon_handle_report() > ram_block_discard_range() > madvise(host_startaddr, kebngth, QEMU_MADV_DONTNEED) > > This operates directly on the mmap'ed guest RAM region. Still no malloc > involvement anywhere in this path. > > In short, the tests above reveal a consistent pattern: tcmalloc helps > where allocation pressure is high, and is neutral everywhere else. > Please let me know if you'd like more data on any specific workload, or > if there is anything I overlooked. I appreciate the careful review and > insights. > > >> >> Thanks! If it was running fine for you for this long, that is a very >> good data point :) Still, testing different scenarios/configurations >> would be nice to rule out that there is a (performance) regression >> somewhere else. > > cheers, > Kefu ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH pve-qemu 0/2] Re-enable tcmalloc as the memory allocator 2026-04-13 13:18 ` Fiona Ebner @ 2026-04-14 5:41 ` Kefu Chai 0 siblings, 0 replies; 16+ messages in thread From: Kefu Chai @ 2026-04-14 5:41 UTC (permalink / raw) To: Fiona Ebner, DERUMIER, Alexandre, pve-devel On Mon Apr 13, 2026 at 9:18 PM CST, Fiona Ebner wrote: > Am 13.04.26 um 1:12 PM schrieb Kefu Chai: >> On Mon Apr 13, 2026 at 4:14 PM CST, Fiona Ebner wrote: >>> Am 10.04.26 um 12:44 PM schrieb DERUMIER, Alexandre: >>>> >>>>>> How does the performance change when doing IO within a QEMU guest? >>>>>> >>>>>> How does this affect the performance for other storage types, like >>>>>> ZFS, >>>>>> qcow2 on top of directory-based storages, qcow2 on top of LVM, LVM- >>>>>> thin, >>>>>> etc. and other workloads like saving VM state during snapshot, >>>>>> transfer >>>>>> during migration, maybe memory hotplug/ballooning, network >>>>>> performance >>>>>> for vNICs? >> >> Hi Fiona, >> >> Thanks for the questions. >> >> I traced QEMU's source code. It turns out that guest's RAM is allocated >> via direct mmap() calls, which completely bypassing QEMU's C library >> allocator. The path looks like: >> >> -m 4G on command line: >> >> memory_region_init_ram_flags_nomigrate() >> qemu_ram_alloc() >> qemu_ram_alloc_internal() >> g_malloc0(sizeof(*new_block)) <-- only the RAMBlock metadata >> struct, about 512 bytes >> ram_block_add() >> qemu_anon_ram_alloc() >> qemu_ram_mmap(-1, size, ...) >> mmap_reserve(total) >> mmap(0, size, PROT_NONE, ...) <-- reserve address space >> mmap_activate(ptr, size, ...) >> mmap(ptr, size, PROT_RW, ...) <-- actual guest RAM >> >> So the gigabytes of guest memory go straight to the kernel via mmap() >> and never touch malloc or tcmalloc. Only the small RAMBlock metadata >> structure (~512 bytes per region) goes through g_malloc0(). >> >> In other words, tcmalloc's scope is limited to QEMU's own working >> memory, among other things, block layer buffers, coroutine stacks, >> internal data structs. It does not touch gues RAM at all. >> >> The inflat/deflate path of balloon code does involve glib malloc. It's >> virtio_balloon_pbp_alloc(), which calls bitmap_new() via g_new0() to >> track partially-ballooned pages. But these bitmaps tracks only metadata, >> whose memory footprints are relatively small. And this only happens in >> infrequent operations. >> >> Hopefully, this addresses the balloon concern. If I've missed something >> or misread the code, please help point it out. >> >> When it comes to different storage types and workloads, I ran benchmarks >> covering some scenarios you listed. All comparison use the same binary >> (pve-qemu-kvm 10.1.2-7) with and without LD_PRELOAD=libtcmalloc.so.4, so >> the allocator is the only variable. >> >> >> Storage backends (block layer, via qemu-img bench) >> >> 4K reads, depth=32, io_uring, cache=none, 5M ops per run, best >> of 3. Host NVMe (ext4) or LVM-thin backed by NVMe. >> >> backend glibc ops/s tcmalloc ops/s delta >> qcow2 on ext4 1,188,495 1,189,343 +0.1% >> qcow2 on ext4 write 1,036,914 1,036,699 0.0% >> raw on ext4 1,263,583 1,277,465 +1.1% >> raw on LVM-thin 433,727 433,576 0.0% >> >> The reason why raw/LVM-thin is slower, is that, I *think* it actually >> hits the dm-thin layer rather than page cache. The allocator delta is >> noise in all cases. >> >> RBD tested via a local vstart cluster (3 OSDs, bluestore on NVMe): >> >> path glibc tcmalloc delta >> qemu-img bench + librbd 19,111 19,156 +0.2% >> rbd bench (librbd direct) 35,329 36,622 +3.7% >> >> I don't have ZFS configured on this (host) machine, but QEMU's file I/O >> path to ZFS goes through the same code as ext4, so the difference >> is in the kernel. I'd expect the same null result, >> though I could be wrong and am happy to set up a test pool if >> you'd like to see the numbers. >> >> Guest I/O (Debian 13 guest, virtio-blk) >> >> Ran dd inside a Debian 13 cloud image guest (4 vCPU, 2 GB RAM, >> qcow2 on ext4, cache=none, aio=native) against a second virtio >> disk (/dev/vdb, 8 GB qcow2). >> >> workload glibc tcmalloc delta >> dd if=/dev/vdb bs=4M count=1024 15.3 GB/s 15.5 GB/s +1.3% >> iflag=direct (sequential) >> 8x parallel dd if=/dev/vdb bs=4k 787 MB/s 870 MB/s +10.5% >> count=100k iflag=direct >> (each starting at a different offset) >> >> Migration and savevm >> >> Tested with a 4 GB guest where about 2 GB of RAM was dirtied >> (filled /dev/shm with urandom, 8x256 MB) before triggering each >> operation. >> >> scenario glibc tcmalloc delta >> migrate (exec: URI) 0.622 s 0.622 s 0.0% >> savevm (qcow2 snap) 0.503 s 0.504 s +0.2% >> >> What I didn't measure >> >> I did'nt test ZFS/vNIC throughput, as the host path lives in the kernel >> side, while QEMU only handles the control plane. Hence I'd expect very >> little allocator impact there, though I could be wrong. >> >> I also didn't test memory hotplug on ballooning cases, because, as >> explained above, these are rare one-shot operations. and as the source >> code trace shows malloc is not involved. But again, happy to look into >> it, if it's still a concern. >> > > Thank you for testing and the analysis! > > I'd still appreciate if somebody could test these, just to be sure. It > is a rather core change after all. I'll also try to find some time to > test around, but more eyes cannot hurt here! > > That said, it looks okay in general, so I'm giving a preliminary: > > Acked-by: Fiona Ebner <f.ebner@proxmox.com> > Thanks for the ack, Fiona! I agree more testing can't hurt for the critical software in our stack, so I went ahead and ran through the remaining scenarios, ZFS, balloon, vNIC and snapshot, using PVE's qm tooling this time instead of bare QEMU. To avoid the QEMU version difference muddying the numbers, I built two packages from the same 10.2.1 source tree: one with --enable-malloc=tcmalloc (plus the proposed patch) and one without. Same compiler, same patches 0001-0047, allocator is the only variable, like I did yesterday. Test VM is a Debian 13 cloud image guest (4 vCPU, 4 GB RAM, virtio-scsi-single + iothread on LVM-thin, Samsung 9100 PRO NVMe). Guest I/O on LVM-thin (dd inside guest, best of 3) workload glibc tcmalloc delta seq 4M read (1024 x dd) 16.6 GB/s 16.6 GB/s 0.0% par 4K direct (8x dd) 578 MB/s 576 MB/s -0.3% Snapshot (qm snapshot, 2 GB of dirty guest RAM) glibc: 3.502 s tcmalloc: 3.544 s (+1.2%, noise) Balloon (QMP balloon 4G -> 2G) Both correctly deflated: glibc: RSS 2,462 MB -> 2,328 MB (freed 135 MB) tcmalloc: RSS 2,496 MB -> 707 MB (freed 1,789 MB) The difference in freed amount comes from different initial page residency, not the allocator. Storage backends (tcmalloc, absolute numbers) I moved the VM disk between backends to cover ZFS and qcow2-on-directory. Everything worked without issues: backend seq read par 4K read LVM-thin (raw, NVMe) 17.5 GB/s 670 MB/s ZFS (zvol, file-backed) 11.3 GB/s 696 MB/s qcow2 on ext4 (directory) 19.6 GB/s 763 MB/s ZFS is slower because the pool is file-backed (double indirection), not because of the allocator. The block-layer A/B from my earlier mail already showed the allocator is neutral across backends, so I didn't repeat the full A/B for each one. vNIC (iperf3 over vmbr0, 20 rounds each) This one needed more rounds to get a stable picture -- early attempts with 5 rounds looked misleadingly noisy. glibc tcmalloc delta host->guest mean 132.2 Gbps 132.9 Gbps +0.6% host->guest stddev 17.4 20.6 guest->host mean 69.6 Gbps 72.0 Gbps +3.5% guest->host stddev 17.1 15.0 Welch's t-test: t=0.12 (h2g), t=0.48 (g2h), neither significant at p=0.05. The stddev (~17 Gbps) dwarfs the mean difference, so there's no detectable allocator effect here. The hot path is in vhost-net on the kernel side, so this is pretty much what I'd expect. Overall: workload delta method guest seq I/O 0.0% same-version A/B guest par 4K I/O -0.3% same-version A/B (noise) qm snapshot +1.2% same-version A/B (noise) balloon OK both deflate correctly vNIC +0.6/+3.5% 20 rounds, not significant block layer (all) 0 to +1% LD_PRELOAD, same binary RBD (librbd) +3.7% LD_PRELOAD, same binary No regressions on any tested path. Together with Alexandre's production fio numbers (+31.6% on 4K randread on a real cluster), I think we have a more solid picture now. Please let me know if there's anything else I should look into, or if you'd like me to re-run something with different parameters! cheers, Kefu >>>> >>>> Hi Fiona, >>>> >>>> I'm stil running in production (I have keeped tcmalloc after the >>>> removal some year ago from the pve build), and I didn't notice problem. >>>> (but I still don't use pbs). >>>> >>>> But I never have done bench with/without it since 5/6 year. >> >> And thanks Alexandre for sharing your production experience, that's very >> valuable context. >> >>>> >>>> Maybe vm memory should be checked too, I'm thinking about RSS memory >>>> with balloon free_page_reporting, to see if it's correcting freeing >>>> memory. >> >> For free_page_reporting, the VIRTIO_BALLOON_F_REPORTING path: >> >> virtual_balloon_handle_report() >> ram_block_discard_range() >> madvise(host_startaddr, kebngth, QEMU_MADV_DONTNEED) >> >> This operates directly on the mmap'ed guest RAM region. Still no malloc >> involvement anywhere in this path. >> >> In short, the tests above reveal a consistent pattern: tcmalloc helps >> where allocation pressure is high, and is neutral everywhere else. >> Please let me know if you'd like more data on any specific workload, or >> if there is anything I overlooked. I appreciate the careful review and >> insights. >> >> >>> >>> Thanks! If it was running fine for you for this long, that is a very >>> good data point :) Still, testing different scenarios/configurations >>> would be nice to rule out that there is a (performance) regression >>> somewhere else. >> >> cheers, >> Kefu ^ permalink raw reply [flat|nested] 16+ messages in thread
* superseded: [PATCH pve-qemu 0/2] Re-enable tcmalloc as the memory allocator 2026-04-10 4:30 [PATCH pve-qemu 0/2] Re-enable tcmalloc as the memory allocator Kefu Chai ` (2 preceding siblings ...) 2026-04-10 8:12 ` [PATCH pve-qemu 0/2] Re-enable " Fiona Ebner @ 2026-04-14 5:48 ` Kefu Chai 3 siblings, 0 replies; 16+ messages in thread From: Kefu Chai @ 2026-04-14 5:48 UTC (permalink / raw) To: pve-devel superseded by https://lore.proxmox.com/pve-devel/20260414054645.405151-1-k.chai@proxmox.com/ ^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2026-04-14 8:09 UTC | newest] Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2026-04-10 4:30 [PATCH pve-qemu 0/2] Re-enable tcmalloc as the memory allocator Kefu Chai 2026-04-10 4:30 ` [PATCH pve-qemu 1/2] PVE: use " Kefu Chai 2026-04-13 13:12 ` Fiona Ebner 2026-04-10 4:30 ` [PATCH pve-qemu 2/2] d/rules: enable " Kefu Chai 2026-04-13 13:13 ` Fiona Ebner 2026-04-13 13:23 ` DERUMIER, Alexandre 2026-04-13 13:40 ` Fabian Grünbichler 2026-04-14 8:10 ` Fiona Ebner 2026-04-10 8:12 ` [PATCH pve-qemu 0/2] Re-enable " Fiona Ebner 2026-04-10 10:45 ` DERUMIER, Alexandre 2026-04-13 8:14 ` Fiona Ebner 2026-04-13 11:13 ` Kefu Chai 2026-04-13 13:14 ` DERUMIER, Alexandre 2026-04-13 13:18 ` Fiona Ebner 2026-04-14 5:41 ` Kefu Chai 2026-04-14 5:48 ` superseded: " Kefu Chai
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox