From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from firstgate.proxmox.com (firstgate.proxmox.com [IPv6:2a01:7e0:0:424::9]) by lore.proxmox.com (Postfix) with ESMTPS id 0665E1FF140 for ; Fri, 10 Apr 2026 06:30:24 +0200 (CEST) Received: from firstgate.proxmox.com (localhost [127.0.0.1]) by firstgate.proxmox.com (Proxmox) with ESMTP id 3A1B312E60; Fri, 10 Apr 2026 06:31:08 +0200 (CEST) From: Kefu Chai To: pve-devel@lists.proxmox.com Subject: [PATCH pve-qemu 0/2] Re-enable tcmalloc as the memory allocator Date: Fri, 10 Apr 2026 12:30:25 +0800 Message-ID: <20260410043027.3621673-1-k.chai@proxmox.com> X-Mailer: git-send-email 2.47.3 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Bm-Milter-Handled: 55990f41-d878-4baa-be0a-ee34c49e34d2 X-Bm-Transport-Timestamp: 1775795362816 X-SPAM-LEVEL: Spam detection results: 0 AWL 0.393 Adjusted score from AWL reputation of From: address BAYES_00 -1.9 Bayes spam probability is 0 to 1% DMARC_MISSING 0.1 Missing DMARC policy KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment KAM_SHORT 0.001 Use of a URL Shortener for very short URL RCVD_IN_VALIDITY_CERTIFIED_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to Validity was blocked. See https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more information. RCVD_IN_VALIDITY_RPBL_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to Validity was blocked. See https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more information. RCVD_IN_VALIDITY_SAFE_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to Validity was blocked. See https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more information. SPF_HELO_NONE 0.001 SPF: HELO does not publish an SPF Record SPF_PASS -0.001 SPF: sender matches SPF record Message-ID-Hash: ANUGQUAFJ3S6KLXENNR5KGFSTYHPCCGR X-Message-ID-Hash: ANUGQUAFJ3S6KLXENNR5KGFSTYHPCCGR X-MailFrom: k.chai@proxmox.com X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; loop; banned-address; emergency; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header X-Mailman-Version: 3.3.10 Precedence: list List-Id: Proxmox VE development discussion List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: Following up on the RFC thread [0], here's the formal submission to re-enable tcmalloc for pve-qemu. Quick recap: librbd's I/O path allocates a lot of small, short-lived objects with plain new/malloc (ObjectReadRequest, bufferlist, etc.), and glibc's ptmalloc2 handles this pattern poorly -- cross-thread arena contention and cache-line bouncing show up clearly in perf profiles. tcmalloc's per-thread fast path avoids both. A bit of history for context: tcmalloc was tried in 2015 but dropped after 8 days due to gperftools 2.2 tuning issues (fixed in 2.4+). jemalloc replaced it but was dropped in 2020 because it didn't release Rust-allocated memory (from proxmox-backup-qemu) back to the OS. PVE 9 ships gperftools 2.16, and patch 1/2 addresses the reclamation gap explicitly. On Dietmar's two concerns from the RFC: "Could ReleaseFreeMemory() halt the application?" -- No, and I verified this directly. It walks tcmalloc's page heap free span lists and calls madvise(MADV_DONTNEED) on each span. It does not walk allocated memory or compact the heap. A standalone test reclaimed 386 MB of 410 MB cached memory (94%) in effectively zero wall time. The call runs once at backup completion, same spot where malloc_trim runs today. "Wouldn't a pool allocator in librbd be the proper fix?" -- In principle yes, but I audited librbd in Ceph squid and it does NOT use a pool allocator -- all I/O path objects go through plain new. Ceph's mempool is tracking-only, not actual pooling. Adding real pooling would be a significant Ceph-side change (submission and completion happen on different threads), and it's orthogonal to the allocator choice here. Also thanks to Alexandre for confirming the 2015 gperftools issues are long resolved. Test results ------------ Benchmarked on a local vstart Ceph cluster (3 OSDs on local NVMe). This is the worst case for showing allocator impact, since there's no network latency for CPU savings to amortize against: rbd bench --io-type read --io-size 4096 --io-threads 16 \ --io-pattern rand Metric | glibc ptmalloc2 | tcmalloc | Delta ---------------+-----------------+-----------+-------- IOPS | 131,201 | 136,389 | +4.0% CPU time | 1,556 ms | 1,439 ms | -7.5% Cycles | 6.74B | 6.06B | -10.1% Cache misses | 137.1M | 123.9M | -9.6% perf report on the glibc run shows ~8% of CPU in allocator internals (_int_malloc, cfree, malloc_consolidate, _int_free_*); the same symbols are barely visible with tcmalloc because the fast path is just a pointer bump. The Ceph blog [1] reports ~50% IOPS gain on production clusters where network RTT dominates per-I/O latency -- the 10% CPU savings compound there since the host can push more I/O into the pipeline during the same wall time. The series is small: 1/2 adds the QEMU source patch (0048) with the CONFIG_TCMALLOC meson define and the ReleaseFreeMemory() call in pve-backup.c's cleanup path. 2/2 adds libgoogle-perftools-dev to Build-Depends and --enable-malloc=tcmalloc to configure. Runtime dep libgoogle-perftools4t64 (>= 2.16) is picked up automatically by dh_shlibdeps. [0]: https://lore.proxmox.com/pve-devel/DHCDIFA0P8QP.2CTY4G4EEGKQ0@proxmox.com/ [1]: https://ceph.io/en/news/blog/2023/reef-freelist-bench/ Kefu Chai (2): PVE: use tcmalloc as the memory allocator d/rules: enable tcmalloc as the memory allocator debian/control | 1 + ...use-tcmalloc-as-the-memory-allocator.patch | 77 +++++++++++++++++++ debian/patches/series | 1 + debian/rules | 1 + 4 files changed, 80 insertions(+) create mode 100644 debian/patches/pve/0048-PVE-use-tcmalloc-as-the-memory-allocator.patch -- 2.47.3