From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from firstgate.proxmox.com (firstgate.proxmox.com [IPv6:2a01:7e0:0:424::9]) by lore.proxmox.com (Postfix) with ESMTPS id 9BB7F1FF13F for ; Thu, 26 Mar 2026 03:55:01 +0100 (CET) Received: from firstgate.proxmox.com (localhost [127.0.0.1]) by firstgate.proxmox.com (Proxmox) with ESMTP id 3F16B122; Thu, 26 Mar 2026 03:55:21 +0100 (CET) Mime-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=UTF-8 Date: Thu, 26 Mar 2026 10:54:41 +0800 Message-Id: To: Subject: [RFC] Re-enable alternative memory allocator (tcmalloc) for pve-qemu? From: "Kefu Chai" X-Mailer: aerc 0.20.0 X-Bm-Milter-Handled: 55990f41-d878-4baa-be0a-ee34c49e34d2 X-Bm-Transport-Timestamp: 1774493637596 X-SPAM-LEVEL: Spam detection results: 0 AWL 0.377 Adjusted score from AWL reputation of From: address BAYES_00 -1.9 Bayes spam probability is 0 to 1% DMARC_MISSING 0.1 Missing DMARC policy KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment KAM_SHORT 0.001 Use of a URL Shortener for very short URL SPF_HELO_NONE 0.001 SPF: HELO does not publish an SPF Record SPF_PASS -0.001 SPF: sender matches SPF record Message-ID-Hash: AIUFWTKHBZVQ4A6ZY5JYPTUUAPAFCAO5 X-Message-ID-Hash: AIUFWTKHBZVQ4A6ZY5JYPTUUAPAFCAO5 X-MailFrom: k.chai@proxmox.com X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; loop; banned-address; emergency; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header CC: Kefu Chai X-Mailman-Version: 3.3.10 Precedence: list List-Id: Proxmox VE development discussion List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: Hi all, I'd like to revisit the question of using an alternative memory allocator in pve-qemu. The topic came up in a recent forum thread [1] about Ceph read performance on multi-AZ clusters, where both read-replica localization and tcmalloc were discussed as complementary optimizations. After digging into the history and source code, I think there may be a path to re-enabling tcmalloc (or jemalloc) that addresses the issues that led to their removal. ## Background: benchmark numbers The Ceph blog post "QEMU/KVM Tuning" [2] reports a ~50% improvement on 16KB random reads (53.5k -> 80k IOPS) when using LD_PRELOAD=3D"/usr/lib64/libtcmalloc.so" with QEMU+librbd. spirit's pve-devel patch [3] reported similar numbers on 4K randread (60k -> 90k IOPS, also +50%). The mechanism is well-understood: librbd creates many small temporary objects per I/O operation, and glibc's allocator is slow under that fragmentation pattern. ## Background: malloc_trim interaction On top of the allocation overhead, QEMU's RCU call thread (util/rcu.c:call_rcu_thread) calls malloc_trim(4 MiB) each time the RCU callback queue drains. Under I/O load this can happen several times per second. malloc_trim walks the glibc heap looking for pages to return to the OS via madvise(MADV_DONTNEED), which adds CPU overhead. QEMU upstream handles this correctly: when built with --enable-malloc=3Dtcmalloc (or jemalloc), meson.build sets has_malloc_trim =3D false, which compiles out all malloc_trim calls. Bonzini's commit aa087962 even makes --enable-malloc-trim + --enable-malloc=3Dtcmalloc a fatal build error =E2=80=94 they are explicitl= y mutually exclusive. ## History in pve-qemu pve-qemu has tried both allocators before: 1. tcmalloc was enabled in 2.3-2 (Jun 10, 2015) and removed 8 days later in 2.3-4 (Jun 18, 2015). The changelog gives no reason. 2. jemalloc replaced it in 2.3-5 (Jun 19, 2015) and lasted until 5.1.0-8 (Dec 2020), when Stefan Reiter removed it in commit 3d785ea with the explanation: "jemalloc does not play nice with our Rust library (proxmox-backup-qemu), specifically it never releases memory allocated from Rust to the OS. This leads to a problem with larger caches (e.g. for the PBS block driver)." Stefan referenced jemalloc#1398 [4]. The upstream fix was background_thread:true in MALLOC_CONF, which Stefan described as "weirdly hacky." 3. In Aug 2023, Fiona Ebner added an explicit malloc_trim(4 MiB) call to the backup completion path in commit 5f9cb29, because without it, QEMU's RSS ballooned from ~370 MB to ~2.1 GB after a PBS backup and never came back. The forum thread that motivated this was [5]. ## The core problem and a possible solution The reason both allocators were removed boils down to: after a PBS backup, the Rust-based proxmox-backup-qemu library frees a large amount of memory, but neither jemalloc nor tcmalloc return it to the OS promptly: - jemalloc's time-based decay requires ongoing allocation activity to make progress; when the app goes idle after backup, nothing triggers the decay. - gperftools tcmalloc's IncrementalScavenge (page_heap.cc) is also driven by page-level deallocations, not by time. With the default TCMALLOC_RELEASE_RATE=3D1.0, it releases 1 page per 1000 pages deallocated. If the app goes idle, no scavenging happens. But both allocators provide explicit "release everything now" APIs: - gperftools: MallocExtension::instance()->ReleaseFreeMemory() (calls ReleaseToSystem(LONG_MAX) internally) - jemalloc: mallctl("arena..purge", ...) The current pve-backup.c code already has the right structure: #if defined(CONFIG_MALLOC_TRIM) malloc_trim(4 * 1024 * 1024); #endif This could be extended to: #if defined(CONFIG_MALLOC_TRIM) malloc_trim(4 * 1024 * 1024); #elif defined(CONFIG_TCMALLOC) MallocExtension_ReleaseFreeMemory(); #elif defined(CONFIG_JEMALLOC) /* call mallctl to purge arenas */ #endif QEMU's meson build doesn't currently define CONFIG_TCMALLOC or CONFIG_JEMALLOC, but the meson option value is available to add them. The RCU thread malloc_trim is already handled =E2=80=94 --enable-malloc=3Dt= cmalloc compiles it out, and the alternative allocator handles steady-state memory management on its own. ## Questions for the team 1. Does anyone remember why tcmalloc was removed after only 8 days in 2015 (2.3-4)? The changelog just says "remove tcmalloc" with no explanation. 2. Fiona: given the backup memory reclamation work in 5f9cb29, would you see a problem with replacing malloc_trim with the allocator-specific release API in the backup completion path? 3. Has the proxmox-backup-qemu library's allocation pattern changed since 2020 in ways that might affect this? (e.g., different buffer management, Rust allocator changes) 4. Would anyone be willing to test a pve-qemu build with --enable-malloc=3Dtcmalloc + the allocator-aware release patch against the PBS backup workload? The key metric is VmRSS in /proc//status before and after a backup cycle. I'm happy to prepare a patch series if there's interest, but wanted to get input first given the history. [1] https://forum.proxmox.com/threads/181751/ [2] https://ceph.io/en/news/blog/2022/qemu-kvm-tuning/ [3] https://lists.proxmox.com/pipermail/pve-devel/2023-May/056815.html [4] https://github.com/jemalloc/jemalloc/issues/1398 [5] https://forum.proxmox.com/threads/131339/ Regards, Kefu