From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68]) by lore.proxmox.com (Postfix) with ESMTPS id ABAEC1FF137 for ; Tue, 14 Apr 2026 07:41:30 +0200 (CEST) Received: from firstgate.proxmox.com (localhost [127.0.0.1]) by firstgate.proxmox.com (Proxmox) with ESMTP id 5D30ECBD3; Tue, 14 Apr 2026 07:42:18 +0200 (CEST) Mime-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=UTF-8 Date: Tue, 14 Apr 2026 13:41:37 +0800 Message-Id: From: "Kefu Chai" To: "Fiona Ebner" , "DERUMIER, Alexandre" , "pve-devel@lists.proxmox.com" Subject: Re: [PATCH pve-qemu 0/2] Re-enable tcmalloc as the memory allocator X-Mailer: aerc 0.20.0 References: <20260410043027.3621673-1-k.chai@proxmox.com> <4628fcc1c283bc4ae80f19e6fe8ae922c0968af9.camel@groupe-cyllene.com> <9db86c5e-d382-4eed-a1fd-905e44a259e1@proxmox.com> <88ad2c07-ef25-40e0-b921-9d715284788e@proxmox.com> In-Reply-To: <88ad2c07-ef25-40e0-b921-9d715284788e@proxmox.com> X-Bm-Milter-Handled: 55990f41-d878-4baa-be0a-ee34c49e34d2 X-Bm-Transport-Timestamp: 1776145228592 X-SPAM-LEVEL: Spam detection results: 0 AWL 0.348 Adjusted score from AWL reputation of From: address BAYES_00 -1.9 Bayes spam probability is 0 to 1% DMARC_MISSING 0.1 Missing DMARC policy KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment RCVD_IN_VALIDITY_CERTIFIED_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to Validity was blocked. See https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more information. RCVD_IN_VALIDITY_RPBL_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to Validity was blocked. See https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more information. RCVD_IN_VALIDITY_SAFE_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to Validity was blocked. See https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more information. SPF_HELO_NONE 0.001 SPF: HELO does not publish an SPF Record SPF_PASS -0.001 SPF: sender matches SPF record Message-ID-Hash: URJDXQMAUH6NL37LNWNHZLP67ZVHQMVD X-Message-ID-Hash: URJDXQMAUH6NL37LNWNHZLP67ZVHQMVD X-MailFrom: k.chai@proxmox.com X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; loop; banned-address; emergency; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header X-Mailman-Version: 3.3.10 Precedence: list List-Id: Proxmox VE development discussion List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: On Mon Apr 13, 2026 at 9:18 PM CST, Fiona Ebner wrote: > Am 13.04.26 um 1:12 PM schrieb Kefu Chai: >> On Mon Apr 13, 2026 at 4:14 PM CST, Fiona Ebner wrote: >>> Am 10.04.26 um 12:44 PM schrieb DERUMIER, Alexandre: >>>> >>>>>> How does the performance change when doing IO within a QEMU guest? >>>>>> >>>>>> How does this affect the performance for other storage types, like >>>>>> ZFS, >>>>>> qcow2 on top of directory-based storages, qcow2 on top of LVM, LVM- >>>>>> thin, >>>>>> etc. and other workloads like saving VM state during snapshot, >>>>>> transfer >>>>>> during migration, maybe memory hotplug/ballooning, network >>>>>> performance >>>>>> for vNICs? >>=20 >> Hi Fiona, >>=20 >> Thanks for the questions. >>=20 >> I traced QEMU's source code. It turns out that guest's RAM is allocated >> via direct mmap() calls, which completely bypassing QEMU's C library >> allocator. The path looks like: >>=20 >> -m 4G on command line: >>=20 >> memory_region_init_ram_flags_nomigrate() =20 >> qemu_ram_alloc() >> qemu_ram_alloc_internal() >> g_malloc0(sizeof(*new_block)) <-- only the RAMBlock metadata >> struct, about 512 bytes >> ram_block_add() >> qemu_anon_ram_alloc() >> qemu_ram_mmap(-1, size, ...) >> mmap_reserve(total) >> mmap(0, size, PROT_NONE, ...) <-- reserve address space >> mmap_activate(ptr, size, ...) >> mmap(ptr, size, PROT_RW, ...) <-- actual guest RAM >>=20 >> So the gigabytes of guest memory go straight to the kernel via mmap() >> and never touch malloc or tcmalloc. Only the small RAMBlock metadata >> structure (~512 bytes per region) goes through g_malloc0(). >>=20 >> In other words, tcmalloc's scope is limited to QEMU's own working >> memory, among other things, block layer buffers, coroutine stacks, >> internal data structs. It does not touch gues RAM at all. >>=20 >> The inflat/deflate path of balloon code does involve glib malloc. It's >> virtio_balloon_pbp_alloc(), which calls bitmap_new() via g_new0() to >> track partially-ballooned pages. But these bitmaps tracks only metadata, >> whose memory footprints are relatively small. And this only happens in= =20 >> infrequent operations. >>=20 >> Hopefully, this addresses the balloon concern. If I've missed something >> or misread the code, please help point it out. >>=20 >> When it comes to different storage types and workloads, I ran benchmarks >> covering some scenarios you listed. All comparison use the same binary >> (pve-qemu-kvm 10.1.2-7) with and without LD_PRELOAD=3Dlibtcmalloc.so.4, = so >> the allocator is the only variable. >>=20 >>=20 >> Storage backends (block layer, via qemu-img bench) >>=20 >> 4K reads, depth=3D32, io_uring, cache=3Dnone, 5M ops per run, best >> of 3. Host NVMe (ext4) or LVM-thin backed by NVMe. >>=20 >> backend glibc ops/s tcmalloc ops/s delta >> qcow2 on ext4 1,188,495 1,189,343 +0.1% >> qcow2 on ext4 write 1,036,914 1,036,699 0.0% >> raw on ext4 1,263,583 1,277,465 +1.1% >> raw on LVM-thin 433,727 433,576 0.0% >>=20 >> The reason why raw/LVM-thin is slower, is that, I *think* it actually >> hits the dm-thin layer rather than page cache. The allocator delta is >> noise in all cases. >>=20 >> RBD tested via a local vstart cluster (3 OSDs, bluestore on NVMe): >>=20 >> path glibc tcmalloc delta >> qemu-img bench + librbd 19,111 19,156 +0.2% >> rbd bench (librbd direct) 35,329 36,622 +3.7% >>=20 >> I don't have ZFS configured on this (host) machine, but QEMU's file I/= O >> path to ZFS goes through the same code as ext4, so the difference >> is in the kernel. I'd expect the same null result, >> though I could be wrong and am happy to set up a test pool if >> you'd like to see the numbers. >>=20 >> Guest I/O (Debian 13 guest, virtio-blk) >>=20 >> Ran dd inside a Debian 13 cloud image guest (4 vCPU, 2 GB RAM, >> qcow2 on ext4, cache=3Dnone, aio=3Dnative) against a second virtio >> disk (/dev/vdb, 8 GB qcow2). >>=20 >> workload glibc tcmalloc delta >> dd if=3D/dev/vdb bs=3D4M count=3D1024 15.3 GB/s 15.5 GB/s +1.3% >> iflag=3Ddirect (sequential) >> 8x parallel dd if=3D/dev/vdb bs=3D4k 787 MB/s 870 MB/s +10.5% >> count=3D100k iflag=3Ddirect >> (each starting at a different offset) >>=20 >> Migration and savevm >>=20 >> Tested with a 4 GB guest where about 2 GB of RAM was dirtied >> (filled /dev/shm with urandom, 8x256 MB) before triggering each >> operation. >>=20 >> scenario glibc tcmalloc delta >> migrate (exec: URI) 0.622 s 0.622 s 0.0% >> savevm (qcow2 snap) 0.503 s 0.504 s +0.2% >>=20 >> What I didn't measure >>=20 >> I did'nt test ZFS/vNIC throughput, as the host path lives in the kerne= l >> side, while QEMU only handles the control plane. Hence I'd expect very >> little allocator impact there, though I could be wrong. >>=20 >> I also didn't test memory hotplug on ballooning cases, because, as >> explained above, these are rare one-shot operations. and as the source >> code trace shows malloc is not involved. But again, happy to look into >> it, if it's still a concern. >>=20 > > Thank you for testing and the analysis! > > I'd still appreciate if somebody could test these, just to be sure. It > is a rather core change after all. I'll also try to find some time to > test around, but more eyes cannot hurt here! > > That said, it looks okay in general, so I'm giving a preliminary: > > Acked-by: Fiona Ebner > Thanks for the ack, Fiona! I agree more testing can't hurt for the critical software in our stack, so I went ahead and ran through the remaining scenarios, ZFS, balloon, vNIC and snapshot, using PVE's qm tooling this time instead of bare QEMU. To avoid the QEMU version difference muddying the numbers, I built two packages from the same 10.2.1 source tree: one with --enable-malloc=3Dtcmalloc (plus the proposed patch) and one without. Same compiler, same patches 0001-0047, allocator is the only variable, like I did yesterday. Test VM is a Debian 13 cloud image guest (4 vCPU, 4 GB RAM, virtio-scsi-single + iothread on LVM-thin, Samsung 9100 PRO NVMe). Guest I/O on LVM-thin (dd inside guest, best of 3) workload glibc tcmalloc delta seq 4M read (1024 x dd) 16.6 GB/s 16.6 GB/s 0.0% par 4K direct (8x dd) 578 MB/s 576 MB/s -0.3% Snapshot (qm snapshot, 2 GB of dirty guest RAM) glibc: 3.502 s tcmalloc: 3.544 s (+1.2%, noise) Balloon (QMP balloon 4G -> 2G) Both correctly deflated: glibc: RSS 2,462 MB -> 2,328 MB (freed 135 MB) tcmalloc: RSS 2,496 MB -> 707 MB (freed 1,789 MB) The difference in freed amount comes from different initial page residency, not the allocator. Storage backends (tcmalloc, absolute numbers) I moved the VM disk between backends to cover ZFS and qcow2-on-directory. Everything worked without issues: backend seq read par 4K read LVM-thin (raw, NVMe) 17.5 GB/s 670 MB/s ZFS (zvol, file-backed) 11.3 GB/s 696 MB/s qcow2 on ext4 (directory) 19.6 GB/s 763 MB/s ZFS is slower because the pool is file-backed (double indirection), not because of the allocator. The block-layer A/B from my earlier mail already showed the allocator is neutral across backends, so I didn't repeat the full A/B for each one. vNIC (iperf3 over vmbr0, 20 rounds each) This one needed more rounds to get a stable picture -- early attempts with 5 rounds looked misleadingly noisy. glibc tcmalloc delta host->guest mean 132.2 Gbps 132.9 Gbps +0.6% host->guest stddev 17.4 20.6 guest->host mean 69.6 Gbps 72.0 Gbps +3.5% guest->host stddev 17.1 15.0 Welch's t-test: t=3D0.12 (h2g), t=3D0.48 (g2h), neither significant at p=3D0.05. The stddev (~17 Gbps) dwarfs the mean difference, so there's no detectable allocator effect here. The hot path is in vhost-net on the kernel side, so this is pretty much what I'd expect. Overall: workload delta method guest seq I/O 0.0% same-version A/B guest par 4K I/O -0.3% same-version A/B (noise) qm snapshot +1.2% same-version A/B (noise) balloon OK both deflate correctly vNIC +0.6/+3.5% 20 rounds, not significant block layer (all) 0 to +1% LD_PRELOAD, same binary RBD (librbd) +3.7% LD_PRELOAD, same binary No regressions on any tested path. Together with Alexandre's production fio numbers (+31.6% on 4K randread on a real cluster), I think we have a more solid picture now. Please let me know if there's anything else I should look into, or if you'd like me to re-run something with different parameters! cheers, Kefu >>>> >>>> Hi Fiona, >>>> >>>> I'm stil running in production (I have keeped tcmalloc after the >>>> removal some year ago from the pve build), and I didn't notice problem= . >>>> (but I still don't use pbs). >>>> >>>> But I never have done bench with/without it since 5/6 year. >>=20 >> And thanks Alexandre for sharing your production experience, that's very >> valuable context. >>=20 >>>> >>>> Maybe vm memory should be checked too, I'm thinking about RSS memory >>>> with balloon free_page_reporting, to see if it's correcting freeing >>>> memory. >>=20 >> For free_page_reporting, the VIRTIO_BALLOON_F_REPORTING path: >>=20 >> virtual_balloon_handle_report() >> ram_block_discard_range() >> madvise(host_startaddr, kebngth, QEMU_MADV_DONTNEED) >>=20 >> This operates directly on the mmap'ed guest RAM region. Still no malloc >> involvement anywhere in this path. >>=20 >> In short, the tests above reveal a consistent pattern: tcmalloc helps >> where allocation pressure is high, and is neutral everywhere else. >> Please let me know if you'd like more data on any specific workload, or >> if there is anything I overlooked. I appreciate the careful review and >> insights. >>=20 >>=20 >>> >>> Thanks! If it was running fine for you for this long, that is a very >>> good data point :) Still, testing different scenarios/configurations >>> would be nice to rule out that there is a (performance) regression >>> somewhere else. >>=20 >> cheers, >> Kefu