From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <pve-devel-bounces@lists.proxmox.com>
Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68])
	by lore.proxmox.com (Postfix) with ESMTPS id CA90B1FF140
	for <inbox@lore.proxmox.com>; Fri, 10 Apr 2026 10:12:15 +0200 (CEST)
Received: from firstgate.proxmox.com (localhost [127.0.0.1])
	by firstgate.proxmox.com (Proxmox) with ESMTP id B4471170E0;
	Fri, 10 Apr 2026 10:12:59 +0200 (CEST)
Message-ID: <b1db5ad4-b3aa-4980-afd6-56cce2120fad@proxmox.com>
Date: Fri, 10 Apr 2026 10:12:45 +0200
MIME-Version: 1.0
User-Agent: Mozilla Thunderbird
Subject: Re: [PATCH pve-qemu 0/2] Re-enable tcmalloc as the memory allocator
To: Kefu Chai <k.chai@proxmox.com>, pve-devel@lists.proxmox.com
References: <20260410043027.3621673-1-k.chai@proxmox.com>
Content-Language: en-US
From: Fiona Ebner <f.ebner@proxmox.com>
In-Reply-To: <20260410043027.3621673-1-k.chai@proxmox.com>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit
X-Bm-Milter-Handled: 55990f41-d878-4baa-be0a-ee34c49e34d2
X-Bm-Transport-Timestamp: 1775808700799
X-SPAM-LEVEL: Spam detection results:  0
	AWL                     0.007 Adjusted score from AWL reputation of From:
 address
	BAYES_00                 -1.9 Bayes spam probability is 0 to 1%
	DMARC_MISSING             0.1 Missing DMARC policy
	KAM_DMARC_STATUS         0.01 Test Rule for DKIM or SPF Failure with Strict
 Alignment
	RCVD_IN_VALIDITY_CERTIFIED_BLOCKED  0.001 ADMINISTRATOR NOTICE: The query to
 Validity was blocked.  See
 https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more
 information.
	RCVD_IN_VALIDITY_RPBL_BLOCKED  0.001 ADMINISTRATOR NOTICE: The query to
 Validity was blocked.  See
 https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more
 information.
	RCVD_IN_VALIDITY_SAFE_BLOCKED  0.001 ADMINISTRATOR NOTICE: The query to
 Validity was blocked.  See
 https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more
 information.
	SPF_HELO_NONE           0.001 SPF: HELO does not publish an SPF Record
	SPF_PASS               -0.001 SPF: sender matches SPF record
Message-ID-Hash: WTNF23QF5NGTT4VKOIG3S4WS7KPMWMSV
X-Message-ID-Hash: WTNF23QF5NGTT4VKOIG3S4WS7KPMWMSV
X-MailFrom: f.ebner@proxmox.com
X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; loop;
 banned-address; emergency; member-moderation; nonmember-moderation;
 administrivia; implicit-dest; max-recipients; max-size; news-moderation;
 no-subject; digests; suspicious-header
X-Mailman-Version: 3.3.10
Precedence: list
List-Id: Proxmox VE development discussion <pve-devel.lists.proxmox.com>
List-Help: <mailto:pve-devel-request@lists.proxmox.com?subject=help>
List-Owner: <mailto:pve-devel-owner@lists.proxmox.com>
List-Post: <mailto:pve-devel@lists.proxmox.com>
List-Subscribe: <mailto:pve-devel-join@lists.proxmox.com>
List-Unsubscribe: <mailto:pve-devel-leave@lists.proxmox.com>

Am 10.04.26 um 6:30 AM schrieb Kefu Chai:
> Following up on the RFC thread [0], here's the formal submission to
> re-enable tcmalloc for pve-qemu.
> 
> Quick recap: librbd's I/O path allocates a lot of small, short-lived
> objects with plain new/malloc (ObjectReadRequest, bufferlist, etc.),
> and glibc's ptmalloc2 handles this pattern poorly -- cross-thread
> arena contention and cache-line bouncing show up clearly in perf
> profiles. tcmalloc's per-thread fast path avoids both.
>
> A bit of history for context: tcmalloc was tried in 2015 but dropped
> after 8 days due to gperftools 2.2 tuning issues (fixed in 2.4+).
> jemalloc replaced it but was dropped in 2020 because it didn't
> release Rust-allocated memory (from proxmox-backup-qemu) back to the
> OS. PVE 9 ships gperftools 2.16, and patch 1/2 addresses the
> reclamation gap explicitly.
> 
> On Dietmar's two concerns from the RFC:
> 
> "Could ReleaseFreeMemory() halt the application?" -- No, and I
> verified this directly. It walks tcmalloc's page heap free span
> lists and calls madvise(MADV_DONTNEED) on each span. It does not
> walk allocated memory or compact the heap. A standalone test
> reclaimed 386 MB of 410 MB cached memory (94%) in effectively zero
> wall time. The call runs once at backup completion, same spot where
> malloc_trim runs today.
> 
> "Wouldn't a pool allocator in librbd be the proper fix?" -- In
> principle yes, but I audited librbd in Ceph squid and it does NOT
> use a pool allocator -- all I/O path objects go through plain new.
> Ceph's mempool is tracking-only, not actual pooling. Adding real
> pooling would be a significant Ceph-side change (submission and
> completion happen on different threads), and it's orthogonal to the
> allocator choice here.
> 
> Also thanks to Alexandre for confirming the 2015 gperftools issues
> are long resolved.
> 
> Test results
> ------------
> 
> Benchmarked on a local vstart Ceph cluster (3 OSDs on local NVMe).
> This is the worst case for showing allocator impact, since there's
> no network latency for CPU savings to amortize against:
> 
>   rbd bench --io-type read --io-size 4096 --io-threads 16 \
>             --io-pattern rand
> 
>   Metric         | glibc ptmalloc2 | tcmalloc  | Delta
>   ---------------+-----------------+-----------+--------
>   IOPS           |         131,201 |   136,389 |  +4.0%
>   CPU time       |        1,556 ms |  1,439 ms |  -7.5%
>   Cycles         |           6.74B |     6.06B | -10.1%
>   Cache misses   |          137.1M |    123.9M |  -9.6%
> 
> perf report on the glibc run shows ~8% of CPU in allocator internals
> (_int_malloc, cfree, malloc_consolidate, _int_free_*); the same
> symbols are barely visible with tcmalloc because the fast path is
> just a pointer bump. The Ceph blog [1] reports ~50% IOPS gain on
> production clusters where network RTT dominates per-I/O latency --
> the 10% CPU savings compound there since the host can push more I/O
> into the pipeline during the same wall time.

How does the performance change when doing IO within a QEMU guest?

How does this affect the performance for other storage types, like ZFS,
qcow2 on top of directory-based storages, qcow2 on top of LVM, LVM-thin,
etc. and other workloads like saving VM state during snapshot, transfer
during migration, maybe memory hotplug/ballooning, network performance
for vNICs?