From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from firstgate.proxmox.com (firstgate.proxmox.com [IPv6:2a01:7e0:0:424::9]) by lore.proxmox.com (Postfix) with ESMTPS id 94DC81FF13F for ; Thu, 26 Mar 2026 08:47:51 +0100 (CET) Received: from firstgate.proxmox.com (localhost [127.0.0.1]) by firstgate.proxmox.com (Proxmox) with ESMTP id 88D5376E2; Thu, 26 Mar 2026 08:48:11 +0100 (CET) Mime-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=UTF-8 Date: Thu, 26 Mar 2026 15:48:02 +0800 Message-Id: From: "Kefu Chai" To: "Dietmar Maurer" , "DERUMIER, Alexandre" , Subject: Re: [RFC] Re-enable alternative memory allocator (tcmalloc) for pve-qemu? X-Mailer: aerc 0.20.0 References: <03fc2a07-34c8-4029-bbed-f1c764d4562e@proxmox.com> In-Reply-To: <03fc2a07-34c8-4029-bbed-f1c764d4562e@proxmox.com> X-Bm-Milter-Handled: 55990f41-d878-4baa-be0a-ee34c49e34d2 X-Bm-Transport-Timestamp: 1774511239037 X-SPAM-LEVEL: Spam detection results: 0 AWL 0.370 Adjusted score from AWL reputation of From: address BAYES_00 -1.9 Bayes spam probability is 0 to 1% DMARC_MISSING 0.1 Missing DMARC policy KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment RCVD_IN_VALIDITY_CERTIFIED_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to Validity was blocked. See https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more information. RCVD_IN_VALIDITY_RPBL_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to Validity was blocked. See https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more information. RCVD_IN_VALIDITY_SAFE_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to Validity was blocked. See https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more information. SPF_HELO_NONE 0.001 SPF: HELO does not publish an SPF Record SPF_PASS -0.001 SPF: sender matches SPF record Message-ID-Hash: QXYVJ5JVANZOOW5EHSPPNLMSIDDTKANJ X-Message-ID-Hash: QXYVJ5JVANZOOW5EHSPPNLMSIDDTKANJ X-MailFrom: k.chai@proxmox.com X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; loop; banned-address; emergency; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header X-Mailman-Version: 3.3.10 Precedence: list List-Id: Proxmox VE development discussion List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: Thanks Dietmar and Alexandre for the replies. > Such things are usually easy to fix using by using a custom pool > allocator, freeing all allocated objects at once. But I don't know > the Ceph code, so I am not sure if it is possible to apply such > pattern to the Ceph code ... Good question =E2=80=94 I went and audited librbd's allocation patterns in = the Ceph squid source to check exactly this. Short answer: librbd does NOT use a pool allocator. All I/O-path allocations go directly to the system malloc/new. Ceph does have a `mempool` system (src/include/mempool.h), but it's tracking-only =E2=80=94 it provides memory accounting and observability, no= t actual pooling. The allocate/deallocate paths are just: // mempool.h:375 T* r =3D reinterpret_cast(new char[total]); // mempool.h:392 delete[] reinterpret_cast(p); And there is no "rbd" or "librbd" mempool defined at all =E2=80=94 the exis= ting pools are for bluestore, osd internals, and buffer management. The per-I/O request objects (ObjectReadRequest, ObjectWriteRequest, ObjectDispatchSpec, etc.) are all allocated via plain `new` with no custom allocator, no boost::pool, no arena, no slab. Buffer data (the actual read/write payloads) uses malloc/posix_memalign, again with mempool tracking counters but no pooling. So you're right that a pool allocator would be the proper upstream fix. But it would be a significant Ceph-side change =E2=80=94 the I/O dispatch l= ayer touches many files across src/librbd/io/ and the allocation/deallocation is spread across different threads (submitter thread allocates, OSD callback thread deallocates), which makes arena-per-request tricky. This is why the system allocator choice has such an outsized impact: with no application-level pooling, every I/O operation generates a burst of small allocations that hit the system allocator directly. tcmalloc and jemalloc handle this pattern well (per-thread caches, size-class freelists); glibc's ptmalloc2 does not. > The problem with that is that it can halt/delay the whole application > while freeing memory? > We need to make sure that there is no noticeable delay. For gperftools tcmalloc, `ReleaseFreeMemory()` calls `ReleaseToSystem(SIZE_T_MAX)` (malloc_extension.cc:127), which calls `PageHeap::ReleaseAtLeastNPages()` (page_heap.cc:599). That function round-robins through the page heap's free span lists (`free_[]` normal lists and `large_normal_`) and calls `ReleaseSpan()` =E2=86=92 `madvise(MAD= V_DONTNEED)` on each free span. It does NOT walk allocated memory or compact the heap =E2=80=94 it only touches spans that are already free in tcmalloc's page ca= che. It does hold the page heap spinlock during the walk, so concurrent allocations on other threads would briefly contend on that lock. That said, we'd only call this once at backup completion (same place the current malloc_trim(4 MiB) runs), not on any hot path. For comparison, glibc's malloc_trim iterates all arenas (locking each in turn) and trims free space at the top of each heap segment. Under fragmentation with many arenas, this can be more expensive than tcmalloc's walk of its free span lists. > If I remember, it was because I had lower performance than jemalloc > because of too low default TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES && > also the tcmalloc 2.2 at this time. (it was resolved later with > tcmalloc 2.4) Thanks for the context and the link! That explains the 8-day turnaround. I checked the gperftools repo =E2=80=94 interestingly, the thread cache def= ault (kDefaultOverallThreadCacheSize) is the same 32 MB in both 2.2 and 2.4 (8 =C3=97 kMaxThreadCacheSize =3D 8 =C3=97 4 MB; src/common.h). What did ch= ange between 2.2 and 2.4 is commit 7da5bd0 which flipped TCMALLOC_AGGRESSIVE_DECOMMIT from false to true by default. In 2.2, tcmalloc would hold onto freed pages indefinitely unless explicitly told to decommit, leading to higher RSS =E2=80=94 that combined with the 2.= 2-era thread cache behavior likely explains the poor showing vs jemalloc. PVE 9 ships gperftools 2.16 (via libgoogle-perftools4t64), so all of this is long resolved. So to summarize where the two historical concerns stand today: 1. 2015 tcmalloc removal (Alexandre's point): caused by gperftools 2.2-era issues =E2=80=94 notably TCMALLOC_AGGRESSIVE_DECOMMIT defaulting= to false (commit 7da5bd0 flipped it to true in 2.4). PVE 9 ships gperftools 2.16 =E2=80=94 this is a non-issue. 2. 2020 jemalloc removal (Stefan Reiter's commit 3d785ea): post-PBS- backup memory not returned to OS. Addressable by calling the allocator-specific release API (ReleaseFreeMemory for tcmalloc) in the backup completion path, right where malloc_trim(4 MiB) already sits today. Both blockers are resolved. I think the pragmatic path is: 1. Re-enable tcmalloc via --enable-malloc=3Dtcmalloc in the pve-qemu build (this also compiles out malloc_trim, which is correct) 2. Add allocator-aware memory release in the backup completion path (ReleaseFreeMemory for tcmalloc, arena purge for jemalloc) 3. Test the PBS backup memory reclamation to confirm VmRSS comes back down after backup I'll prepare a patch series if this approach sounds reasonable.