public inbox for pve-devel@lists.proxmox.com
 help / color / mirror / Atom feed
From: "Kefu Chai" <k.chai@proxmox.com>
To: "Kefu Chai" <k.chai@proxmox.com>,
	"Dietmar Maurer" <dietmar@proxmox.com>,
	"DERUMIER, Alexandre" <alexandre.derumier@groupe-cyllene.com>,
	<pve-devel@lists.proxmox.com>
Subject: Re: [RFC] Re-enable alternative memory allocator (tcmalloc) for pve-qemu?
Date: Fri, 10 Apr 2026 12:33:27 +0800	[thread overview]
Message-ID: <DHP707TCJLG8.XKRTTK452S55@proxmox.com> (raw)
In-Reply-To: <DHCJR18OV951.1UFQTRLNWT77Z@proxmox.com>

patchset sent at https://lore.proxmox.com/pve-devel/20260410043027.3621673-1-k.chai@proxmox.com/T/#t

On Thu Mar 26, 2026 at 3:48 PM CST, Kefu Chai wrote:
> Thanks Dietmar and Alexandre for the replies.
>
>> Such things are usually easy to fix using by using a custom pool
>> allocator, freeing all allocated objects at once. But I don't know
>> the Ceph code, so I am not sure if it is possible to apply such
>> pattern to the Ceph code ...
>
> Good question — I went and audited librbd's allocation patterns in the
> Ceph squid source to check exactly this.
>
> Short answer: librbd does NOT use a pool allocator. All I/O-path
> allocations go directly to the system malloc/new.
>
> Ceph does have a `mempool` system (src/include/mempool.h), but it's
> tracking-only — it provides memory accounting and observability, not
> actual pooling. The allocate/deallocate paths are just:
>
>     // mempool.h:375
>     T* r = reinterpret_cast<T*>(new char[total]);
>     // mempool.h:392
>     delete[] reinterpret_cast<char*>(p);
>
> And there is no "rbd" or "librbd" mempool defined at all — the existing
> pools are for bluestore, osd internals, and buffer management. The
> per-I/O request objects (ObjectReadRequest, ObjectWriteRequest,
> ObjectDispatchSpec, etc.) are all allocated via plain `new` with no
> custom allocator, no boost::pool, no arena, no slab.
>
> Buffer data (the actual read/write payloads) uses malloc/posix_memalign,
> again with mempool tracking counters but no pooling.
>
> So you're right that a pool allocator would be the proper upstream fix.
> But it would be a significant Ceph-side change — the I/O dispatch layer
> touches many files across src/librbd/io/ and the allocation/deallocation
> is spread across different threads (submitter thread allocates, OSD
> callback thread deallocates), which makes arena-per-request tricky.
>
> This is why the system allocator choice has such an outsized impact:
> with no application-level pooling, every I/O operation generates a
> burst of small allocations that hit the system allocator directly.
> tcmalloc and jemalloc handle this pattern well (per-thread caches,
> size-class freelists); glibc's ptmalloc2 does not.
>
>> The problem with that is that it can halt/delay the whole application
>> while freeing memory?
>> We need to make sure that there is no noticeable delay.
>
> For gperftools tcmalloc, `ReleaseFreeMemory()` calls
> `ReleaseToSystem(SIZE_T_MAX)` (malloc_extension.cc:127), which calls
> `PageHeap::ReleaseAtLeastNPages()` (page_heap.cc:599). That function
> round-robins through the page heap's free span lists (`free_[]` normal
> lists and `large_normal_`) and calls `ReleaseSpan()` → `madvise(MADV_DONTNEED)`
> on each free span. It does NOT walk allocated memory or compact the heap
> — it only touches spans that are already free in tcmalloc's page cache.
> It does hold the page heap spinlock during the walk, so concurrent
> allocations on other threads would briefly contend on that lock.
>
> That said, we'd only call this once at backup completion (same place the
> current malloc_trim(4 MiB) runs), not on any hot path. For comparison,
> glibc's malloc_trim iterates all arenas (locking each in turn) and
> trims free space at the top of each heap segment. Under fragmentation
> with many arenas, this can be more expensive than tcmalloc's walk of
> its free span lists.
>
>> If I remember, it was because I had lower performance than jemalloc
>> because of too low default TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES &&
>> also the tcmalloc 2.2 at this time. (it was resolved later with
>> tcmalloc 2.4)
>
> Thanks for the context and the link! That explains the 8-day turnaround.
>
> I checked the gperftools repo — interestingly, the thread cache default
> (kDefaultOverallThreadCacheSize) is the same 32 MB in both 2.2 and 2.4
> (8 × kMaxThreadCacheSize = 8 × 4 MB; src/common.h). What did change
> between 2.2 and 2.4 is commit 7da5bd0 which flipped
> TCMALLOC_AGGRESSIVE_DECOMMIT from false to true by default. In 2.2,
> tcmalloc would hold onto freed pages indefinitely unless explicitly
> told to decommit, leading to higher RSS — that combined with the 2.2-era
> thread cache behavior likely explains the poor showing vs jemalloc.
>
> PVE 9 ships gperftools 2.16 (via libgoogle-perftools4t64), so all of
> this is long resolved.
>
> So to summarize where the two historical concerns stand today:
>
> 1. 2015 tcmalloc removal (Alexandre's point): caused by gperftools
>    2.2-era issues — notably TCMALLOC_AGGRESSIVE_DECOMMIT defaulting to
>    false (commit 7da5bd0 flipped it to true in 2.4). PVE 9 ships
>    gperftools 2.16 — this is a non-issue.
>
> 2. 2020 jemalloc removal (Stefan Reiter's commit 3d785ea): post-PBS-
>    backup memory not returned to OS. Addressable by calling the
>    allocator-specific release API (ReleaseFreeMemory for tcmalloc) in
>    the backup completion path, right where malloc_trim(4 MiB) already
>    sits today.
>
> Both blockers are resolved. I think the pragmatic path is:
>
> 1. Re-enable tcmalloc via --enable-malloc=tcmalloc in the pve-qemu
>    build (this also compiles out malloc_trim, which is correct)
> 2. Add allocator-aware memory release in the backup completion path
>    (ReleaseFreeMemory for tcmalloc, arena purge for jemalloc)
> 3. Test the PBS backup memory reclamation to confirm VmRSS comes back
>    down after backup
>
> I'll prepare a patch series if this approach sounds reasonable.





      reply	other threads:[~2026-04-10  4:33 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-26  2:54 Kefu Chai
2026-03-26  6:09 ` DERUMIER, Alexandre
2026-03-26  6:16 ` Dietmar Maurer
2026-03-26  7:48   ` Kefu Chai
2026-04-10  4:33     ` Kefu Chai [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=DHP707TCJLG8.XKRTTK452S55@proxmox.com \
    --to=k.chai@proxmox.com \
    --cc=alexandre.derumier@groupe-cyllene.com \
    --cc=dietmar@proxmox.com \
    --cc=pve-devel@lists.proxmox.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox
Service provided by Proxmox Server Solutions GmbH | Privacy | Legal