all lists on lists.proxmox.com
 help / color / mirror / Atom feed
From: "Kefu Chai" <k.chai@proxmox.com>
To: "Kefu Chai" <k.chai@proxmox.com>,
	"Dietmar Maurer" <dietmar@proxmox.com>,
	"DERUMIER, Alexandre" <alexandre.derumier@groupe-cyllene.com>,
	<pve-devel@lists.proxmox.com>
Subject: Re: [RFC] Re-enable alternative memory allocator (tcmalloc) for pve-qemu?
Date: Fri, 10 Apr 2026 12:33:27 +0800	[thread overview]
Message-ID: <DHP707TCJLG8.XKRTTK452S55@proxmox.com> (raw)
In-Reply-To: <DHCJR18OV951.1UFQTRLNWT77Z@proxmox.com>

patchset sent at https://lore.proxmox.com/pve-devel/20260410043027.3621673-1-k.chai@proxmox.com/T/#t

On Thu Mar 26, 2026 at 3:48 PM CST, Kefu Chai wrote:
> Thanks Dietmar and Alexandre for the replies.
>
>> Such things are usually easy to fix using by using a custom pool
>> allocator, freeing all allocated objects at once. But I don't know
>> the Ceph code, so I am not sure if it is possible to apply such
>> pattern to the Ceph code ...
>
> Good question — I went and audited librbd's allocation patterns in the
> Ceph squid source to check exactly this.
>
> Short answer: librbd does NOT use a pool allocator. All I/O-path
> allocations go directly to the system malloc/new.
>
> Ceph does have a `mempool` system (src/include/mempool.h), but it's
> tracking-only — it provides memory accounting and observability, not
> actual pooling. The allocate/deallocate paths are just:
>
>     // mempool.h:375
>     T* r = reinterpret_cast<T*>(new char[total]);
>     // mempool.h:392
>     delete[] reinterpret_cast<char*>(p);
>
> And there is no "rbd" or "librbd" mempool defined at all — the existing
> pools are for bluestore, osd internals, and buffer management. The
> per-I/O request objects (ObjectReadRequest, ObjectWriteRequest,
> ObjectDispatchSpec, etc.) are all allocated via plain `new` with no
> custom allocator, no boost::pool, no arena, no slab.
>
> Buffer data (the actual read/write payloads) uses malloc/posix_memalign,
> again with mempool tracking counters but no pooling.
>
> So you're right that a pool allocator would be the proper upstream fix.
> But it would be a significant Ceph-side change — the I/O dispatch layer
> touches many files across src/librbd/io/ and the allocation/deallocation
> is spread across different threads (submitter thread allocates, OSD
> callback thread deallocates), which makes arena-per-request tricky.
>
> This is why the system allocator choice has such an outsized impact:
> with no application-level pooling, every I/O operation generates a
> burst of small allocations that hit the system allocator directly.
> tcmalloc and jemalloc handle this pattern well (per-thread caches,
> size-class freelists); glibc's ptmalloc2 does not.
>
>> The problem with that is that it can halt/delay the whole application
>> while freeing memory?
>> We need to make sure that there is no noticeable delay.
>
> For gperftools tcmalloc, `ReleaseFreeMemory()` calls
> `ReleaseToSystem(SIZE_T_MAX)` (malloc_extension.cc:127), which calls
> `PageHeap::ReleaseAtLeastNPages()` (page_heap.cc:599). That function
> round-robins through the page heap's free span lists (`free_[]` normal
> lists and `large_normal_`) and calls `ReleaseSpan()` → `madvise(MADV_DONTNEED)`
> on each free span. It does NOT walk allocated memory or compact the heap
> — it only touches spans that are already free in tcmalloc's page cache.
> It does hold the page heap spinlock during the walk, so concurrent
> allocations on other threads would briefly contend on that lock.
>
> That said, we'd only call this once at backup completion (same place the
> current malloc_trim(4 MiB) runs), not on any hot path. For comparison,
> glibc's malloc_trim iterates all arenas (locking each in turn) and
> trims free space at the top of each heap segment. Under fragmentation
> with many arenas, this can be more expensive than tcmalloc's walk of
> its free span lists.
>
>> If I remember, it was because I had lower performance than jemalloc
>> because of too low default TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES &&
>> also the tcmalloc 2.2 at this time. (it was resolved later with
>> tcmalloc 2.4)
>
> Thanks for the context and the link! That explains the 8-day turnaround.
>
> I checked the gperftools repo — interestingly, the thread cache default
> (kDefaultOverallThreadCacheSize) is the same 32 MB in both 2.2 and 2.4
> (8 × kMaxThreadCacheSize = 8 × 4 MB; src/common.h). What did change
> between 2.2 and 2.4 is commit 7da5bd0 which flipped
> TCMALLOC_AGGRESSIVE_DECOMMIT from false to true by default. In 2.2,
> tcmalloc would hold onto freed pages indefinitely unless explicitly
> told to decommit, leading to higher RSS — that combined with the 2.2-era
> thread cache behavior likely explains the poor showing vs jemalloc.
>
> PVE 9 ships gperftools 2.16 (via libgoogle-perftools4t64), so all of
> this is long resolved.
>
> So to summarize where the two historical concerns stand today:
>
> 1. 2015 tcmalloc removal (Alexandre's point): caused by gperftools
>    2.2-era issues — notably TCMALLOC_AGGRESSIVE_DECOMMIT defaulting to
>    false (commit 7da5bd0 flipped it to true in 2.4). PVE 9 ships
>    gperftools 2.16 — this is a non-issue.
>
> 2. 2020 jemalloc removal (Stefan Reiter's commit 3d785ea): post-PBS-
>    backup memory not returned to OS. Addressable by calling the
>    allocator-specific release API (ReleaseFreeMemory for tcmalloc) in
>    the backup completion path, right where malloc_trim(4 MiB) already
>    sits today.
>
> Both blockers are resolved. I think the pragmatic path is:
>
> 1. Re-enable tcmalloc via --enable-malloc=tcmalloc in the pve-qemu
>    build (this also compiles out malloc_trim, which is correct)
> 2. Add allocator-aware memory release in the backup completion path
>    (ReleaseFreeMemory for tcmalloc, arena purge for jemalloc)
> 3. Test the PBS backup memory reclamation to confirm VmRSS comes back
>    down after backup
>
> I'll prepare a patch series if this approach sounds reasonable.





      reply	other threads:[~2026-04-10  4:33 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-26  2:54 Kefu Chai
2026-03-26  6:09 ` DERUMIER, Alexandre
2026-03-26  6:16 ` Dietmar Maurer
2026-03-26  7:48   ` Kefu Chai
2026-04-10  4:33     ` Kefu Chai [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=DHP707TCJLG8.XKRTTK452S55@proxmox.com \
    --to=k.chai@proxmox.com \
    --cc=alexandre.derumier@groupe-cyllene.com \
    --cc=dietmar@proxmox.com \
    --cc=pve-devel@lists.proxmox.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.
Service provided by Proxmox Server Solutions GmbH | Privacy | Legal