Re: [RFC] Re-enable alternative memory allocator (tcmalloc) for pve-qemu?

all lists on lists.proxmox.com
 help / color / mirror / Atom feed

From: "Kefu Chai" <k.chai@proxmox.com>
To: "Dietmar Maurer" <dietmar@proxmox.com>,
	"DERUMIER, Alexandre" <alexandre.derumier@groupe-cyllene.com>,
	<pve-devel@lists.proxmox.com>
Subject: Re: [RFC] Re-enable alternative memory allocator (tcmalloc) for pve-qemu?
Date: Thu, 26 Mar 2026 15:48:02 +0800	[thread overview]
Message-ID: <DHCJR18OV951.1UFQTRLNWT77Z@proxmox.com> (raw)
In-Reply-To: <03fc2a07-34c8-4029-bbed-f1c764d4562e@proxmox.com>

Thanks Dietmar and Alexandre for the replies.

> Such things are usually easy to fix using by using a custom pool
> allocator, freeing all allocated objects at once. But I don't know
> the Ceph code, so I am not sure if it is possible to apply such
> pattern to the Ceph code ...

Good question — I went and audited librbd's allocation patterns in the
Ceph squid source to check exactly this.

Short answer: librbd does NOT use a pool allocator. All I/O-path
allocations go directly to the system malloc/new.

Ceph does have a `mempool` system (src/include/mempool.h), but it's
tracking-only — it provides memory accounting and observability, not
actual pooling. The allocate/deallocate paths are just:

    // mempool.h:375
    T* r = reinterpret_cast<T*>(new char[total]);
    // mempool.h:392
    delete[] reinterpret_cast<char*>(p);

And there is no "rbd" or "librbd" mempool defined at all — the existing
pools are for bluestore, osd internals, and buffer management. The
per-I/O request objects (ObjectReadRequest, ObjectWriteRequest,
ObjectDispatchSpec, etc.) are all allocated via plain `new` with no
custom allocator, no boost::pool, no arena, no slab.

Buffer data (the actual read/write payloads) uses malloc/posix_memalign,
again with mempool tracking counters but no pooling.

So you're right that a pool allocator would be the proper upstream fix.
But it would be a significant Ceph-side change — the I/O dispatch layer
touches many files across src/librbd/io/ and the allocation/deallocation
is spread across different threads (submitter thread allocates, OSD
callback thread deallocates), which makes arena-per-request tricky.

This is why the system allocator choice has such an outsized impact:
with no application-level pooling, every I/O operation generates a
burst of small allocations that hit the system allocator directly.
tcmalloc and jemalloc handle this pattern well (per-thread caches,
size-class freelists); glibc's ptmalloc2 does not.

> The problem with that is that it can halt/delay the whole application
> while freeing memory?
> We need to make sure that there is no noticeable delay.

For gperftools tcmalloc, `ReleaseFreeMemory()` calls
`ReleaseToSystem(SIZE_T_MAX)` (malloc_extension.cc:127), which calls
`PageHeap::ReleaseAtLeastNPages()` (page_heap.cc:599). That function
round-robins through the page heap's free span lists (`free_[]` normal
lists and `large_normal_`) and calls `ReleaseSpan()` → `madvise(MADV_DONTNEED)`
on each free span. It does NOT walk allocated memory or compact the heap
— it only touches spans that are already free in tcmalloc's page cache.
It does hold the page heap spinlock during the walk, so concurrent
allocations on other threads would briefly contend on that lock.

That said, we'd only call this once at backup completion (same place the
current malloc_trim(4 MiB) runs), not on any hot path. For comparison,
glibc's malloc_trim iterates all arenas (locking each in turn) and
trims free space at the top of each heap segment. Under fragmentation
with many arenas, this can be more expensive than tcmalloc's walk of
its free span lists.

> If I remember, it was because I had lower performance than jemalloc
> because of too low default TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES &&
> also the tcmalloc 2.2 at this time. (it was resolved later with
> tcmalloc 2.4)

Thanks for the context and the link! That explains the 8-day turnaround.

I checked the gperftools repo — interestingly, the thread cache default
(kDefaultOverallThreadCacheSize) is the same 32 MB in both 2.2 and 2.4
(8 × kMaxThreadCacheSize = 8 × 4 MB; src/common.h). What did change
between 2.2 and 2.4 is commit 7da5bd0 which flipped
TCMALLOC_AGGRESSIVE_DECOMMIT from false to true by default. In 2.2,
tcmalloc would hold onto freed pages indefinitely unless explicitly
told to decommit, leading to higher RSS — that combined with the 2.2-era
thread cache behavior likely explains the poor showing vs jemalloc.

PVE 9 ships gperftools 2.16 (via libgoogle-perftools4t64), so all of
this is long resolved.

So to summarize where the two historical concerns stand today:

1. 2015 tcmalloc removal (Alexandre's point): caused by gperftools
   2.2-era issues — notably TCMALLOC_AGGRESSIVE_DECOMMIT defaulting to
   false (commit 7da5bd0 flipped it to true in 2.4). PVE 9 ships
   gperftools 2.16 — this is a non-issue.

2. 2020 jemalloc removal (Stefan Reiter's commit 3d785ea): post-PBS-
   backup memory not returned to OS. Addressable by calling the
   allocator-specific release API (ReleaseFreeMemory for tcmalloc) in
   the backup completion path, right where malloc_trim(4 MiB) already
   sits today.

Both blockers are resolved. I think the pragmatic path is:

1. Re-enable tcmalloc via --enable-malloc=tcmalloc in the pve-qemu
   build (this also compiles out malloc_trim, which is correct)
2. Add allocator-aware memory release in the backup completion path
   (ReleaseFreeMemory for tcmalloc, arena purge for jemalloc)
3. Test the PBS backup memory reclamation to confirm VmRSS comes back
   down after backup

I'll prepare a patch series if this approach sounds reasonable.

     prev parent reply	other threads:[~2026-03-26  7:47 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-26  2:54 Kefu Chai
2026-03-26  6:09 ` DERUMIER, Alexandre
2026-03-26  6:16 ` Dietmar Maurer
2026-03-26  7:48   ` Kefu Chai [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=DHCJR18OV951.1UFQTRLNWT77Z@proxmox.com \
    --to=k.chai@proxmox.com \
    --cc=alexandre.derumier@groupe-cyllene.com \
    --cc=dietmar@proxmox.com \
    --cc=pve-devel@lists.proxmox.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.

Service provided by Proxmox Server Solutions GmbH | Privacy | Legal