Re: [PATCH pve-qemu 0/2] Re-enable tcmalloc as the memory allocator

public inbox for pve-devel@lists.proxmox.com
 help / color / mirror / Atom feed

From: "Kefu Chai" <k.chai@proxmox.com>
To: "Fiona Ebner" <f.ebner@proxmox.com>,
	"DERUMIER, Alexandre" <alexandre.derumier@groupe-cyllene.com>,
	"pve-devel@lists.proxmox.com" <pve-devel@lists.proxmox.com>
Subject: Re: [PATCH pve-qemu 0/2] Re-enable tcmalloc as the memory allocator
Date: Mon, 13 Apr 2026 19:13:12 +0800	[thread overview]
Message-ID: <DHRZDXD9PXPG.1C6HPUMULADZT@proxmox.com> (raw)
In-Reply-To: <9db86c5e-d382-4eed-a1fd-905e44a259e1@proxmox.com>

On Mon Apr 13, 2026 at 4:14 PM CST, Fiona Ebner wrote:
> Am 10.04.26 um 12:44 PM schrieb DERUMIER, Alexandre:
>> 
>>>> How does the performance change when doing IO within a QEMU guest?
>>>>
>>>> How does this affect the performance for other storage types, like
>>>> ZFS,
>>>> qcow2 on top of directory-based storages, qcow2 on top of LVM, LVM-
>>>> thin,
>>>> etc. and other workloads like saving VM state during snapshot,
>>>> transfer
>>>> during migration, maybe memory hotplug/ballooning, network
>>>> performance
>>>> for vNICs?

Hi Fiona,

Thanks for the questions.

I traced QEMU's source code. It turns out that guest's RAM is allocated
via direct mmap() calls, which completely bypassing QEMU's C library
allocator. The path looks like:

-m 4G on command line:

memory_region_init_ram_flags_nomigrate()   
  qemu_ram_alloc()
    qemu_ram_alloc_internal()
      g_malloc0(sizeof(*new_block))  <-- only the RAMBlock metadata
                                         struct, about 512 bytes
      ram_block_add()
        qemu_anon_ram_alloc()
          qemu_ram_mmap(-1, size, ...)
            mmap_reserve(total)
              mmap(0, size, PROT_NONE, ...)  <-- reserve address space
            mmap_activate(ptr, size, ...)
              mmap(ptr, size, PROT_RW, ...)  <-- actual guest RAM

So the gigabytes of guest memory go straight to the kernel via mmap()
and never touch malloc or tcmalloc. Only the small RAMBlock metadata
structure (~512 bytes per region) goes through g_malloc0().

In other words, tcmalloc's scope is limited to QEMU's own working
memory, among other things, block layer buffers, coroutine stacks,
internal data structs. It does not touch gues RAM at all.

The inflat/deflate path of balloon code does involve glib malloc. It's
virtio_balloon_pbp_alloc(), which calls bitmap_new() via g_new0() to
track partially-ballooned pages. But these bitmaps tracks only metadata,
whose memory footprints are relatively small. And this only happens in 
infrequent operations.

Hopefully, this addresses the balloon concern. If I've missed something
or misread the code, please help point it out.

When it comes to different storage types and workloads, I ran benchmarks
covering some scenarios you listed. All comparison use the same binary
(pve-qemu-kvm 10.1.2-7) with and without LD_PRELOAD=libtcmalloc.so.4, so
the allocator is the only variable.

Storage backends (block layer, via qemu-img bench)

  4K reads, depth=32, io_uring, cache=none, 5M ops per run, best
  of 3. Host NVMe (ext4) or LVM-thin backed by NVMe.

  backend              glibc ops/s   tcmalloc ops/s   delta
  qcow2 on ext4        1,188,495     1,189,343        +0.1%
  qcow2 on ext4 write  1,036,914     1,036,699         0.0%
  raw on ext4          1,263,583     1,277,465        +1.1%
  raw on LVM-thin        433,727       433,576         0.0%

  The reason why raw/LVM-thin is slower, is that, I *think* it actually
  hits the dm-thin layer rather than page cache. The allocator delta is
  noise in all cases.

  RBD tested via a local vstart cluster (3 OSDs, bluestore on NVMe):

  path                        glibc       tcmalloc    delta
  qemu-img bench + librbd     19,111      19,156      +0.2%
  rbd bench (librbd direct)   35,329      36,622      +3.7%

  I don't have ZFS configured on this (host) machine, but QEMU's file I/O
  path to ZFS goes through the same code as ext4, so the difference
  is in the kernel. I'd expect the same null result,
  though I could be wrong and am happy to set up a test pool if
  you'd like to see the numbers.

Guest I/O (Debian 13 guest, virtio-blk)

  Ran dd inside a Debian 13 cloud image guest (4 vCPU, 2 GB RAM,
  qcow2 on ext4, cache=none, aio=native) against a second virtio
  disk (/dev/vdb, 8 GB qcow2).

  workload                              glibc     tcmalloc   delta
  dd if=/dev/vdb bs=4M count=1024       15.3 GB/s 15.5 GB/s  +1.3%
    iflag=direct (sequential)
  8x parallel dd if=/dev/vdb bs=4k      787 MB/s  870 MB/s  +10.5%
    count=100k iflag=direct
    (each starting at a different offset)

Migration and savevm

  Tested with a 4 GB guest where about 2 GB of RAM was dirtied
  (filled /dev/shm with urandom, 8x256 MB) before triggering each
  operation.

  scenario                glibc      tcmalloc   delta
  migrate (exec: URI)     0.622 s    0.622 s    0.0%
  savevm (qcow2 snap)     0.503 s    0.504 s    +0.2%

What I didn't measure

  I did'nt test ZFS/vNIC throughput, as the host path lives in the kernel
  side, while QEMU only handles the control plane. Hence I'd expect very
  little allocator impact there, though I could be wrong.

  I also didn't test memory hotplug on ballooning cases, because, as
  explained above, these are rare one-shot operations. and as the source
  code trace shows malloc is not involved. But again, happy to look into
  it, if it's still a concern.

>> 
>> Hi Fiona,
>> 
>> I'm stil running in production (I have keeped tcmalloc after the
>> removal some year ago from the pve build), and I didn't notice problem.
>> (but I still don't use pbs).
>> 
>> But I never have done bench with/without it since 5/6 year.

And thanks Alexandre for sharing your production experience, that's very
valuable context.

>> 
>> Maybe vm memory should be checked too, I'm thinking about RSS memory
>> with balloon free_page_reporting, to see if it's correcting freeing
>> memory.

For free_page_reporting, the VIRTIO_BALLOON_F_REPORTING path:

virtual_balloon_handle_report()
  ram_block_discard_range()
    madvise(host_startaddr, kebngth, QEMU_MADV_DONTNEED)

This operates directly on the mmap'ed guest RAM region. Still no malloc
involvement anywhere in this path.

In short, the tests above reveal a consistent pattern: tcmalloc helps
where allocation pressure is high, and is neutral everywhere else.
Please let me know if you'd like more data on any specific workload, or
if there is anything I overlooked. I appreciate the careful review and
insights.

>
> Thanks! If it was running fine for you for this long, that is a very
> good data point :) Still, testing different scenarios/configurations
> would be nice to rule out that there is a (performance) regression
> somewhere else.

cheers,
Kefu

next prev parent reply	other threads:[~2026-04-13 11:13 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-04-10  4:30 Kefu Chai
2026-04-10  4:30 ` [PATCH pve-qemu 1/2] PVE: use " Kefu Chai
2026-04-13 13:12   ` Fiona Ebner
2026-04-10  4:30 ` [PATCH pve-qemu 2/2] d/rules: enable " Kefu Chai
2026-04-13 13:13   ` Fiona Ebner
2026-04-13 13:23     ` DERUMIER, Alexandre
2026-04-13 13:40       ` Fabian Grünbichler
2026-04-14  8:10         ` Fiona Ebner
2026-04-10  8:12 ` [PATCH pve-qemu 0/2] Re-enable " Fiona Ebner
2026-04-10 10:45   ` DERUMIER, Alexandre
2026-04-13  8:14     ` Fiona Ebner
2026-04-13 11:13       ` Kefu Chai [this message]
2026-04-13 13:14         ` DERUMIER, Alexandre
2026-04-13 13:18         ` Fiona Ebner
2026-04-14  5:41           ` Kefu Chai
2026-04-14  5:48 ` superseded: " Kefu Chai

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=DHRZDXD9PXPG.1C6HPUMULADZT@proxmox.com \
    --to=k.chai@proxmox.com \
    --cc=alexandre.derumier@groupe-cyllene.com \
    --cc=f.ebner@proxmox.com \
    --cc=pve-devel@lists.proxmox.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox

Service provided by Proxmox Server Solutions GmbH | Privacy | Legal