From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68]) by lore.proxmox.com (Postfix) with ESMTPS id 522A41FF13A for ; Wed, 15 Apr 2026 12:23:05 +0200 (CEST) Received: from firstgate.proxmox.com (localhost [127.0.0.1]) by firstgate.proxmox.com (Proxmox) with ESMTP id 7A8AFDD7B; Wed, 15 Apr 2026 12:23:04 +0200 (CEST) Mime-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=UTF-8 Date: Wed, 15 Apr 2026 18:22:53 +0800 Message-Id: Subject: Re: [PATCH v2 pve-qemu 0/2] Re-enable tcmalloc as the memory allocator From: "Kefu Chai" To: "Fiona Ebner" , X-Mailer: aerc 0.20.0 References: <20260414054645.405151-1-k.chai@proxmox.com> In-Reply-To: X-Bm-Milter-Handled: 55990f41-d878-4baa-be0a-ee34c49e34d2 X-Bm-Transport-Timestamp: 1776248501155 X-SPAM-LEVEL: Spam detection results: 0 AWL 0.339 Adjusted score from AWL reputation of From: address BAYES_00 -1.9 Bayes spam probability is 0 to 1% DMARC_MISSING 0.1 Missing DMARC policy KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment RCVD_IN_VALIDITY_CERTIFIED_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to Validity was blocked. See https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more information. RCVD_IN_VALIDITY_RPBL_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to Validity was blocked. See https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more information. RCVD_IN_VALIDITY_SAFE_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to Validity was blocked. See https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more information. SPF_HELO_NONE 0.001 SPF: HELO does not publish an SPF Record SPF_PASS -0.001 SPF: sender matches SPF record Message-ID-Hash: 7KQYE5Y7M75TPIX52UKNRZXD77OCT35X X-Message-ID-Hash: 7KQYE5Y7M75TPIX52UKNRZXD77OCT35X X-MailFrom: k.chai@proxmox.com X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; loop; banned-address; emergency; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header X-Mailman-Version: 3.3.10 Precedence: list List-Id: Proxmox VE development discussion List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: On Tue Apr 14, 2026 at 11:36 PM CST, Fiona Ebner wrote: > Am 14.04.26 um 1:08 PM schrieb Fiona Ebner: >> Note that I did play around with memory hotplug and ballooning before as >> well, not sure if related. >>=20 >> Unfortunately, I don't have the debug symbols for librbd.so.1 right now: >>=20 >>> Program terminated with signal SIGSEGV, Segmentation fault. >>> #0 0x00007ea8da6442d0 in tc_memalign () from /lib/x86_64-linux-gnu/lib= tcmalloc.so.4 >>> [Current thread is 1 (Thread 0x7ea8ca66a6c0 (LWP 109157))] Hi Fiona, Thank you for the backtrace. I dug into the segfault, but was not able to reprudce it locally after performing over 3300 snapshot ops on two RBD drives, including concurrent and batch ops. I also searched over internet to see if we are alone. And here is what I found:=20 The crash site (SLL_Next in linked_list.h) is a known pattern in gperftools. It's what happens when a *prior* operation corrupts a freed block's embedded freelist pointer, and a later allocation follows the garbage and segfaults. Essentially, tc_memalign() is the victim, not the culprit. RHBZ #1430223 [1] and gperftools issues #1036 [2] and #1096 [3] all describe the same crash pattern with Ceph. RHBZ #1494309 [4] is also worth noting -- tcmalloc didn't intercept aligned_alloc() until gperftools 2.6.1-5, causing a mixed-allocator situation where glibc allocated but tcmalloc freed. That one's long fixed in our 2.16, but it shows this corner of the allocator has had real bugs before. If it happens again, probably the way to catch the actual corruption at its source would be: LD_PRELOAD=3Dlibtcmalloc_debug.so.4 qemu-system-x86_64 ... This adds guard words around allocations and checks them on free, so it'd point straight at whatever is doing the corrupting write. This comes with 2-5x overhead, but guess it's fine for debugging. If you manage to reproduce it, I am more than happy to debug it with your reproducer. [1] https://bugzilla.redhat.com/show_bug.cgi?id=3D1430223 [2] https://github.com/gperftools/gperftools/issues/1036 [3] https://github.com/gperftools/gperftools/issues/1096 [4] https://bugzilla.redhat.com/show_bug.cgi?id=3D1494309 > > I had added malloc_stats(); calls around > MallocExtension_ReleaseFreeMemory(); to better see the effects, which > also requires including malloc.h in pve-backup.c when building for > tcmalloc. I also did a few backups before, so I can't rule out that it's > related to that. I did a build of librbd1 and librados2 with the debug > symbols now, but haven't been able to reproduce the issue yet. Will try > more tomorrow.