From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68]) by lore.proxmox.com (Postfix) with ESMTPS id 247811FF13E for ; Fri, 17 Apr 2026 14:33:57 +0200 (CEST) Received: from firstgate.proxmox.com (localhost [127.0.0.1]) by firstgate.proxmox.com (Proxmox) with ESMTP id DC9842029; Fri, 17 Apr 2026 14:33:56 +0200 (CEST) Message-ID: Date: Fri, 17 Apr 2026 14:33:18 +0200 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird From: Fiona Ebner Subject: Re: [PATCH v2 pve-qemu 0/2] Re-enable tcmalloc as the memory allocator To: Kefu Chai , pve-devel@lists.proxmox.com References: <20260414054645.405151-1-k.chai@proxmox.com> Content-Language: en-US In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Bm-Milter-Handled: 55990f41-d878-4baa-be0a-ee34c49e34d2 X-Bm-Transport-Timestamp: 1776429118889 X-SPAM-LEVEL: Spam detection results: 0 AWL 0.007 Adjusted score from AWL reputation of From: address BAYES_00 -1.9 Bayes spam probability is 0 to 1% DMARC_MISSING 0.1 Missing DMARC policy KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment RCVD_IN_VALIDITY_CERTIFIED_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to Validity was blocked. See https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more information. RCVD_IN_VALIDITY_RPBL_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to Validity was blocked. See https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more information. RCVD_IN_VALIDITY_SAFE_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to Validity was blocked. See https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more information. SPF_HELO_NONE 0.001 SPF: HELO does not publish an SPF Record SPF_PASS -0.001 SPF: sender matches SPF record Message-ID-Hash: SGGY7QTC4ZR5MZGELWJIZEKMAA7U4IHR X-Message-ID-Hash: SGGY7QTC4ZR5MZGELWJIZEKMAA7U4IHR X-MailFrom: f.ebner@proxmox.com X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; loop; banned-address; emergency; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header X-Mailman-Version: 3.3.10 Precedence: list List-Id: Proxmox VE development discussion List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: Am 15.04.26 um 12:21 PM schrieb Kefu Chai: > On Tue Apr 14, 2026 at 11:36 PM CST, Fiona Ebner wrote: >> Am 14.04.26 um 1:08 PM schrieb Fiona Ebner: >>> Note that I did play around with memory hotplug and ballooning before as >>> well, not sure if related. >>> >>> Unfortunately, I don't have the debug symbols for librbd.so.1 right now: >>> >>>> Program terminated with signal SIGSEGV, Segmentation fault. >>>> #0 0x00007ea8da6442d0 in tc_memalign () from /lib/x86_64-linux-gnu/libtcmalloc.so.4 >>>> [Current thread is 1 (Thread 0x7ea8ca66a6c0 (LWP 109157))] > > Hi Fiona, > > Thank you for the backtrace. > > I dug into the segfault, but was not able to reprudce it locally after > performing over 3300 snapshot ops on two RBD drives, including > concurrent and batch ops. > > I also searched over internet to see if we are alone. And here is what I > found: > > The crash site (SLL_Next in linked_list.h) is a known pattern in How do you know that the crash site is there? The trace only shows tc_memalign(). Telling from the past issues you found, it could be, but I wouldn't jump to conclusions. > gperftools. It's what happens when a *prior* operation corrupts a freed > block's embedded freelist pointer, and a later allocation follows the > garbage and segfaults. Essentially, tc_memalign() is the victim, not the > culprit. RHBZ #1430223 [1] and gperftools issues #1036 [2] and #1096 [3] > all describe the same crash pattern with Ceph. RHBZ #1494309 [4] is also > worth noting -- tcmalloc didn't intercept aligned_alloc() until > gperftools 2.6.1-5, causing a mixed-allocator situation where glibc > allocated but tcmalloc freed. That one's long fixed in our 2.16, but it > shows this corner of the allocator has had real bugs before. > > If it happens again, probably the way to catch the actual corruption > at its source would be: > > LD_PRELOAD=libtcmalloc_debug.so.4 qemu-system-x86_64 ... This doesn't work unfortunately: LD_PRELOAD=libtcmalloc_debug.so.4 /usr/bin/kvm \ [I] root@pve9a1 ~# ~/start-vm.sh Check failed: !internal_init_start_has_run: Heap-check constructor called twice. Perhaps you both linked in the heap checker, and also used LD_PRELOAD to load it? Aborted (core dumped) I also tried ln -s libtcmalloc_debug.so.4 libtcmalloc.so.4 but then my VMs wouldn't start even with 'qm start ID --timeout 900'. Not sure if just too slow or another issue. > > This adds guard words around allocations and checks them on free, > so it'd point straight at whatever is doing the corrupting write. > This comes with 2-5x overhead, but guess it's fine for debugging. > > If you manage to reproduce it, I am more than happy to debug it with > your reproducer. > > [1] https://bugzilla.redhat.com/show_bug.cgi?id=1430223 > [2] https://github.com/gperftools/gperftools/issues/1036 > [3] https://github.com/gperftools/gperftools/issues/1096 > [4] https://bugzilla.redhat.com/show_bug.cgi?id=1494309 > >> >> I had added malloc_stats(); calls around >> MallocExtension_ReleaseFreeMemory(); to better see the effects, which >> also requires including malloc.h in pve-backup.c when building for >> tcmalloc. I also did a few backups before, so I can't rule out that it's >> related to that. I did a build of librbd1 and librados2 with the debug >> symbols now, but haven't been able to reproduce the issue yet. Will try >> more tomorrow. I was sick for 2 days, so I only got around to do more testing today. I was not able to trigger any segfaults since that initial one, so let's hope this was a one-off issue and/or caused by my additional modification with malloc_stats().