From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from firstgate.proxmox.com (firstgate.proxmox.com [IPv6:2a01:7e0:0:424::9]) by lore.proxmox.com (Postfix) with ESMTPS id 8C8AB1FF191 for ; Tue, 9 Sep 2025 19:05:50 +0200 (CEST) Received: from firstgate.proxmox.com (localhost [127.0.0.1]) by firstgate.proxmox.com (Proxmox) with ESMTP id 775BAF661; Tue, 9 Sep 2025 19:05:51 +0200 (CEST) From: "Max R. Carrara" To: pve-devel@lists.proxmox.com Date: Tue, 9 Sep 2025 19:05:10 +0200 Message-ID: <20250909170515.606422-1-m.carrara@proxmox.com> X-Mailer: git-send-email 2.47.3 MIME-Version: 1.0 X-Bm-Milter-Handled: 55990f41-d878-4baa-be0a-ee34c49e34d2 X-Bm-Transport-Timestamp: 1757437492141 X-SPAM-LEVEL: Spam detection results: 0 AWL -0.039 Adjusted score from AWL reputation of From: address BAYES_00 -1.9 Bayes spam probability is 0 to 1% DMARC_MISSING 0.1 Missing DMARC policy KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment KAM_LOTSOFHASH 0.25 Emails with lots of hash-like gibberish RCVD_IN_VALIDITY_CERTIFIED_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to Validity was blocked. See https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more information. RCVD_IN_VALIDITY_RPBL_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to Validity was blocked. See https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more information. RCVD_IN_VALIDITY_SAFE_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to Validity was blocked. See https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more information. SPF_HELO_NONE 0.001 SPF: HELO does not publish an SPF Record SPF_PASS -0.001 SPF: sender matches SPF record URIBL_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to URIBL was blocked. See http://wiki.apache.org/spamassassin/DnsBlocklists#dnsbl-block for more information. [proxmox.com, python.org, ceph.com, kubernetes.io, docker.io, cython.org, k8s.io] Subject: [pve-devel] [PATCH ceph master v1] pybind/rbd: disable on_progress callbacks to prevent MGR segfaults X-BeenThere: pve-devel@lists.proxmox.com X-Mailman-Version: 2.1.29 Precedence: list List-Id: Proxmox VE development discussion List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: Proxmox VE development discussion Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: pve-devel-bounces@lists.proxmox.com Sender: "pve-devel" Currently, *all* MGRs collectively segfault on Ceph v19.2.3 running on Debian Trixie if a client requests the removal of an RBD image from the RBD trash (#6635 [0]). After a lot of investigation, the cause of this still isn't clear to me; the most likely culprit are some internal changes to Python sub-interpreters that happened between Python versions 3.12 and 3.13. What leads me to this conclusion is the following: 1. A user on our forum noted [0] that the issue disappeared as soon as they set up a Ceph MGR inside a Debian Bookworm VM. Bookworm has Python version 3.11, before any substantial changes to sub-interpreters [1][2] were made. 2. There is an upstream issue [3] regarding another segfault during MGR startup. The author concluded that this problem is related to sub-interpreters and opened another issue [4] on Python's issue tracker that goes into more detail. Even though this is for a completely different code path, it shows that issues related to sub-interpreters are popping up elsewhere at the very least. 3. The segfault happens *inside* the Python interpreter: #0 0x000078e04d89e95c __pthread_kill_implementation (libc.so.6 + 0x9495c) #1 0x000078e04d849cc2 __GI_raise (libc.so.6 + 0x3fcc2) #2 0x00005ab95de92658 reraise_fatal (/usr/bin/ceph-mgr + 0x32d658) #3 0x000078e04d849df0 __restore_rt (libc.so.6 + 0x3fdf0) #4 0x000078e04ef598b0 _Py_dict_lookup (libpython3.13.so.1.0 + 0x1598b0) #5 0x000078e04efa1843 _PyDict_GetItemRef_KnownHash (libpython3.13.so.1.0 + 0x1a1843) #6 0x000078e04efa1af5 _PyType_LookupRef (libpython3.13.so.1.0 + 0x1a1af5) #7 0x000078e04efa216b _Py_type_getattro_impl (libpython3.13.so.1.0 + 0x1a216b) #8 0x000078e04ef6f60d PyObject_GetAttr (libpython3.13.so.1.0 + 0x16f60d) #9 0x000078e04f043f20 _PyEval_EvalFrameDefault (libpython3.13.so.1.0 + 0x243f20) #10 0x000078e04ef109dd _PyObject_VectorcallTstate (libpython3.13.so.1.0 + 0x1109dd) #11 0x000078e04f1d3442 _PyObject_VectorcallTstate (libpython3.13.so.1.0 + 0x3d3442) #12 0x000078e03b74ffed __pyx_f_3rbd_progress_callback (rbd.cpython-313-x86_64-linux-gnu.so + 0xacfed) #13 0x000078e03afcc8af _ZN6librbd19AsyncObjectThrottleINS_8ImageCtxEE13start_next_opEv (librbd.so.1 + 0x3cc8af) #14 0x000078e03afccfed _ZN6librbd19AsyncObjectThrottleINS_8ImageCtxEE9start_opsEm (librbd.so.1 + 0x3ccfed) #15 0x000078e03afafec6 _ZN6librbd9operation11TrimRequestINS_8ImageCtxEE19send_remove_objectsEv (librbd.so.1 + 0x3afec6) #16 0x000078e03afb0560 _ZN6librbd9operation11TrimRequestINS_8ImageCtxEE19send_copyup_objectsEv (librbd.so.1 + 0x3b0560) #17 0x000078e03afb2e16 _ZN6librbd9operation11TrimRequestINS_8ImageCtxEE15should_completeEi (librbd.so.1 + 0x3b2e16) #18 0x000078e03afae379 _ZN6librbd12AsyncRequestINS_8ImageCtxEE8completeEi (librbd.so.1 + 0x3ae379) #19 0x000078e03ada8c70 _ZN7Context8completeEi (librbd.so.1 + 0x1a8c70) #20 0x000078e03afcdb1e _ZN7Context8completeEi (librbd.so.1 + 0x3cdb1e) #21 0x000078e04d6e4716 _ZN8librados14CB_AioCompleteclEv (librados.so.2 + 0xd2716) #22 0x000078e04d6e5705 _ZN5boost4asio6detail19scheduler_operation8completeEPvRKNS_6system10error_codeEm (librados.so.2 + 0xd3705) #23 0x000078e04d6e5f8a _ZN5boost4asio19asio_handler_invokeINS0_6detail23strand_executor_service7invokerIKNS0_10io_context19basic_executor_typeISaIvELm0EEEvEEEEvRT_z (librados.so.2 + 0xd3f8a) #24 0x000078e04d6fc598 _ZN5boost4asio6detail19scheduler_operation8completeEPvRKNS_6system10error_codeEm (librados.so.2 + 0xea598) #25 0x000078e04d6e9a71 _ZN5boost4asio6detail9scheduler3runERNS_6system10error_codeE (librados.so.2 + 0xd7a71) #26 0x000078e04d6fff63 _ZN5boost4asio10io_context3runEv (librados.so.2 + 0xedf63) #27 0x000078e04dae1224 n/a (libstdc++.so.6 + 0xe1224) #28 0x000078e04d89cb7b start_thread (libc.so.6 + 0x92b7b) #29 0x000078e04d91a7b8 __clone3 (libc.so.6 + 0x1107b8) Note that in #12, you can see that a "progress callback" is being called by librbd. This callback is a plain Python function that is passed down via Ceph's Python/C++ bindings for librbd [6]. (I'd provide more stack traces for the other threads here, but they're rather massive.) Then, from #11 to #4 the entire execution happens within the Python interpreter: This is just the callback being executed. The segfault happens at #4 during _Py_dict_lookup(), which is a private function inside the Python interpreter to look something up in a `dict` [7]. As this function is so fundamental, it shouldn't ever fail, ever; but yet it does, which suggests that some internal interpreter state is most likely corrupted at that point. Since it's incredibly hard to debug and actually figure out what the *real* underlying issue is, simply disable that on_progress callback instead. I just hope that this doesn't move the problem somewhere else. Unless I'm mistaken, there aren't any other callbacks that get passed through C/C++ via Cython [8] like this, so this should hopefully prevent any further SIGSEGVs until this is fixed upstream (somehow). [0]: https://bugzilla.proxmox.com/show_bug.cgi?id=6635 [1]: https://forum.proxmox.com/threads/ceph-managers-seg-faulting-post-upgrade-8-9-upgrade.169363/post-796315 [2]: https://docs.python.org/3.12/whatsnew/3.12.html#pep-684-a-per-interpreter-gil [3]: https://github.com/python/cpython/issues/117953 [4]: https://tracker.ceph.com/issues/67696 [5]: https://github.com/python/cpython/issues/138045 [6]: https://github.com/ceph/ceph/blob/c92aebb279828e9c3c1f5d24613efca272649e62/src/pybind/rbd/rbd.pyx#L878-L907 [7]: https://github.com/python/cpython/blob/282bd0fe98bf1c3432fd5a079ecf65f165a52587/Objects/dictobject.c#L1262-L1278 [8]: https://cython.org/ Signed-off-by: Max R. Carrara --- Additional Notes: I tested this with my local Kubernetes cluster on top of Debian Trixie: # kubectl version Client Version: v1.33.4 Kustomize Version: v5.6.0 Server Version: v1.33.4 The version of the Ceph CSI drivers is v3.15.0. Posting the entire testing config here would be way too much, but essentially it boiled down to: ### 1. Configuring a storage class for each driver: --- apiVersion: storage.k8s.io/v1 kind: StorageClass metadata: name: cephfs-ssd annotations: storageclass.kubernetes.io/is-default-class: "true" provisioner: cephfs.csi.ceph.com allowVolumeExpansion: true parameters: clusterID: fsName: cephfs-main pool: cephfs-main_data reclaimPolicy: Delete csi.storage.k8s.io/provisioner-secret-name: csi-cephfs-secret csi.storage.k8s.io/provisioner-secret-namespace: ceph-csi-operator-system csi.storage.k8s.io/controller-expand-secret-name: csi-cephfs-secret csi.storage.k8s.io/controller-expand-secret-namespace: ceph-csi-operator-system csi.storage.k8s.io/node-stage-secret-name: csi-cephfs-secret csi.storage.k8s.io/node-stage-secret-namespace: ceph-csi-operator-system --- apiVersion: storage.k8s.io/v1 kind: StorageClass metadata: name: rbd-ssd annotations: storageclass.kubernetes.io/is-default-class: "true" provisioner: rbd.csi.ceph.com allowVolumeExpansion: true parameters: clusterID: pool: k8s-rbd reclaimPolicy: Delete csi.storage.k8s.io/provisioner-secret-name: csi-cephfs-secret csi.storage.k8s.io/provisioner-secret-namespace: ceph-csi-operator-system csi.storage.k8s.io/controller-expand-secret-name: csi-cephfs-secret csi.storage.k8s.io/controller-expand-secret-namespace: ceph-csi-operator-system csi.storage.k8s.io/node-stage-secret-name: csi-cephfs-secret csi.storage.k8s.io/node-stage-secret-namespace: ceph-csi-operator-system --- ### 2. Configuring a persistent volume claim, one for each storage class: --- apiVersion: v1 kind: PersistentVolumeClaim metadata: name: test-pvc spec: accessModes: - ReadWriteOnce resources: requests: storage: 2Gi storageClassName: cephfs-ssd --- apiVersion: v1 kind: PersistentVolumeClaim metadata: name: test-pvc-rbd spec: accessModes: - ReadWriteOnce resources: requests: storage: 2Gi storageClassName: rbd-ssd --- ### 3. Tossing the persistent volume claims into a Debian Trixie pod that runs in an infinite loop: --- apiVersion: v1 kind: Pod metadata: name: test-0 spec: containers: - env: image: docker.io/library/debian:13 imagePullPolicy: IfNotPresent name: test-trixie command: ["bash", "-c", "while sleep 1; do date; done"] volumeMounts: - mountPath: /data name: data - mountPath: /data-rbd name: data-rbd volumes: - name: data persistentVolumeClaim: claimName: test-pvc - name: data-rbd persistentVolumeClaim: claimName: test-pvc-rbd --- After an apply-delete-apply cycle, the MGRs usually segfault, because the CSI driver makes a request to remove the old PVC RBD image from the RBD trash. With this patch, this doesn't happen anymore; the old RBD images get trashed and removed, and the new RBD images are created as expected. The CephFS CSI driver keeps chugging along fine as well; it was only the RBD stuff that caused issues, AFAIA. Let's hope that I don't have to play SIGSEGV-whac-a-mole and this workaround is enough to prevent this from happening again. ...nd-rbd-disable-on_progress-callbacks.patch | 163 ++++++++++++++++++ patches/series | 1 + 2 files changed, 164 insertions(+) create mode 100644 patches/0055-pybind-rbd-disable-on_progress-callbacks.patch diff --git a/patches/0055-pybind-rbd-disable-on_progress-callbacks.patch b/patches/0055-pybind-rbd-disable-on_progress-callbacks.patch new file mode 100644 index 0000000000..41e786e5b8 --- /dev/null +++ b/patches/0055-pybind-rbd-disable-on_progress-callbacks.patch @@ -0,0 +1,163 @@ +From 0000000000000000000000000000000000000000 Mon Sep 17 00:00:00 2001 +From: "Max R. Carrara" +Date: Tue, 9 Sep 2025 16:52:42 +0200 +Subject: [PATCH] pybind/rbd: disable on_progress callbacks + +Currently, *all* MGRs collectively segfault on Ceph v19.2.3 running on +Debian Trixie if a client requests the removal of an RBD image from +the RBD trash (#6635 [0]). + +After a lot of investigation, the cause of this still isn't clear to +me; the most likely culprit are some internal changes to Python +sub-interpreters that happened between Python versions 3.12 and 3.13. + +What leads me to this conclusion is the following: + 1. A user on our forum noted [0] that the issue disappeared as soon as + they set up a Ceph MGR inside a Debian Bookworm VM. Bookworm has + Python version 3.11, before any substantial changes to + sub-interpreters [1][2] were made. + + 2. There is an upstream issue [3] regarding another segfault during + MGR startup. The author concluded that this problem is related to + sub-interpreters and opened another issue [4] on Python's issue + tracker that goes into more detail. + + Even though this is for a completely different code path, it shows + that issues related to sub-interpreters are popping up elsewhere + at the very least. + + 3. The segfault happens *inside* the Python interpreter: + #0 0x000078e04d89e95c __pthread_kill_implementation (libc.so.6 + 0x9495c) + #1 0x000078e04d849cc2 __GI_raise (libc.so.6 + 0x3fcc2) + #2 0x00005ab95de92658 reraise_fatal (/usr/bin/ceph-mgr + 0x32d658) + #3 0x000078e04d849df0 __restore_rt (libc.so.6 + 0x3fdf0) + #4 0x000078e04ef598b0 _Py_dict_lookup (libpython3.13.so.1.0 + 0x1598b0) + #5 0x000078e04efa1843 _PyDict_GetItemRef_KnownHash (libpython3.13.so.1.0 + 0x1a1843) + #6 0x000078e04efa1af5 _PyType_LookupRef (libpython3.13.so.1.0 + 0x1a1af5) + #7 0x000078e04efa216b _Py_type_getattro_impl (libpython3.13.so.1.0 + 0x1a216b) + #8 0x000078e04ef6f60d PyObject_GetAttr (libpython3.13.so.1.0 + 0x16f60d) + #9 0x000078e04f043f20 _PyEval_EvalFrameDefault (libpython3.13.so.1.0 + 0x243f20) + #10 0x000078e04ef109dd _PyObject_VectorcallTstate (libpython3.13.so.1.0 + 0x1109dd) + #11 0x000078e04f1d3442 _PyObject_VectorcallTstate (libpython3.13.so.1.0 + 0x3d3442) + #12 0x000078e03b74ffed __pyx_f_3rbd_progress_callback (rbd.cpython-313-x86_64-linux-gnu.so + 0xacfed) + #13 0x000078e03afcc8af _ZN6librbd19AsyncObjectThrottleINS_8ImageCtxEE13start_next_opEv (librbd.so.1 + 0x3cc8af) + #14 0x000078e03afccfed _ZN6librbd19AsyncObjectThrottleINS_8ImageCtxEE9start_opsEm (librbd.so.1 + 0x3ccfed) + #15 0x000078e03afafec6 _ZN6librbd9operation11TrimRequestINS_8ImageCtxEE19send_remove_objectsEv (librbd.so.1 + 0x3afec6) + #16 0x000078e03afb0560 _ZN6librbd9operation11TrimRequestINS_8ImageCtxEE19send_copyup_objectsEv (librbd.so.1 + 0x3b0560) + #17 0x000078e03afb2e16 _ZN6librbd9operation11TrimRequestINS_8ImageCtxEE15should_completeEi (librbd.so.1 + 0x3b2e16) + #18 0x000078e03afae379 _ZN6librbd12AsyncRequestINS_8ImageCtxEE8completeEi (librbd.so.1 + 0x3ae379) + #19 0x000078e03ada8c70 _ZN7Context8completeEi (librbd.so.1 + 0x1a8c70) + #20 0x000078e03afcdb1e _ZN7Context8completeEi (librbd.so.1 + 0x3cdb1e) + #21 0x000078e04d6e4716 _ZN8librados14CB_AioCompleteclEv (librados.so.2 + 0xd2716) + #22 0x000078e04d6e5705 _ZN5boost4asio6detail19scheduler_operation8completeEPvRKNS_6system10error_codeEm (librados.so.2 + 0xd3705) + #23 0x000078e04d6e5f8a _ZN5boost4asio19asio_handler_invokeINS0_6detail23strand_executor_service7invokerIKNS0_10io_context19basic_executor_typeISaIvELm0EEEvEEEEvRT_z (librados.so.2 + 0xd3f8a) + #24 0x000078e04d6fc598 _ZN5boost4asio6detail19scheduler_operation8completeEPvRKNS_6system10error_codeEm (librados.so.2 + 0xea598) + #25 0x000078e04d6e9a71 _ZN5boost4asio6detail9scheduler3runERNS_6system10error_codeE (librados.so.2 + 0xd7a71) + #26 0x000078e04d6fff63 _ZN5boost4asio10io_context3runEv (librados.so.2 + 0xedf63) + #27 0x000078e04dae1224 n/a (libstdc++.so.6 + 0xe1224) + #28 0x000078e04d89cb7b start_thread (libc.so.6 + 0x92b7b) + #29 0x000078e04d91a7b8 __clone3 (libc.so.6 + 0x1107b8) + + Note that in #12, you can see that a "progress callback" is being + called by librbd. This callback is a plain Python function that is + passed down via Ceph's Python/C++ bindings for librbd [6]. + (I'd provide more stack traces for the other threads here, but + they're rather massive.) + + Then, from #11 to #4 the entire execution happens within the + Python interpreter: This is just the callback being executed. + The segfault happens at #4 during _Py_dict_lookup(), which is a + private function inside the Python interpreter to look something + up in a `dict` [7]. As this function is so fundamental, it + shouldn't ever fail, ever; but yet it does, which suggests that + some internal interpreter state is most likely corrupted at that + point. + +Since it's incredibly hard to debug and actually figure out what the +*real* underlying issue is, simply disable that on_progress callback +instead. I just hope that this doesn't move the problem somewhere +else. + +Unless I'm mistaken, there aren't any other callbacks that get passed +through C/C++ via Cython [8] like this, so this should hopefully +prevent any further SIGSEGVs until this is fixed upstream (somehow). + +[0]: https://bugzilla.proxmox.com/show_bug.cgi?id=6635 +[1]: https://forum.proxmox.com/threads/ceph-managers-seg-faulting-post-upgrade-8-9-upgrade.169363/post-796315 +[2]: https://docs.python.org/3.12/whatsnew/3.12.html#pep-684-a-per-interpreter-gil +[3]: https://github.com/python/cpython/issues/117953 +[4]: https://tracker.ceph.com/issues/67696 +[5]: https://github.com/python/cpython/issues/138045 +[6]: https://github.com/ceph/ceph/blob/c92aebb279828e9c3c1f5d24613efca272649e62/src/pybind/rbd/rbd.pyx#L878-L907 +[7]: https://github.com/python/cpython/blob/282bd0fe98bf1c3432fd5a079ecf65f165a52587/Objects/dictobject.c#L1262-L1278 +[8]: https://cython.org/ + +Signed-off-by: Max R. Carrara +--- + src/pybind/rbd/rbd.pyx | 18 ++++++------------ + 1 file changed, 6 insertions(+), 12 deletions(-) + +diff --git a/src/pybind/rbd/rbd.pyx b/src/pybind/rbd/rbd.pyx +index f206e78ed1d..921c150803b 100644 +--- a/src/pybind/rbd/rbd.pyx ++++ b/src/pybind/rbd/rbd.pyx +@@ -790,8 +790,7 @@ class RBD(object): + librbd_progress_fn_t _prog_cb = &no_op_progress_callback + void *_prog_arg = NULL + if on_progress: +- _prog_cb = &progress_callback +- _prog_arg = on_progress ++ pass + with nogil: + ret = rbd_remove_with_progress(_ioctx, _name, _prog_cb, _prog_arg) + if ret != 0: +@@ -898,8 +897,7 @@ class RBD(object): + librbd_progress_fn_t _prog_cb = &no_op_progress_callback + void *_prog_arg = NULL + if on_progress: +- _prog_cb = &progress_callback +- _prog_arg = on_progress ++ pass + with nogil: + ret = rbd_trash_remove_with_progress(_ioctx, _image_id, _force, + _prog_cb, _prog_arg) +@@ -1137,8 +1135,7 @@ class RBD(object): + librbd_progress_fn_t _prog_cb = &no_op_progress_callback + void *_prog_arg = NULL + if on_progress: +- _prog_cb = &progress_callback +- _prog_arg = on_progress ++ pass + with nogil: + ret = rbd_migration_execute_with_progress(_ioctx, _image_name, + _prog_cb, _prog_arg) +@@ -1164,8 +1161,7 @@ class RBD(object): + librbd_progress_fn_t _prog_cb = &no_op_progress_callback + void *_prog_arg = NULL + if on_progress: +- _prog_cb = &progress_callback +- _prog_arg = on_progress ++ pass + with nogil: + ret = rbd_migration_commit_with_progress(_ioctx, _image_name, + _prog_cb, _prog_arg) +@@ -1191,8 +1187,7 @@ class RBD(object): + librbd_progress_fn_t _prog_cb = &no_op_progress_callback + void *_prog_arg = NULL + if on_progress: +- _prog_cb = &progress_callback +- _prog_arg = on_progress ++ pass + with nogil: + ret = rbd_migration_abort_with_progress(_ioctx, _image_name, + _prog_cb, _prog_arg) +@@ -4189,8 +4184,7 @@ written." % (self.name, ret, length)) + librbd_progress_fn_t _prog_cb = &no_op_progress_callback + void *_prog_arg = NULL + if on_progress: +- _prog_cb = &progress_callback +- _prog_arg = on_progress ++ pass + with nogil: + ret = rbd_flatten_with_progress(self.image, _prog_cb, _prog_arg) + if ret < 0: diff --git a/patches/series b/patches/series index fa95ce05d4..6bc974acc1 100644 --- a/patches/series +++ b/patches/series @@ -52,3 +52,4 @@ 0052-mgr-osd_perf_query-fix-ivalid-escape-sequence.patch 0053-mgr-zabbix-fix-invalid-escape-sequences.patch 0054-client-prohibit-unprivileged-users-from-setting-sgid.patch +0055-pybind-rbd-disable-on_progress-callbacks.patch -- 2.47.3 _______________________________________________ pve-devel mailing list pve-devel@lists.proxmox.com https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel