all lists on lists.proxmox.com
 help / color / mirror / Atom feed
* [pve-devel] [PATCH ceph master v1] pybind/rbd: disable on_progress callbacks to prevent MGR segfaults
@ 2025-09-09 17:05 Max R. Carrara
  2025-09-10  7:00 ` Fabian Grünbichler
  2025-09-10  8:54 ` [pve-devel] superseded: " Max R. Carrara
  0 siblings, 2 replies; 4+ messages in thread
From: Max R. Carrara @ 2025-09-09 17:05 UTC (permalink / raw)
  To: pve-devel

Currently, *all* MGRs collectively segfault on Ceph v19.2.3 running on
Debian Trixie if a client requests the removal of an RBD image from
the RBD trash (#6635 [0]).

After a lot of investigation, the cause of this still isn't clear to
me; the most likely culprit are some internal changes to Python
sub-interpreters that happened between Python versions 3.12 and 3.13.

What leads me to this conclusion is the following:
 1. A user on our forum noted [0] that the issue disappeared as soon as
    they set up a Ceph MGR inside a Debian Bookworm VM. Bookworm has
    Python version 3.11, before any substantial changes to
    sub-interpreters [1][2] were made.

 2. There is an upstream issue [3] regarding another segfault during
    MGR startup. The author concluded that this problem is related to
    sub-interpreters and opened another issue [4] on Python's issue
    tracker that goes into more detail.

    Even though this is for a completely different code path, it shows
    that issues related to sub-interpreters are popping up elsewhere
    at the very least.

 3. The segfault happens *inside* the Python interpreter:
    #0  0x000078e04d89e95c __pthread_kill_implementation (libc.so.6 + 0x9495c)
    #1  0x000078e04d849cc2 __GI_raise (libc.so.6 + 0x3fcc2)
    #2  0x00005ab95de92658 reraise_fatal (/usr/bin/ceph-mgr + 0x32d658)
    #3  0x000078e04d849df0 __restore_rt (libc.so.6 + 0x3fdf0)
    #4  0x000078e04ef598b0 _Py_dict_lookup (libpython3.13.so.1.0 + 0x1598b0)
    #5  0x000078e04efa1843 _PyDict_GetItemRef_KnownHash (libpython3.13.so.1.0 + 0x1a1843)
    #6  0x000078e04efa1af5 _PyType_LookupRef (libpython3.13.so.1.0 + 0x1a1af5)
    #7  0x000078e04efa216b _Py_type_getattro_impl (libpython3.13.so.1.0 + 0x1a216b)
    #8  0x000078e04ef6f60d PyObject_GetAttr (libpython3.13.so.1.0 + 0x16f60d)
    #9  0x000078e04f043f20 _PyEval_EvalFrameDefault (libpython3.13.so.1.0 + 0x243f20)
    #10 0x000078e04ef109dd _PyObject_VectorcallTstate (libpython3.13.so.1.0 + 0x1109dd)
    #11 0x000078e04f1d3442 _PyObject_VectorcallTstate (libpython3.13.so.1.0 + 0x3d3442)
    #12 0x000078e03b74ffed __pyx_f_3rbd_progress_callback (rbd.cpython-313-x86_64-linux-gnu.so + 0xacfed)
    #13 0x000078e03afcc8af _ZN6librbd19AsyncObjectThrottleINS_8ImageCtxEE13start_next_opEv (librbd.so.1 + 0x3cc8af)
    #14 0x000078e03afccfed _ZN6librbd19AsyncObjectThrottleINS_8ImageCtxEE9start_opsEm (librbd.so.1 + 0x3ccfed)
    #15 0x000078e03afafec6 _ZN6librbd9operation11TrimRequestINS_8ImageCtxEE19send_remove_objectsEv (librbd.so.1 + 0x3afec6)
    #16 0x000078e03afb0560 _ZN6librbd9operation11TrimRequestINS_8ImageCtxEE19send_copyup_objectsEv (librbd.so.1 + 0x3b0560)
    #17 0x000078e03afb2e16 _ZN6librbd9operation11TrimRequestINS_8ImageCtxEE15should_completeEi (librbd.so.1 + 0x3b2e16)
    #18 0x000078e03afae379 _ZN6librbd12AsyncRequestINS_8ImageCtxEE8completeEi (librbd.so.1 + 0x3ae379)
    #19 0x000078e03ada8c70 _ZN7Context8completeEi (librbd.so.1 + 0x1a8c70)
    #20 0x000078e03afcdb1e _ZN7Context8completeEi (librbd.so.1 + 0x3cdb1e)
    #21 0x000078e04d6e4716 _ZN8librados14CB_AioCompleteclEv (librados.so.2 + 0xd2716)
    #22 0x000078e04d6e5705 _ZN5boost4asio6detail19scheduler_operation8completeEPvRKNS_6system10error_codeEm (librados.so.2 + 0xd3705)
    #23 0x000078e04d6e5f8a _ZN5boost4asio19asio_handler_invokeINS0_6detail23strand_executor_service7invokerIKNS0_10io_context19basic_executor_typeISaIvELm0EEEvEEEEvRT_z (librados.so.2 + 0xd3f8a)
    #24 0x000078e04d6fc598 _ZN5boost4asio6detail19scheduler_operation8completeEPvRKNS_6system10error_codeEm (librados.so.2 + 0xea598)
    #25 0x000078e04d6e9a71 _ZN5boost4asio6detail9scheduler3runERNS_6system10error_codeE (librados.so.2 + 0xd7a71)
    #26 0x000078e04d6fff63 _ZN5boost4asio10io_context3runEv (librados.so.2 + 0xedf63)
    #27 0x000078e04dae1224 n/a (libstdc++.so.6 + 0xe1224)
    #28 0x000078e04d89cb7b start_thread (libc.so.6 + 0x92b7b)
    #29 0x000078e04d91a7b8 __clone3 (libc.so.6 + 0x1107b8)

    Note that in #12, you can see that a "progress callback" is being
    called by librbd. This callback is a plain Python function that is
    passed down via Ceph's Python/C++ bindings for librbd [6].
    (I'd provide more stack traces for the other threads here, but
    they're rather massive.)

    Then, from #11 to #4 the entire execution happens within the
    Python interpreter: This is just the callback being executed.
    The segfault happens at #4 during _Py_dict_lookup(), which is a
    private function inside the Python interpreter to look something
    up in a `dict` [7]. As this function is so fundamental, it
    shouldn't ever fail, ever; but yet it does, which suggests that
    some internal interpreter state is most likely corrupted at that
    point.

Since it's incredibly hard to debug and actually figure out what the
*real* underlying issue is, simply disable that on_progress callback
instead. I just hope that this doesn't move the problem somewhere
else.

Unless I'm mistaken, there aren't any other callbacks that get passed
through C/C++ via Cython [8] like this, so this should hopefully
prevent any further SIGSEGVs until this is fixed upstream (somehow).

[0]: https://bugzilla.proxmox.com/show_bug.cgi?id=6635
[1]: https://forum.proxmox.com/threads/ceph-managers-seg-faulting-post-upgrade-8-9-upgrade.169363/post-796315
[2]: https://docs.python.org/3.12/whatsnew/3.12.html#pep-684-a-per-interpreter-gil
[3]: https://github.com/python/cpython/issues/117953
[4]: https://tracker.ceph.com/issues/67696
[5]: https://github.com/python/cpython/issues/138045
[6]: https://github.com/ceph/ceph/blob/c92aebb279828e9c3c1f5d24613efca272649e62/src/pybind/rbd/rbd.pyx#L878-L907
[7]: https://github.com/python/cpython/blob/282bd0fe98bf1c3432fd5a079ecf65f165a52587/Objects/dictobject.c#L1262-L1278
[8]: https://cython.org/

Signed-off-by: Max R. Carrara <m.carrara@proxmox.com>
---
Additional Notes:

I tested this with my local Kubernetes cluster on top of Debian Trixie:

# kubectl version
Client Version: v1.33.4
Kustomize Version: v5.6.0
Server Version: v1.33.4

The version of the Ceph CSI drivers is v3.15.0.

Posting the entire testing config here would be way too much, but
essentially it boiled down to:

### 1. Configuring a storage class for each driver:

---
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: cephfs-ssd
  annotations:
    storageclass.kubernetes.io/is-default-class: "true"
provisioner: cephfs.csi.ceph.com
allowVolumeExpansion: true
parameters:
  clusterID: <REDACTED>
  fsName: cephfs-main
  pool: cephfs-main_data
  reclaimPolicy: Delete

  csi.storage.k8s.io/provisioner-secret-name: csi-cephfs-secret
  csi.storage.k8s.io/provisioner-secret-namespace: ceph-csi-operator-system

  csi.storage.k8s.io/controller-expand-secret-name: csi-cephfs-secret
  csi.storage.k8s.io/controller-expand-secret-namespace: ceph-csi-operator-system

  csi.storage.k8s.io/node-stage-secret-name: csi-cephfs-secret
  csi.storage.k8s.io/node-stage-secret-namespace: ceph-csi-operator-system

---
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: rbd-ssd
  annotations:
    storageclass.kubernetes.io/is-default-class: "true"
provisioner: rbd.csi.ceph.com
allowVolumeExpansion: true
parameters:
  clusterID: <REDACTED>
  pool: k8s-rbd
  reclaimPolicy: Delete

  csi.storage.k8s.io/provisioner-secret-name: csi-cephfs-secret
  csi.storage.k8s.io/provisioner-secret-namespace: ceph-csi-operator-system

  csi.storage.k8s.io/controller-expand-secret-name: csi-cephfs-secret
  csi.storage.k8s.io/controller-expand-secret-namespace: ceph-csi-operator-system

  csi.storage.k8s.io/node-stage-secret-name: csi-cephfs-secret
  csi.storage.k8s.io/node-stage-secret-namespace: ceph-csi-operator-system

---

### 2. Configuring a persistent volume claim, one for each storage class:

---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: test-pvc
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 2Gi
  storageClassName: cephfs-ssd

---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: test-pvc-rbd
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 2Gi
  storageClassName: rbd-ssd

---

### 3. Tossing the persistent volume claims into a Debian Trixie pod
       that runs in an infinite loop:

---
apiVersion: v1
kind: Pod
metadata:
  name: test-0
spec:
  containers:
  - env:
    image: docker.io/library/debian:13
    imagePullPolicy: IfNotPresent
    name: test-trixie
    command: ["bash", "-c", "while sleep 1; do date; done"]
    volumeMounts:
      - mountPath: /data
        name: data
      - mountPath: /data-rbd
        name: data-rbd
  volumes:
    - name: data
      persistentVolumeClaim:
        claimName: test-pvc
    - name: data-rbd
      persistentVolumeClaim:
        claimName: test-pvc-rbd

---

After an apply-delete-apply cycle, the MGRs usually segfault, because
the CSI driver makes a request to remove the old PVC RBD image from the
RBD trash.

With this patch, this doesn't happen anymore; the old RBD images get
trashed and removed, and the new RBD images are created as expected.

The CephFS CSI driver keeps chugging along fine as well; it was only the
RBD stuff that caused issues, AFAIA.

Let's hope that I don't have to play SIGSEGV-whac-a-mole and this
workaround is enough to prevent this from happening again.


 ...nd-rbd-disable-on_progress-callbacks.patch | 163 ++++++++++++++++++
 patches/series                                |   1 +
 2 files changed, 164 insertions(+)
 create mode 100644 patches/0055-pybind-rbd-disable-on_progress-callbacks.patch

diff --git a/patches/0055-pybind-rbd-disable-on_progress-callbacks.patch b/patches/0055-pybind-rbd-disable-on_progress-callbacks.patch
new file mode 100644
index 0000000000..41e786e5b8
--- /dev/null
+++ b/patches/0055-pybind-rbd-disable-on_progress-callbacks.patch
@@ -0,0 +1,163 @@
+From 0000000000000000000000000000000000000000 Mon Sep 17 00:00:00 2001
+From: "Max R. Carrara" <m.carrara@proxmox.com>
+Date: Tue, 9 Sep 2025 16:52:42 +0200
+Subject: [PATCH] pybind/rbd: disable on_progress callbacks
+
+Currently, *all* MGRs collectively segfault on Ceph v19.2.3 running on
+Debian Trixie if a client requests the removal of an RBD image from
+the RBD trash (#6635 [0]).
+
+After a lot of investigation, the cause of this still isn't clear to
+me; the most likely culprit are some internal changes to Python
+sub-interpreters that happened between Python versions 3.12 and 3.13.
+
+What leads me to this conclusion is the following:
+ 1. A user on our forum noted [0] that the issue disappeared as soon as
+    they set up a Ceph MGR inside a Debian Bookworm VM. Bookworm has
+    Python version 3.11, before any substantial changes to
+    sub-interpreters [1][2] were made.
+
+ 2. There is an upstream issue [3] regarding another segfault during
+    MGR startup. The author concluded that this problem is related to
+    sub-interpreters and opened another issue [4] on Python's issue
+    tracker that goes into more detail.
+
+    Even though this is for a completely different code path, it shows
+    that issues related to sub-interpreters are popping up elsewhere
+    at the very least.
+
+ 3. The segfault happens *inside* the Python interpreter:
+    #0  0x000078e04d89e95c __pthread_kill_implementation (libc.so.6 + 0x9495c)
+    #1  0x000078e04d849cc2 __GI_raise (libc.so.6 + 0x3fcc2)
+    #2  0x00005ab95de92658 reraise_fatal (/usr/bin/ceph-mgr + 0x32d658)
+    #3  0x000078e04d849df0 __restore_rt (libc.so.6 + 0x3fdf0)
+    #4  0x000078e04ef598b0 _Py_dict_lookup (libpython3.13.so.1.0 + 0x1598b0)
+    #5  0x000078e04efa1843 _PyDict_GetItemRef_KnownHash (libpython3.13.so.1.0 + 0x1a1843)
+    #6  0x000078e04efa1af5 _PyType_LookupRef (libpython3.13.so.1.0 + 0x1a1af5)
+    #7  0x000078e04efa216b _Py_type_getattro_impl (libpython3.13.so.1.0 + 0x1a216b)
+    #8  0x000078e04ef6f60d PyObject_GetAttr (libpython3.13.so.1.0 + 0x16f60d)
+    #9  0x000078e04f043f20 _PyEval_EvalFrameDefault (libpython3.13.so.1.0 + 0x243f20)
+    #10 0x000078e04ef109dd _PyObject_VectorcallTstate (libpython3.13.so.1.0 + 0x1109dd)
+    #11 0x000078e04f1d3442 _PyObject_VectorcallTstate (libpython3.13.so.1.0 + 0x3d3442)
+    #12 0x000078e03b74ffed __pyx_f_3rbd_progress_callback (rbd.cpython-313-x86_64-linux-gnu.so + 0xacfed)
+    #13 0x000078e03afcc8af _ZN6librbd19AsyncObjectThrottleINS_8ImageCtxEE13start_next_opEv (librbd.so.1 + 0x3cc8af)
+    #14 0x000078e03afccfed _ZN6librbd19AsyncObjectThrottleINS_8ImageCtxEE9start_opsEm (librbd.so.1 + 0x3ccfed)
+    #15 0x000078e03afafec6 _ZN6librbd9operation11TrimRequestINS_8ImageCtxEE19send_remove_objectsEv (librbd.so.1 + 0x3afec6)
+    #16 0x000078e03afb0560 _ZN6librbd9operation11TrimRequestINS_8ImageCtxEE19send_copyup_objectsEv (librbd.so.1 + 0x3b0560)
+    #17 0x000078e03afb2e16 _ZN6librbd9operation11TrimRequestINS_8ImageCtxEE15should_completeEi (librbd.so.1 + 0x3b2e16)
+    #18 0x000078e03afae379 _ZN6librbd12AsyncRequestINS_8ImageCtxEE8completeEi (librbd.so.1 + 0x3ae379)
+    #19 0x000078e03ada8c70 _ZN7Context8completeEi (librbd.so.1 + 0x1a8c70)
+    #20 0x000078e03afcdb1e _ZN7Context8completeEi (librbd.so.1 + 0x3cdb1e)
+    #21 0x000078e04d6e4716 _ZN8librados14CB_AioCompleteclEv (librados.so.2 + 0xd2716)
+    #22 0x000078e04d6e5705 _ZN5boost4asio6detail19scheduler_operation8completeEPvRKNS_6system10error_codeEm (librados.so.2 + 0xd3705)
+    #23 0x000078e04d6e5f8a _ZN5boost4asio19asio_handler_invokeINS0_6detail23strand_executor_service7invokerIKNS0_10io_context19basic_executor_typeISaIvELm0EEEvEEEEvRT_z (librados.so.2 + 0xd3f8a)
+    #24 0x000078e04d6fc598 _ZN5boost4asio6detail19scheduler_operation8completeEPvRKNS_6system10error_codeEm (librados.so.2 + 0xea598)
+    #25 0x000078e04d6e9a71 _ZN5boost4asio6detail9scheduler3runERNS_6system10error_codeE (librados.so.2 + 0xd7a71)
+    #26 0x000078e04d6fff63 _ZN5boost4asio10io_context3runEv (librados.so.2 + 0xedf63)
+    #27 0x000078e04dae1224 n/a (libstdc++.so.6 + 0xe1224)
+    #28 0x000078e04d89cb7b start_thread (libc.so.6 + 0x92b7b)
+    #29 0x000078e04d91a7b8 __clone3 (libc.so.6 + 0x1107b8)
+
+    Note that in #12, you can see that a "progress callback" is being
+    called by librbd. This callback is a plain Python function that is
+    passed down via Ceph's Python/C++ bindings for librbd [6].
+    (I'd provide more stack traces for the other threads here, but
+    they're rather massive.)
+
+    Then, from #11 to #4 the entire execution happens within the
+    Python interpreter: This is just the callback being executed.
+    The segfault happens at #4 during _Py_dict_lookup(), which is a
+    private function inside the Python interpreter to look something
+    up in a `dict` [7]. As this function is so fundamental, it
+    shouldn't ever fail, ever; but yet it does, which suggests that
+    some internal interpreter state is most likely corrupted at that
+    point.
+
+Since it's incredibly hard to debug and actually figure out what the
+*real* underlying issue is, simply disable that on_progress callback
+instead. I just hope that this doesn't move the problem somewhere
+else.
+
+Unless I'm mistaken, there aren't any other callbacks that get passed
+through C/C++ via Cython [8] like this, so this should hopefully
+prevent any further SIGSEGVs until this is fixed upstream (somehow).
+
+[0]: https://bugzilla.proxmox.com/show_bug.cgi?id=6635
+[1]: https://forum.proxmox.com/threads/ceph-managers-seg-faulting-post-upgrade-8-9-upgrade.169363/post-796315
+[2]: https://docs.python.org/3.12/whatsnew/3.12.html#pep-684-a-per-interpreter-gil
+[3]: https://github.com/python/cpython/issues/117953
+[4]: https://tracker.ceph.com/issues/67696
+[5]: https://github.com/python/cpython/issues/138045
+[6]: https://github.com/ceph/ceph/blob/c92aebb279828e9c3c1f5d24613efca272649e62/src/pybind/rbd/rbd.pyx#L878-L907
+[7]: https://github.com/python/cpython/blob/282bd0fe98bf1c3432fd5a079ecf65f165a52587/Objects/dictobject.c#L1262-L1278
+[8]: https://cython.org/
+
+Signed-off-by: Max R. Carrara <m.carrara@proxmox.com>
+---
+ src/pybind/rbd/rbd.pyx | 18 ++++++------------
+ 1 file changed, 6 insertions(+), 12 deletions(-)
+
+diff --git a/src/pybind/rbd/rbd.pyx b/src/pybind/rbd/rbd.pyx
+index f206e78ed1d..921c150803b 100644
+--- a/src/pybind/rbd/rbd.pyx
++++ b/src/pybind/rbd/rbd.pyx
+@@ -790,8 +790,7 @@ class RBD(object):
+             librbd_progress_fn_t _prog_cb = &no_op_progress_callback
+             void *_prog_arg = NULL
+         if on_progress:
+-            _prog_cb = &progress_callback
+-            _prog_arg = <void *>on_progress
++            pass
+         with nogil:
+             ret = rbd_remove_with_progress(_ioctx, _name, _prog_cb, _prog_arg)
+         if ret != 0:
+@@ -898,8 +897,7 @@ class RBD(object):
+             librbd_progress_fn_t _prog_cb = &no_op_progress_callback
+             void *_prog_arg = NULL
+         if on_progress:
+-            _prog_cb = &progress_callback
+-            _prog_arg = <void *>on_progress
++            pass
+         with nogil:
+             ret = rbd_trash_remove_with_progress(_ioctx, _image_id, _force,
+                                                  _prog_cb, _prog_arg)
+@@ -1137,8 +1135,7 @@ class RBD(object):
+             librbd_progress_fn_t _prog_cb = &no_op_progress_callback
+             void *_prog_arg = NULL
+         if on_progress:
+-            _prog_cb = &progress_callback
+-            _prog_arg = <void *>on_progress
++            pass
+         with nogil:
+             ret = rbd_migration_execute_with_progress(_ioctx, _image_name,
+                                                       _prog_cb, _prog_arg)
+@@ -1164,8 +1161,7 @@ class RBD(object):
+             librbd_progress_fn_t _prog_cb = &no_op_progress_callback
+             void *_prog_arg = NULL
+         if on_progress:
+-            _prog_cb = &progress_callback
+-            _prog_arg = <void *>on_progress
++            pass
+         with nogil:
+             ret = rbd_migration_commit_with_progress(_ioctx, _image_name,
+                                                      _prog_cb, _prog_arg)
+@@ -1191,8 +1187,7 @@ class RBD(object):
+             librbd_progress_fn_t _prog_cb = &no_op_progress_callback
+             void *_prog_arg = NULL
+         if on_progress:
+-            _prog_cb = &progress_callback
+-            _prog_arg = <void *>on_progress
++            pass
+         with nogil:
+             ret = rbd_migration_abort_with_progress(_ioctx, _image_name,
+                                                     _prog_cb, _prog_arg)
+@@ -4189,8 +4184,7 @@ written." % (self.name, ret, length))
+             librbd_progress_fn_t _prog_cb = &no_op_progress_callback
+             void *_prog_arg = NULL
+         if on_progress:
+-            _prog_cb = &progress_callback
+-            _prog_arg = <void *>on_progress
++            pass
+         with nogil:
+             ret = rbd_flatten_with_progress(self.image, _prog_cb, _prog_arg)
+         if ret < 0:
diff --git a/patches/series b/patches/series
index fa95ce05d4..6bc974acc1 100644
--- a/patches/series
+++ b/patches/series
@@ -52,3 +52,4 @@
 0052-mgr-osd_perf_query-fix-ivalid-escape-sequence.patch
 0053-mgr-zabbix-fix-invalid-escape-sequences.patch
 0054-client-prohibit-unprivileged-users-from-setting-sgid.patch
+0055-pybind-rbd-disable-on_progress-callbacks.patch
-- 
2.47.3



_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [pve-devel] [PATCH ceph master v1] pybind/rbd: disable on_progress callbacks to prevent MGR segfaults
  2025-09-09 17:05 [pve-devel] [PATCH ceph master v1] pybind/rbd: disable on_progress callbacks to prevent MGR segfaults Max R. Carrara
@ 2025-09-10  7:00 ` Fabian Grünbichler
  2025-09-10  7:54   ` Max R. Carrara
  2025-09-10  8:54 ` [pve-devel] superseded: " Max R. Carrara
  1 sibling, 1 reply; 4+ messages in thread
From: Fabian Grünbichler @ 2025-09-10  7:00 UTC (permalink / raw)
  To: Proxmox VE development discussion

On September 9, 2025 7:05 pm, Max R. Carrara wrote:
> Currently, *all* MGRs collectively segfault on Ceph v19.2.3 running on
> Debian Trixie if a client requests the removal of an RBD image from
> the RBD trash (#6635 [0]).
> 
> After a lot of investigation, the cause of this still isn't clear to
> me; the most likely culprit are some internal changes to Python
> sub-interpreters that happened between Python versions 3.12 and 3.13.
> 
> What leads me to this conclusion is the following:
>  1. A user on our forum noted [0] that the issue disappeared as soon as
>     they set up a Ceph MGR inside a Debian Bookworm VM. Bookworm has
>     Python version 3.11, before any substantial changes to
>     sub-interpreters [1][2] were made.

did you try with stock Debian Trixie packages (the Ceph version is still
18.2 there, which might help narrowing it down)?

in any case, it would be good to bring this issue to upstream's
attention as well!
 
>  2. There is an upstream issue [3] regarding another segfault during
>     MGR startup. The author concluded that this problem is related to
>     sub-interpreters and opened another issue [4] on Python's issue
>     tracker that goes into more detail.
> 
>     Even though this is for a completely different code path, it shows
>     that issues related to sub-interpreters are popping up elsewhere
>     at the very least.

did you try reproducing that one? it seems it requires an optional
ceph-mgr plugin that we have packaged as well, so should be fairly
straight-forward..


_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [pve-devel] [PATCH ceph master v1] pybind/rbd: disable on_progress callbacks to prevent MGR segfaults
  2025-09-10  7:00 ` Fabian Grünbichler
@ 2025-09-10  7:54   ` Max R. Carrara
  0 siblings, 0 replies; 4+ messages in thread
From: Max R. Carrara @ 2025-09-10  7:54 UTC (permalink / raw)
  To: Proxmox VE development discussion

On Wed Sep 10, 2025 at 9:00 AM CEST, Fabian Grünbichler wrote:
> On September 9, 2025 7:05 pm, Max R. Carrara wrote:
> > Currently, *all* MGRs collectively segfault on Ceph v19.2.3 running on
> > Debian Trixie if a client requests the removal of an RBD image from
> > the RBD trash (#6635 [0]).
> > 
> > After a lot of investigation, the cause of this still isn't clear to
> > me; the most likely culprit are some internal changes to Python
> > sub-interpreters that happened between Python versions 3.12 and 3.13.
> > 
> > What leads me to this conclusion is the following:
> >  1. A user on our forum noted [0] that the issue disappeared as soon as
> >     they set up a Ceph MGR inside a Debian Bookworm VM. Bookworm has
> >     Python version 3.11, before any substantial changes to
> >     sub-interpreters [1][2] were made.
>
> did you try with stock Debian Trixie packages (the Ceph version is still
> 18.2 there, which might help narrowing it down)?

Not yet, but I'm going to eventually. Will just take a while to set
everything up.

>
> in any case, it would be good to bring this issue to upstream's
> attention as well!

Already done ;)

https://tracker.ceph.com/issues/72713

>  
> >  2. There is an upstream issue [3] regarding another segfault during
> >     MGR startup. The author concluded that this problem is related to
> >     sub-interpreters and opened another issue [4] on Python's issue
> >     tracker that goes into more detail.
> > 
> >     Even though this is for a completely different code path, it shows
> >     that issues related to sub-interpreters are popping up elsewhere
> >     at the very least.
>
> did you try reproducing that one? it seems it requires an optional
> ceph-mgr plugin that we have packaged as well, so should be fairly
> straight-forward..

Not yet, since the root cause of the bug was already found, it seems:
https://github.com/python/cpython/issues/138045

>
>
> _______________________________________________
> pve-devel mailing list
> pve-devel@lists.proxmox.com
> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel



_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [pve-devel] superseded: [PATCH ceph master v1] pybind/rbd: disable on_progress callbacks to prevent MGR segfaults
  2025-09-09 17:05 [pve-devel] [PATCH ceph master v1] pybind/rbd: disable on_progress callbacks to prevent MGR segfaults Max R. Carrara
  2025-09-10  7:00 ` Fabian Grünbichler
@ 2025-09-10  8:54 ` Max R. Carrara
  1 sibling, 0 replies; 4+ messages in thread
From: Max R. Carrara @ 2025-09-10  8:54 UTC (permalink / raw)
  To: Proxmox VE development discussion

On Tue Sep 9, 2025 at 7:05 PM CEST, Max R. Carrara wrote:

superseded by: https://lore.proxmox.com/pve-devel/20250910085244.123467-1-m.carrara@proxmox.com/


_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel


^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2025-09-10  8:55 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-09-09 17:05 [pve-devel] [PATCH ceph master v1] pybind/rbd: disable on_progress callbacks to prevent MGR segfaults Max R. Carrara
2025-09-10  7:00 ` Fabian Grünbichler
2025-09-10  7:54   ` Max R. Carrara
2025-09-10  8:54 ` [pve-devel] superseded: " Max R. Carrara

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.
Service provided by Proxmox Server Solutions GmbH | Privacy | Legal