all lists on lists.proxmox.com
 help / color / mirror / Atom feed
From: Alexander Zeidler <a.zeidler@proxmox.com>
To: pve-devel@lists.proxmox.com
Subject: [pve-devel] [PATCH docs v2 3/6] ceph: troubleshooting: revise and add frequently needed information
Date: Wed,  5 Feb 2025 11:08:47 +0100	[thread overview]
Message-ID: <20250205100850.3-3-a.zeidler@proxmox.com> (raw)
In-Reply-To: <20250205100850.3-1-a.zeidler@proxmox.com>

Existing information is slightly modified and retained.

Add information:
* List which logs are usually helpful for troubleshooting
* Explain how to acknowledge listed Ceph crashes and view details
* List common causes of Ceph problems and link to recommendations for a
  healthy cluster
* Briefly describe the common problem "OSDs down/crashed"

Signed-off-by: Alexander Zeidler <a.zeidler@proxmox.com>
---
v2:
* implement all comments from Max Carrara
** using longer link texts
** fix build errors by adding two missing anchors in patch:
   "ceph: add anchors for use in troubleshooting"

 pveceph.adoc | 72 ++++++++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 64 insertions(+), 8 deletions(-)

diff --git a/pveceph.adoc b/pveceph.adoc
index 90bb975..7401d2b 100644
--- a/pveceph.adoc
+++ b/pveceph.adoc
@@ -1150,22 +1150,78 @@ The following Ceph commands can be used to see if the cluster is healthy
 ('HEALTH_OK'), if there are warnings ('HEALTH_WARN'), or even errors
 ('HEALTH_ERR'). If the cluster is in an unhealthy state, the status commands
 below will also give you an overview of the current events and actions to take.
+To stop their execution, press CTRL-C.
 
 ----
-# single time output
-pve# ceph -s
-# continuously output status changes (press CTRL+C to stop)
-pve# ceph -w
+# Continuously watch the cluster status
+pve# watch ceph --status
+
+# Print the cluster status once (not being updated)
+# and continuously append lines of status events
+pve# ceph --watch
 ----
 
+[[pve_ceph_ts]]
+Troubleshooting
+~~~~~~~~~~~~~~~
+
+This section includes frequently used troubleshooting information.
+More information can be found on the official Ceph website under
+Troubleshooting
+footnote:[Ceph troubleshooting {cephdocs-url}/rados/troubleshooting/].
+
+[[pve_ceph_ts_logs]]
+.Relevant Logs on Affected Node
+
+* xref:disk_health_monitoring[Disk Health Monitoring]
+* __System -> System Log__ (or, for example,
+  `journalctl --since "2 days ago"`)
+* IPMI and RAID controller logs
+
+Ceph service crashes can be listed and viewed in detail by running
+`ceph crash ls` and `ceph crash info <crash_id>`. Crashes marked as
+new can be acknowledged by running, for example,
+`ceph crash archive-all`.
+
 To get a more detailed view, every Ceph service has a log file under
 `/var/log/ceph/`. If more detail is required, the log level can be
 adjusted footnote:[Ceph log and debugging {cephdocs-url}/rados/troubleshooting/log-and-debug/].
 
-You can find more information about troubleshooting
-footnote:[Ceph troubleshooting {cephdocs-url}/rados/troubleshooting/]
-a Ceph cluster on the official website.
-
+[[pve_ceph_ts_causes]]
+.Common Causes of Ceph Problems
+
+* Network problems like congestion, a faulty switch, a shut down
+interface or a blocking firewall. Check whether all {pve} nodes are
+reliably reachable on the
+xref:pvecm_cluster_network[corosync cluster network] and on the
+xref:pve_ceph_install_wizard[Ceph public and cluster network].
+
+* Disk or connection parts which are:
+** defective
+** not firmly mounted
+** lacking I/O performance under higher load (e.g. when using HDDs,
+consumer hardware or
+xref:pve_ceph_recommendation_raid[inadvisable RAID controllers])
+
+* Not fulfilling the xref:pve_ceph_recommendation[recommendations] for
+a healthy Ceph cluster.
+
+[[pve_ceph_ts_problems]]
+.Common Ceph Problems
+ ::
+
+OSDs `down`/crashed:::
+A faulty OSD will be reported as `down` and mostly (auto) `out` 10
+minutes later. Depending on the cause, it can also automatically
+become `up` and `in` again. To try a manual activation via web
+interface, go to __Any node -> Ceph -> OSD__, select the OSD and click
+on **Start**, **In** and **Reload**. When using the shell, run on the
+affected node `ceph-volume lvm activate --all`.
++
+To activate a failed OSD, it may be necessary to
+xref:ha_manager_node_maintenance[safely reboot] the respective node
+or, as a last resort, to
+xref:pve_ceph_osd_replace[recreate or replace] the OSD.
 
 ifdef::manvolnum[]
 include::pve-copyright.adoc[]
-- 
2.39.5



_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel


  parent reply	other threads:[~2025-02-05 10:09 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-02-05 10:08 [pve-devel] [PATCH docs v2 1/6] ceph: add anchors for use in troubleshooting section Alexander Zeidler
2025-02-05 10:08 ` [pve-devel] [PATCH docs v2 2/6] ceph: correct heading capitalization Alexander Zeidler
2025-02-05 10:08 ` Alexander Zeidler [this message]
2025-02-05 10:08 ` [pve-devel] [PATCH docs v2 4/6] ceph: osd: revise and expand the section "Destroy OSDs" Alexander Zeidler
2025-02-05 10:08 ` [pve-devel] [PATCH docs v2 5/6] ceph: maintenance: revise and expand section "Replace OSDs" Alexander Zeidler
2025-02-05 10:08 ` [pve-devel] [PATCH docs v2 6/6] pvecm: remove node: mention Ceph and its steps for safe removal Alexander Zeidler
2025-02-05 14:20 ` [pve-devel] [PATCH docs v2 1/6] ceph: add anchors for use in troubleshooting section Max Carrara
2025-03-24 16:42 ` [pve-devel] applied: " Aaron Lauterer
2025-03-26 10:20   ` Max Carrara
2025-03-26 13:13     ` Aaron Lauterer
2025-03-26 13:36       ` Max Carrara

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20250205100850.3-3-a.zeidler@proxmox.com \
    --to=a.zeidler@proxmox.com \
    --cc=pve-devel@lists.proxmox.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.
Service provided by Proxmox Server Solutions GmbH | Privacy | Legal