Re: [pve-devel] [PATCH docs 3/6] ceph: troubleshooting: revise and add frequently needed information

From: "Max Carrara" <m.carrara@proxmox.com>
To: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com>
Subject: Re: [pve-devel] [PATCH docs 3/6] ceph: troubleshooting: revise and add frequently needed information
Date: Mon, 03 Feb 2025 17:19:26 +0100	[thread overview]
Message-ID: <D7IY3YVAE5LR.FMAM13UIHG08@proxmox.com> (raw)
In-Reply-To: <20250203142801.3-3-a.zeidler@proxmox.com>

On Mon Feb 3, 2025 at 3:27 PM CET, Alexander Zeidler wrote:
> Existing information is slightly modified and retained.
>
> Add information:
> * List which logs are usually helpful for troubleshooting
> * Explain how to acknowledge listed Ceph crashes and view details
> * List common causes of Ceph problems and link to recommendations for a
>   healthy cluster
> * Briefly describe the common problem "OSDs down/crashed"
>
> Signed-off-by: Alexander Zeidler <a.zeidler@proxmox.com>
> ---
>  pveceph.adoc | 72 ++++++++++++++++++++++++++++++++++++++++++++++------
>  1 file changed, 64 insertions(+), 8 deletions(-)
>
> diff --git a/pveceph.adoc b/pveceph.adoc
> index 90bb975..4e1c1e2 100644
> --- a/pveceph.adoc
> +++ b/pveceph.adoc
> @@ -1150,22 +1150,78 @@ The following Ceph commands can be used to see if the cluster is healthy
>  ('HEALTH_OK'), if there are warnings ('HEALTH_WARN'), or even errors
>  ('HEALTH_ERR'). If the cluster is in an unhealthy state, the status commands
>  below will also give you an overview of the current events and actions to take.
> +To stop their execution, press CTRL-C.
>  
>  ----
> -# single time output
> -pve# ceph -s
> -# continuously output status changes (press CTRL+C to stop)
> -pve# ceph -w
> +# Continuously watch the cluster status
> +pve# watch ceph --status
> +
> +# Print the cluster status once (not being updated)
> +# and continuously append lines of status events
> +pve# ceph --watch
>  ----
>  
> +[[pve_ceph_ts]]
> +Troubleshooting
> +~~~~~~~~~~~~~~~
> +
> +This section includes frequently used troubleshooting information.
> +More information can be found on the official Ceph website under
> +Troubleshooting
> +footnote:[Ceph troubleshooting {cephdocs-url}/rados/troubleshooting/].
> +
> +[[pve_ceph_ts_logs]]
> +.Relevant Logs on Affected Node
> +
> +* xref:_disk_health_monitoring[Disk Health Monitoring]

For some reason, the "_disk_health_monitoring" anchor above breaks
building the docs for me -- "make update" exits with an error,
complaining that it can't find the anchor. The one-page docs
("pve-admin-guide.html") seems to build just fine, though. The anchor
works there too, so I'm not sure what's going wrong there exactly.

> +* __System -> System Log__ (or, for example,
> +  `journalctl --since "2 days ago"`)
> +* IPMI and RAID controller logs
> +
> +Ceph service crashes can be listed and viewed in detail by running
> +`ceph crash ls` and `ceph crash info <crash_id>`. Crashes marked as
> +new can be acknowledged by running, for example,
> +`ceph crash archive-all`.
> +
>  To get a more detailed view, every Ceph service has a log file under
>  `/var/log/ceph/`. If more detail is required, the log level can be
>  adjusted footnote:[Ceph log and debugging {cephdocs-url}/rados/troubleshooting/log-and-debug/].
>  
> -You can find more information about troubleshooting
> -footnote:[Ceph troubleshooting {cephdocs-url}/rados/troubleshooting/]
> -a Ceph cluster on the official website.
> -
> +[[pve_ceph_ts_causes]]
> +.Common Causes of Ceph Problems
> +
> +* Network problems like congestion, a faulty switch, a shut down
> +interface or a blocking firewall. Check whether all {pve} nodes are
> +reliably reachable on the xref:_cluster_network[corosync] network and

Would personally prefer "xref:_cluster_network[corosync network]" above,
but no hard opinions there.

> +on the xref:pve_ceph_install_wizard[configured] Ceph public and
> +cluster network.

Would also prefer [configured Ceph public and cluster network] as a
whole here.

> +
> +* Disk or connection parts which are:
> +** defective
> +** not firmly mounted
> +** lacking I/O performance under higher load (e.g. when using HDDs,
> +consumer hardware or xref:pve_ceph_recommendation_raid[inadvisable]
> +RAID controllers)

Same here; I would prefer to highlight [inadvisable RAID controllers] as
a whole.

> +
> +* Not fulfilling the xref:pve_ceph_recommendation[recommendations] for
> +a healthy Ceph cluster.
> +
> +[[pve_ceph_ts_problems]]
> +.Common Ceph Problems
> + ::
> +
> +OSDs `down`/crashed:::
> +A faulty OSD will be reported as `down` and mostly (auto) `out` 10
> +minutes later. Depending on the cause, it can also automatically
> +become `up` and `in` again. To try a manual activation via web
> +interface, go to __Any node -> Ceph -> OSD__, select the OSD and click
> +on **Start**, **In** and **Reload**. When using the shell, run on the
> +affected node `ceph-volume lvm activate --all`.
> ++
> +To activate a failed OSD, it may be necessary to
> +xref:ha_manager_node_maintenance[safely] reboot the respective node

And again here: Would personally prefer [safely reboot] in the anchor
ref.

> +or, as a last resort, to
> +xref:pve_ceph_osd_replace[recreate or replace] the OSD.
>  
>  ifdef::manvolnum[]
>  include::pve-copyright.adoc[]

Note: The only thing that really stood out to me was the
"_disk_health_monitoring" refusing to build on my system; the other
comments here are just tiny style suggestions. If you disagree with
them, no hard feelings at all! :P

_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel