all lists on lists.proxmox.com
 help / color / mirror / Atom feed
From: "Max Carrara" <m.carrara@proxmox.com>
To: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com>
Subject: Re: [pve-devel] [PATCH docs 3/6] ceph: troubleshooting: revise and add frequently needed information
Date: Mon, 03 Feb 2025 17:19:26 +0100	[thread overview]
Message-ID: <D7IY3YVAE5LR.FMAM13UIHG08@proxmox.com> (raw)
In-Reply-To: <20250203142801.3-3-a.zeidler@proxmox.com>

On Mon Feb 3, 2025 at 3:27 PM CET, Alexander Zeidler wrote:
> Existing information is slightly modified and retained.
>
> Add information:
> * List which logs are usually helpful for troubleshooting
> * Explain how to acknowledge listed Ceph crashes and view details
> * List common causes of Ceph problems and link to recommendations for a
>   healthy cluster
> * Briefly describe the common problem "OSDs down/crashed"
>
> Signed-off-by: Alexander Zeidler <a.zeidler@proxmox.com>
> ---
>  pveceph.adoc | 72 ++++++++++++++++++++++++++++++++++++++++++++++------
>  1 file changed, 64 insertions(+), 8 deletions(-)
>
> diff --git a/pveceph.adoc b/pveceph.adoc
> index 90bb975..4e1c1e2 100644
> --- a/pveceph.adoc
> +++ b/pveceph.adoc
> @@ -1150,22 +1150,78 @@ The following Ceph commands can be used to see if the cluster is healthy
>  ('HEALTH_OK'), if there are warnings ('HEALTH_WARN'), or even errors
>  ('HEALTH_ERR'). If the cluster is in an unhealthy state, the status commands
>  below will also give you an overview of the current events and actions to take.
> +To stop their execution, press CTRL-C.
>  
>  ----
> -# single time output
> -pve# ceph -s
> -# continuously output status changes (press CTRL+C to stop)
> -pve# ceph -w
> +# Continuously watch the cluster status
> +pve# watch ceph --status
> +
> +# Print the cluster status once (not being updated)
> +# and continuously append lines of status events
> +pve# ceph --watch
>  ----
>  
> +[[pve_ceph_ts]]
> +Troubleshooting
> +~~~~~~~~~~~~~~~
> +
> +This section includes frequently used troubleshooting information.
> +More information can be found on the official Ceph website under
> +Troubleshooting
> +footnote:[Ceph troubleshooting {cephdocs-url}/rados/troubleshooting/].
> +
> +[[pve_ceph_ts_logs]]
> +.Relevant Logs on Affected Node
> +
> +* xref:_disk_health_monitoring[Disk Health Monitoring]

For some reason, the "_disk_health_monitoring" anchor above breaks
building the docs for me -- "make update" exits with an error,
complaining that it can't find the anchor. The one-page docs
("pve-admin-guide.html") seems to build just fine, though. The anchor
works there too, so I'm not sure what's going wrong there exactly.

> +* __System -> System Log__ (or, for example,
> +  `journalctl --since "2 days ago"`)
> +* IPMI and RAID controller logs
> +
> +Ceph service crashes can be listed and viewed in detail by running
> +`ceph crash ls` and `ceph crash info <crash_id>`. Crashes marked as
> +new can be acknowledged by running, for example,
> +`ceph crash archive-all`.
> +
>  To get a more detailed view, every Ceph service has a log file under
>  `/var/log/ceph/`. If more detail is required, the log level can be
>  adjusted footnote:[Ceph log and debugging {cephdocs-url}/rados/troubleshooting/log-and-debug/].
>  
> -You can find more information about troubleshooting
> -footnote:[Ceph troubleshooting {cephdocs-url}/rados/troubleshooting/]
> -a Ceph cluster on the official website.
> -
> +[[pve_ceph_ts_causes]]
> +.Common Causes of Ceph Problems
> +
> +* Network problems like congestion, a faulty switch, a shut down
> +interface or a blocking firewall. Check whether all {pve} nodes are
> +reliably reachable on the xref:_cluster_network[corosync] network and

Would personally prefer "xref:_cluster_network[corosync network]" above,
but no hard opinions there.

> +on the xref:pve_ceph_install_wizard[configured] Ceph public and
> +cluster network.

Would also prefer [configured Ceph public and cluster network] as a
whole here.

> +
> +* Disk or connection parts which are:
> +** defective
> +** not firmly mounted
> +** lacking I/O performance under higher load (e.g. when using HDDs,
> +consumer hardware or xref:pve_ceph_recommendation_raid[inadvisable]
> +RAID controllers)

Same here; I would prefer to highlight [inadvisable RAID controllers] as
a whole.

> +
> +* Not fulfilling the xref:pve_ceph_recommendation[recommendations] for
> +a healthy Ceph cluster.
> +
> +[[pve_ceph_ts_problems]]
> +.Common Ceph Problems
> + ::
> +
> +OSDs `down`/crashed:::
> +A faulty OSD will be reported as `down` and mostly (auto) `out` 10
> +minutes later. Depending on the cause, it can also automatically
> +become `up` and `in` again. To try a manual activation via web
> +interface, go to __Any node -> Ceph -> OSD__, select the OSD and click
> +on **Start**, **In** and **Reload**. When using the shell, run on the
> +affected node `ceph-volume lvm activate --all`.
> ++
> +To activate a failed OSD, it may be necessary to
> +xref:ha_manager_node_maintenance[safely] reboot the respective node

And again here: Would personally prefer [safely reboot] in the anchor
ref.

> +or, as a last resort, to
> +xref:pve_ceph_osd_replace[recreate or replace] the OSD.
>  
>  ifdef::manvolnum[]
>  include::pve-copyright.adoc[]

Note: The only thing that really stood out to me was the
"_disk_health_monitoring" refusing to build on my system; the other
comments here are just tiny style suggestions. If you disagree with
them, no hard feelings at all! :P



_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel


  reply	other threads:[~2025-02-03 16:19 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-02-03 14:27 [pve-devel] [PATCH docs 1/6] ceph: add anchors for use in troubleshooting section Alexander Zeidler
2025-02-03 14:27 ` [pve-devel] [PATCH docs 2/6] ceph: correct heading capitalization Alexander Zeidler
2025-02-03 14:27 ` [pve-devel] [PATCH docs 3/6] ceph: troubleshooting: revise and add frequently needed information Alexander Zeidler
2025-02-03 16:19   ` Max Carrara [this message]
2025-02-03 14:27 ` [pve-devel] [PATCH docs 4/6] ceph: osd: revise and expand the section "Destroy OSDs" Alexander Zeidler
2025-02-03 16:19   ` Max Carrara
2025-02-03 14:28 ` [pve-devel] [PATCH docs 5/6] ceph: maintenance: revise and expand section "Replace OSDs" Alexander Zeidler
2025-02-03 14:28 ` [pve-devel] [PATCH docs 6/6] pvecm: remove node: mention Ceph and its steps for safe removal Alexander Zeidler
2025-02-03 16:19 ` [pve-devel] [PATCH docs 1/6] ceph: add anchors for use in troubleshooting section Max Carrara
2025-02-04  9:22   ` Alexander Zeidler
2025-02-04  9:52     ` Max Carrara
2025-02-05 10:10       ` Alexander Zeidler

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=D7IY3YVAE5LR.FMAM13UIHG08@proxmox.com \
    --to=m.carrara@proxmox.com \
    --cc=pve-devel@lists.proxmox.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.
Service provided by Proxmox Server Solutions GmbH | Privacy | Legal