From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: <pve-devel-bounces@lists.proxmox.com> Received: from firstgate.proxmox.com (firstgate.proxmox.com [IPv6:2a01:7e0:0:424::9]) by lore.proxmox.com (Postfix) with ESMTPS id 6270E1FF16E for <inbox@lore.proxmox.com>; Mon, 3 Feb 2025 17:19:32 +0100 (CET) Received: from firstgate.proxmox.com (localhost [127.0.0.1]) by firstgate.proxmox.com (Proxmox) with ESMTP id B35E54FA5; Mon, 3 Feb 2025 17:19:30 +0100 (CET) Mime-Version: 1.0 Date: Mon, 03 Feb 2025 17:19:26 +0100 Message-Id: <D7IY3YVAE5LR.FMAM13UIHG08@proxmox.com> To: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com> From: "Max Carrara" <m.carrara@proxmox.com> X-Mailer: aerc 0.18.2-0-ge037c095a049 References: <20250203142801.3-1-a.zeidler@proxmox.com> <20250203142801.3-3-a.zeidler@proxmox.com> In-Reply-To: <20250203142801.3-3-a.zeidler@proxmox.com> X-SPAM-LEVEL: Spam detection results: 0 AWL 0.068 Adjusted score from AWL reputation of From: address BAYES_00 -1.9 Bayes spam probability is 0 to 1% DMARC_MISSING 0.1 Missing DMARC policy KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment SPF_HELO_NONE 0.001 SPF: HELO does not publish an SPF Record SPF_PASS -0.001 SPF: sender matches SPF record Subject: Re: [pve-devel] [PATCH docs 3/6] ceph: troubleshooting: revise and add frequently needed information X-BeenThere: pve-devel@lists.proxmox.com X-Mailman-Version: 2.1.29 Precedence: list List-Id: Proxmox VE development discussion <pve-devel.lists.proxmox.com> List-Unsubscribe: <https://lists.proxmox.com/cgi-bin/mailman/options/pve-devel>, <mailto:pve-devel-request@lists.proxmox.com?subject=unsubscribe> List-Archive: <http://lists.proxmox.com/pipermail/pve-devel/> List-Post: <mailto:pve-devel@lists.proxmox.com> List-Help: <mailto:pve-devel-request@lists.proxmox.com?subject=help> List-Subscribe: <https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel>, <mailto:pve-devel-request@lists.proxmox.com?subject=subscribe> Reply-To: Proxmox VE development discussion <pve-devel@lists.proxmox.com> Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: pve-devel-bounces@lists.proxmox.com Sender: "pve-devel" <pve-devel-bounces@lists.proxmox.com> On Mon Feb 3, 2025 at 3:27 PM CET, Alexander Zeidler wrote: > Existing information is slightly modified and retained. > > Add information: > * List which logs are usually helpful for troubleshooting > * Explain how to acknowledge listed Ceph crashes and view details > * List common causes of Ceph problems and link to recommendations for a > healthy cluster > * Briefly describe the common problem "OSDs down/crashed" > > Signed-off-by: Alexander Zeidler <a.zeidler@proxmox.com> > --- > pveceph.adoc | 72 ++++++++++++++++++++++++++++++++++++++++++++++------ > 1 file changed, 64 insertions(+), 8 deletions(-) > > diff --git a/pveceph.adoc b/pveceph.adoc > index 90bb975..4e1c1e2 100644 > --- a/pveceph.adoc > +++ b/pveceph.adoc > @@ -1150,22 +1150,78 @@ The following Ceph commands can be used to see if the cluster is healthy > ('HEALTH_OK'), if there are warnings ('HEALTH_WARN'), or even errors > ('HEALTH_ERR'). If the cluster is in an unhealthy state, the status commands > below will also give you an overview of the current events and actions to take. > +To stop their execution, press CTRL-C. > > ---- > -# single time output > -pve# ceph -s > -# continuously output status changes (press CTRL+C to stop) > -pve# ceph -w > +# Continuously watch the cluster status > +pve# watch ceph --status > + > +# Print the cluster status once (not being updated) > +# and continuously append lines of status events > +pve# ceph --watch > ---- > > +[[pve_ceph_ts]] > +Troubleshooting > +~~~~~~~~~~~~~~~ > + > +This section includes frequently used troubleshooting information. > +More information can be found on the official Ceph website under > +Troubleshooting > +footnote:[Ceph troubleshooting {cephdocs-url}/rados/troubleshooting/]. > + > +[[pve_ceph_ts_logs]] > +.Relevant Logs on Affected Node > + > +* xref:_disk_health_monitoring[Disk Health Monitoring] For some reason, the "_disk_health_monitoring" anchor above breaks building the docs for me -- "make update" exits with an error, complaining that it can't find the anchor. The one-page docs ("pve-admin-guide.html") seems to build just fine, though. The anchor works there too, so I'm not sure what's going wrong there exactly. > +* __System -> System Log__ (or, for example, > + `journalctl --since "2 days ago"`) > +* IPMI and RAID controller logs > + > +Ceph service crashes can be listed and viewed in detail by running > +`ceph crash ls` and `ceph crash info <crash_id>`. Crashes marked as > +new can be acknowledged by running, for example, > +`ceph crash archive-all`. > + > To get a more detailed view, every Ceph service has a log file under > `/var/log/ceph/`. If more detail is required, the log level can be > adjusted footnote:[Ceph log and debugging {cephdocs-url}/rados/troubleshooting/log-and-debug/]. > > -You can find more information about troubleshooting > -footnote:[Ceph troubleshooting {cephdocs-url}/rados/troubleshooting/] > -a Ceph cluster on the official website. > - > +[[pve_ceph_ts_causes]] > +.Common Causes of Ceph Problems > + > +* Network problems like congestion, a faulty switch, a shut down > +interface or a blocking firewall. Check whether all {pve} nodes are > +reliably reachable on the xref:_cluster_network[corosync] network and Would personally prefer "xref:_cluster_network[corosync network]" above, but no hard opinions there. > +on the xref:pve_ceph_install_wizard[configured] Ceph public and > +cluster network. Would also prefer [configured Ceph public and cluster network] as a whole here. > + > +* Disk or connection parts which are: > +** defective > +** not firmly mounted > +** lacking I/O performance under higher load (e.g. when using HDDs, > +consumer hardware or xref:pve_ceph_recommendation_raid[inadvisable] > +RAID controllers) Same here; I would prefer to highlight [inadvisable RAID controllers] as a whole. > + > +* Not fulfilling the xref:pve_ceph_recommendation[recommendations] for > +a healthy Ceph cluster. > + > +[[pve_ceph_ts_problems]] > +.Common Ceph Problems > + :: > + > +OSDs `down`/crashed::: > +A faulty OSD will be reported as `down` and mostly (auto) `out` 10 > +minutes later. Depending on the cause, it can also automatically > +become `up` and `in` again. To try a manual activation via web > +interface, go to __Any node -> Ceph -> OSD__, select the OSD and click > +on **Start**, **In** and **Reload**. When using the shell, run on the > +affected node `ceph-volume lvm activate --all`. > ++ > +To activate a failed OSD, it may be necessary to > +xref:ha_manager_node_maintenance[safely] reboot the respective node And again here: Would personally prefer [safely reboot] in the anchor ref. > +or, as a last resort, to > +xref:pve_ceph_osd_replace[recreate or replace] the OSD. > > ifdef::manvolnum[] > include::pve-copyright.adoc[] Note: The only thing that really stood out to me was the "_disk_health_monitoring" refusing to build on my system; the other comments here are just tiny style suggestions. If you disagree with them, no hard feelings at all! :P _______________________________________________ pve-devel mailing list pve-devel@lists.proxmox.com https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel