From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <pve-devel-bounces@lists.proxmox.com>
Received: from firstgate.proxmox.com (firstgate.proxmox.com [IPv6:2a01:7e0:0:424::9])
	by lore.proxmox.com (Postfix) with ESMTPS id 0D3D91FF15C
	for <inbox@lore.proxmox.com>; Wed,  5 Feb 2025 11:09:18 +0100 (CET)
Received: from firstgate.proxmox.com (localhost [127.0.0.1])
	by firstgate.proxmox.com (Proxmox) with ESMTP id E359A733E;
	Wed,  5 Feb 2025 11:09:09 +0100 (CET)
From: Alexander Zeidler <a.zeidler@proxmox.com>
To: pve-devel@lists.proxmox.com
Date: Wed,  5 Feb 2025 11:08:47 +0100
Message-Id: <20250205100850.3-3-a.zeidler@proxmox.com>
X-Mailer: git-send-email 2.39.5
In-Reply-To: <20250205100850.3-1-a.zeidler@proxmox.com>
References: <20250205100850.3-1-a.zeidler@proxmox.com>
MIME-Version: 1.0
X-SPAM-LEVEL: Spam detection results:  0
 AWL 0.087 Adjusted score from AWL reputation of From: address
 BAYES_00                 -1.9 Bayes spam probability is 0 to 1%
 DMARC_MISSING             0.1 Missing DMARC policy
 KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment
 RCVD_IN_VALIDITY_CERTIFIED_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to
 Validity was blocked. See
 https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more
 information.
 RCVD_IN_VALIDITY_RPBL_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to
 Validity was blocked. See
 https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more
 information.
 RCVD_IN_VALIDITY_SAFE_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to
 Validity was blocked. See
 https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more
 information.
 SPF_HELO_NONE           0.001 SPF: HELO does not publish an SPF Record
 SPF_PASS               -0.001 SPF: sender matches SPF record
Subject: [pve-devel] [PATCH docs v2 3/6] ceph: troubleshooting: revise and
 add frequently needed information
X-BeenThere: pve-devel@lists.proxmox.com
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Proxmox VE development discussion <pve-devel.lists.proxmox.com>
List-Unsubscribe: <https://lists.proxmox.com/cgi-bin/mailman/options/pve-devel>, 
 <mailto:pve-devel-request@lists.proxmox.com?subject=unsubscribe>
List-Archive: <http://lists.proxmox.com/pipermail/pve-devel/>
List-Post: <mailto:pve-devel@lists.proxmox.com>
List-Help: <mailto:pve-devel-request@lists.proxmox.com?subject=help>
List-Subscribe: <https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel>, 
 <mailto:pve-devel-request@lists.proxmox.com?subject=subscribe>
Reply-To: Proxmox VE development discussion <pve-devel@lists.proxmox.com>
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Errors-To: pve-devel-bounces@lists.proxmox.com
Sender: "pve-devel" <pve-devel-bounces@lists.proxmox.com>

Existing information is slightly modified and retained.

Add information:
* List which logs are usually helpful for troubleshooting
* Explain how to acknowledge listed Ceph crashes and view details
* List common causes of Ceph problems and link to recommendations for a
  healthy cluster
* Briefly describe the common problem "OSDs down/crashed"

Signed-off-by: Alexander Zeidler <a.zeidler@proxmox.com>
---
v2:
* implement all comments from Max Carrara
** using longer link texts
** fix build errors by adding two missing anchors in patch:
   "ceph: add anchors for use in troubleshooting"

 pveceph.adoc | 72 ++++++++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 64 insertions(+), 8 deletions(-)

diff --git a/pveceph.adoc b/pveceph.adoc
index 90bb975..7401d2b 100644
--- a/pveceph.adoc
+++ b/pveceph.adoc
@@ -1150,22 +1150,78 @@ The following Ceph commands can be used to see if the cluster is healthy
 ('HEALTH_OK'), if there are warnings ('HEALTH_WARN'), or even errors
 ('HEALTH_ERR'). If the cluster is in an unhealthy state, the status commands
 below will also give you an overview of the current events and actions to take.
+To stop their execution, press CTRL-C.
 
 ----
-# single time output
-pve# ceph -s
-# continuously output status changes (press CTRL+C to stop)
-pve# ceph -w
+# Continuously watch the cluster status
+pve# watch ceph --status
+
+# Print the cluster status once (not being updated)
+# and continuously append lines of status events
+pve# ceph --watch
 ----
 
+[[pve_ceph_ts]]
+Troubleshooting
+~~~~~~~~~~~~~~~
+
+This section includes frequently used troubleshooting information.
+More information can be found on the official Ceph website under
+Troubleshooting
+footnote:[Ceph troubleshooting {cephdocs-url}/rados/troubleshooting/].
+
+[[pve_ceph_ts_logs]]
+.Relevant Logs on Affected Node
+
+* xref:disk_health_monitoring[Disk Health Monitoring]
+* __System -> System Log__ (or, for example,
+  `journalctl --since "2 days ago"`)
+* IPMI and RAID controller logs
+
+Ceph service crashes can be listed and viewed in detail by running
+`ceph crash ls` and `ceph crash info <crash_id>`. Crashes marked as
+new can be acknowledged by running, for example,
+`ceph crash archive-all`.
+
 To get a more detailed view, every Ceph service has a log file under
 `/var/log/ceph/`. If more detail is required, the log level can be
 adjusted footnote:[Ceph log and debugging {cephdocs-url}/rados/troubleshooting/log-and-debug/].
 
-You can find more information about troubleshooting
-footnote:[Ceph troubleshooting {cephdocs-url}/rados/troubleshooting/]
-a Ceph cluster on the official website.
-
+[[pve_ceph_ts_causes]]
+.Common Causes of Ceph Problems
+
+* Network problems like congestion, a faulty switch, a shut down
+interface or a blocking firewall. Check whether all {pve} nodes are
+reliably reachable on the
+xref:pvecm_cluster_network[corosync cluster network] and on the
+xref:pve_ceph_install_wizard[Ceph public and cluster network].
+
+* Disk or connection parts which are:
+** defective
+** not firmly mounted
+** lacking I/O performance under higher load (e.g. when using HDDs,
+consumer hardware or
+xref:pve_ceph_recommendation_raid[inadvisable RAID controllers])
+
+* Not fulfilling the xref:pve_ceph_recommendation[recommendations] for
+a healthy Ceph cluster.
+
+[[pve_ceph_ts_problems]]
+.Common Ceph Problems
+ ::
+
+OSDs `down`/crashed:::
+A faulty OSD will be reported as `down` and mostly (auto) `out` 10
+minutes later. Depending on the cause, it can also automatically
+become `up` and `in` again. To try a manual activation via web
+interface, go to __Any node -> Ceph -> OSD__, select the OSD and click
+on **Start**, **In** and **Reload**. When using the shell, run on the
+affected node `ceph-volume lvm activate --all`.
++
+To activate a failed OSD, it may be necessary to
+xref:ha_manager_node_maintenance[safely reboot] the respective node
+or, as a last resort, to
+xref:pve_ceph_osd_replace[recreate or replace] the OSD.
 
 ifdef::manvolnum[]
 include::pve-copyright.adoc[]
-- 
2.39.5



_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel