From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: <pve-devel-bounces@lists.proxmox.com> Received: from firstgate.proxmox.com (firstgate.proxmox.com [IPv6:2a01:7e0:0:424::9]) by lore.proxmox.com (Postfix) with ESMTPS id 0D3D91FF15C for <inbox@lore.proxmox.com>; Wed, 5 Feb 2025 11:09:18 +0100 (CET) Received: from firstgate.proxmox.com (localhost [127.0.0.1]) by firstgate.proxmox.com (Proxmox) with ESMTP id E359A733E; Wed, 5 Feb 2025 11:09:09 +0100 (CET) From: Alexander Zeidler <a.zeidler@proxmox.com> To: pve-devel@lists.proxmox.com Date: Wed, 5 Feb 2025 11:08:47 +0100 Message-Id: <20250205100850.3-3-a.zeidler@proxmox.com> X-Mailer: git-send-email 2.39.5 In-Reply-To: <20250205100850.3-1-a.zeidler@proxmox.com> References: <20250205100850.3-1-a.zeidler@proxmox.com> MIME-Version: 1.0 X-SPAM-LEVEL: Spam detection results: 0 AWL 0.087 Adjusted score from AWL reputation of From: address BAYES_00 -1.9 Bayes spam probability is 0 to 1% DMARC_MISSING 0.1 Missing DMARC policy KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment RCVD_IN_VALIDITY_CERTIFIED_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to Validity was blocked. See https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more information. RCVD_IN_VALIDITY_RPBL_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to Validity was blocked. See https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more information. RCVD_IN_VALIDITY_SAFE_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to Validity was blocked. See https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more information. SPF_HELO_NONE 0.001 SPF: HELO does not publish an SPF Record SPF_PASS -0.001 SPF: sender matches SPF record Subject: [pve-devel] [PATCH docs v2 3/6] ceph: troubleshooting: revise and add frequently needed information X-BeenThere: pve-devel@lists.proxmox.com X-Mailman-Version: 2.1.29 Precedence: list List-Id: Proxmox VE development discussion <pve-devel.lists.proxmox.com> List-Unsubscribe: <https://lists.proxmox.com/cgi-bin/mailman/options/pve-devel>, <mailto:pve-devel-request@lists.proxmox.com?subject=unsubscribe> List-Archive: <http://lists.proxmox.com/pipermail/pve-devel/> List-Post: <mailto:pve-devel@lists.proxmox.com> List-Help: <mailto:pve-devel-request@lists.proxmox.com?subject=help> List-Subscribe: <https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel>, <mailto:pve-devel-request@lists.proxmox.com?subject=subscribe> Reply-To: Proxmox VE development discussion <pve-devel@lists.proxmox.com> Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: pve-devel-bounces@lists.proxmox.com Sender: "pve-devel" <pve-devel-bounces@lists.proxmox.com> Existing information is slightly modified and retained. Add information: * List which logs are usually helpful for troubleshooting * Explain how to acknowledge listed Ceph crashes and view details * List common causes of Ceph problems and link to recommendations for a healthy cluster * Briefly describe the common problem "OSDs down/crashed" Signed-off-by: Alexander Zeidler <a.zeidler@proxmox.com> --- v2: * implement all comments from Max Carrara ** using longer link texts ** fix build errors by adding two missing anchors in patch: "ceph: add anchors for use in troubleshooting" pveceph.adoc | 72 ++++++++++++++++++++++++++++++++++++++++++++++------ 1 file changed, 64 insertions(+), 8 deletions(-) diff --git a/pveceph.adoc b/pveceph.adoc index 90bb975..7401d2b 100644 --- a/pveceph.adoc +++ b/pveceph.adoc @@ -1150,22 +1150,78 @@ The following Ceph commands can be used to see if the cluster is healthy ('HEALTH_OK'), if there are warnings ('HEALTH_WARN'), or even errors ('HEALTH_ERR'). If the cluster is in an unhealthy state, the status commands below will also give you an overview of the current events and actions to take. +To stop their execution, press CTRL-C. ---- -# single time output -pve# ceph -s -# continuously output status changes (press CTRL+C to stop) -pve# ceph -w +# Continuously watch the cluster status +pve# watch ceph --status + +# Print the cluster status once (not being updated) +# and continuously append lines of status events +pve# ceph --watch ---- +[[pve_ceph_ts]] +Troubleshooting +~~~~~~~~~~~~~~~ + +This section includes frequently used troubleshooting information. +More information can be found on the official Ceph website under +Troubleshooting +footnote:[Ceph troubleshooting {cephdocs-url}/rados/troubleshooting/]. + +[[pve_ceph_ts_logs]] +.Relevant Logs on Affected Node + +* xref:disk_health_monitoring[Disk Health Monitoring] +* __System -> System Log__ (or, for example, + `journalctl --since "2 days ago"`) +* IPMI and RAID controller logs + +Ceph service crashes can be listed and viewed in detail by running +`ceph crash ls` and `ceph crash info <crash_id>`. Crashes marked as +new can be acknowledged by running, for example, +`ceph crash archive-all`. + To get a more detailed view, every Ceph service has a log file under `/var/log/ceph/`. If more detail is required, the log level can be adjusted footnote:[Ceph log and debugging {cephdocs-url}/rados/troubleshooting/log-and-debug/]. -You can find more information about troubleshooting -footnote:[Ceph troubleshooting {cephdocs-url}/rados/troubleshooting/] -a Ceph cluster on the official website. - +[[pve_ceph_ts_causes]] +.Common Causes of Ceph Problems + +* Network problems like congestion, a faulty switch, a shut down +interface or a blocking firewall. Check whether all {pve} nodes are +reliably reachable on the +xref:pvecm_cluster_network[corosync cluster network] and on the +xref:pve_ceph_install_wizard[Ceph public and cluster network]. + +* Disk or connection parts which are: +** defective +** not firmly mounted +** lacking I/O performance under higher load (e.g. when using HDDs, +consumer hardware or +xref:pve_ceph_recommendation_raid[inadvisable RAID controllers]) + +* Not fulfilling the xref:pve_ceph_recommendation[recommendations] for +a healthy Ceph cluster. + +[[pve_ceph_ts_problems]] +.Common Ceph Problems + :: + +OSDs `down`/crashed::: +A faulty OSD will be reported as `down` and mostly (auto) `out` 10 +minutes later. Depending on the cause, it can also automatically +become `up` and `in` again. To try a manual activation via web +interface, go to __Any node -> Ceph -> OSD__, select the OSD and click +on **Start**, **In** and **Reload**. When using the shell, run on the +affected node `ceph-volume lvm activate --all`. ++ +To activate a failed OSD, it may be necessary to +xref:ha_manager_node_maintenance[safely reboot] the respective node +or, as a last resort, to +xref:pve_ceph_osd_replace[recreate or replace] the OSD. ifdef::manvolnum[] include::pve-copyright.adoc[] -- 2.39.5 _______________________________________________ pve-devel mailing list pve-devel@lists.proxmox.com https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel