From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68]) by lore.proxmox.com (Postfix) with ESMTPS id ECD2A1FF138 for ; Wed, 04 Feb 2026 11:53:33 +0100 (CET) Received: from firstgate.proxmox.com (localhost [127.0.0.1]) by firstgate.proxmox.com (Proxmox) with ESMTP id E40D210B13; Wed, 4 Feb 2026 11:54:05 +0100 (CET) From: Hannes Duerr To: pve-devel@lists.proxmox.com Subject: [PATCH pve-docs v2 1/3] pvecm: restructure "remove a cluster node" section Date: Wed, 4 Feb 2026 11:53:22 +0100 Message-ID: <20260204105324.61841-1-h.duerr@proxmox.com> X-Mailer: git-send-email 2.47.3 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Bm-Milter-Handled: 55990f41-d878-4baa-be0a-ee34c49e34d2 X-Bm-Transport-Timestamp: 1770202334775 X-SPAM-LEVEL: Spam detection results: 0 AWL -0.329 Adjusted score from AWL reputation of From: address BAYES_00 -1.9 Bayes spam probability is 0 to 1% DMARC_MISSING 0.1 Missing DMARC policy KAM_ASCII_DIVIDERS 0.8 Email that uses ascii formatting dividers and possible spam tricks KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment RCVD_IN_VALIDITY_CERTIFIED_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to Validity was blocked. See https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more information. RCVD_IN_VALIDITY_RPBL_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to Validity was blocked. See https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more information. RCVD_IN_VALIDITY_SAFE_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to Validity was blocked. See https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more information. SPF_HELO_NONE 0.001 SPF: HELO does not publish an SPF Record SPF_PASS -0.001 SPF: sender matches SPF record Message-ID-Hash: TKDYWXO747DN4X4W2X2FTEKIMQQYMHGR X-Message-ID-Hash: TKDYWXO747DN4X4W2X2FTEKIMQQYMHGR X-MailFrom: h.duerr@proxmox.com X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; loop; banned-address; emergency; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header X-Mailman-Version: 3.3.10 Precedence: list List-Id: Proxmox VE development discussion List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: The old section did not have a clear structure or sequence to follow. For example, the final point, `pvecm delnode`, was not included in the list of steps required to remove the cluster node. The new structure consists of prerequisites, steps to remove the cluster node and how to rejoin the existing node. The steps are explained using an example. Signed-off-by: Hannes Duerr --- Notes: v1 -> v2: * fixed typos (thanks @Maximilliano) * established "Cleanup steps after removing the cluster node" section pvecm.adoc | 179 ++++++++++++++++++++++++++--------------------------- 1 file changed, 88 insertions(+), 91 deletions(-) diff --git a/pvecm.adoc b/pvecm.adoc index d12dde7..ddd222c 100644 --- a/pvecm.adoc +++ b/pvecm.adoc @@ -318,22 +318,30 @@ Remove a Cluster Node CAUTION: Read the procedure carefully before proceeding, as it may not be what you want or need. -Move all virtual machines from the node. Ensure that you have made copies of any -local data or backups that you want to keep. In addition, make sure to remove -any scheduled replication jobs to the node to be removed. +The following steps explain how to remove a node from a cluster that +was also part of a xref:chapter_pveceph[Ceph] cluster. +If Ceph was not installed on your node, you can simply ignore the +steps that mention it. + +.Prerequisites: + +* Move all virtual machines and containers away from the node. +* Back up all local data on the node to be deleted. +* Make sure the node to be deleted is not part of any replication job + anymore. + +CAUTION: If you fail to remove replication jobs from a node before +removing the node itself, the replication job will become irremovable. +Note that replication automatically switches direction when a +replicated VM is migrated. Therefore, migrating a replicated VM from a +node that is going to be deleted will set up replication jobs to that +node automatically. + +* Ensure that the remaining Ceph cluster has sufficient storage space + and that the OSDs are running (i.e. `up` and `in`). The destruction + of any OSD, especially the last one on a node, will trigger a data + rebalance in Ceph. -CAUTION: Failure to remove replication jobs to a node before removing said node -will result in the replication job becoming irremovable. Especially note that -replication automatically switches direction if a replicated VM is migrated, so -by migrating a replicated VM from a node to be deleted, replication jobs will be -set up to that node automatically. - -If the node to be removed has been configured for -xref:chapter_pveceph[Ceph]: - -. Ensure that sufficient {pve} nodes with running OSDs (`up` and `in`) -continue to exist. -+ NOTE: By default, Ceph pools have a `size/min_size` of `3/2` and a full node as `failure domain` at the object balancer xref:pve_ceph_device_classes[CRUSH]. So if less than `size` (`3`) @@ -341,118 +349,107 @@ nodes with running OSDs are online, data redundancy will be degraded. If less than `min_size` are online, pool I/O will be blocked and affected guests may crash. -. Ensure that sufficient xref:pve_ceph_monitors[monitors], -xref:pve_ceph_manager[managers] and, if using CephFS, -xref:pveceph_fs_mds[metadata servers] remain available. +* Ensure that sufficient xref:pve_ceph_monitors[monitors], + xref:pve_ceph_manager[managers] and, if using CephFS, + xref:pveceph_fs_mds[metadata servers] remain available in the Ceph + cluster. -. To maintain data redundancy, each destruction of an OSD, especially -the last one on a node, will trigger a data rebalance. Therefore, -ensure that the OSDs on the remaining nodes have sufficient free space -left. +.Remove the cluster node: -. To remove Ceph from the node to be deleted, start by -xref:pve_ceph_osd_destroy[destroying] its OSDs, one after the other. +Before a node can be removed from a cluster, you must ensure that it +is no longer part of the Ceph cluster and that no Ceph resources or +services are residing on it. +In the following the cluster node `node4` will be removed from the +cluster -. Once the xref:pve_ceph_mon_and_ts[CEPH status] is `HEALTH_OK` again, -proceed by: - -[arabic] -.. destroying its xref:pveceph_fs_mds[metadata server] via web -interface at __Ceph -> CephFS__ or by running: -+ ---- -# pveceph mds destroy +node4# pvecm nodes + +Membership information +~~~~~~~~~~~~~~~~~~~~~~ + Nodeid Votes Name + 1 1 node1 + 2 1 node2 + 3 1 node3 + 4 1 node4 (local) ---- -.. xref:pveceph_destroy_mon[destroying its monitor] -.. xref:pveceph_destroy_mgr[destroying its manager] +. Start by xref:pve_ceph_osd_destroy[destroying] the remaining OSDs on + the node to be deleted, one after another. -. Finally, remove the now empty bucket ({pve} node to be removed) from -the CRUSH hierarchy by running: +. Wait until the xref:pve_ceph_mon_and_ts[CEPH status] reaches + `HEALTH_OK` again. + +. If it exists, destroy the remaining xref:pveceph_fs_mds[metadata server] + via the web interface at __Ceph -> CephFS__ or by running: + ---- -# ceph osd crush remove +node4# pveceph mds destroy node4 ---- -In the following example, we will remove the node hp4 from the cluster. +. xref:pveceph_destroy_mon[Destroy the remaining monitor.] -Log in to a *different* cluster node (not hp4), and issue a `pvecm nodes` -command to identify the node ID to remove: +. xref:pveceph_destroy_mgr[Destroy the remaining manager.] +. Finally, remove the now empty bucket ({pve} node to be removed) from + the CRUSH hierarchy. ++ ---- - hp1# pvecm nodes - -Membership information -~~~~~~~~~~~~~~~~~~~~~~ - Nodeid Votes Name - 1 1 hp1 (local) - 2 1 hp2 - 3 1 hp3 - 4 1 hp4 +node4# ceph osd crush remove node4 ---- - -At this point, you must power off hp4 and ensure that it will not power on -again (in the network) with its current configuration. - +. Power off `node4` and make sure that it will not power on again + in this network with its current configuration. ++ IMPORTANT: As mentioned above, it is critical to power off the node *before* removal, and make sure that it will *not* power on again (in the existing cluster network) with its current configuration. If you power on the node as it is, the cluster could end up broken, and it could be difficult to restore it to a functioning state. -After powering off the node hp4, we can safely remove it from the cluster. - +. Log into one of the remaining cluster node and remove the node + `node4` from the cluster. ++ ---- - hp1# pvecm delnode hp4 - Killing node 4 +node1# pvecm delnode node4 ---- - -NOTE: At this point, it is possible that you will receive an error message -stating `Could not kill node (error = CS_ERR_NOT_EXIST)`. This does not -signify an actual failure in the deletion of the node, but rather a failure in -corosync trying to kill an offline node. Thus, it can be safely ignored. - -Use `pvecm nodes` or `pvecm status` to check the node list again. It should -look something like: - +. Verify the node was successfully removed from the cluster ---- -hp1# pvecm status - -... - -Votequorum information -~~~~~~~~~~~~~~~~~~~~~~ -Expected votes: 3 -Highest expected: 3 -Total votes: 3 -Quorum: 2 -Flags: Quorate +node1# pvecm nodes Membership information ~~~~~~~~~~~~~~~~~~~~~~ Nodeid Votes Name -0x00000001 1 192.168.15.90 (local) -0x00000002 1 192.168.15.91 -0x00000003 1 192.168.15.92 + 1 1 node1 (local) + 2 1 node2 + 3 1 node3 ---- -If, for whatever reason, you want this server to join the same cluster again, -you have to: ++ +NOTE: It is possible that you will receive an error message stating +`could not kill node (error = cs_err_not_exist)`. This does not +signify an actual failure in the deletion of the node, but rather a +failure in Corosync trying to kill an offline node. thus, it can be +safely ignored. + +.Cleanup steps after removing the cluster node. +* The configuration files of the removed node will still reside in + '/etc/pve/nodes/node4' in the cluster filesystem. Recover any + configuration you still need and remove the directory afterwards. +* The SSH fingerprint of the removed node will still reside in the + 'known_hosts' file on the other nodes. If you receive an SSH error + after rejoining a node with the same IP or hostname, run `pvecm + updatecerts` once on the re-added node to update its fingerprint + cluster wide. -* do a fresh install of {pve} on it, +.Rejoin the same node again: -* then join it, as explained in the previous section. +If you want the same server to join the same cluster again, you have to: -The configuration files for the removed node will still reside in -'/etc/pve/nodes/hp4'. Recover any configuration you still need and remove the -directory afterwards. +* Reinstall {pve} on the server, -NOTE: After removal of the node, its SSH fingerprint will still reside in the -'known_hosts' of the other nodes. If you receive an SSH error after rejoining -a node with the same IP or hostname, run `pvecm updatecerts` once on the -re-added node to update its fingerprint cluster wide. +* then xref:pvecm_join_node_to_cluster[rejoin the node to the cluster] [[pvecm_separate_node_without_reinstall]] Separate a Node Without Reinstalling -- 2.47.3