From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from firstgate.proxmox.com (firstgate.proxmox.com [IPv6:2a01:7e0:0:424::9]) by lore.proxmox.com (Postfix) with ESMTPS id 3C2C21FF13B for ; Wed, 25 Mar 2026 04:51:02 +0100 (CET) Received: from firstgate.proxmox.com (localhost [127.0.0.1]) by firstgate.proxmox.com (Proxmox) with ESMTP id 90AAB2A27; Wed, 25 Mar 2026 04:51:19 +0100 (CET) From: Kefu Chai To: pve-devel@lists.proxmox.com Subject: [PATCH manager 1/1] ceph: add opt-in locality-aware replica reads (crush_location_hook) Date: Wed, 25 Mar 2026 11:51:04 +0800 Message-ID: <20260325035104.2264118-2-k.chai@proxmox.com> X-Mailer: git-send-email 2.47.3 In-Reply-To: <20260325035104.2264118-1-k.chai@proxmox.com> References: <20260325035104.2264118-1-k.chai@proxmox.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Bm-Milter-Handled: 55990f41-d878-4baa-be0a-ee34c49e34d2 X-Bm-Transport-Timestamp: 1774410626373 X-SPAM-LEVEL: Spam detection results: 0 AWL -1.124 Adjusted score from AWL reputation of From: address BAYES_00 -1.9 Bayes spam probability is 0 to 1% DMARC_MISSING 0.1 Missing DMARC policy KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment KAM_TIME 3 Pssss. Hey Buddy, wanna buy a watch? RCVD_IN_VALIDITY_CERTIFIED_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to Validity was blocked. See https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more information. RCVD_IN_VALIDITY_RPBL_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to Validity was blocked. See https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more information. RCVD_IN_VALIDITY_SAFE_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to Validity was blocked. See https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more information. SPF_HELO_NONE 0.001 SPF: HELO does not publish an SPF Record SPF_PASS -0.001 SPF: sender matches SPF record Message-ID-Hash: BO35MYKL73AM72TQJWJSC2WZQEA5UUHZ X-Message-ID-Hash: BO35MYKL73AM72TQJWJSC2WZQEA5UUHZ X-MailFrom: k.chai@proxmox.com X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; loop; banned-address; emergency; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header X-Mailman-Version: 3.3.10 Precedence: list List-Id: Proxmox VE development discussion List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: Without this feature, every VM disk read goes to the primary OSD regardless of where it sits in the cluster. In a hyperconverged setup this means a network round-trip even when a replica of the requested object lives on the same hypervisor node as the VM. Under an IOPS-heavy workload such as a database, the cumulative cost is significant: same host (loopback): ~microseconds — no NIC, no switch same rack / switch: ~0.1 ms — one hop cross-rack / cross-AZ: 0.5–1+ ms — uplink traversal + distance With rbd_read_from_replica_policy=localize, librbd scores each replica by its CRUSH distance to the client and issues the read to the closest one. The savings scale with topology depth: Hyperconverged (single DC): a replica co-located with the VM is served entirely in the local kernel — zero network overhead. Even when no replica is local, the localizer avoids unnecessary cross-switch traffic by preferring the same rack or host over a more distant replica. Multi-layer topology (node → rack → AZ → region): reads stay within the nearest boundary. A VM in zone1 reads from a zone1 OSD rather than one in zone2, reducing per-read latency by an order of magnitude and eliminating inter-AZ bandwidth costs. For this to work the librbd client must declare its own position in the CRUSH hierarchy. PVE does not manage the CRUSH topology itself — host buckets are created automatically by Ceph when OSDs first boot, and users may later run 'osd crush move' to add rack or AZ levels that PVE knows nothing about. Writing a static crush_location at init time would encode a snapshot of the topology that silently becomes stale after any CRUSH reorganisation, potentially causing the localizer to make worse decisions than the default. A hook script avoids this by querying the live CRUSH map on each VM start: ceph osd crush find "$(hostname -s)" This returns the full location chain (root, az, rack, host, ...) as it currently stands, so the client location is always consistent with the actual OSD placement, whatever topology the operator has built. The feature is shipped as an opt-in because it can degrade performance in specific scenarios: - Hook fallback with a non-default CRUSH root: when ceph osd crush find fails at VM start (e.g. monitors temporarily unreachable during cluster recovery or a rolling upgrade), the hook returns the static fallback "host= root=default". In clusters that use a custom root bucket (e.g. root=ceph-3az for AZ topologies) this location does not exist in the CRUSH map. The localizer then treats all replicas as equidistant and may select one arbitrarily, including a cross-AZ replica, rather than falling back to the primary. This is strictly worse than the default primary-reads policy. - Equidistant replicas: in a uniform single-DC cluster where all replicas sit at the same CRUSH depth, the localizer may prefer a secondary OSD over the primary. The primary often holds recently written objects hot in its page cache; a secondary does not. Enabling the policy in this topology provides no latency improvement but can increase read latency for write-then-read workloads. Operators who understand their topology and know that locality affinity is beneficial (typically: hyperconverged nodes with co-located VMs and OSDs, or multi-AZ clusters with AZ-aware CRUSH rules) should opt in explicitly. When enabled, pveceph init writes two entries into the [client] section of ceph.conf: crush_location_hook = /usr/share/pve-manager/helpers/ceph-crush-location rbd_read_from_replica_policy = localize These go into [client] rather than [global] because crush_location_hook is also read by OSD daemons (for initial CRUSH placement on first boot). Writing it to [global] would cause every OSD restart to call the hook, adding an unnecessary mon round-trip and a soft circular dependency during cluster recovery. The two settings are activated together because each is meaningless without the other. Signed-off-by: Kefu Chai --- PVE/API2/Ceph.pm | 17 ++++++++++ bin/Makefile | 3 +- bin/ceph-crush-location | 43 ++++++++++++++++++++++++++ www/manager6/ceph/CephInstallWizard.js | 8 ++++- 4 files changed, 69 insertions(+), 2 deletions(-) create mode 100644 bin/ceph-crush-location diff --git a/PVE/API2/Ceph.pm b/PVE/API2/Ceph.pm index e9fdcd37..231b751c 100644 --- a/PVE/API2/Ceph.pm +++ b/PVE/API2/Ceph.pm @@ -181,6 +181,17 @@ __PACKAGE__->register_method({ optional => 1, default => 0, }, + 'localize-reads' => { + description => "Enable locality-aware replica reads.\n\n" + . "Writes crush_location_hook and rbd_read_from_replica_policy=localize " + . "into the [client] section of ceph.conf so that VM disk reads prefer " + . "the replica nearest to the hypervisor node (same host, rack, or AZ). " + . "The hook queries the live CRUSH map on each VM start so it stays " + . "correct even after CRUSH topology changes (e.g. osd crush move).", + type => 'boolean', + optional => 1, + default => 0, + }, }, }, returns => { type => 'null' }, @@ -245,6 +256,12 @@ __PACKAGE__->register_method({ $cfg->{global}->{'cluster_network'} = $param->{'cluster-network'}; } + if ($param->{'localize-reads'}) { + $cfg->{client}->{'crush_location_hook'} = + '/usr/share/pve-manager/helpers/ceph-crush-location'; + $cfg->{client}->{'rbd_read_from_replica_policy'} = 'localize'; + } + cfs_write_file('ceph.conf', $cfg); if ($auth eq 'cephx') { diff --git a/bin/Makefile b/bin/Makefile index 5d225d17..1560a109 100644 --- a/bin/Makefile +++ b/bin/Makefile @@ -32,7 +32,8 @@ HELPERS = \ pve-startall-delay \ pve-init-ceph-crash \ pve-firewall-commit \ - pve-sdn-commit + pve-sdn-commit \ + ceph-crush-location MIGRATIONS = \ pve-lvm-disable-autoactivation \ diff --git a/bin/ceph-crush-location b/bin/ceph-crush-location new file mode 100644 index 00000000..ef53d183 --- /dev/null +++ b/bin/ceph-crush-location @@ -0,0 +1,43 @@ +#!/usr/bin/perl +# +# ceph-crush-location - CRUSH location hook for PVE Ceph clusters +# +# Invoked by Ceph client libraries (librbd/QEMU) when crush_location_hook +# is configured. Must print this node's position in the CRUSH hierarchy to +# stdout in "key=value ..." form, e.g.: +# +# root=default host=pve01 +# root=ceph-3az az=zone1 host=pve01 +# +# The output is used by rbd_read_from_replica_policy=localize to select the +# nearest replica for each read, reducing cross-host or cross-zone traffic. +# +# Arguments passed by the calling library (--cluster, --id, --type) are +# accepted but ignored: the relevant location is that of the physical node, +# not of any specific daemon. + +use strict; +use warnings; + +use JSON::PP; + +chomp(my $hostname = `hostname -s`); + +my $location = eval { + my $json = `ceph osd crush find "$hostname" 2>/dev/null`; + return undef if !$json; + my $loc = JSON::PP->new->decode($json)->{location} // {}; + join(' ', map { "$_=$loc->{$_}" } sort keys %$loc); +}; + +if ($location) { + print "$location\n"; +} else { + # Fallback for nodes with no OSDs in the CRUSH map yet, or when the + # cluster is temporarily unreachable at VM start time. + # host-level affinity still allows the localizer to prefer any replica + # that happens to land on this node. + print "host=$hostname root=default\n"; +} + +exit 0; diff --git a/www/manager6/ceph/CephInstallWizard.js b/www/manager6/ceph/CephInstallWizard.js index 5a9dc6b0..04163a19 100644 --- a/www/manager6/ceph/CephInstallWizard.js +++ b/www/manager6/ceph/CephInstallWizard.js @@ -603,9 +603,15 @@ Ext.define('PVE.ceph.CephInstallWizard', { }, emptyText: '2', }, + { + xtype: 'proxmoxcheckbox', + name: 'localize-reads', + fieldLabel: gettext('Localize reads'), + boxLabel: gettext('Prefer nearest replica for VM disk reads'), + }, ], onGetValues: function (values) { - ['cluster-network', 'size', 'min_size'].forEach(function (field) { + ['cluster-network', 'size', 'min_size', 'localize-reads'].forEach(function (field) { if (!values[field]) { delete values[field]; } -- 2.47.3