From: Kefu Chai <k.chai@proxmox.com>
To: pve-devel@lists.proxmox.com
Subject: [PATCH manager 1/1] ceph: add opt-in locality-aware replica reads (crush_location_hook)
Date: Wed, 25 Mar 2026 11:51:04 +0800 [thread overview]
Message-ID: <20260325035104.2264118-2-k.chai@proxmox.com> (raw)
In-Reply-To: <20260325035104.2264118-1-k.chai@proxmox.com>
Without this feature, every VM disk read goes to the primary OSD
regardless of where it sits in the cluster. In a hyperconverged setup
this means a network round-trip even when a replica of the requested
object lives on the same hypervisor node as the VM. Under an IOPS-heavy
workload such as a database, the cumulative cost is significant:
same host (loopback): ~microseconds — no NIC, no switch
same rack / switch: ~0.1 ms — one hop
cross-rack / cross-AZ: 0.5–1+ ms — uplink traversal + distance
With rbd_read_from_replica_policy=localize, librbd scores each replica
by its CRUSH distance to the client and issues the read to the closest
one. The savings scale with topology depth:
Hyperconverged (single DC): a replica co-located with the VM is served
entirely in the local kernel — zero network overhead. Even when no
replica is local, the localizer avoids unnecessary cross-switch traffic
by preferring the same rack or host over a more distant replica.
Multi-layer topology (node → rack → AZ → region): reads stay within
the nearest boundary. A VM in zone1 reads from a zone1 OSD rather than
one in zone2, reducing per-read latency by an order of magnitude and
eliminating inter-AZ bandwidth costs.
For this to work the librbd client must declare its own position in the
CRUSH hierarchy. PVE does not manage the CRUSH topology itself — host
buckets are created automatically by Ceph when OSDs first boot, and
users may later run 'osd crush move' to add rack or AZ levels that PVE
knows nothing about. Writing a static crush_location at init time would
encode a snapshot of the topology that silently becomes stale after any
CRUSH reorganisation, potentially causing the localizer to make worse
decisions than the default.
A hook script avoids this by querying the live CRUSH map on each VM
start:
ceph osd crush find "$(hostname -s)"
This returns the full location chain (root, az, rack, host, ...) as it
currently stands, so the client location is always consistent with the
actual OSD placement, whatever topology the operator has built.
The feature is shipped as an opt-in because it can degrade performance
in specific scenarios:
- Hook fallback with a non-default CRUSH root: when ceph osd crush find
fails at VM start (e.g. monitors temporarily unreachable during cluster
recovery or a rolling upgrade), the hook returns the static fallback
"host=<hostname> root=default". In clusters that use a custom root
bucket (e.g. root=ceph-3az for AZ topologies) this location does not
exist in the CRUSH map. The localizer then treats all replicas as
equidistant and may select one arbitrarily, including a cross-AZ
replica, rather than falling back to the primary. This is strictly
worse than the default primary-reads policy.
- Equidistant replicas: in a uniform single-DC cluster where all
replicas sit at the same CRUSH depth, the localizer may prefer a
secondary OSD over the primary. The primary often holds recently
written objects hot in its page cache; a secondary does not. Enabling
the policy in this topology provides no latency improvement but can
increase read latency for write-then-read workloads.
Operators who understand their topology and know that locality affinity
is beneficial (typically: hyperconverged nodes with co-located VMs and
OSDs, or multi-AZ clusters with AZ-aware CRUSH rules) should opt in
explicitly.
When enabled, pveceph init writes two entries into the [client] section
of ceph.conf:
crush_location_hook = /usr/share/pve-manager/helpers/ceph-crush-location
rbd_read_from_replica_policy = localize
These go into [client] rather than [global] because crush_location_hook
is also read by OSD daemons (for initial CRUSH placement on first boot).
Writing it to [global] would cause every OSD restart to call the hook,
adding an unnecessary mon round-trip and a soft circular dependency
during cluster recovery.
The two settings are activated together because each is meaningless
without the other.
Signed-off-by: Kefu Chai <k.chai@proxmox.com>
---
PVE/API2/Ceph.pm | 17 ++++++++++
bin/Makefile | 3 +-
bin/ceph-crush-location | 43 ++++++++++++++++++++++++++
www/manager6/ceph/CephInstallWizard.js | 8 ++++-
4 files changed, 69 insertions(+), 2 deletions(-)
create mode 100644 bin/ceph-crush-location
diff --git a/PVE/API2/Ceph.pm b/PVE/API2/Ceph.pm
index e9fdcd37..231b751c 100644
--- a/PVE/API2/Ceph.pm
+++ b/PVE/API2/Ceph.pm
@@ -181,6 +181,17 @@ __PACKAGE__->register_method({
optional => 1,
default => 0,
},
+ 'localize-reads' => {
+ description => "Enable locality-aware replica reads.\n\n"
+ . "Writes crush_location_hook and rbd_read_from_replica_policy=localize "
+ . "into the [client] section of ceph.conf so that VM disk reads prefer "
+ . "the replica nearest to the hypervisor node (same host, rack, or AZ). "
+ . "The hook queries the live CRUSH map on each VM start so it stays "
+ . "correct even after CRUSH topology changes (e.g. osd crush move).",
+ type => 'boolean',
+ optional => 1,
+ default => 0,
+ },
},
},
returns => { type => 'null' },
@@ -245,6 +256,12 @@ __PACKAGE__->register_method({
$cfg->{global}->{'cluster_network'} = $param->{'cluster-network'};
}
+ if ($param->{'localize-reads'}) {
+ $cfg->{client}->{'crush_location_hook'} =
+ '/usr/share/pve-manager/helpers/ceph-crush-location';
+ $cfg->{client}->{'rbd_read_from_replica_policy'} = 'localize';
+ }
+
cfs_write_file('ceph.conf', $cfg);
if ($auth eq 'cephx') {
diff --git a/bin/Makefile b/bin/Makefile
index 5d225d17..1560a109 100644
--- a/bin/Makefile
+++ b/bin/Makefile
@@ -32,7 +32,8 @@ HELPERS = \
pve-startall-delay \
pve-init-ceph-crash \
pve-firewall-commit \
- pve-sdn-commit
+ pve-sdn-commit \
+ ceph-crush-location
MIGRATIONS = \
pve-lvm-disable-autoactivation \
diff --git a/bin/ceph-crush-location b/bin/ceph-crush-location
new file mode 100644
index 00000000..ef53d183
--- /dev/null
+++ b/bin/ceph-crush-location
@@ -0,0 +1,43 @@
+#!/usr/bin/perl
+#
+# ceph-crush-location - CRUSH location hook for PVE Ceph clusters
+#
+# Invoked by Ceph client libraries (librbd/QEMU) when crush_location_hook
+# is configured. Must print this node's position in the CRUSH hierarchy to
+# stdout in "key=value ..." form, e.g.:
+#
+# root=default host=pve01
+# root=ceph-3az az=zone1 host=pve01
+#
+# The output is used by rbd_read_from_replica_policy=localize to select the
+# nearest replica for each read, reducing cross-host or cross-zone traffic.
+#
+# Arguments passed by the calling library (--cluster, --id, --type) are
+# accepted but ignored: the relevant location is that of the physical node,
+# not of any specific daemon.
+
+use strict;
+use warnings;
+
+use JSON::PP;
+
+chomp(my $hostname = `hostname -s`);
+
+my $location = eval {
+ my $json = `ceph osd crush find "$hostname" 2>/dev/null`;
+ return undef if !$json;
+ my $loc = JSON::PP->new->decode($json)->{location} // {};
+ join(' ', map { "$_=$loc->{$_}" } sort keys %$loc);
+};
+
+if ($location) {
+ print "$location\n";
+} else {
+ # Fallback for nodes with no OSDs in the CRUSH map yet, or when the
+ # cluster is temporarily unreachable at VM start time.
+ # host-level affinity still allows the localizer to prefer any replica
+ # that happens to land on this node.
+ print "host=$hostname root=default\n";
+}
+
+exit 0;
diff --git a/www/manager6/ceph/CephInstallWizard.js b/www/manager6/ceph/CephInstallWizard.js
index 5a9dc6b0..04163a19 100644
--- a/www/manager6/ceph/CephInstallWizard.js
+++ b/www/manager6/ceph/CephInstallWizard.js
@@ -603,9 +603,15 @@ Ext.define('PVE.ceph.CephInstallWizard', {
},
emptyText: '2',
},
+ {
+ xtype: 'proxmoxcheckbox',
+ name: 'localize-reads',
+ fieldLabel: gettext('Localize reads'),
+ boxLabel: gettext('Prefer nearest replica for VM disk reads'),
+ },
],
onGetValues: function (values) {
- ['cluster-network', 'size', 'min_size'].forEach(function (field) {
+ ['cluster-network', 'size', 'min_size', 'localize-reads'].forEach(function (field) {
if (!values[field]) {
delete values[field];
}
--
2.47.3
next prev parent reply other threads:[~2026-03-25 3:51 UTC|newest]
Thread overview: 3+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-03-25 3:51 [PATCH manager 0/1] " Kefu Chai
2026-03-25 3:51 ` Kefu Chai [this message]
2026-03-26 3:44 ` Kefu Chai
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260325035104.2264118-2-k.chai@proxmox.com \
--to=k.chai@proxmox.com \
--cc=pve-devel@lists.proxmox.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox