all lists on lists.proxmox.com
 help / color / mirror / Atom feed
From: Kefu Chai <k.chai@proxmox.com>
To: pve-devel@lists.proxmox.com
Subject: [PATCH manager 1/1] ceph: add opt-in locality-aware replica reads (crush_location_hook)
Date: Wed, 25 Mar 2026 11:51:04 +0800	[thread overview]
Message-ID: <20260325035104.2264118-2-k.chai@proxmox.com> (raw)
In-Reply-To: <20260325035104.2264118-1-k.chai@proxmox.com>

Without this feature, every VM disk read goes to the primary OSD
regardless of where it sits in the cluster. In a hyperconverged setup
this means a network round-trip even when a replica of the requested
object lives on the same hypervisor node as the VM. Under an IOPS-heavy
workload such as a database, the cumulative cost is significant:

  same host (loopback):   ~microseconds  — no NIC, no switch
  same rack / switch:     ~0.1 ms        — one hop
  cross-rack / cross-AZ:  0.5–1+ ms      — uplink traversal + distance

With rbd_read_from_replica_policy=localize, librbd scores each replica
by its CRUSH distance to the client and issues the read to the closest
one. The savings scale with topology depth:

  Hyperconverged (single DC): a replica co-located with the VM is served
  entirely in the local kernel — zero network overhead. Even when no
  replica is local, the localizer avoids unnecessary cross-switch traffic
  by preferring the same rack or host over a more distant replica.

  Multi-layer topology (node → rack → AZ → region): reads stay within
  the nearest boundary. A VM in zone1 reads from a zone1 OSD rather than
  one in zone2, reducing per-read latency by an order of magnitude and
  eliminating inter-AZ bandwidth costs.

For this to work the librbd client must declare its own position in the
CRUSH hierarchy. PVE does not manage the CRUSH topology itself — host
buckets are created automatically by Ceph when OSDs first boot, and
users may later run 'osd crush move' to add rack or AZ levels that PVE
knows nothing about. Writing a static crush_location at init time would
encode a snapshot of the topology that silently becomes stale after any
CRUSH reorganisation, potentially causing the localizer to make worse
decisions than the default.

A hook script avoids this by querying the live CRUSH map on each VM
start:

  ceph osd crush find "$(hostname -s)"

This returns the full location chain (root, az, rack, host, ...) as it
currently stands, so the client location is always consistent with the
actual OSD placement, whatever topology the operator has built.

The feature is shipped as an opt-in because it can degrade performance
in specific scenarios:

- Hook fallback with a non-default CRUSH root: when ceph osd crush find
  fails at VM start (e.g. monitors temporarily unreachable during cluster
  recovery or a rolling upgrade), the hook returns the static fallback
  "host=<hostname> root=default". In clusters that use a custom root
  bucket (e.g. root=ceph-3az for AZ topologies) this location does not
  exist in the CRUSH map. The localizer then treats all replicas as
  equidistant and may select one arbitrarily, including a cross-AZ
  replica, rather than falling back to the primary. This is strictly
  worse than the default primary-reads policy.

- Equidistant replicas: in a uniform single-DC cluster where all
  replicas sit at the same CRUSH depth, the localizer may prefer a
  secondary OSD over the primary. The primary often holds recently
  written objects hot in its page cache; a secondary does not. Enabling
  the policy in this topology provides no latency improvement but can
  increase read latency for write-then-read workloads.

Operators who understand their topology and know that locality affinity
is beneficial (typically: hyperconverged nodes with co-located VMs and
OSDs, or multi-AZ clusters with AZ-aware CRUSH rules) should opt in
explicitly.

When enabled, pveceph init writes two entries into the [client] section
of ceph.conf:

  crush_location_hook = /usr/share/pve-manager/helpers/ceph-crush-location
  rbd_read_from_replica_policy = localize

These go into [client] rather than [global] because crush_location_hook
is also read by OSD daemons (for initial CRUSH placement on first boot).
Writing it to [global] would cause every OSD restart to call the hook,
adding an unnecessary mon round-trip and a soft circular dependency
during cluster recovery.

The two settings are activated together because each is meaningless
without the other.

Signed-off-by: Kefu Chai <k.chai@proxmox.com>
---
 PVE/API2/Ceph.pm                       | 17 ++++++++++
 bin/Makefile                           |  3 +-
 bin/ceph-crush-location                | 43 ++++++++++++++++++++++++++
 www/manager6/ceph/CephInstallWizard.js |  8 ++++-
 4 files changed, 69 insertions(+), 2 deletions(-)
 create mode 100644 bin/ceph-crush-location

diff --git a/PVE/API2/Ceph.pm b/PVE/API2/Ceph.pm
index e9fdcd37..231b751c 100644
--- a/PVE/API2/Ceph.pm
+++ b/PVE/API2/Ceph.pm
@@ -181,6 +181,17 @@ __PACKAGE__->register_method({
                 optional => 1,
                 default => 0,
             },
+            'localize-reads' => {
+                description => "Enable locality-aware replica reads.\n\n"
+                    . "Writes crush_location_hook and rbd_read_from_replica_policy=localize "
+                    . "into the [client] section of ceph.conf so that VM disk reads prefer "
+                    . "the replica nearest to the hypervisor node (same host, rack, or AZ). "
+                    . "The hook queries the live CRUSH map on each VM start so it stays "
+                    . "correct even after CRUSH topology changes (e.g. osd crush move).",
+                type => 'boolean',
+                optional => 1,
+                default => 0,
+            },
         },
     },
     returns => { type => 'null' },
@@ -245,6 +256,12 @@ __PACKAGE__->register_method({
                     $cfg->{global}->{'cluster_network'} = $param->{'cluster-network'};
                 }
 
+                if ($param->{'localize-reads'}) {
+                    $cfg->{client}->{'crush_location_hook'} =
+                        '/usr/share/pve-manager/helpers/ceph-crush-location';
+                    $cfg->{client}->{'rbd_read_from_replica_policy'} = 'localize';
+                }
+
                 cfs_write_file('ceph.conf', $cfg);
 
                 if ($auth eq 'cephx') {
diff --git a/bin/Makefile b/bin/Makefile
index 5d225d17..1560a109 100644
--- a/bin/Makefile
+++ b/bin/Makefile
@@ -32,7 +32,8 @@ HELPERS =			\
 	pve-startall-delay	\
 	pve-init-ceph-crash	\
 	pve-firewall-commit	\
-	pve-sdn-commit
+	pve-sdn-commit		\
+	ceph-crush-location
 
 MIGRATIONS =			\
 	pve-lvm-disable-autoactivation		\
diff --git a/bin/ceph-crush-location b/bin/ceph-crush-location
new file mode 100644
index 00000000..ef53d183
--- /dev/null
+++ b/bin/ceph-crush-location
@@ -0,0 +1,43 @@
+#!/usr/bin/perl
+#
+# ceph-crush-location - CRUSH location hook for PVE Ceph clusters
+#
+# Invoked by Ceph client libraries (librbd/QEMU) when crush_location_hook
+# is configured. Must print this node's position in the CRUSH hierarchy to
+# stdout in "key=value ..." form, e.g.:
+#
+#   root=default host=pve01
+#   root=ceph-3az az=zone1 host=pve01
+#
+# The output is used by rbd_read_from_replica_policy=localize to select the
+# nearest replica for each read, reducing cross-host or cross-zone traffic.
+#
+# Arguments passed by the calling library (--cluster, --id, --type) are
+# accepted but ignored: the relevant location is that of the physical node,
+# not of any specific daemon.
+
+use strict;
+use warnings;
+
+use JSON::PP;
+
+chomp(my $hostname = `hostname -s`);
+
+my $location = eval {
+    my $json = `ceph osd crush find "$hostname" 2>/dev/null`;
+    return undef if !$json;
+    my $loc = JSON::PP->new->decode($json)->{location} // {};
+    join(' ', map { "$_=$loc->{$_}" } sort keys %$loc);
+};
+
+if ($location) {
+    print "$location\n";
+} else {
+    # Fallback for nodes with no OSDs in the CRUSH map yet, or when the
+    # cluster is temporarily unreachable at VM start time.
+    # host-level affinity still allows the localizer to prefer any replica
+    # that happens to land on this node.
+    print "host=$hostname root=default\n";
+}
+
+exit 0;
diff --git a/www/manager6/ceph/CephInstallWizard.js b/www/manager6/ceph/CephInstallWizard.js
index 5a9dc6b0..04163a19 100644
--- a/www/manager6/ceph/CephInstallWizard.js
+++ b/www/manager6/ceph/CephInstallWizard.js
@@ -603,9 +603,15 @@ Ext.define('PVE.ceph.CephInstallWizard', {
                     },
                     emptyText: '2',
                 },
+                {
+                    xtype: 'proxmoxcheckbox',
+                    name: 'localize-reads',
+                    fieldLabel: gettext('Localize reads'),
+                    boxLabel: gettext('Prefer nearest replica for VM disk reads'),
+                },
             ],
             onGetValues: function (values) {
-                ['cluster-network', 'size', 'min_size'].forEach(function (field) {
+                ['cluster-network', 'size', 'min_size', 'localize-reads'].forEach(function (field) {
                     if (!values[field]) {
                         delete values[field];
                     }
-- 
2.47.3





  reply	other threads:[~2026-03-25  3:51 UTC|newest]

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-25  3:51 [PATCH manager 0/1] " Kefu Chai
2026-03-25  3:51 ` Kefu Chai [this message]
2026-03-26  3:44 ` Kefu Chai

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260325035104.2264118-2-k.chai@proxmox.com \
    --to=k.chai@proxmox.com \
    --cc=pve-devel@lists.proxmox.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.
Service provided by Proxmox Server Solutions GmbH | Privacy | Legal