[PATCH manager 0/1] ceph: add opt-in locality-aware replica reads (crush_location

public inbox for pve-devel@lists.proxmox.com
 help / color / mirror / Atom feed

* [PATCH manager 0/1] ceph: add opt-in locality-aware replica reads (crush_location_hook)
@ 2026-03-25  3:51 Kefu Chai
  2026-03-25  3:51 ` [PATCH manager 1/1] " Kefu Chai
  2026-03-26  3:44 ` [PATCH manager 0/1] " Kefu Chai
  0 siblings, 2 replies; 3+ messages in thread
From: Kefu Chai @ 2026-03-25  3:51 UTC (permalink / raw)
  To: pve-devel

This patch was prompted by a forum thread [1] in which a user reported
persistent high IO wait on PostgreSQL VMs running on a three-AZ Ceph
cluster. The discussion surfaced a general optimization opportunity:
librbd, by default, always reads from the primary OSD regardless of
its location. In a multi-AZ deployment, that can mean every read pays
a cross-AZ round-trip even when a same-AZ replica is available.

rbd_read_from_replica_policy = localize addresses this by directing
librbd to prefer the nearest replica, but it requires the client to
declare its own position in the CRUSH hierarchy. This patch ships a
hook script that supplies that position by querying the live CRUSH map
(ceph osd crush find), and wires it up as an opt-in in pveceph init.

The benefit scales with topology: in a multi-AZ cluster it keeps reads
within the same AZ; in a hyperconverged setup, reads to a co-located
OSD never leave the host at all. The feature is opt-in because it can
degrade performance when replicas are equidistant or when the hook
falls back to an incorrect CRUSH root — see the commit message for
details.

[1] https://forum.proxmox.com/threads/ceph-vm-with-high-io-wait.181751/

Kefu Chai (1):
  ceph: add opt-in locality-aware replica reads (crush_location_hook)

 PVE/API2/Ceph.pm                       | 17 ++++++++++
 bin/Makefile                           |  3 +-
 bin/ceph-crush-location                | 43 ++++++++++++++++++++++++++
 www/manager6/ceph/CephInstallWizard.js |  8 ++++-
 4 files changed, 69 insertions(+), 2 deletions(-)
 create mode 100644 bin/ceph-crush-location

-- 
2.47.3

^ permalink raw reply	[flat|nested] 3+ messages in thread

* [PATCH manager 1/1] ceph: add opt-in locality-aware replica reads (crush_location_hook)
  2026-03-25  3:51 [PATCH manager 0/1] ceph: add opt-in locality-aware replica reads (crush_location_hook) Kefu Chai
@ 2026-03-25  3:51 ` Kefu Chai
  2026-03-26  3:44 ` [PATCH manager 0/1] " Kefu Chai
  1 sibling, 0 replies; 3+ messages in thread
From: Kefu Chai @ 2026-03-25  3:51 UTC (permalink / raw)
  To: pve-devel

Without this feature, every VM disk read goes to the primary OSD
regardless of where it sits in the cluster. In a hyperconverged setup
this means a network round-trip even when a replica of the requested
object lives on the same hypervisor node as the VM. Under an IOPS-heavy
workload such as a database, the cumulative cost is significant:

  same host (loopback):   ~microseconds  — no NIC, no switch
  same rack / switch:     ~0.1 ms        — one hop
  cross-rack / cross-AZ:  0.5–1+ ms      — uplink traversal + distance

With rbd_read_from_replica_policy=localize, librbd scores each replica
by its CRUSH distance to the client and issues the read to the closest
one. The savings scale with topology depth:

  Hyperconverged (single DC): a replica co-located with the VM is served
  entirely in the local kernel — zero network overhead. Even when no
  replica is local, the localizer avoids unnecessary cross-switch traffic
  by preferring the same rack or host over a more distant replica.

  Multi-layer topology (node → rack → AZ → region): reads stay within
  the nearest boundary. A VM in zone1 reads from a zone1 OSD rather than
  one in zone2, reducing per-read latency by an order of magnitude and
  eliminating inter-AZ bandwidth costs.

For this to work the librbd client must declare its own position in the
CRUSH hierarchy. PVE does not manage the CRUSH topology itself — host
buckets are created automatically by Ceph when OSDs first boot, and
users may later run 'osd crush move' to add rack or AZ levels that PVE
knows nothing about. Writing a static crush_location at init time would
encode a snapshot of the topology that silently becomes stale after any
CRUSH reorganisation, potentially causing the localizer to make worse
decisions than the default.

A hook script avoids this by querying the live CRUSH map on each VM
start:

  ceph osd crush find "$(hostname -s)"

This returns the full location chain (root, az, rack, host, ...) as it
currently stands, so the client location is always consistent with the
actual OSD placement, whatever topology the operator has built.

The feature is shipped as an opt-in because it can degrade performance
in specific scenarios:

- Hook fallback with a non-default CRUSH root: when ceph osd crush find
  fails at VM start (e.g. monitors temporarily unreachable during cluster
  recovery or a rolling upgrade), the hook returns the static fallback
  "host=<hostname> root=default". In clusters that use a custom root
  bucket (e.g. root=ceph-3az for AZ topologies) this location does not
  exist in the CRUSH map. The localizer then treats all replicas as
  equidistant and may select one arbitrarily, including a cross-AZ
  replica, rather than falling back to the primary. This is strictly
  worse than the default primary-reads policy.

- Equidistant replicas: in a uniform single-DC cluster where all
  replicas sit at the same CRUSH depth, the localizer may prefer a
  secondary OSD over the primary. The primary often holds recently
  written objects hot in its page cache; a secondary does not. Enabling
  the policy in this topology provides no latency improvement but can
  increase read latency for write-then-read workloads.

Operators who understand their topology and know that locality affinity
is beneficial (typically: hyperconverged nodes with co-located VMs and
OSDs, or multi-AZ clusters with AZ-aware CRUSH rules) should opt in
explicitly.

When enabled, pveceph init writes two entries into the [client] section
of ceph.conf:

  crush_location_hook = /usr/share/pve-manager/helpers/ceph-crush-location
  rbd_read_from_replica_policy = localize

These go into [client] rather than [global] because crush_location_hook
is also read by OSD daemons (for initial CRUSH placement on first boot).
Writing it to [global] would cause every OSD restart to call the hook,
adding an unnecessary mon round-trip and a soft circular dependency
during cluster recovery.

The two settings are activated together because each is meaningless
without the other.

Signed-off-by: Kefu Chai <k.chai@proxmox.com>
---
 PVE/API2/Ceph.pm                       | 17 ++++++++++
 bin/Makefile                           |  3 +-
 bin/ceph-crush-location                | 43 ++++++++++++++++++++++++++
 www/manager6/ceph/CephInstallWizard.js |  8 ++++-
 4 files changed, 69 insertions(+), 2 deletions(-)
 create mode 100644 bin/ceph-crush-location

diff --git a/PVE/API2/Ceph.pm b/PVE/API2/Ceph.pm
index e9fdcd37..231b751c 100644
--- a/PVE/API2/Ceph.pm
+++ b/PVE/API2/Ceph.pm
@@ -181,6 +181,17 @@ __PACKAGE__->register_method({
                 optional => 1,
                 default => 0,
             },
+            'localize-reads' => {
+                description => "Enable locality-aware replica reads.\n\n"
+                    . "Writes crush_location_hook and rbd_read_from_replica_policy=localize "
+                    . "into the [client] section of ceph.conf so that VM disk reads prefer "
+                    . "the replica nearest to the hypervisor node (same host, rack, or AZ). "
+                    . "The hook queries the live CRUSH map on each VM start so it stays "
+                    . "correct even after CRUSH topology changes (e.g. osd crush move).",
+                type => 'boolean',
+                optional => 1,
+                default => 0,
+            },
         },
     },
     returns => { type => 'null' },
@@ -245,6 +256,12 @@ __PACKAGE__->register_method({
                     $cfg->{global}->{'cluster_network'} = $param->{'cluster-network'};
                 }
 
+                if ($param->{'localize-reads'}) {
+                    $cfg->{client}->{'crush_location_hook'} =
+                        '/usr/share/pve-manager/helpers/ceph-crush-location';
+                    $cfg->{client}->{'rbd_read_from_replica_policy'} = 'localize';
+                }
+
                 cfs_write_file('ceph.conf', $cfg);
 
                 if ($auth eq 'cephx') {
diff --git a/bin/Makefile b/bin/Makefile
index 5d225d17..1560a109 100644
--- a/bin/Makefile
+++ b/bin/Makefile
@@ -32,7 +32,8 @@ HELPERS =			\
 	pve-startall-delay	\
 	pve-init-ceph-crash	\
 	pve-firewall-commit	\
-	pve-sdn-commit
+	pve-sdn-commit		\
+	ceph-crush-location
 
 MIGRATIONS =			\
 	pve-lvm-disable-autoactivation		\
diff --git a/bin/ceph-crush-location b/bin/ceph-crush-location
new file mode 100644
index 00000000..ef53d183
--- /dev/null
+++ b/bin/ceph-crush-location
@@ -0,0 +1,43 @@
+#!/usr/bin/perl
+#
+# ceph-crush-location - CRUSH location hook for PVE Ceph clusters
+#
+# Invoked by Ceph client libraries (librbd/QEMU) when crush_location_hook
+# is configured. Must print this node's position in the CRUSH hierarchy to
+# stdout in "key=value ..." form, e.g.:
+#
+#   root=default host=pve01
+#   root=ceph-3az az=zone1 host=pve01
+#
+# The output is used by rbd_read_from_replica_policy=localize to select the
+# nearest replica for each read, reducing cross-host or cross-zone traffic.
+#
+# Arguments passed by the calling library (--cluster, --id, --type) are
+# accepted but ignored: the relevant location is that of the physical node,
+# not of any specific daemon.
+
+use strict;
+use warnings;
+
+use JSON::PP;
+
+chomp(my $hostname = `hostname -s`);
+
+my $location = eval {
+    my $json = `ceph osd crush find "$hostname" 2>/dev/null`;
+    return undef if !$json;
+    my $loc = JSON::PP->new->decode($json)->{location} // {};
+    join(' ', map { "$_=$loc->{$_}" } sort keys %$loc);
+};
+
+if ($location) {
+    print "$location\n";
+} else {
+    # Fallback for nodes with no OSDs in the CRUSH map yet, or when the
+    # cluster is temporarily unreachable at VM start time.
+    # host-level affinity still allows the localizer to prefer any replica
+    # that happens to land on this node.
+    print "host=$hostname root=default\n";
+}
+
+exit 0;
diff --git a/www/manager6/ceph/CephInstallWizard.js b/www/manager6/ceph/CephInstallWizard.js
index 5a9dc6b0..04163a19 100644
--- a/www/manager6/ceph/CephInstallWizard.js
+++ b/www/manager6/ceph/CephInstallWizard.js
@@ -603,9 +603,15 @@ Ext.define('PVE.ceph.CephInstallWizard', {
                     },
                     emptyText: '2',
                 },
+                {
+                    xtype: 'proxmoxcheckbox',
+                    name: 'localize-reads',
+                    fieldLabel: gettext('Localize reads'),
+                    boxLabel: gettext('Prefer nearest replica for VM disk reads'),
+                },
             ],
             onGetValues: function (values) {
-                ['cluster-network', 'size', 'min_size'].forEach(function (field) {
+                ['cluster-network', 'size', 'min_size', 'localize-reads'].forEach(function (field) {
                     if (!values[field]) {
                         delete values[field];
                     }
-- 
2.47.3





^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH manager 0/1] ceph: add opt-in locality-aware replica reads (crush_location_hook)
  2026-03-25  3:51 [PATCH manager 0/1] ceph: add opt-in locality-aware replica reads (crush_location_hook) Kefu Chai
  2026-03-25  3:51 ` [PATCH manager 1/1] " Kefu Chai
@ 2026-03-26  3:44 ` Kefu Chai
  1 sibling, 0 replies; 3+ messages in thread
From: Kefu Chai @ 2026-03-26  3:44 UTC (permalink / raw)
  To: Kefu Chai, pve-devel

Hi,

Putting a hold on this series for now.

Friedrich kindly pointed out that Tentacle v20.2.0 ships with a regression [1]
that affects rbd_read_from_replica_policy=localize. The issue is that commit
4b01c004b5d [2] ("PrimaryLogPG: don't accept ops with mixed balance_reads and
rwordered flags") causes the OSD to reject write ops that carry the LOCALIZE_READS
flag, returning -EINVAL. Since librbd sets this flag connection-wide when the
localize policy is active, this can lead to silent write failures.

The fix (a revert, PR #66611 [3]) has been merged to the tentacle branch and
should ship with v20.2.1, which is currently in QE validation [4]. Squid is not
affected — the problematic commit was only cherry-picked into tentacle.

I'll resend once v20.2.1 is released and picked up by our Tentacle packages.
The patch itself is opt-in, so there's no urgency.

Thanks,
Kefu

[1] https://tracker.ceph.com/issues/73997
[2] https://github.com/ceph/ceph/commit/4b01c004b5dc342cbdfb7cb26b47f6afe6245599
[3] https://github.com/ceph/ceph/pull/66611
[4] https://tracker.ceph.com/issues/74838

On Wed Mar 25, 2026 at 11:51 AM CST, Kefu Chai wrote:
> This patch was prompted by a forum thread [1] in which a user reported
> persistent high IO wait on PostgreSQL VMs running on a three-AZ Ceph
> cluster. The discussion surfaced a general optimization opportunity:
> librbd, by default, always reads from the primary OSD regardless of
> its location. In a multi-AZ deployment, that can mean every read pays
> a cross-AZ round-trip even when a same-AZ replica is available.
>
> rbd_read_from_replica_policy = localize addresses this by directing
> librbd to prefer the nearest replica, but it requires the client to
> declare its own position in the CRUSH hierarchy. This patch ships a
> hook script that supplies that position by querying the live CRUSH map
> (ceph osd crush find), and wires it up as an opt-in in pveceph init.
>
> The benefit scales with topology: in a multi-AZ cluster it keeps reads
> within the same AZ; in a hyperconverged setup, reads to a co-located
> OSD never leave the host at all. The feature is opt-in because it can
> degrade performance when replicas are equidistant or when the hook
> falls back to an incorrect CRUSH root — see the commit message for
> details.
>
> [1] https://forum.proxmox.com/threads/ceph-vm-with-high-io-wait.181751/
>   
>
> Kefu Chai (1):
>   ceph: add opt-in locality-aware replica reads (crush_location_hook)
>
>  PVE/API2/Ceph.pm                       | 17 ++++++++++
>  bin/Makefile                           |  3 +-
>  bin/ceph-crush-location                | 43 ++++++++++++++++++++++++++
>  www/manager6/ceph/CephInstallWizard.js |  8 ++++-
>  4 files changed, 69 insertions(+), 2 deletions(-)
>  create mode 100644 bin/ceph-crush-location





^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2026-03-26  3:44 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2026-03-25  3:51 [PATCH manager 0/1] ceph: add opt-in locality-aware replica reads (crush_location_hook) Kefu Chai
2026-03-25  3:51 ` [PATCH manager 1/1] " Kefu Chai
2026-03-26  3:44 ` [PATCH manager 0/1] " Kefu Chai

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox

Service provided by Proxmox Server Solutions GmbH | Privacy | Legal