From: "Kefu Chai" <k.chai@proxmox.com>
To: "Kefu Chai" <k.chai@proxmox.com>, <pve-devel@lists.proxmox.com>
Subject: Re: [PATCH manager 0/1] ceph: add opt-in locality-aware replica reads (crush_location_hook)
Date: Thu, 26 Mar 2026 11:44:40 +0800 [thread overview]
Message-ID: <DHCEKP54BYYB.3OV8EU87R0I22@proxmox.com> (raw)
In-Reply-To: <20260325035104.2264118-1-k.chai@proxmox.com>
Hi,
Putting a hold on this series for now.
Friedrich kindly pointed out that Tentacle v20.2.0 ships with a regression [1]
that affects rbd_read_from_replica_policy=localize. The issue is that commit
4b01c004b5d [2] ("PrimaryLogPG: don't accept ops with mixed balance_reads and
rwordered flags") causes the OSD to reject write ops that carry the LOCALIZE_READS
flag, returning -EINVAL. Since librbd sets this flag connection-wide when the
localize policy is active, this can lead to silent write failures.
The fix (a revert, PR #66611 [3]) has been merged to the tentacle branch and
should ship with v20.2.1, which is currently in QE validation [4]. Squid is not
affected — the problematic commit was only cherry-picked into tentacle.
I'll resend once v20.2.1 is released and picked up by our Tentacle packages.
The patch itself is opt-in, so there's no urgency.
Thanks,
Kefu
[1] https://tracker.ceph.com/issues/73997
[2] https://github.com/ceph/ceph/commit/4b01c004b5dc342cbdfb7cb26b47f6afe6245599
[3] https://github.com/ceph/ceph/pull/66611
[4] https://tracker.ceph.com/issues/74838
On Wed Mar 25, 2026 at 11:51 AM CST, Kefu Chai wrote:
> This patch was prompted by a forum thread [1] in which a user reported
> persistent high IO wait on PostgreSQL VMs running on a three-AZ Ceph
> cluster. The discussion surfaced a general optimization opportunity:
> librbd, by default, always reads from the primary OSD regardless of
> its location. In a multi-AZ deployment, that can mean every read pays
> a cross-AZ round-trip even when a same-AZ replica is available.
>
> rbd_read_from_replica_policy = localize addresses this by directing
> librbd to prefer the nearest replica, but it requires the client to
> declare its own position in the CRUSH hierarchy. This patch ships a
> hook script that supplies that position by querying the live CRUSH map
> (ceph osd crush find), and wires it up as an opt-in in pveceph init.
>
> The benefit scales with topology: in a multi-AZ cluster it keeps reads
> within the same AZ; in a hyperconverged setup, reads to a co-located
> OSD never leave the host at all. The feature is opt-in because it can
> degrade performance when replicas are equidistant or when the hook
> falls back to an incorrect CRUSH root — see the commit message for
> details.
>
> [1] https://forum.proxmox.com/threads/ceph-vm-with-high-io-wait.181751/
>
>
> Kefu Chai (1):
> ceph: add opt-in locality-aware replica reads (crush_location_hook)
>
> PVE/API2/Ceph.pm | 17 ++++++++++
> bin/Makefile | 3 +-
> bin/ceph-crush-location | 43 ++++++++++++++++++++++++++
> www/manager6/ceph/CephInstallWizard.js | 8 ++++-
> 4 files changed, 69 insertions(+), 2 deletions(-)
> create mode 100644 bin/ceph-crush-location
prev parent reply other threads:[~2026-03-26 3:44 UTC|newest]
Thread overview: 3+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-03-25 3:51 Kefu Chai
2026-03-25 3:51 ` [PATCH manager 1/1] " Kefu Chai
2026-03-26 3:44 ` Kefu Chai [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=DHCEKP54BYYB.3OV8EU87R0I22@proxmox.com \
--to=k.chai@proxmox.com \
--cc=pve-devel@lists.proxmox.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox