From: Kefu Chai <k.chai@proxmox.com>
To: pve-devel@lists.proxmox.com
Subject: [PATCH manager] ceph: mds: reimplement hotstandby via ceph fs set allow_standby_replay
Date: Sun, 29 Mar 2026 12:46:18 +0800 [thread overview]
Message-ID: <20260329044618.2033129-1-k.chai@proxmox.com> (raw)
PVE was writing two per-MDS config options into ceph.conf on every MDS
creation:
[mds.<id>]
mds_standby_for_name = pve
mds_standby_replay = true (when hotstandby=1)
Neither option exists in Ceph Squid or Tentacle. A grep of the upstream
Ceph source confirms that both 'mds_standby_for_name' and
'mds_standby_replay' are absent from src/common/options/mds.yaml.in,
the authoritative registry of valid MDS config keys. In MDSMap.cc, the
standby_for_name field is hardcoded to an empty string during encoding
and is only decoded for backward compatibility with old state. Ceph
silently ignores these unknown options (potentially logging warnings).
There are two distinct problems:
1. mds_standby_for_name = 'pve' was written unconditionally regardless
of what filesystems actually exist. The goal of this option was to
make the standby MDS preferentially serve a specific filesystem.
However, the hardcoded value 'pve' never matched the PVE default
filesystem name 'cephfs' (see FS.pm: $fs_name = $param->{name} //
'cephfs'), so the option was always pointing at a nonexistent
filesystem. Even with the correct name, the option is a no-op in
modern Ceph. Removing it has no effect on cluster behaviour.
2. mds_standby_replay = 'true' (the hotstandby feature) was the old
per-daemon mechanism for standby replay. The goal — keeping a
standby MDS replaying the active MDS journal for faster failover —
is still valid and the feature still exists in Ceph Squid/Tentacle,
but the mechanism changed: the setting is now per-filesystem via
'ceph fs set <fsname> allow_standby_replay true', stored in the MDS
map rather than as a daemon config key (FSCommands.cc:SetHandler).
Because the old config key is silently ignored, hotstandby has had
no actual effect on Ceph Squid/Tentacle clusters.
Fix:
- Remove the unconditional mds_standby_for_name write entirely.
- Add a 'filesystem' parameter to the createmds API (optional,
defaults to 'cephfs'). When 'hotstandby' is set, call the mon
command 'ceph fs set <filesystem> allow_standby_replay true' after
the MDS daemon is created, replacing the dead ceph.conf write with
the correct modern mechanism.
- If the mon command fails (e.g., the filesystem does not yet exist),
the error is printed as a warning and MDS creation is not rolled
back. The MDS is running and can serve as a standby regardless; the
operator can enable standby replay separately once the filesystem
is created.
- Update the hotstandby parameter description to reflect the new
behaviour and document the companion 'filesystem' parameter.
Signed-off-by: Kefu Chai <k.chai@proxmox.com>
---
PVE/API2/Ceph/MDS.pm | 39 ++++++++++++++++++++++++++++++++-------
1 file changed, 32 insertions(+), 7 deletions(-)
diff --git a/PVE/API2/Ceph/MDS.pm b/PVE/API2/Ceph/MDS.pm
index f7188cee..15f13b17 100644
--- a/PVE/API2/Ceph/MDS.pm
+++ b/PVE/API2/Ceph/MDS.pm
@@ -115,8 +115,18 @@ __PACKAGE__->register_method({
optional => 1,
default => '0',
description =>
- "Determines whether a ceph-mds daemon should poll and replay the log of an active MDS. "
- . "Faster switch on MDS failure, but needs more idle resources.",
+ "Use together with 'filesystem' to enable standby-replay "
+ . "for the given CephFS. Keeps a standby MDS replaying the "
+ . "active MDS journal for faster failover.",
+ },
+ filesystem => {
+ type => 'string',
+ optional => 1,
+ default => 'cephfs',
+ pattern => qr|^[^:/\s]+$|,
+ description =>
+ "The name of the CephFS filesystem to enable standby replay "
+ . "for when 'hotstandby' is set. Defaults to 'cephfs'.",
},
},
},
@@ -158,11 +168,6 @@ __PACKAGE__->register_method({
}
$cfg->{$section}->{host} = $nodename;
- $cfg->{$section}->{'mds_standby_for_name'} = 'pve';
-
- if ($param->{hotstandby}) {
- $cfg->{$section}->{'mds_standby_replay'} = 'true';
- }
cfs_write_file('ceph.conf', $cfg);
@@ -178,6 +183,26 @@ __PACKAGE__->register_method({
die "$err\n";
}
+
+ if ($param->{hotstandby}) {
+ my $fs_name = $param->{filesystem} // 'cephfs';
+ print "Enabling standby replay for filesystem '$fs_name'...\n";
+ eval {
+ $rados->mon_command({
+ prefix => 'fs set',
+ fs_name => $fs_name,
+ var => 'allow_standby_replay',
+ val => 'true',
+ format => 'plain',
+ });
+ };
+ if (my $err = $@) {
+ chomp $err;
+ warn "Could not enable standby replay for '$fs_name': $err\n"
+ . "Run 'ceph fs set $fs_name allow_standby_replay true'"
+ . " manually once the filesystem exists.\n";
+ }
+ }
};
return $rpcenv->fork_worker('cephcreatemds', "mds.$mds_id", $authuser, $worker);
--
2.47.3
reply other threads:[~2026-03-29 4:46 UTC|newest]
Thread overview: [no followups] expand[flat|nested] mbox.gz Atom feed
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260329044618.2033129-1-k.chai@proxmox.com \
--to=k.chai@proxmox.com \
--cc=pve-devel@lists.proxmox.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox