all lists on lists.proxmox.com
 help / color / mirror / Atom feed
From: Kefu Chai <k.chai@proxmox.com>
To: pve-devel@lists.proxmox.com
Subject: [PATCH manager] ceph: mds: reimplement hotstandby via ceph fs set allow_standby_replay
Date: Sun, 29 Mar 2026 12:46:18 +0800	[thread overview]
Message-ID: <20260329044618.2033129-1-k.chai@proxmox.com> (raw)

PVE was writing two per-MDS config options into ceph.conf on every MDS
creation:

  [mds.<id>]
  mds_standby_for_name = pve
  mds_standby_replay = true   (when hotstandby=1)

Neither option exists in Ceph Squid or Tentacle. A grep of the upstream
Ceph source confirms that both 'mds_standby_for_name' and
'mds_standby_replay' are absent from src/common/options/mds.yaml.in,
the authoritative registry of valid MDS config keys. In MDSMap.cc, the
standby_for_name field is hardcoded to an empty string during encoding
and is only decoded for backward compatibility with old state. Ceph
silently ignores these unknown options (potentially logging warnings).

There are two distinct problems:

1. mds_standby_for_name = 'pve' was written unconditionally regardless
   of what filesystems actually exist. The goal of this option was to
   make the standby MDS preferentially serve a specific filesystem.
   However, the hardcoded value 'pve' never matched the PVE default
   filesystem name 'cephfs' (see FS.pm: $fs_name = $param->{name} //
   'cephfs'), so the option was always pointing at a nonexistent
   filesystem. Even with the correct name, the option is a no-op in
   modern Ceph. Removing it has no effect on cluster behaviour.

2. mds_standby_replay = 'true' (the hotstandby feature) was the old
   per-daemon mechanism for standby replay. The goal — keeping a
   standby MDS replaying the active MDS journal for faster failover —
   is still valid and the feature still exists in Ceph Squid/Tentacle,
   but the mechanism changed: the setting is now per-filesystem via
   'ceph fs set <fsname> allow_standby_replay true', stored in the MDS
   map rather than as a daemon config key (FSCommands.cc:SetHandler).
   Because the old config key is silently ignored, hotstandby has had
   no actual effect on Ceph Squid/Tentacle clusters.

Fix:
- Remove the unconditional mds_standby_for_name write entirely.
- Add a 'filesystem' parameter to the createmds API (optional,
  defaults to 'cephfs'). When 'hotstandby' is set, call the mon
  command 'ceph fs set <filesystem> allow_standby_replay true' after
  the MDS daemon is created, replacing the dead ceph.conf write with
  the correct modern mechanism.
- If the mon command fails (e.g., the filesystem does not yet exist),
  the error is printed as a warning and MDS creation is not rolled
  back. The MDS is running and can serve as a standby regardless; the
  operator can enable standby replay separately once the filesystem
  is created.
- Update the hotstandby parameter description to reflect the new
  behaviour and document the companion 'filesystem' parameter.

Signed-off-by: Kefu Chai <k.chai@proxmox.com>
---
 PVE/API2/Ceph/MDS.pm | 39 ++++++++++++++++++++++++++++++++-------
 1 file changed, 32 insertions(+), 7 deletions(-)

diff --git a/PVE/API2/Ceph/MDS.pm b/PVE/API2/Ceph/MDS.pm
index f7188cee..15f13b17 100644
--- a/PVE/API2/Ceph/MDS.pm
+++ b/PVE/API2/Ceph/MDS.pm
@@ -115,8 +115,18 @@ __PACKAGE__->register_method({
                 optional => 1,
                 default => '0',
                 description =>
-                    "Determines whether a ceph-mds daemon should poll and replay the log of an active MDS. "
-                    . "Faster switch on MDS failure, but needs more idle resources.",
+                    "Use together with 'filesystem' to enable standby-replay "
+                    . "for the given CephFS. Keeps a standby MDS replaying the "
+                    . "active MDS journal for faster failover.",
+            },
+            filesystem => {
+                type => 'string',
+                optional => 1,
+                default => 'cephfs',
+                pattern => qr|^[^:/\s]+$|,
+                description =>
+                    "The name of the CephFS filesystem to enable standby replay "
+                    . "for when 'hotstandby' is set. Defaults to 'cephfs'.",
             },
         },
     },
@@ -158,11 +168,6 @@ __PACKAGE__->register_method({
             }
 
             $cfg->{$section}->{host} = $nodename;
-            $cfg->{$section}->{'mds_standby_for_name'} = 'pve';
-
-            if ($param->{hotstandby}) {
-                $cfg->{$section}->{'mds_standby_replay'} = 'true';
-            }
 
             cfs_write_file('ceph.conf', $cfg);
 
@@ -178,6 +183,26 @@ __PACKAGE__->register_method({
 
                 die "$err\n";
             }
+
+            if ($param->{hotstandby}) {
+                my $fs_name = $param->{filesystem} // 'cephfs';
+                print "Enabling standby replay for filesystem '$fs_name'...\n";
+                eval {
+                    $rados->mon_command({
+                        prefix => 'fs set',
+                        fs_name => $fs_name,
+                        var => 'allow_standby_replay',
+                        val => 'true',
+                        format => 'plain',
+                    });
+                };
+                if (my $err = $@) {
+                    chomp $err;
+                    warn "Could not enable standby replay for '$fs_name': $err\n"
+                        . "Run 'ceph fs set $fs_name allow_standby_replay true'"
+                        . " manually once the filesystem exists.\n";
+                }
+            }
         };
 
         return $rpcenv->fork_worker('cephcreatemds', "mds.$mds_id", $authuser, $worker);
-- 
2.47.3





                 reply	other threads:[~2026-03-29  4:46 UTC|newest]

Thread overview: [no followups] expand[flat|nested]  mbox.gz  Atom feed

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260329044618.2033129-1-k.chai@proxmox.com \
    --to=k.chai@proxmox.com \
    --cc=pve-devel@lists.proxmox.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.
Service provided by Proxmox Server Solutions GmbH | Privacy | Legal