From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from firstgate.proxmox.com (firstgate.proxmox.com [IPv6:2a01:7e0:0:424::9]) by lore.proxmox.com (Postfix) with ESMTPS id 4BF061FF14E for ; Sun, 29 Mar 2026 06:46:07 +0200 (CEST) Received: from firstgate.proxmox.com (localhost [127.0.0.1]) by firstgate.proxmox.com (Proxmox) with ESMTP id 0A106CFEB; Sun, 29 Mar 2026 06:46:31 +0200 (CEST) From: Kefu Chai To: pve-devel@lists.proxmox.com Subject: [PATCH manager] ceph: mds: reimplement hotstandby via ceph fs set allow_standby_replay Date: Sun, 29 Mar 2026 12:46:18 +0800 Message-ID: <20260329044618.2033129-1-k.chai@proxmox.com> X-Mailer: git-send-email 2.47.3 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Bm-Milter-Handled: 55990f41-d878-4baa-be0a-ee34c49e34d2 X-Bm-Transport-Timestamp: 1774759534312 X-SPAM-LEVEL: Spam detection results: 0 AWL 0.353 Adjusted score from AWL reputation of From: address BAYES_00 -1.9 Bayes spam probability is 0 to 1% DMARC_MISSING 0.1 Missing DMARC policy KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment RCVD_IN_VALIDITY_CERTIFIED_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to Validity was blocked. See https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more information. RCVD_IN_VALIDITY_RPBL_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to Validity was blocked. See https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more information. RCVD_IN_VALIDITY_SAFE_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to Validity was blocked. See https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more information. SPF_HELO_NONE 0.001 SPF: HELO does not publish an SPF Record SPF_PASS -0.001 SPF: sender matches SPF record Message-ID-Hash: DRFKEOOCIDIDNQ5GEQFBQJ3WC4PHZRWO X-Message-ID-Hash: DRFKEOOCIDIDNQ5GEQFBQJ3WC4PHZRWO X-MailFrom: k.chai@proxmox.com X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; loop; banned-address; emergency; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header X-Mailman-Version: 3.3.10 Precedence: list List-Id: Proxmox VE development discussion List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: PVE was writing two per-MDS config options into ceph.conf on every MDS creation: [mds.] mds_standby_for_name = pve mds_standby_replay = true (when hotstandby=1) Neither option exists in Ceph Squid or Tentacle. A grep of the upstream Ceph source confirms that both 'mds_standby_for_name' and 'mds_standby_replay' are absent from src/common/options/mds.yaml.in, the authoritative registry of valid MDS config keys. In MDSMap.cc, the standby_for_name field is hardcoded to an empty string during encoding and is only decoded for backward compatibility with old state. Ceph silently ignores these unknown options (potentially logging warnings). There are two distinct problems: 1. mds_standby_for_name = 'pve' was written unconditionally regardless of what filesystems actually exist. The goal of this option was to make the standby MDS preferentially serve a specific filesystem. However, the hardcoded value 'pve' never matched the PVE default filesystem name 'cephfs' (see FS.pm: $fs_name = $param->{name} // 'cephfs'), so the option was always pointing at a nonexistent filesystem. Even with the correct name, the option is a no-op in modern Ceph. Removing it has no effect on cluster behaviour. 2. mds_standby_replay = 'true' (the hotstandby feature) was the old per-daemon mechanism for standby replay. The goal — keeping a standby MDS replaying the active MDS journal for faster failover — is still valid and the feature still exists in Ceph Squid/Tentacle, but the mechanism changed: the setting is now per-filesystem via 'ceph fs set allow_standby_replay true', stored in the MDS map rather than as a daemon config key (FSCommands.cc:SetHandler). Because the old config key is silently ignored, hotstandby has had no actual effect on Ceph Squid/Tentacle clusters. Fix: - Remove the unconditional mds_standby_for_name write entirely. - Add a 'filesystem' parameter to the createmds API (optional, defaults to 'cephfs'). When 'hotstandby' is set, call the mon command 'ceph fs set allow_standby_replay true' after the MDS daemon is created, replacing the dead ceph.conf write with the correct modern mechanism. - If the mon command fails (e.g., the filesystem does not yet exist), the error is printed as a warning and MDS creation is not rolled back. The MDS is running and can serve as a standby regardless; the operator can enable standby replay separately once the filesystem is created. - Update the hotstandby parameter description to reflect the new behaviour and document the companion 'filesystem' parameter. Signed-off-by: Kefu Chai --- PVE/API2/Ceph/MDS.pm | 39 ++++++++++++++++++++++++++++++++------- 1 file changed, 32 insertions(+), 7 deletions(-) diff --git a/PVE/API2/Ceph/MDS.pm b/PVE/API2/Ceph/MDS.pm index f7188cee..15f13b17 100644 --- a/PVE/API2/Ceph/MDS.pm +++ b/PVE/API2/Ceph/MDS.pm @@ -115,8 +115,18 @@ __PACKAGE__->register_method({ optional => 1, default => '0', description => - "Determines whether a ceph-mds daemon should poll and replay the log of an active MDS. " - . "Faster switch on MDS failure, but needs more idle resources.", + "Use together with 'filesystem' to enable standby-replay " + . "for the given CephFS. Keeps a standby MDS replaying the " + . "active MDS journal for faster failover.", + }, + filesystem => { + type => 'string', + optional => 1, + default => 'cephfs', + pattern => qr|^[^:/\s]+$|, + description => + "The name of the CephFS filesystem to enable standby replay " + . "for when 'hotstandby' is set. Defaults to 'cephfs'.", }, }, }, @@ -158,11 +168,6 @@ __PACKAGE__->register_method({ } $cfg->{$section}->{host} = $nodename; - $cfg->{$section}->{'mds_standby_for_name'} = 'pve'; - - if ($param->{hotstandby}) { - $cfg->{$section}->{'mds_standby_replay'} = 'true'; - } cfs_write_file('ceph.conf', $cfg); @@ -178,6 +183,26 @@ __PACKAGE__->register_method({ die "$err\n"; } + + if ($param->{hotstandby}) { + my $fs_name = $param->{filesystem} // 'cephfs'; + print "Enabling standby replay for filesystem '$fs_name'...\n"; + eval { + $rados->mon_command({ + prefix => 'fs set', + fs_name => $fs_name, + var => 'allow_standby_replay', + val => 'true', + format => 'plain', + }); + }; + if (my $err = $@) { + chomp $err; + warn "Could not enable standby replay for '$fs_name': $err\n" + . "Run 'ceph fs set $fs_name allow_standby_replay true'" + . " manually once the filesystem exists.\n"; + } + } }; return $rpcenv->fork_worker('cephcreatemds', "mds.$mds_id", $authuser, $worker); -- 2.47.3