From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits)) (No client certificate requested) by lists.proxmox.com (Postfix) with ESMTPS id CE8FC7BEB0 for ; Wed, 13 Jul 2022 15:14:09 +0200 (CEST) Received: from firstgate.proxmox.com (localhost [127.0.0.1]) by firstgate.proxmox.com (Proxmox) with ESMTP id C3FDA349C for ; Wed, 13 Jul 2022 15:14:09 +0200 (CEST) Received: from proxmox-new.maurer-it.com (proxmox-new.maurer-it.com [94.136.29.106]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by firstgate.proxmox.com (Proxmox) with ESMTPS for ; Wed, 13 Jul 2022 15:14:06 +0200 (CEST) Received: from proxmox-new.maurer-it.com (localhost.localdomain [127.0.0.1]) by proxmox-new.maurer-it.com (Proxmox) with ESMTP id 25AA24200F for ; Wed, 13 Jul 2022 15:14:06 +0200 (CEST) From: =?UTF-8?q?Fabian=20Gr=C3=BCnbichler?= To: pve-devel@lists.proxmox.com Date: Wed, 13 Jul 2022 15:13:59 +0200 Message-Id: <20220713131359.2771787-1-f.gruenbichler@proxmox.com> X-Mailer: git-send-email 2.30.2 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-SPAM-LEVEL: Spam detection results: 0 AWL 0.169 Adjusted score from AWL reputation of From: address BAYES_00 -1.9 Bayes spam probability is 0 to 1% KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment SPF_HELO_NONE 0.001 SPF: HELO does not publish an SPF Record SPF_PASS -0.001 SPF: sender matches SPF record T_SCC_BODY_TEXT_LINE -0.01 - Subject: [pve-devel] [PATCH access-control] auth key: fix double rotation in clusters X-BeenThere: pve-devel@lists.proxmox.com X-Mailman-Version: 2.1.29 Precedence: list List-Id: Proxmox VE development discussion List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 13 Jul 2022 13:14:09 -0000 there is a (hard to trigger) race that can cause a double rotation of the auth key, with potentially confusing fallout (various processes on different nodes having an inconsistent view of the current and previous auth keys, resulting in "random" invalid ticket errors until the next proper key rotation 24h later). the underlying cause is that `stat()` calls are excempt from our otherwise non-cached/direct_io handling of pmxcfs I/O, which allows the following sequence of events to take place: LAST: mtime of current auth key - current epoch advances to LAST + 24h the following can be arbitrarly interleaved between node A and B: - LAST+24h node A: pvedaemon/pvestatd on node A calls check_authkey(1) - LAST+24h node A: it returns 0 (rotation required) - LAST+24h node A: rotate_key() is called - LAST+24h node A: cfs_lock_authkey is called - LAST+24h node B: pvedaemon/pvestatd calls check_authkey(1) - LAST+24h node B: key is not yet cached in-memory by current process - LAST+24h node B: key file is opened, stat-ed, read, parsed, and content+mtime is cached (the kernel will now cache this stat result for 1s unless the path is opened) - LAST+24h node B: it returns 0 (rotation required) - LAST+24h node B: rotate_key() is called - LAST+24h node B: cfs_lock_authkey is called the following is mutex-ed via a cfs_lock: - LAST+24h node A: lock is obtained - LAST+24h node A: check_authkey() is called - LAST+24h node A: key is stat-ed, mtime is still (correctly) LAST, cached mtime and content are returned - LAST+24h node A: it returns 0 (rotation still required) - LAST+24h node A: get_pubkey() is called and returns current auth key - LAST+24h node A: new keypair is generated and persisted - LAST+24h node A: cfs_lock is released - LAST+24h node B: changes by node A are processed by pmxcfs - LAST+24h node B: lock is obtained - LAST+24h node B: check_authkey() is called - LAST+24h node B: key is stat-ed, mtime is (incorrectly!) still LAST since the stat call is handled by the kernel/page cache, not by pmxcfs, cached mtime and content are returned - LAST+24h node B: it returns 0 (rotation still required) - LAST+24h node B: get_pubkey() is called and returns either previous or key written by node A (depending on whether page cache or pmxcfs answers stat call) - LAST+24h node B: new keypair is generated, key returned by last get_pubkey call is written as old key the end result is that some nodes and process will treat the key generated by node A as "current", while others will treat the one generated by nodoe B as "current". both have the same mtime, so the in-memory cache hash won't be updated unless the service is restarted or another rotation happens. depending on who generated the ticket and who attempts validating it, a ticket might be rejected as invalid even though the generating party would treat it as valid, and time on all nodes is properly synced. there seems to be now way for pmxcfs to pro-actively invalidate the page cache entry safely (since we'd need to do so while writes to the same path can happen concurrently), so work around by forcing an open/close at the (stat) call site which does the work for us. regular reads are not affected since those already bypass the page cache entirely anyway. thankfully in almost all cases, the following sequence has enough synchronization overhead baked in to avoid triggering the issue almost entirely: - cfs_lock - generate key - create tmp file for old key - write tmp file - rename tmp file into proper place - create tmp file for new pub key - write tmp file - rename tmp file into proper place - create tmp file for new priv key - write tmp file - rename tmp file into proper place - release lock that being said, there has been at least one report where this was triggered in the wild from time to time. it is easy to reproduce by increasing the attr_timeout and entry_timeout fuse settings inside pmxcfs to increase the time stat results are treated as valid/retained in the page cache: -----8<----- diff --git a/data/src/pmxcfs.c b/data/src/pmxcfs.c index d78a248..e3e807b 100644 --- a/data/src/pmxcfs.c +++ b/data/src/pmxcfs.c @@ -935,7 +935,7 @@ int main(int argc, char *argv[]) mkdir(CFSDIR, 0755); - char *fa[] = { "-f", "-odefault_permissions", "-oallow_other", NULL}; + char *fa[] = { "-f", "-odefault_permissions", "-oallow_other", "-oentry_timeout=5", "-oattr_timeout=5", NULL}; struct fuse_args fuse_args = FUSE_ARGS_INIT(sizeof (fa)/sizeof(gpointer) - 1, fa); ----->8----- in which case it's even easy to trigger more than double rotation in a bigger test cluster (stopping all PVE services except for pve-cluster helps avoiding interference): on a single node: $ touch --date yesterday /etc/pve/authkey.pub in parallel (i.e., via tmux synchronized panes): -----8<----- use strict; use warnings; use PVE::Cluster; use PVE::AccessControl; PVE::Cluster::cfs_update(); # ensure page cache entry is there PVE::AccessControl::check_authkey(1); PVE::AccessControl::check_authkey(1); # now attempt rotation PVE::AccessControl::rotate_authkey(); ----->8----- Thanks to Wolfgang Bumiller for assistance in triaging and exploring various avenues of fixing. Signed-off-by: Fabian Grünbichler --- apologies for the wall of text - but probably better to have too much info preserved than too little ;) src/PVE/AccessControl.pm | 12 +++++++++++- 1 file changed, 11 insertions(+), 1 deletion(-) diff --git a/src/PVE/AccessControl.pm b/src/PVE/AccessControl.pm index 3725a7d..953f135 100644 --- a/src/PVE/AccessControl.pm +++ b/src/PVE/AccessControl.pm @@ -203,7 +203,16 @@ sub rotate_authkey { return if $authkey_lifetime == 0; PVE::Cluster::cfs_lock_authkey(undef, sub { - # re-check with lock to avoid double rotation in clusters + # stat() calls might be answered from the kernel page cache for up to + # 1s, so this special dance is needed to avoid a double rotation in + # clusters *despite* the cfs_lock context.. + + # drop in-process cache hash + $pve_auth_key_cache = {}; + # force open/close of file to invalidate page cache entry + get_pubkey(); + # now re-check with lock held and page cache invalidated so that stat() + # does the right thing, and any key updates by other nodes are visible. return if check_authkey(); my $old = get_pubkey(); -- 2.30.2