From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <t.lamprecht@proxmox.com>
Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68])
 (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
 key-exchange X25519 server-signature RSA-PSS (2048 bits))
 (No client certificate requested)
 by lists.proxmox.com (Postfix) with ESMTPS id BCE837C514
 for <pve-devel@lists.proxmox.com>; Fri, 15 Jul 2022 13:40:07 +0200 (CEST)
Received: from firstgate.proxmox.com (localhost [127.0.0.1])
 by firstgate.proxmox.com (Proxmox) with ESMTP id ACE382791A
 for <pve-devel@lists.proxmox.com>; Fri, 15 Jul 2022 13:39:37 +0200 (CEST)
Received: from proxmox-new.maurer-it.com (proxmox-new.maurer-it.com
 [94.136.29.106])
 (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
 key-exchange X25519 server-signature RSA-PSS (2048 bits))
 (No client certificate requested)
 by firstgate.proxmox.com (Proxmox) with ESMTPS
 for <pve-devel@lists.proxmox.com>; Fri, 15 Jul 2022 13:39:36 +0200 (CEST)
Received: from proxmox-new.maurer-it.com (localhost.localdomain [127.0.0.1])
 by proxmox-new.maurer-it.com (Proxmox) with ESMTP id BBBEB4231C
 for <pve-devel@lists.proxmox.com>; Fri, 15 Jul 2022 13:39:36 +0200 (CEST)
Message-ID: <f83363a2-74b3-6fe2-82f7-ef6d7932f066@proxmox.com>
Date: Fri, 15 Jul 2022 13:39:35 +0200
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:103.0) Gecko/20100101
 Thunderbird/103.0
From: Thomas Lamprecht <t.lamprecht@proxmox.com>
To: Proxmox VE development discussion <pve-devel@lists.proxmox.com>,
 =?UTF-8?Q?Fabian_Gr=c3=bcnbichler?= <f.gruenbichler@proxmox.com>
References: <20220713131359.2771787-1-f.gruenbichler@proxmox.com>
Content-Language: en-GB
In-Reply-To: <20220713131359.2771787-1-f.gruenbichler@proxmox.com>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
X-SPAM-LEVEL: Spam detection results:  0
 AWL 0.004 Adjusted score from AWL reputation of From: address
 BAYES_00                 -1.9 Bayes spam probability is 0 to 1%
 KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment
 SPF_HELO_NONE           0.001 SPF: HELO does not publish an SPF Record
 SPF_PASS               -0.001 SPF: sender matches SPF record
 T_SCC_BODY_TEXT_LINE    -0.01 -
Subject: [pve-devel] applied: [PATCH access-control] auth key: fix double
 rotation in clusters
X-BeenThere: pve-devel@lists.proxmox.com
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Proxmox VE development discussion <pve-devel.lists.proxmox.com>
List-Unsubscribe: <https://lists.proxmox.com/cgi-bin/mailman/options/pve-devel>, 
 <mailto:pve-devel-request@lists.proxmox.com?subject=unsubscribe>
List-Archive: <http://lists.proxmox.com/pipermail/pve-devel/>
List-Post: <mailto:pve-devel@lists.proxmox.com>
List-Help: <mailto:pve-devel-request@lists.proxmox.com?subject=help>
List-Subscribe: <https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel>, 
 <mailto:pve-devel-request@lists.proxmox.com?subject=subscribe>
X-List-Received-Date: Fri, 15 Jul 2022 11:40:07 -0000

Am 13/07/2022 um 15:13 schrieb Fabian Gr=C3=BCnbichler:
> there is a (hard to trigger) race that can cause a double rotation of
> the auth key, with potentially confusing fallout (various processes on
> different nodes having an inconsistent view of the current and previous=

> auth keys, resulting in "random" invalid ticket errors until the next
> proper key rotation 24h later).
>=20
> the underlying cause is that `stat()` calls are excempt from our
> otherwise non-cached/direct_io handling of pmxcfs I/O, which allows the=

> following sequence of events to take place:
>=20
> LAST: mtime of current auth key
>=20
> - current epoch advances to LAST + 24h
>=20
> the following can be arbitrarly interleaved between node A and B:
> - LAST+24h node A: pvedaemon/pvestatd on node A calls check_authkey(1)
> - LAST+24h node A: it returns 0 (rotation required)
> - LAST+24h node A: rotate_key() is called
> - LAST+24h node A: cfs_lock_authkey is called
> - LAST+24h node B: pvedaemon/pvestatd calls check_authkey(1)
> - LAST+24h node B: key is not yet cached in-memory by current process
> - LAST+24h node B: key file is opened, stat-ed, read, parsed, and conte=
nt+mtime
>   is cached (the kernel will now cache this stat result for 1s unless
>   the path is opened)
> - LAST+24h node B: it returns 0 (rotation required)
> - LAST+24h node B: rotate_key() is called
> - LAST+24h node B: cfs_lock_authkey is called
>=20
> the following is mutex-ed via a cfs_lock:
> - LAST+24h node A: lock is obtained
> - LAST+24h node A: check_authkey() is called
> - LAST+24h node A: key is stat-ed, mtime is still (correctly) LAST,
>   cached mtime and content are returned
> - LAST+24h node A: it returns 0 (rotation still required)
> - LAST+24h node A: get_pubkey() is called and returns current auth key
> - LAST+24h node A: new keypair is generated and persisted
> - LAST+24h node A: cfs_lock is released
> - LAST+24h node B: changes by node A are processed by pmxcfs
> - LAST+24h node B: lock is obtained
> - LAST+24h node B: check_authkey() is called
> - LAST+24h node B: key is stat-ed, mtime is (incorrectly!) still LAST
>   since the stat call is handled by the kernel/page cache, not by
>   pmxcfs, cached mtime and content are returned
> - LAST+24h node B: it returns 0 (rotation still required)
> - LAST+24h node B: get_pubkey() is called and returns either previous o=
r
>   key written by node A (depending on whether page cache or pmxcfs
>   answers stat call)
> - LAST+24h node B: new keypair is generated, key returned by last
>   get_pubkey call is written as old key
>=20
> the end result is that some nodes and process will treat the key
> generated by node A as "current", while others will treat the one
> generated by nodoe B as "current". both have the same mtime, so the
> in-memory cache hash won't be updated unless the service is restarted o=
r
> another rotation happens. depending on who generated the ticket and who=

> attempts validating it, a ticket might be rejected as invalid even
> though the generating party would treat it as valid, and time on all
> nodes is properly synced.
>=20
> there seems to be now way for pmxcfs to pro-actively invalidate the pag=
e
> cache entry safely (since we'd need to do so while writes to the same
> path can happen concurrently), so work around by forcing an open/close
> at the (stat) call site which does the work for us. regular reads are
> not affected since those already bypass the page cache entirely anyway.=

>=20
> thankfully in almost all cases, the following sequence has enough
> synchronization overhead baked in to avoid triggering the issue almost
> entirely:
>=20
> - cfs_lock
> - generate key
> - create tmp file for old key
> - write tmp file
> - rename tmp file into proper place
> - create tmp file for new pub key
> - write tmp file
> - rename tmp file into proper place
> - create tmp file for new priv key
> - write tmp file
> - rename tmp file into proper place
> - release lock
>=20
> that being said, there has been at least one report where this was
> triggered in the wild from time to time.
>=20
> it is easy to reproduce by increasing the attr_timeout and entry_timeou=
t
> fuse settings inside pmxcfs to increase the time stat results are
> treated as valid/retained in the page cache:
>=20
> -----8<-----
>=20

applied, thanks!