Re: [pve-devel] [PATCH storage/zfsonlinux v2 0/3] fix #4997: lvm: avoid autoactivating (new) LVs after boot

From: "Fabian Grünbichler" <f.gruenbichler@proxmox.com>
To: Proxmox VE development discussion <pve-devel@lists.proxmox.com>,
	Friedrich Weber <f.weber@proxmox.com>
Subject: Re: [pve-devel] [PATCH storage/zfsonlinux v2 0/3] fix #4997: lvm: avoid autoactivating (new) LVs after boot
Date: Fri, 7 Mar 2025 13:14:53 +0100 (CET)	[thread overview]
Message-ID: <1233088344.6300.1741349693091@webmail.proxmox.com> (raw)
In-Reply-To: <20250307095245.65698-1-f.weber@proxmox.com>

> Friedrich Weber <f.weber@proxmox.com> hat am 07.03.2025 10:52 CET geschrieben:
> # Summary
> 
> With default settings, LVM autoactivates LVs when it sees a new VG, e.g. after
> boot or iSCSI login. In a cluster with guest disks on a shared LVM VG (e.g. on
> top of iSCSI/Fibre Channel (FC)/direct-attached SAS), this can indirectly cause
> guest creation or migration to fail. See bug #4997 [1] and patch #3 for
> details.
> 
> The goal of this series is to avoid autoactivating LVs that hold guest disks in
> order to fix #4997. With this series applied, (newly created) LVs should only
> be active on the node that hosts the corresponding guest.
> 
> # Difference to v1
> 
> v1 [0] went with an approach that creates new LVs with the "activation skip"
> flag set, which means lvchange will only activate the LV if the `-K` flag is
> passed. This approach requires a two-step patch apply process to not break
> mixed-version clusters: With PVE 8.4, we would add a change that passes the
> `-K` flag when activating, and only with PVE 9 we would also create new LVs
> with the "activation skip" flag. This approach works but is relatively complex.
> 
> # This v2 approach
> 
> This v2 goes a different route: Instead of "activation skip" flag, we use the
> the `--setautoactivation n` flag for LVs. This flag is available in Debian's
> LVM since Bookworm and only concerns autoactivation, not normal activation as
> performed by our LVM plugin. Hence, it is enough to create new LVs with
> `--setautoactivation n`, we don't require any two-step patch apply process. We
> may still need to wait for PVE 9 to apply this though (see next section).
> Considering this, the advantage over v1 may appear quite small, though I still
> find v2's approach preferable because the `--setautoactivation` flag seems less
> invasive than the activation skip flag.

thanks for the clear description and analysis! :)

> # Mixed 7/8 cluster
> 
> Unfortunately, we need to consider the mixed-version cluster between PVE 7 and
> PVE 8 because PVE 7/Bullseye's LVM does not know `--setautoactivation`. A user
> upgrading from PVE 7 will temporarily have a mixed 7/8 cluster. Once this
> series is applied, the PVE 8 nodes will create new LVs with
> `--setautoactivation n`, which the PVE 7 nodes do not know. In my tests, the
> PVE 7 nodes can read/interact with such LVs just fine, *but*: As soon as a PVE
> 7 node creates a new (unrelated) LV, the `--setautoactivation n` flag is reset
> to default `y` on *all* LVs of the VG. I presume this is because creating a new
> LV rewrites metadata, and the PVE 7 LVM doesn't write out the
> `--setautoactivation n` flag. I imagine (have not tested) this will cause
> problems on a mixed cluster.
> 
> Hence, it may be safer to hold off on applying patches #2/#3 until PVE 9,
> because then we can be sure all nodes run at least PVE 8.

wow - I wonder what other similar timebombs are lurking there??

e.g., for the changelog for bullseye -> bookworm:

2.03.15:   "Increase some hash table size to better support large device sets." (hopefully those are in memory?)
2.03.14: various VDO things (not used by us at least)
2.03.12:   "Allow attaching cache to thin data volume." and other thin-related changes (not relevant for shared thankfully), auto activation, more VDO things

bookworm -> trixie:

2.03.28: various things using radix_trees now (hopefully in memory?)
2.03.25: same
2.03.24: "Support external origin between different thin-pool." & "Support creation of thin-pool with VDO use for its data volume" (local only thankfully), 
2.03.23: "Set the first lv_attr flag for raid integrity images to i or I." (not used by us, maybe in memory only)
2.03.22: "Support parsing of vdo geometry format version 4.", "Fix parsing of VDO metadata.", "Allow snapshots of raid+integrity LV." (all not applicable to regular PVE setups)
2.03.18: "Add support for writecache metadata_only and pause_writeback settings." (?)

so at least nothing that jumps out as definitely problematic for our common setups..

> # Interaction with zfs-initramfs
> 
> One complication is that zfs-initramfs ships an initramfs-tools script that
> unconditionally activates *all* VGs that are visible at boot time, see [2].
> This is the case for local VGs, but also for shared VGs on FC/SAS. Patch #2 of
> this series fixes this by making the script perform autoactivation instead,
> which honors the `--setautoactivation` flag. The patch is necessary to fix
> #4997 on FC/SAS-attached LUNs too.
> 
> # Bonus fix for FC/SAS multipath+LVM issue
> 
> As it turns out, this series seems to additionally fix an issue on hosts with
> LVM on FC/SAS-attached LUNs *with multipath* where LVM would report "Device
> mismatch detected" warnings because the LVs are activated too early in the boot
> process before multipath is available. Our current suggested workaround is to
> install multipath-tools-boot [2]. With this series applied, this shouldn't be
> necessary anymore, as (newly created) LVs are not auto-activated after boot.
> 
> # LVM-thick/LVM-thin
> 
> Note that this change affects all LVs on LVM-thick, not just ones on shared
> storage. As a result, also on single-node hosts, local guest disk LVs on
> LVM-thick will not be automatically active after boot anymore (after applying
> all patches of this series). Guest disk LVs on LVM-thin will still be
> auto-activated, but since LVM-thin storage is necessarily local, we don't run
> into #4997 here.

we could check the shared property, but I don't think having them not
auto-activated hurts as long as it is documented..

> 
> # Transition to LVs with `--setautoactivation n`
> 
> Both v1 and v2 approaches only take effect for new LVs, so we should probably
> also have pve8to9 check for guest disks on (shared?) LVM that have
> autoactivation enabled, and suggest to the user to manually disable
> autoactivation on the LVs, or even the entire VG if it holds only PVE-managed
> LVs.

if we want to wait for PVE 9 anyway to start enabling (disabling? ;)) it, then
the upgrade script would be a nice place to tell users to fix up their volumes?
OTOH, setting the flag automatically starting with PVE 9 also for existing
volumes should have no downsides, we need to document anyway that the behaviour
there changed (so that people that rely on them becoming auto-activated on boot
can adapt whatever is relying on that).. or we could provide a script that does
it post-upgrade..

> We could implement something on top to make the transition smoother, some ideas:
> 
> - When activating existing LVs, check the auto activation flag, and if auto
>   activation is still enabled, disable it.

the only question is whether we want to "pay" for that on each activate_volume?

> - When creating a new VG via the LVM storage plugin or the /disks/lvm API
>   endpoints, disable autoactivation for the whole VG. But this may become
>   confusing as then we'd deal with the flag on both LVs and VGs. Also, especially
>   for FC/SAS-attached LUNs, users may not use the GUI/API and instead create the
>   VG via the command line. We could adjust our guides to use `vgcreate
>   --setautoactivation n` [4], but not all users may follow these guides.

_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel