Re: [pve-devel] [PATCH qemu-server 1/1] qemu: add offline migration from dead node

From: Dominik Csapak <d.csapak@proxmox.com>
To: pve-devel@lists.proxmox.com
Subject: Re: [pve-devel] [PATCH qemu-server 1/1] qemu: add offline migration from dead node
Date: Tue, 1 Apr 2025 12:19:20 +0200	[thread overview]
Message-ID: <52ef2b59-21ec-4a6c-b528-47f1e11c691e@proxmox.com> (raw)
In-Reply-To: <ce2544c2-d467-4eff-ba4f-6df6c3350fe2@proxmox.com>

On 4/1/25 11:57, Thomas Lamprecht wrote:
> Am 01.04.25 um 11:52 schrieb Fabian Grünbichler:
>>> Alexandre Derumier via pve-devel <pve-devel@lists.proxmox.com> hat am 24.03.2025 12:15 CET geschrieben:
>>> verify that node is dead from corosync && ssh
>>> and move config file from /etc/pve directly
>> there are two reasons why this is dangerous and why we haven't exposed anything like this in the API or UI..
>>
>> the first one is the risk of corruption - just because a (supposedly dead) node X is not reachable from your local node A doesn't mean it isn't still running. if it is still running, any guests that were already started before might still be running as well. any guests still running might still be able to talk to shared storage. unless there are other safeguards in place (like MMP, which is not a given for all storages), this can easily completely corrupt guest volumes if you attempt to recover and start such a guest. HA protects against this - node X will fence itself before node A will attempt recovery, so there is never a situation where both nodes will try to write to the same volume. just checking whether other cluster nodes can still connect to node X is not enough by any stretch to make this safe.
>>
>> the second one is ownership of a VM/CT - PVE relies on node-local locking of guests to avoid contention. this only works because each guest/VMID has a clear owner - the node where the config is currently on. if you steal a config by moving it, you are violating this assumption. we only change the owner of a VMID in two scenarios with careful consideration of the implications:
>> - when doing a migration, which is initiated by the source node that is currently owning the guest, so it willingly hands over control to the new node which is safe by definition (no stealing involved and proper locking in place)
>> - when doing a HA recovery, which is protected by the HA locks and the watchdog - we know that the original node has been fenced before the recovery happens and we know it cannot do anything with the guest before it has been informed about the recovery (this is ensured by the design of the HA locks).
>> your code below is not protected by the HA stack, so there is a race involved - your node where the "deadnode migration" is initiated cannot lock the VMID in a way that the supposedly "dead" node knows about (config locking for guests is node-local, so it can only happen on the node that "owns" the config, anything else doesn't make sense/doesn't protect anything). if the "dead" node rejoins the cluster at the right moment, it still owns the VMID/config and can start it, while the other node thinks it can still steal it. there's also no protection against initiating multiple deadnode migrations in parallel for the same VMID, although of course all but one will fail because pmxcfs ensures the VMID.conf only exists under a single node. we'd need to give up node-local guest locking to close this gap, which is a no-go for performance reasons.
>>
>> I understand that this would be convenient to expose, but it is also really dangerous without understanding the implications - and once there is an option to trigger it via the UI, no matter how many disclaimers you put on it, people will press that button and mess up and blame PVE. at the same time there is an actual implementation that safely implements it - it's called HA 😉 so I'd rather spend some time focusing on improving the robustness of our HA stack, rather than adding such a footgun.
>>
> 
> +1 to all of the above.
> 
while i also agree to all said here, I have one counter point to offer:

In the case that such an operation is necessary (e.g. HA is not wanted/needed/possible
for what ever reason), the user will fall back to do it manually (iow. 'mv source target')
which is at least as dangerous as exposing over the API, since

* now the admins sharing the system must share root@pam credentials (ssh/console access)
   (alternatively setup sudo, which has it's own problems)

* it promotes manually modifying /etc/pve/ content

* any error could be even more fatal than if done via the API
   (e.g. mv of the wrong file, from the wrong node, etc.)

IMHO ways forward for this scenario could be:

* use cluster level locking only for config move? (not sure if performance is still
   a concern for this action, since parallel moves don't happen too much?)

* provide a special CLI tool/cmd to deal with that -> would minimize potential
   errors but is still contained to root equivalent users

* link to the doc section for it from the UI with a big caveat
   https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_recovery

just my 2c

_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel