From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <pve-devel-bounces@lists.proxmox.com>
Received: from firstgate.proxmox.com (firstgate.proxmox.com [IPv6:2a01:7e0:0:424::9])
	by lore.proxmox.com (Postfix) with ESMTPS id 985C01FF172
	for <inbox@lore.proxmox.com>; Tue,  1 Apr 2025 11:52:53 +0200 (CEST)
Received: from firstgate.proxmox.com (localhost [127.0.0.1])
	by firstgate.proxmox.com (Proxmox) with ESMTP id B87581E717;
	Tue,  1 Apr 2025 11:52:41 +0200 (CEST)
Date: Tue, 1 Apr 2025 11:52:36 +0200 (CEST)
From: =?UTF-8?Q?Fabian_Gr=C3=BCnbichler?= <f.gruenbichler@proxmox.com>
To: Proxmox VE development discussion <pve-devel@lists.proxmox.com>
Message-ID: <502327008.3598.1743501156797@webmail.proxmox.com>
In-Reply-To: <mailman.127.1742814976.359.pve-devel@lists.proxmox.com>
References: <20250324111529.338025-1-alexandre.derumier@groupe-cyllene.com>
 <mailman.127.1742814976.359.pve-devel@lists.proxmox.com>
MIME-Version: 1.0
X-Priority: 3
Importance: Normal
X-Mailer: Open-Xchange Mailer v7.10.6-Rev75
X-Originating-Client: open-xchange-appsuite
X-SPAM-LEVEL: Spam detection results:  0
 AWL 0.044 Adjusted score from AWL reputation of From: address
 BAYES_00                 -1.9 Bayes spam probability is 0 to 1%
 DMARC_MISSING             0.1 Missing DMARC policy
 KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment
 RCVD_IN_VALIDITY_CERTIFIED_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to
 Validity was blocked. See
 https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more
 information.
 RCVD_IN_VALIDITY_RPBL_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to
 Validity was blocked. See
 https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more
 information.
 RCVD_IN_VALIDITY_SAFE_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to
 Validity was blocked. See
 https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more
 information.
 SPF_HELO_NONE           0.001 SPF: HELO does not publish an SPF Record
 SPF_PASS               -0.001 SPF: sender matches SPF record
 URIBL_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to URIBL was blocked. See
 http://wiki.apache.org/spamassassin/DnsBlocklists#dnsbl-block for more
 information. [qemu.pm]
Subject: Re: [pve-devel] [PATCH qemu-server 1/1] qemu: add offline migration
 from dead node
X-BeenThere: pve-devel@lists.proxmox.com
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Proxmox VE development discussion <pve-devel.lists.proxmox.com>
List-Unsubscribe: <https://lists.proxmox.com/cgi-bin/mailman/options/pve-devel>, 
 <mailto:pve-devel-request@lists.proxmox.com?subject=unsubscribe>
List-Archive: <http://lists.proxmox.com/pipermail/pve-devel/>
List-Post: <mailto:pve-devel@lists.proxmox.com>
List-Help: <mailto:pve-devel-request@lists.proxmox.com?subject=help>
List-Subscribe: <https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel>, 
 <mailto:pve-devel-request@lists.proxmox.com?subject=subscribe>
Reply-To: Proxmox VE development discussion <pve-devel@lists.proxmox.com>
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Errors-To: pve-devel-bounces@lists.proxmox.com
Sender: "pve-devel" <pve-devel-bounces@lists.proxmox.com>

> Alexandre Derumier via pve-devel <pve-devel@lists.proxmox.com> hat am 24.03.2025 12:15 CET geschrieben:
> verify that node is dead from corosync && ssh
> and move config file from /etc/pve directly

there are two reasons why this is dangerous and why we haven't exposed anything like this in the API or UI..

the first one is the risk of corruption - just because a (supposedly dead) node X is not reachable from your local node A doesn't mean it isn't still running. if it is still running, any guests that were already started before might still be running as well. any guests still running might still be able to talk to shared storage. unless there are other safeguards in place (like MMP, which is not a given for all storages), this can easily completely corrupt guest volumes if you attempt to recover and start such a guest. HA protects against this - node X will fence itself before node A will attempt recovery, so there is never a situation where both nodes will try to write to the same volume. just checking whether other cluster nodes can still connect to node X is not enough by any stretch to make this safe.

the second one is ownership of a VM/CT - PVE relies on node-local locking of guests to avoid contention. this only works because each guest/VMID has a clear owner - the node where the config is currently on. if you steal a config by moving it, you are violating this assumption. we only change the owner of a VMID in two scenarios with careful consideration of the implications:
- when doing a migration, which is initiated by the source node that is currently owning the guest, so it willingly hands over control to the new node which is safe by definition (no stealing involved and proper locking in place)
- when doing a HA recovery, which is protected by the HA locks and the watchdog - we know that the original node has been fenced before the recovery happens and we know it cannot do anything with the guest before it has been informed about the recovery (this is ensured by the design of the HA locks).
your code below is not protected by the HA stack, so there is a race involved - your node where the "deadnode migration" is initiated cannot lock the VMID in a way that the supposedly "dead" node knows about (config locking for guests is node-local, so it can only happen on the node that "owns" the config, anything else doesn't make sense/doesn't protect anything). if the "dead" node rejoins the cluster at the right moment, it still owns the VMID/config and can start it, while the other node thinks it can still steal it. there's also no protection against initiating multiple deadnode migrations in parallel for the same VMID, although of course all but one will fail because pmxcfs ensures the VMID.conf only exists under a single node. we'd need to give up node-local guest locking to close this gap, which is a no-go for performance reasons.

I understand that this would be convenient to expose, but it is also really dangerous without understanding the implications - and once there is an option to trigger it via the UI, no matter how many disclaimers you put on it, people will press that button and mess up and blame PVE. at the same time there is an actual implementation that safely implements it - it's called HA ;) so I'd rather spend some time focusing on improving the robustness of our HA stack, rather than adding such a footgun. 

> 
> Signed-off-by: Alexandre Derumier <alexandre.derumier@groupe-cyllene.com>
> ---
>  PVE/API2/Qemu.pm | 56 ++++++++++++++++++++++++++++++++++++++++++++----
>  1 file changed, 52 insertions(+), 4 deletions(-)
> 
> diff --git a/PVE/API2/Qemu.pm b/PVE/API2/Qemu.pm
> index 156b1c7b..58c454a6 100644
> --- a/PVE/API2/Qemu.pm
> +++ b/PVE/API2/Qemu.pm
> @@ -4764,6 +4764,9 @@ __PACKAGE__->register_method({
>  		description => "Target node.",
>  		completion =>  \&PVE::Cluster::complete_migration_target,
>              }),
> +            deadnode => get_standard_option('pve-node', {
> +                description => "Dead source node.",
> +            }),
>  	    online => {
>  		type => 'boolean',
>  		description => "Use online/live migration if VM is running. Ignored if VM is stopped.",
> @@ -4813,8 +4816,9 @@ __PACKAGE__->register_method({
>  	my $authuser = $rpcenv->get_user();
>  
>  	my $target = extract_param($param, 'target');
> +	my $deadnode = extract_param($param, 'deadnode');
>  
> -	my $localnode = PVE::INotify::nodename();
> +	my $localnode = $deadnode ? $deadnode : PVE::INotify::nodename();
>  	raise_param_exc({ target => "target is local node."}) if $target eq $localnode;
>  
>  	PVE::Cluster::check_cfs_quorum();
> @@ -4835,14 +4839,43 @@ __PACKAGE__->register_method({
>  	raise_param_exc({ migration_network => "Only root may use this option." })
>  	    if $param->{migration_network} && $authuser ne 'root@pam';
>  
> +	raise_param_exc({ deadnode => "Only root may use this option." })
> +	    if $param->{deadnode} && $authuser ne 'root@pam';
> +
>  	# test if VM exists
> -	my $conf = PVE::QemuConfig->load_config($vmid);
> +	my $conf = $deadnode ? PVE::QemuConfig->load_config($vmid, $deadnode) : PVE::QemuConfig->load_config($vmid);
>  
>  	# try to detect errors early
>  
>  	PVE::QemuConfig->check_lock($conf);
>  
> -	if (PVE::QemuServer::check_running($vmid)) {
> +        if ($deadnode) {
> +	    die "Can't do online migration of a dead node.\n" if $param->{online};
> +	    my $members = PVE::Cluster::get_members();
> +	    die "The deadnode $deadnode seem to be alive" if $members->{$deadnode} && $members->{$deadnode}->{online};
> +
> +	    print "test if deadnode $deadnode respond to ping\n";
> +	    eval {
> +		PVE::Tools::run_command("/usr/bin/ping -c 1 $members->{$deadnode}->{ip}");
> +	    };
> +	    if(!$@){
> +		die "error: ping to target $deadnode is still working. Node seem to be alive.";
> +	    }
> +
> +	    #make an extra ssh connection to double check that it's not just a corosync crash
> +	    my $sshinfo = PVE::SSHInfo::get_ssh_info($deadnode);
> +	    my $sshcmd = PVE::SSHInfo::ssh_info_to_command($sshinfo);
> +	    push @$sshcmd, 'hostname';
> +	    print "test if deadnode $deadnode respond to ssh\n";
> +	    eval {
> +		PVE::Tools::run_command($sshcmd, timeout => 1);
> +	    };
> +	    if(!$@){
> +		die "error: ssh connection to target $deadnode is still working. Node seem to be alive.";
> +	    }
> +
> +
> +	} elsif (PVE::QemuServer::check_running($vmid)) {
>  	    die "can't migrate running VM without --online\n" if !$param->{online};
>  
>  	    my $repl_conf = PVE::ReplicationConfig->new();
> @@ -4881,7 +4914,22 @@ __PACKAGE__->register_method({
>  	    PVE::QemuServer::check_storage_availability($storecfg, $conf, $target);
>  	}
>  
> -	if (PVE::HA::Config::vm_is_ha_managed($vmid) && $rpcenv->{type} ne 'ha') {
> +	if ($deadnode) {
> +	    my $realcmd = sub {
> +		my $config_fn = PVE::QemuConfig->config_file($vmid, $deadnode);
> +		my $new_config_fn = PVE::QemuConfig->config_file($vmid, $target);
> +
> +		rename($config_fn, $new_config_fn)
> +		    or die "failed to move config file to node '$target': $!\n";
> +	    };
> +
> +	    my $worker = sub {
> +		return PVE::GuestHelpers::guest_migration_lock($vmid, 10, $realcmd);
> +	    };
> +
> +	    return $rpcenv->fork_worker('qmigrate', $vmid, $authuser, $worker);
> +
> +        } elsif (PVE::HA::Config::vm_is_ha_managed($vmid) && $rpcenv->{type} ne 'ha') {
>  
>  	    my $hacmd = sub {
>  		my $upid = shift;
> -- 
> 2.39.5


_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel