public inbox for pve-devel@lists.proxmox.com
 help / color / mirror / Atom feed
* [pve-devel] [PATCH pve-manager 1/1] migrate: allow migration from dead node
       [not found] <20250324111529.338025-1-alexandre.derumier@groupe-cyllene.com>
@ 2025-03-24 11:15 ` Alexandre Derumier via pve-devel
  2025-03-24 11:15 ` [pve-devel] [PATCH qemu-server 1/1] qemu: add offline " Alexandre Derumier via pve-devel
  1 sibling, 0 replies; 13+ messages in thread
From: Alexandre Derumier via pve-devel @ 2025-03-24 11:15 UTC (permalink / raw)
  To: pve-devel; +Cc: Alexandre Derumier

[-- Attachment #1: Type: message/rfc822, Size: 4883 bytes --]

From: Alexandre Derumier <alexandre.derumier@groupe-cyllene.com>
To: pve-devel@lists.proxmox.com
Subject: [PATCH pve-manager 1/1] migrate: allow migration from dead node
Date: Mon, 24 Mar 2025 12:15:28 +0100
Message-ID: <20250324111529.338025-2-alexandre.derumier@groupe-cyllene.com>

Signed-off-by: Alexandre Derumier <alexandre.derumier@groupe-cyllene.com>
---
 www/manager6/window/Migrate.js | 19 +++++++++++++++++--
 1 file changed, 17 insertions(+), 2 deletions(-)

diff --git a/www/manager6/window/Migrate.js b/www/manager6/window/Migrate.js
index 78d03921..db63e484 100644
--- a/www/manager6/window/Migrate.js
+++ b/www/manager6/window/Migrate.js
@@ -34,7 +34,9 @@ Ext.define('PVE.window.Migrate', {
 
 	formulas: {
 	    setMigrationMode: function(get) {
-		if (get('running')) {
+		if (get('migration.deadnode')) {
+		    return gettext('Dead Node Mode. Please double check that node is really down and vm is really shutdown !!!!!!!!!');
+		} else if (get('running')) {
 		    if (get('vmtype') === 'qemu') {
 			return gettext('Online');
 		    } else {
@@ -117,6 +119,12 @@ Ext.define('PVE.window.Migrate', {
 		target: values.target,
 	    };
 
+	    var node = vm.get('nodename');
+	    if (vm.get('migration.deadnode')) {
+		node = params.target; //send query to target node
+		params.deadnode = vm.get('nodename');
+	    }
+
 	    if (vm.get('migration.mode')) {
 		params[vm.get('migration.mode')] = 1;
 	    }
@@ -135,7 +143,7 @@ Ext.define('PVE.window.Migrate', {
 
 	    Proxmox.Utils.API2Request({
 		params: params,
-		url: '/nodes/' + vm.get('nodename') + '/' + vm.get('vmtype') + '/' + vm.get('vmid') + '/migrate',
+		url: '/nodes/' + node + '/' + vm.get('vmtype') + '/' + vm.get('vmid') + '/migrate',
 		waitMsgTarget: view,
 		method: 'POST',
 		failure: function(response, opts) {
@@ -185,6 +193,13 @@ Ext.define('PVE.window.Migrate', {
 		vm = me.getViewModel(),
 		migrateStats;
 
+	    //check if the source node is dead/offline
+	    const nodeInfo = PVE.data.ResourceStore.getNodes().find(node => node.node === vm.get('nodename'));
+	    if (nodeInfo.status === 'offline') {
+		vm.set('migration.deadnode', 1);
+		return;
+	    }
+
 	    if (vm.get('running')) {
 		vm.set('migration.mode', 'online');
 	    }
-- 
2.39.5



[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [pve-devel] [PATCH qemu-server 1/1] qemu: add offline migration from dead node
       [not found] <20250324111529.338025-1-alexandre.derumier@groupe-cyllene.com>
  2025-03-24 11:15 ` [pve-devel] [PATCH pve-manager 1/1] migrate: allow migration from dead node Alexandre Derumier via pve-devel
@ 2025-03-24 11:15 ` Alexandre Derumier via pve-devel
  2025-04-01  9:52   ` Fabian Grünbichler
  1 sibling, 1 reply; 13+ messages in thread
From: Alexandre Derumier via pve-devel @ 2025-03-24 11:15 UTC (permalink / raw)
  To: pve-devel; +Cc: Alexandre Derumier

[-- Attachment #1: Type: message/rfc822, Size: 6888 bytes --]

From: Alexandre Derumier <alexandre.derumier@groupe-cyllene.com>
To: pve-devel@lists.proxmox.com
Subject: [PATCH qemu-server 1/1] qemu: add offline migration from dead node
Date: Mon, 24 Mar 2025 12:15:29 +0100
Message-ID: <20250324111529.338025-3-alexandre.derumier@groupe-cyllene.com>

verify that node is dead from corosync && ssh
and move config file from /etc/pve directly

Signed-off-by: Alexandre Derumier <alexandre.derumier@groupe-cyllene.com>
---
 PVE/API2/Qemu.pm | 56 ++++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 52 insertions(+), 4 deletions(-)

diff --git a/PVE/API2/Qemu.pm b/PVE/API2/Qemu.pm
index 156b1c7b..58c454a6 100644
--- a/PVE/API2/Qemu.pm
+++ b/PVE/API2/Qemu.pm
@@ -4764,6 +4764,9 @@ __PACKAGE__->register_method({
 		description => "Target node.",
 		completion =>  \&PVE::Cluster::complete_migration_target,
             }),
+            deadnode => get_standard_option('pve-node', {
+                description => "Dead source node.",
+            }),
 	    online => {
 		type => 'boolean',
 		description => "Use online/live migration if VM is running. Ignored if VM is stopped.",
@@ -4813,8 +4816,9 @@ __PACKAGE__->register_method({
 	my $authuser = $rpcenv->get_user();
 
 	my $target = extract_param($param, 'target');
+	my $deadnode = extract_param($param, 'deadnode');
 
-	my $localnode = PVE::INotify::nodename();
+	my $localnode = $deadnode ? $deadnode : PVE::INotify::nodename();
 	raise_param_exc({ target => "target is local node."}) if $target eq $localnode;
 
 	PVE::Cluster::check_cfs_quorum();
@@ -4835,14 +4839,43 @@ __PACKAGE__->register_method({
 	raise_param_exc({ migration_network => "Only root may use this option." })
 	    if $param->{migration_network} && $authuser ne 'root@pam';
 
+	raise_param_exc({ deadnode => "Only root may use this option." })
+	    if $param->{deadnode} && $authuser ne 'root@pam';
+
 	# test if VM exists
-	my $conf = PVE::QemuConfig->load_config($vmid);
+	my $conf = $deadnode ? PVE::QemuConfig->load_config($vmid, $deadnode) : PVE::QemuConfig->load_config($vmid);
 
 	# try to detect errors early
 
 	PVE::QemuConfig->check_lock($conf);
 
-	if (PVE::QemuServer::check_running($vmid)) {
+        if ($deadnode) {
+	    die "Can't do online migration of a dead node.\n" if $param->{online};
+	    my $members = PVE::Cluster::get_members();
+	    die "The deadnode $deadnode seem to be alive" if $members->{$deadnode} && $members->{$deadnode}->{online};
+
+	    print "test if deadnode $deadnode respond to ping\n";
+	    eval {
+		PVE::Tools::run_command("/usr/bin/ping -c 1 $members->{$deadnode}->{ip}");
+	    };
+	    if(!$@){
+		die "error: ping to target $deadnode is still working. Node seem to be alive.";
+	    }
+
+	    #make an extra ssh connection to double check that it's not just a corosync crash
+	    my $sshinfo = PVE::SSHInfo::get_ssh_info($deadnode);
+	    my $sshcmd = PVE::SSHInfo::ssh_info_to_command($sshinfo);
+	    push @$sshcmd, 'hostname';
+	    print "test if deadnode $deadnode respond to ssh\n";
+	    eval {
+		PVE::Tools::run_command($sshcmd, timeout => 1);
+	    };
+	    if(!$@){
+		die "error: ssh connection to target $deadnode is still working. Node seem to be alive.";
+	    }
+
+
+	} elsif (PVE::QemuServer::check_running($vmid)) {
 	    die "can't migrate running VM without --online\n" if !$param->{online};
 
 	    my $repl_conf = PVE::ReplicationConfig->new();
@@ -4881,7 +4914,22 @@ __PACKAGE__->register_method({
 	    PVE::QemuServer::check_storage_availability($storecfg, $conf, $target);
 	}
 
-	if (PVE::HA::Config::vm_is_ha_managed($vmid) && $rpcenv->{type} ne 'ha') {
+	if ($deadnode) {
+	    my $realcmd = sub {
+		my $config_fn = PVE::QemuConfig->config_file($vmid, $deadnode);
+		my $new_config_fn = PVE::QemuConfig->config_file($vmid, $target);
+
+		rename($config_fn, $new_config_fn)
+		    or die "failed to move config file to node '$target': $!\n";
+	    };
+
+	    my $worker = sub {
+		return PVE::GuestHelpers::guest_migration_lock($vmid, 10, $realcmd);
+	    };
+
+	    return $rpcenv->fork_worker('qmigrate', $vmid, $authuser, $worker);
+
+        } elsif (PVE::HA::Config::vm_is_ha_managed($vmid) && $rpcenv->{type} ne 'ha') {
 
 	    my $hacmd = sub {
 		my $upid = shift;
-- 
2.39.5



[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [pve-devel] [PATCH qemu-server 1/1] qemu: add offline migration from dead node
  2025-03-24 11:15 ` [pve-devel] [PATCH qemu-server 1/1] qemu: add offline " Alexandre Derumier via pve-devel
@ 2025-04-01  9:52   ` Fabian Grünbichler
  2025-04-01  9:57     ` Thomas Lamprecht
  0 siblings, 1 reply; 13+ messages in thread
From: Fabian Grünbichler @ 2025-04-01  9:52 UTC (permalink / raw)
  To: Proxmox VE development discussion

> Alexandre Derumier via pve-devel <pve-devel@lists.proxmox.com> hat am 24.03.2025 12:15 CET geschrieben:
> verify that node is dead from corosync && ssh
> and move config file from /etc/pve directly

there are two reasons why this is dangerous and why we haven't exposed anything like this in the API or UI..

the first one is the risk of corruption - just because a (supposedly dead) node X is not reachable from your local node A doesn't mean it isn't still running. if it is still running, any guests that were already started before might still be running as well. any guests still running might still be able to talk to shared storage. unless there are other safeguards in place (like MMP, which is not a given for all storages), this can easily completely corrupt guest volumes if you attempt to recover and start such a guest. HA protects against this - node X will fence itself before node A will attempt recovery, so there is never a situation where both nodes will try to write to the same volume. just checking whether other cluster nodes can still connect to node X is not enough by any stretch to make this safe.

the second one is ownership of a VM/CT - PVE relies on node-local locking of guests to avoid contention. this only works because each guest/VMID has a clear owner - the node where the config is currently on. if you steal a config by moving it, you are violating this assumption. we only change the owner of a VMID in two scenarios with careful consideration of the implications:
- when doing a migration, which is initiated by the source node that is currently owning the guest, so it willingly hands over control to the new node which is safe by definition (no stealing involved and proper locking in place)
- when doing a HA recovery, which is protected by the HA locks and the watchdog - we know that the original node has been fenced before the recovery happens and we know it cannot do anything with the guest before it has been informed about the recovery (this is ensured by the design of the HA locks).
your code below is not protected by the HA stack, so there is a race involved - your node where the "deadnode migration" is initiated cannot lock the VMID in a way that the supposedly "dead" node knows about (config locking for guests is node-local, so it can only happen on the node that "owns" the config, anything else doesn't make sense/doesn't protect anything). if the "dead" node rejoins the cluster at the right moment, it still owns the VMID/config and can start it, while the other node thinks it can still steal it. there's also no protection against initiating multiple deadnode migrations in parallel for the same VMID, although of course all but one will fail because pmxcfs ensures the VMID.conf only exists under a single node. we'd need to give up node-local guest locking to close this gap, which is a no-go for performance reasons.

I understand that this would be convenient to expose, but it is also really dangerous without understanding the implications - and once there is an option to trigger it via the UI, no matter how many disclaimers you put on it, people will press that button and mess up and blame PVE. at the same time there is an actual implementation that safely implements it - it's called HA ;) so I'd rather spend some time focusing on improving the robustness of our HA stack, rather than adding such a footgun. 

> 
> Signed-off-by: Alexandre Derumier <alexandre.derumier@groupe-cyllene.com>
> ---
>  PVE/API2/Qemu.pm | 56 ++++++++++++++++++++++++++++++++++++++++++++----
>  1 file changed, 52 insertions(+), 4 deletions(-)
> 
> diff --git a/PVE/API2/Qemu.pm b/PVE/API2/Qemu.pm
> index 156b1c7b..58c454a6 100644
> --- a/PVE/API2/Qemu.pm
> +++ b/PVE/API2/Qemu.pm
> @@ -4764,6 +4764,9 @@ __PACKAGE__->register_method({
>  		description => "Target node.",
>  		completion =>  \&PVE::Cluster::complete_migration_target,
>              }),
> +            deadnode => get_standard_option('pve-node', {
> +                description => "Dead source node.",
> +            }),
>  	    online => {
>  		type => 'boolean',
>  		description => "Use online/live migration if VM is running. Ignored if VM is stopped.",
> @@ -4813,8 +4816,9 @@ __PACKAGE__->register_method({
>  	my $authuser = $rpcenv->get_user();
>  
>  	my $target = extract_param($param, 'target');
> +	my $deadnode = extract_param($param, 'deadnode');
>  
> -	my $localnode = PVE::INotify::nodename();
> +	my $localnode = $deadnode ? $deadnode : PVE::INotify::nodename();
>  	raise_param_exc({ target => "target is local node."}) if $target eq $localnode;
>  
>  	PVE::Cluster::check_cfs_quorum();
> @@ -4835,14 +4839,43 @@ __PACKAGE__->register_method({
>  	raise_param_exc({ migration_network => "Only root may use this option." })
>  	    if $param->{migration_network} && $authuser ne 'root@pam';
>  
> +	raise_param_exc({ deadnode => "Only root may use this option." })
> +	    if $param->{deadnode} && $authuser ne 'root@pam';
> +
>  	# test if VM exists
> -	my $conf = PVE::QemuConfig->load_config($vmid);
> +	my $conf = $deadnode ? PVE::QemuConfig->load_config($vmid, $deadnode) : PVE::QemuConfig->load_config($vmid);
>  
>  	# try to detect errors early
>  
>  	PVE::QemuConfig->check_lock($conf);
>  
> -	if (PVE::QemuServer::check_running($vmid)) {
> +        if ($deadnode) {
> +	    die "Can't do online migration of a dead node.\n" if $param->{online};
> +	    my $members = PVE::Cluster::get_members();
> +	    die "The deadnode $deadnode seem to be alive" if $members->{$deadnode} && $members->{$deadnode}->{online};
> +
> +	    print "test if deadnode $deadnode respond to ping\n";
> +	    eval {
> +		PVE::Tools::run_command("/usr/bin/ping -c 1 $members->{$deadnode}->{ip}");
> +	    };
> +	    if(!$@){
> +		die "error: ping to target $deadnode is still working. Node seem to be alive.";
> +	    }
> +
> +	    #make an extra ssh connection to double check that it's not just a corosync crash
> +	    my $sshinfo = PVE::SSHInfo::get_ssh_info($deadnode);
> +	    my $sshcmd = PVE::SSHInfo::ssh_info_to_command($sshinfo);
> +	    push @$sshcmd, 'hostname';
> +	    print "test if deadnode $deadnode respond to ssh\n";
> +	    eval {
> +		PVE::Tools::run_command($sshcmd, timeout => 1);
> +	    };
> +	    if(!$@){
> +		die "error: ssh connection to target $deadnode is still working. Node seem to be alive.";
> +	    }
> +
> +
> +	} elsif (PVE::QemuServer::check_running($vmid)) {
>  	    die "can't migrate running VM without --online\n" if !$param->{online};
>  
>  	    my $repl_conf = PVE::ReplicationConfig->new();
> @@ -4881,7 +4914,22 @@ __PACKAGE__->register_method({
>  	    PVE::QemuServer::check_storage_availability($storecfg, $conf, $target);
>  	}
>  
> -	if (PVE::HA::Config::vm_is_ha_managed($vmid) && $rpcenv->{type} ne 'ha') {
> +	if ($deadnode) {
> +	    my $realcmd = sub {
> +		my $config_fn = PVE::QemuConfig->config_file($vmid, $deadnode);
> +		my $new_config_fn = PVE::QemuConfig->config_file($vmid, $target);
> +
> +		rename($config_fn, $new_config_fn)
> +		    or die "failed to move config file to node '$target': $!\n";
> +	    };
> +
> +	    my $worker = sub {
> +		return PVE::GuestHelpers::guest_migration_lock($vmid, 10, $realcmd);
> +	    };
> +
> +	    return $rpcenv->fork_worker('qmigrate', $vmid, $authuser, $worker);
> +
> +        } elsif (PVE::HA::Config::vm_is_ha_managed($vmid) && $rpcenv->{type} ne 'ha') {
>  
>  	    my $hacmd = sub {
>  		my $upid = shift;
> -- 
> 2.39.5


_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [pve-devel] [PATCH qemu-server 1/1] qemu: add offline migration from dead node
  2025-04-01  9:52   ` Fabian Grünbichler
@ 2025-04-01  9:57     ` Thomas Lamprecht
  2025-04-01 10:19       ` Dominik Csapak
  0 siblings, 1 reply; 13+ messages in thread
From: Thomas Lamprecht @ 2025-04-01  9:57 UTC (permalink / raw)
  To: Proxmox VE development discussion, Fabian Grünbichler

Am 01.04.25 um 11:52 schrieb Fabian Grünbichler:
>> Alexandre Derumier via pve-devel <pve-devel@lists.proxmox.com> hat am 24.03.2025 12:15 CET geschrieben:
>> verify that node is dead from corosync && ssh
>> and move config file from /etc/pve directly
> there are two reasons why this is dangerous and why we haven't exposed anything like this in the API or UI..
> 
> the first one is the risk of corruption - just because a (supposedly dead) node X is not reachable from your local node A doesn't mean it isn't still running. if it is still running, any guests that were already started before might still be running as well. any guests still running might still be able to talk to shared storage. unless there are other safeguards in place (like MMP, which is not a given for all storages), this can easily completely corrupt guest volumes if you attempt to recover and start such a guest. HA protects against this - node X will fence itself before node A will attempt recovery, so there is never a situation where both nodes will try to write to the same volume. just checking whether other cluster nodes can still connect to node X is not enough by any stretch to make this safe.
> 
> the second one is ownership of a VM/CT - PVE relies on node-local locking of guests to avoid contention. this only works because each guest/VMID has a clear owner - the node where the config is currently on. if you steal a config by moving it, you are violating this assumption. we only change the owner of a VMID in two scenarios with careful consideration of the implications:
> - when doing a migration, which is initiated by the source node that is currently owning the guest, so it willingly hands over control to the new node which is safe by definition (no stealing involved and proper locking in place)
> - when doing a HA recovery, which is protected by the HA locks and the watchdog - we know that the original node has been fenced before the recovery happens and we know it cannot do anything with the guest before it has been informed about the recovery (this is ensured by the design of the HA locks).
> your code below is not protected by the HA stack, so there is a race involved - your node where the "deadnode migration" is initiated cannot lock the VMID in a way that the supposedly "dead" node knows about (config locking for guests is node-local, so it can only happen on the node that "owns" the config, anything else doesn't make sense/doesn't protect anything). if the "dead" node rejoins the cluster at the right moment, it still owns the VMID/config and can start it, while the other node thinks it can still steal it. there's also no protection against initiating multiple deadnode migrations in parallel for the same VMID, although of course all but one will fail because pmxcfs ensures the VMID.conf only exists under a single node. we'd need to give up node-local guest locking to close this gap, which is a no-go for performance reasons.
> 
> I understand that this would be convenient to expose, but it is also really dangerous without understanding the implications - and once there is an option to trigger it via the UI, no matter how many disclaimers you put on it, people will press that button and mess up and blame PVE. at the same time there is an actual implementation that safely implements it - it's called HA 😉 so I'd rather spend some time focusing on improving the robustness of our HA stack, rather than adding such a footgun. 
> 

+1 to all of the above.


_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [pve-devel] [PATCH qemu-server 1/1] qemu: add offline migration from dead node
  2025-04-01  9:57     ` Thomas Lamprecht
@ 2025-04-01 10:19       ` Dominik Csapak
  2025-04-01 10:46         ` Thomas Lamprecht
  2025-04-01 16:13         ` DERUMIER, Alexandre via pve-devel
  0 siblings, 2 replies; 13+ messages in thread
From: Dominik Csapak @ 2025-04-01 10:19 UTC (permalink / raw)
  To: pve-devel

On 4/1/25 11:57, Thomas Lamprecht wrote:
> Am 01.04.25 um 11:52 schrieb Fabian Grünbichler:
>>> Alexandre Derumier via pve-devel <pve-devel@lists.proxmox.com> hat am 24.03.2025 12:15 CET geschrieben:
>>> verify that node is dead from corosync && ssh
>>> and move config file from /etc/pve directly
>> there are two reasons why this is dangerous and why we haven't exposed anything like this in the API or UI..
>>
>> the first one is the risk of corruption - just because a (supposedly dead) node X is not reachable from your local node A doesn't mean it isn't still running. if it is still running, any guests that were already started before might still be running as well. any guests still running might still be able to talk to shared storage. unless there are other safeguards in place (like MMP, which is not a given for all storages), this can easily completely corrupt guest volumes if you attempt to recover and start such a guest. HA protects against this - node X will fence itself before node A will attempt recovery, so there is never a situation where both nodes will try to write to the same volume. just checking whether other cluster nodes can still connect to node X is not enough by any stretch to make this safe.
>>
>> the second one is ownership of a VM/CT - PVE relies on node-local locking of guests to avoid contention. this only works because each guest/VMID has a clear owner - the node where the config is currently on. if you steal a config by moving it, you are violating this assumption. we only change the owner of a VMID in two scenarios with careful consideration of the implications:
>> - when doing a migration, which is initiated by the source node that is currently owning the guest, so it willingly hands over control to the new node which is safe by definition (no stealing involved and proper locking in place)
>> - when doing a HA recovery, which is protected by the HA locks and the watchdog - we know that the original node has been fenced before the recovery happens and we know it cannot do anything with the guest before it has been informed about the recovery (this is ensured by the design of the HA locks).
>> your code below is not protected by the HA stack, so there is a race involved - your node where the "deadnode migration" is initiated cannot lock the VMID in a way that the supposedly "dead" node knows about (config locking for guests is node-local, so it can only happen on the node that "owns" the config, anything else doesn't make sense/doesn't protect anything). if the "dead" node rejoins the cluster at the right moment, it still owns the VMID/config and can start it, while the other node thinks it can still steal it. there's also no protection against initiating multiple deadnode migrations in parallel for the same VMID, although of course all but one will fail because pmxcfs ensures the VMID.conf only exists under a single node. we'd need to give up node-local guest locking to close this gap, which is a no-go for performance reasons.
>>
>> I understand that this would be convenient to expose, but it is also really dangerous without understanding the implications - and once there is an option to trigger it via the UI, no matter how many disclaimers you put on it, people will press that button and mess up and blame PVE. at the same time there is an actual implementation that safely implements it - it's called HA 😉 so I'd rather spend some time focusing on improving the robustness of our HA stack, rather than adding such a footgun.
>>
> 
> +1 to all of the above.
> 
while i also agree to all said here, I have one counter point to offer:

In the case that such an operation is necessary (e.g. HA is not wanted/needed/possible
for what ever reason), the user will fall back to do it manually (iow. 'mv source target')
which is at least as dangerous as exposing over the API, since

* now the admins sharing the system must share root@pam credentials (ssh/console access)
   (alternatively setup sudo, which has it's own problems)

* it promotes manually modifying /etc/pve/ content

* any error could be even more fatal than if done via the API
   (e.g. mv of the wrong file, from the wrong node, etc.)


IMHO ways forward for this scenario could be:

* use cluster level locking only for config move? (not sure if performance is still
   a concern for this action, since parallel moves don't happen too much?)

* provide a special CLI tool/cmd to deal with that -> would minimize potential
   errors but is still contained to root equivalent users

* link to the doc section for it from the UI with a big caveat
   https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_recovery

just my 2c


_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [pve-devel] [PATCH qemu-server 1/1] qemu: add offline migration from dead node
  2025-04-01 10:19       ` Dominik Csapak
@ 2025-04-01 10:46         ` Thomas Lamprecht
  2025-04-01 11:13           ` Fabian Grünbichler
  2025-04-01 11:37           ` Dominik Csapak
  2025-04-01 16:13         ` DERUMIER, Alexandre via pve-devel
  1 sibling, 2 replies; 13+ messages in thread
From: Thomas Lamprecht @ 2025-04-01 10:46 UTC (permalink / raw)
  To: Proxmox VE development discussion, Dominik Csapak

Am 01.04.25 um 12:19 schrieb Dominik Csapak:
> while i also agree to all said here, I have one counter point to offer:
> 
> In the case that such an operation is necessary (e.g. HA is not wanted/needed/possible
> for what ever reason), the user will fall back to do it manually (iow. 'mv source target')
> which is at least as dangerous as exposing over the API, since
> 
> * now the admins sharing the system must share root@pam credentials (ssh/console access)
>    (alternatively setup sudo, which has it's own problems)

Setups with many admins need to handle already how they can log in as root, be
it through a jump user (`doas` is a thing if sudo is deemed to complex), some
identity provider (LDAP, OIDC, ... with PAM configuration), as root operations
are required for other things too.

> * it promotes manually modifying /etc/pve/ content

Yeah, as that's what's actually required after manual assessment, abstracting
that away won't really bring a big benefit IMO.

> 
> * any error could be even more fatal than if done via the API
>    (e.g. mv of the wrong file, from the wrong node, etc.)

This cannot be said for sure, these are unknown unknowns. FWIW, the API could
make it worse too compared to an admin carefully fixing this according to the
needs of a specific situation at hand.

> IMHO ways forward for this scenario could be:
> 
> * use cluster level locking only for config move? (not sure if performance is still
>    a concern for this action, since parallel moves don't happen too much?)

What does this solve? The old node is still in an unknown state and does not
sees any pmxcfs changes at all. The VM can still run and cause issues with
duplicate unsynchronized resource access and all the other woes that can
happen if the same guest runs twice.

> 
> * provide a special CLI tool/cmd to deal with that -> would minimize potential
>    errors but is still contained to root equivalent users

This would still have your own arguments w.r.t. root login speaking against
that. And it would not be that big of a difference as for local involved
resources the tool cannot work if the source node cannot be talked with and
for all-shared resources the simple config move is as safe as such a tool
would get in the context of a dead source node, as for either the admin must
ensure it's actually dead.

> * link to the doc section for it from the UI with a big caveat
>    https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_recovery

As Fabian wrote, such disclaimers might be nice for shifting the blame
but are not enough in practice for such an operation.

And Fabians point wasn't that doing it on the CLI is less dangerous, its
about the same either way, but that exposing this as well-integrated feature
makes it seem much less dangerous to the user, especially those that are
less experienced and should be stumped and ask some support channel for
help.

That said, the actual first step to move this forward would IMO be to create
an extensive documentation/how-to for how such things can be resolved and what
one needs to watch out for, sort of check-list style might be a good format.
As that alone should help users a lot already, and that would also make it
much clearer what a more integrated (semi-automated) way could look like.
Which could be a check tool that helps with assessing the recovery depending
on config, storage (types), network, mappings, ... which would ensure that
common issues/blockers are not missed and will even help experienced admins.
If that cannot be first documented and then optionally transformed into a
hands-off evaluation checker tool, or if that's deemed to not help users, I
really do not see how an API integrated solution can do so without just
hand-waving all actual and real issues for why this does not already exists
away.




_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [pve-devel] [PATCH qemu-server 1/1] qemu: add offline migration from dead node
  2025-04-01 10:46         ` Thomas Lamprecht
@ 2025-04-01 11:13           ` Fabian Grünbichler
  2025-04-01 12:38             ` Thomas Lamprecht
  2025-04-01 11:37           ` Dominik Csapak
  1 sibling, 1 reply; 13+ messages in thread
From: Fabian Grünbichler @ 2025-04-01 11:13 UTC (permalink / raw)
  To: Proxmox VE development discussion, Thomas Lamprecht, Dominik Csapak


> Thomas Lamprecht <t.lamprecht@proxmox.com> hat am 01.04.2025 12:46 CEST geschrieben:
> 
>  
> Am 01.04.25 um 12:19 schrieb Dominik Csapak:
> > while i also agree to all said here, I have one counter point to offer:
> > 
> > In the case that such an operation is necessary (e.g. HA is not wanted/needed/possible
> > for what ever reason), the user will fall back to do it manually (iow. 'mv source target')
> > which is at least as dangerous as exposing over the API, since
> > 
> > * now the admins sharing the system must share root@pam credentials (ssh/console access)
> >    (alternatively setup sudo, which has it's own problems)
> 
> Setups with many admins need to handle already how they can log in as root, be
> it through a jump user (`doas` is a thing if sudo is deemed to complex), some
> identity provider (LDAP, OIDC, ... with PAM configuration), as root operations
> are required for other things too.

and the same feature on the API also requires root@pam anyway ;)

> [..]

> > * link to the doc section for it from the UI with a big caveat
> >    https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_recovery
> 
> As Fabian wrote, such disclaimers might be nice for shifting the blame
> but are not enough in practice for such an operation.
> 
> And Fabians point wasn't that doing it on the CLI is less dangerous, its
> about the same either way, but that exposing this as well-integrated feature
> makes it seem much less dangerous to the user, especially those that are
> less experienced and should be stumped and ask some support channel for
> help.
> 
> That said, the actual first step to move this forward would IMO be to create
> an extensive documentation/how-to for how such things can be resolved and what
> one needs to watch out for, sort of check-list style might be a good format.
> As that alone should help users a lot already, and that would also make it
> much clearer what a more integrated (semi-automated) way could look like.
> Which could be a check tool that helps with assessing the recovery depending
> on config, storage (types), network, mappings, ... which would ensure that
> common issues/blockers are not missed and will even help experienced admins.
> If that cannot be first documented and then optionally transformed into a
> hands-off evaluation checker tool, or if that's deemed to not help users, I
> really do not see how an API integrated solution can do so without just
> hand-waving all actual and real issues for why this does not already exists
> away.

(improving) such docs would be nice - we do have a little bit here:

https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_recovering_moving_guests_from_failed_nodes 

the only way to technically improve what is possible IMHO would be to implement
some kind of reliable STONITH mechanism in addition to fencing, and base an
integrated "guest stealing" mechanism on that (with some additional component
that ensures that if the "shot" comes back up right away it won't do anything
with the "stolen" guest before the theft is over).

e.g., if you have a (set of) remote-manageable power strip(s) configured that
allows:
- removing all power from node
- query power state of a node

you could use that to reduce HA failover times (you can shoot the other node
if you want to make it fenced, irrespective of watchdog timeouts/..), and to
implement a guest stealing mechanism:
- put a file/entry in /etc/pve marking a guest as "currently being stolen"
- shoot the other node and verify it is down
- steal config
- remove marker file/entry

no matter at which point after the shooting the other node comes back up, it
must first sync up /etc/pve, which means it can check for markers on VM
locking. if a marker is found, it's not allowed to lock, else it can proceed
(checking doesn't require locking cluster wide, just setting the mark would).
if no marker is found, the config is not there anymore either or it hasn't
been stolen and can be locked and used normally.

if no stonith mechanism is configured, stealing is not available.


_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [pve-devel] [PATCH qemu-server 1/1] qemu: add offline migration from dead node
  2025-04-01 10:46         ` Thomas Lamprecht
  2025-04-01 11:13           ` Fabian Grünbichler
@ 2025-04-01 11:37           ` Dominik Csapak
  2025-04-01 12:54             ` Thomas Lamprecht
  1 sibling, 1 reply; 13+ messages in thread
From: Dominik Csapak @ 2025-04-01 11:37 UTC (permalink / raw)
  To: Thomas Lamprecht, Proxmox VE development discussion

On 4/1/25 12:46, Thomas Lamprecht wrote:
> Am 01.04.25 um 12:19 schrieb Dominik Csapak:
>> while i also agree to all said here, I have one counter point to offer:
>>
>> In the case that such an operation is necessary (e.g. HA is not wanted/needed/possible
>> for what ever reason), the user will fall back to do it manually (iow. 'mv source target')
>> which is at least as dangerous as exposing over the API, since
>>
>> * now the admins sharing the system must share root@pam credentials (ssh/console access)
>>     (alternatively setup sudo, which has it's own problems)
> 
> Setups with many admins need to handle already how they can log in as root, be
> it through a jump user (`doas` is a thing if sudo is deemed to complex), some
> identity provider (LDAP, OIDC, ... with PAM configuration), as root operations
> are required for other things too.

You're right.

> 
>> * it promotes manually modifying /etc/pve/ content
> 
> Yeah, as that's what's actually required after manual assessment, abstracting
> that away won't really bring a big benefit IMO.

It would reduce the necessity to do things via the CLI, which is IMO
a strong point of PVE (but you're right, the assessment part
can't be removed anyway)

> 
>>
>> * any error could be even more fatal than if done via the API
>>     (e.g. mv of the wrong file, from the wrong node, etc.)
> 
> This cannot be said for sure, these are unknown unknowns. FWIW, the API could
> make it worse too compared to an admin carefully fixing this according to the
> needs of a specific situation at hand.

Mhmm, what I meant here is that instructing the user to manually
do 'mv some-path some-other-path' has more error potential (e.g.
typos, misremembering nodenames/vmids/etc.) than e.g. clicking
the vm on the offline node and pressing a button (or
following a CLI tool output/options)

> 
>> IMHO ways forward for this scenario could be:
>>
>> * use cluster level locking only for config move? (not sure if performance is still
>>     a concern for this action, since parallel moves don't happen too much?)
> 
> What does this solve? The old node is still in an unknown state and does not
> sees any pmxcfs changes at all. The VM can still run and cause issues with
> duplicate unsynchronized resource access and all the other woes that can
> happen if the same guest runs twice.

I mentioned it because fabian wrote we could maybe solve it with a
cluster wide VM lock, I think restricting the moving to such a lock
could work, under the assumption that the admin makes sure the offline
node is and stays offline. (Which he has to do anyway)

> 
>>
>> * provide a special CLI tool/cmd to deal with that -> would minimize potential
>>     errors but is still contained to root equivalent users
> 
> This would still have your own arguments w.r.t. root login speaking against
> that. And it would not be that big of a difference as for local involved
> resources the tool cannot work if the source node cannot be talked with and
> for all-shared resources the simple config move is as safe as such a tool
> would get in the context of a dead source node, as for either the admin must
> ensure it's actually dead.

It still improves the UX for that situation since it's then a
provided/guided way vs. mv'ing files on the filesystem.

(e.g. such a tool could check if the source node is reachable, etc.)

> 
>> * link to the doc section for it from the UI with a big caveat
>>     https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_recovery
> 
> As Fabian wrote, such disclaimers might be nice for shifting the blame
> but are not enough in practice for such an operation.
> 
> And Fabians point wasn't that doing it on the CLI is less dangerous, its
> about the same either way, but that exposing this as well-integrated feature
> makes it seem much less dangerous to the user, especially those that are
> less experienced and should be stumped and ask some support channel for
> help.
> 
> That said, the actual first step to move this forward would IMO be to create
> an extensive documentation/how-to for how such things can be resolved and what
> one needs to watch out for, sort of check-list style might be a good format.
> As that alone should help users a lot already, and that would also make it
> much clearer what a more integrated (semi-automated) way could look like.
> Which could be a check tool that helps with assessing the recovery depending
> on config, storage (types), network, mappings, ... which would ensure that
> common issues/blockers are not missed and will even help experienced admins.
> If that cannot be first documented and then optionally transformed into a
> hands-off evaluation checker tool, or if that's deemed to not help users, I
> really do not see how an API integrated solution can do so without just
> hand-waving all actual and real issues for why this does not already exists
> away.
> 
> 

Yes, e.g. that's what i meant with tooling on the CLI is one possibility
to improve it.

Also, as Fabian wrote in the other message, STONITH can improve that (but
comes with it's own set of difficulties & complexity)

Just to clarify, I'm not for blindly implementing such an API call/CLI tool/etc.
but wanted to argue that we probably want to improve the UX of that situation
as good as we can and offered my thoughts on how we could do it.


_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [pve-devel] [PATCH qemu-server 1/1] qemu: add offline migration from dead node
  2025-04-01 11:13           ` Fabian Grünbichler
@ 2025-04-01 12:38             ` Thomas Lamprecht
  0 siblings, 0 replies; 13+ messages in thread
From: Thomas Lamprecht @ 2025-04-01 12:38 UTC (permalink / raw)
  To: Fabian Grünbichler, Proxmox VE development discussion,
	Dominik Csapak

Am 01.04.25 um 13:13 schrieb Fabian Grünbichler:
> the only way to technically improve what is possible IMHO would be to implement
> some kind of reliable STONITH mechanism in addition to fencing, and base an
> integrated "guest stealing" mechanism on that (with some additional component
> that ensures that if the "shot" comes back up right away it won't do anything
> with the "stolen" guest before the theft is over).
> 
> e.g., if you have a (set of) remote-manageable power strip(s) configured that
> allows:
> - removing all power from node
> - query power state of a node
> 
> you could use that to reduce HA failover times (you can shoot the other node
> if you want to make it fenced, irrespective of watchdog timeouts/..), and to
> implement a guest stealing mechanism:
> - put a file/entry in /etc/pve marking a guest as "currently being stolen"
> - shoot the other node and verify it is down
> - steal config
> - remove marker file/entry
> 
> no matter at which point after the shooting the other node comes back up, it
> must first sync up /etc/pve, which means it can check for markers on VM
> locking. if a marker is found, it's not allowed to lock, else it can proceed
> (checking doesn't require locking cluster wide, just setting the mark would).
> if no marker is found, the config is not there anymore either or it hasn't
> been stolen and can be locked and used normally.
> 
> if no stonith mechanism is configured, stealing is not available.

That's basically exactly what the HW fencing series I worked on years ago does,
including lower timeouts and so on. It was only integrated in HA, exposing the
HW fencing (which is STONITH) separately would be possible though.
That said adding STONITH and external fence devices to the mix is not a trivial
thing and hardly simplifies setups IMO, so while a possibility I'd not see it
as something to promote to inexperienced users.


_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [pve-devel] [PATCH qemu-server 1/1] qemu: add offline migration from dead node
  2025-04-01 11:37           ` Dominik Csapak
@ 2025-04-01 12:54             ` Thomas Lamprecht
  2025-04-01 13:20               ` Dominik Csapak
  0 siblings, 1 reply; 13+ messages in thread
From: Thomas Lamprecht @ 2025-04-01 12:54 UTC (permalink / raw)
  To: Dominik Csapak, Proxmox VE development discussion

Am 01.04.25 um 13:37 schrieb Dominik Csapak:
> Mhmm, what I meant here is that instructing the user to manually
> do 'mv some-path some-other-path' has more error potential (e.g.
> typos, misremembering nodenames/vmids/etc.) than e.g. clicking
> the vm on the offline node and pressing a button (or
> following a CLI tool output/options)

Which all have their error potential too, especially with hostnames
being free-form and not exclusive.

> I mentioned it because fabian wrote we could maybe solve it with a
> cluster wide VM lock, I think restricting the moving to such a lock
> could work, under the assumption that the admin makes sure the offline
> node is and stays offline. (Which he has to do anyway)

Still not sure what this would provide, pmxcfs gurantees that the VMID
config can exist only once already anyway, so only one node can do a
move and such moves can only happen if they would be equal to a file
rename as any resource must be shared already to make this work.
Well replication could be fixed up I guess, but that can be handled on
VM start too. Cannot think of anything else (without an in-depth
evaluation though) that an API can/should do different for the actual
move itself. Doing some up-front checks is a different story, but that
could also result in a false sense of safety.

> It still improves the UX for that situation since it's then a
> provided/guided way vs. mv'ing files on the filesystem.

I'd not touch the move part though, at least for starters, just like the
upgrade checker scripts it should only assist.

> Just to clarify, I'm not for blindly implementing such an API call/CLI tool/etc.
> but wanted to argue that we probably want to improve the UX of that situation
> as good as we can and offered my thoughts on how we could do it.
 
That's certainly fine; having it improved would be good, but I'm very wary
of hot takes and hand waving (not meaning you here, just in general), this
isn't a purge/remove/wipe of some resource on a working system, like wiping
disks or removing guests, as that can present the information to the admin
from a known good node that manages its state itself.
An unknown/dead node is literally breaking core clustering assumption that
we build upon on a lot of places, IMO a very different thing. Mentioning this
as it might be easy to question why other destructive actions are exposed in
the UI.

And FWIW, if I should reconsider this it would be much easier to argue for
further integration if the basic assistant/checker guide/tool already
existed for some time and was somewhat battle tested, as that would allow a
much more confident evaluation of options, whatever those then look like;
some "scary" hint in the UI with lots of exclamation marks does not cut it
for me though, no offense to anybody.


_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [pve-devel] [PATCH qemu-server 1/1] qemu: add offline migration from dead node
  2025-04-01 12:54             ` Thomas Lamprecht
@ 2025-04-01 13:20               ` Dominik Csapak
  2025-04-01 15:08                 ` Thomas Lamprecht
  0 siblings, 1 reply; 13+ messages in thread
From: Dominik Csapak @ 2025-04-01 13:20 UTC (permalink / raw)
  To: Thomas Lamprecht, Proxmox VE development discussion

On 4/1/25 14:54, Thomas Lamprecht wrote:
> Am 01.04.25 um 13:37 schrieb Dominik Csapak:
>> Mhmm, what I meant here is that instructing the user to manually
>> do 'mv some-path some-other-path' has more error potential (e.g.
>> typos, misremembering nodenames/vmids/etc.) than e.g. clicking
>> the vm on the offline node and pressing a button (or
>> following a CLI tool output/options)
> 
> Which all have their error potential too, especially with hostnames
> being free-form and not exclusive.
> 
>> I mentioned it because fabian wrote we could maybe solve it with a
>> cluster wide VM lock, I think restricting the moving to such a lock
>> could work, under the assumption that the admin makes sure the offline
>> node is and stays offline. (Which he has to do anyway)
> 
> Still not sure what this would provide, pmxcfs gurantees that the VMID
> config can exist only once already anyway, so only one node can do a
> move and such moves can only happen if they would be equal to a file
> rename as any resource must be shared already to make this work.
> Well replication could be fixed up I guess, but that can be handled on
> VM start too. Cannot think of anything else (without an in-depth
> evaluation though) that an API can/should do different for the actual
> move itself. Doing some up-front checks is a different story, but that
> could also result in a false sense of safety.
> 
>> It still improves the UX for that situation since it's then a
>> provided/guided way vs. mv'ing files on the filesystem.
> 
> I'd not touch the move part though, at least for starters, just like the
> upgrade checker scripts it should only assist.
> 
>> Just to clarify, I'm not for blindly implementing such an API call/CLI tool/etc.
>> but wanted to argue that we probably want to improve the UX of that situation
>> as good as we can and offered my thoughts on how we could do it.
>   
> That's certainly fine; having it improved would be good, but I'm very wary
> of hot takes and hand waving (not meaning you here, just in general), this
> isn't a purge/remove/wipe of some resource on a working system, like wiping
> disks or removing guests, as that can present the information to the admin
> from a known good node that manages its state itself.
> An unknown/dead node is literally breaking core clustering assumption that
> we build upon on a lot of places, IMO a very different thing. Mentioning this
> as it might be easy to question why other destructive actions are exposed in
> the UI.
> 
> And FWIW, if I should reconsider this it would be much easier to argue for
> further integration if the basic assistant/checker guide/tool already
> existed for some time and was somewhat battle tested, as that would allow a
> much more confident evaluation of options, whatever those then look like;
> some "scary" hint in the UI with lots of exclamation marks does not cut it
> for me though, no offense to anybody.


I agree with all of your points, so I think the best and easiest way to improve the current
situation would be to:

* Improve the docs to emphasize more that this situation should be an exception
   and that working around cluster assumptions can have severe consequences.
   (Maybe nudge users towards HA if this is a common situation for them)
   Also it be good for it to be in a (like you suggested) check-list style
   manner, so that admins have an guided way to check for things like
   storage, running nodes, etc.

* Change the migration UI to show a warning that the node is offline
   and provide a direct link to above mentioned improved docs

What do you think?


_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [pve-devel] [PATCH qemu-server 1/1] qemu: add offline migration from dead node
  2025-04-01 13:20               ` Dominik Csapak
@ 2025-04-01 15:08                 ` Thomas Lamprecht
  0 siblings, 0 replies; 13+ messages in thread
From: Thomas Lamprecht @ 2025-04-01 15:08 UTC (permalink / raw)
  To: Proxmox VE development discussion, Dominik Csapak

Am 01.04.25 um 15:20 schrieb Dominik Csapak:
> I agree with all of your points, so I think the best and easiest way to improve the current
> situation would be to:
> 
> * Improve the docs to emphasize more that this situation should be an exception
>    and that working around cluster assumptions can have severe consequences.
>    (Maybe nudge users towards HA if this is a common situation for them)
>    Also it be good for it to be in a (like you suggested) check-list style
>    manner, so that admins have an guided way to check for things like
>    storage, running nodes, etc.
> 
> * Change the migration UI to show a warning that the node is offline
>    and provide a direct link to above mentioned improved docs
> 
> What do you think?

The hint to the docs in the UI is a good idea, so yes, that should have
the best ratio for productive while safe for now.


_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [pve-devel] [PATCH qemu-server 1/1] qemu: add offline migration from dead node
  2025-04-01 10:19       ` Dominik Csapak
  2025-04-01 10:46         ` Thomas Lamprecht
@ 2025-04-01 16:13         ` DERUMIER, Alexandre via pve-devel
  1 sibling, 0 replies; 13+ messages in thread
From: DERUMIER, Alexandre via pve-devel @ 2025-04-01 16:13 UTC (permalink / raw)
  To: pve-devel; +Cc: DERUMIER, Alexandre

[-- Attachment #1: Type: message/rfc822, Size: 15235 bytes --]

From: "DERUMIER, Alexandre" <alexandre.derumier@groupe-cyllene.com>
To: "pve-devel@lists.proxmox.com" <pve-devel@lists.proxmox.com>
Subject: Re: [pve-devel] [PATCH qemu-server 1/1] qemu: add offline migration from dead node
Date: Tue, 1 Apr 2025 16:13:23 +0000
Message-ID: <368d42667a23554c6f7d51697c89eb43f5544cf7.camel@groupe-cyllene.com>

Hi ! (sorry to disturb the mailing )


>>(iow. 'mv source target')
>>which is at least as dangerous as exposing over the API, since

>>* now the admins sharing the system must share root@pam credentials
>>(ssh/console access)
>>   (alternatively setup sudo, which has it's own problems)
>>
>>* it promotes manually modifying /etc/pve/ content
>>
>>* any error could be even more fatal than if done via the API
>>   (e.g. mv of the wrong file, from the wrong node, etc.)

That was the more or less the idea of the patch series. (Ok the gui is
not the best part ^_^ ).

I would like to avoid to do mv manually in /etc/pve.  (I known a 
lott of people doing it, and generally when you need to do it, it's a 
crash during the night when your brain is off and mistake can occur
(murphy law))



So yes, maybe extra manual stonith through ipmi or power devices could
help for manual action.  (Maybe declare the node as dead in the gui,
calling the stonith devices to be sure that the node is really dead)


Be able to do it with root access could be a plus (I remember about a
SuperAdmin patch series some year ago), as sometime tech support night
admins don't always have root permission or sudo for compliance, or it
need escalation just to restart vms stuck on a dead node)


Alexandre



_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://antiphishing.vadesecure.com/v4?f=WExObUdsNkxHUTVwMTdKdhJAqEDL-
R26dkAigaQdBcr446fxBoV5DCPzIhJWszve2S584YgEkH73Ypn894ZZQA&i=dktEMmMyTnl
id1lsUjVvYvDN_JKeSC-
NkIvzg1_2L5o&k=dPpv&r=bGMwQ1dycHZ0bUpyOWJIRiNqIftRvG8M_caPGC_YgDRGxqfco
1zLiQ7nCHK7-
BKb&s=47ee090105782781d958e6067618f39da0e0cc378a0b535fe550cd41924175c2&
u=https%3A%2F%2Flists.proxmox.com%2Fcgi-bin%2Fmailman%2Flistinfo%2Fpve-
devel


[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2025-04-01 16:14 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20250324111529.338025-1-alexandre.derumier@groupe-cyllene.com>
2025-03-24 11:15 ` [pve-devel] [PATCH pve-manager 1/1] migrate: allow migration from dead node Alexandre Derumier via pve-devel
2025-03-24 11:15 ` [pve-devel] [PATCH qemu-server 1/1] qemu: add offline " Alexandre Derumier via pve-devel
2025-04-01  9:52   ` Fabian Grünbichler
2025-04-01  9:57     ` Thomas Lamprecht
2025-04-01 10:19       ` Dominik Csapak
2025-04-01 10:46         ` Thomas Lamprecht
2025-04-01 11:13           ` Fabian Grünbichler
2025-04-01 12:38             ` Thomas Lamprecht
2025-04-01 11:37           ` Dominik Csapak
2025-04-01 12:54             ` Thomas Lamprecht
2025-04-01 13:20               ` Dominik Csapak
2025-04-01 15:08                 ` Thomas Lamprecht
2025-04-01 16:13         ` DERUMIER, Alexandre via pve-devel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox
Service provided by Proxmox Server Solutions GmbH | Privacy | Legal