public inbox for pve-devel@lists.proxmox.com
 help / color / mirror / Atom feed
* [pve-devel] [PATCH cluster] fix #3596: handle delnode of offline node
@ 2021-11-12  8:45 Fabian Grünbichler
  2021-11-12 10:04 ` [pve-devel] applied: " Thomas Lamprecht
  2021-11-12 11:50 ` [pve-devel] " Fabian Ebner
  0 siblings, 2 replies; 7+ messages in thread
From: Fabian Grünbichler @ 2021-11-12  8:45 UTC (permalink / raw)
  To: pve-devel

the recommended way is to first shutdown, then delnode, and never let it
come back online, in which case corosync-cfgtool won't be able to kill
the removed (offline) node.

also, the order was wrong - if we first update corosync.conf to remove
the node entry from the nodelist, corosync doesn't know about the nodeid
anymore, so killing will fail even if the node is still online.

Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
---
 data/PVE/API2/ClusterConfig.pm | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/data/PVE/API2/ClusterConfig.pm b/data/PVE/API2/ClusterConfig.pm
index 8f4a5bb..5a6a1ac 100644
--- a/data/PVE/API2/ClusterConfig.pm
+++ b/data/PVE/API2/ClusterConfig.pm
@@ -485,9 +485,13 @@ __PACKAGE__->register_method ({
 
 	    delete $nodelist->{$node};
 
-	    PVE::Corosync::update_nodelist($conf, $nodelist);
+	    # allowed to fail when node is already shut down!
+	    eval {
+		PVE::Tools::run_command(['corosync-cfgtool','-k', $nodeid])
+		    if defined($nodeid);
+	    };
 
-	    PVE::Tools::run_command(['corosync-cfgtool','-k', $nodeid]) if defined($nodeid);
+	    PVE::Corosync::update_nodelist($conf, $nodelist);
 	};
 
 	$config_change_lock->($code);
-- 
2.30.2





^ permalink raw reply	[flat|nested] 7+ messages in thread

* [pve-devel] applied: [PATCH cluster] fix #3596: handle delnode of offline node
  2021-11-12  8:45 [pve-devel] [PATCH cluster] fix #3596: handle delnode of offline node Fabian Grünbichler
@ 2021-11-12 10:04 ` Thomas Lamprecht
  2021-11-12 11:50 ` [pve-devel] " Fabian Ebner
  1 sibling, 0 replies; 7+ messages in thread
From: Thomas Lamprecht @ 2021-11-12 10:04 UTC (permalink / raw)
  To: Proxmox VE development discussion, Fabian Grünbichler

On 12.11.21 09:45, Fabian Grünbichler wrote:
> the recommended way is to first shutdown, then delnode, and never let it
> come back online, in which case corosync-cfgtool won't be able to kill
> the removed (offline) node.
> 
> also, the order was wrong - if we first update corosync.conf to remove
> the node entry from the nodelist, corosync doesn't know about the nodeid
> anymore, so killing will fail even if the node is still online.
> 
> Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
> ---
>  data/PVE/API2/ClusterConfig.pm | 8 ++++++--
>  1 file changed, 6 insertions(+), 2 deletions(-)
> 
>

applied, thanks!




^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [pve-devel] [PATCH cluster] fix #3596: handle delnode of offline node
  2021-11-12  8:45 [pve-devel] [PATCH cluster] fix #3596: handle delnode of offline node Fabian Grünbichler
  2021-11-12 10:04 ` [pve-devel] applied: " Thomas Lamprecht
@ 2021-11-12 11:50 ` Fabian Ebner
  2021-11-12 12:14   ` Thomas Lamprecht
  1 sibling, 1 reply; 7+ messages in thread
From: Fabian Ebner @ 2021-11-12 11:50 UTC (permalink / raw)
  To: pve-devel, Fabian Grünbichler

Am 12.11.21 um 09:45 schrieb Fabian Grünbichler:
> the recommended way is to first shutdown, then delnode, and never let it
> come back online, in which case corosync-cfgtool won't be able to kill
> the removed (offline) node.
> 
> also, the order was wrong - if we first update corosync.conf to remove
> the node entry from the nodelist, corosync doesn't know about the nodeid
> anymore, so killing will fail even if the node is still online.
> 
> Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
> ---
>   data/PVE/API2/ClusterConfig.pm | 8 ++++++--
>   1 file changed, 6 insertions(+), 2 deletions(-)
> 
> diff --git a/data/PVE/API2/ClusterConfig.pm b/data/PVE/API2/ClusterConfig.pm
> index 8f4a5bb..5a6a1ac 100644
> --- a/data/PVE/API2/ClusterConfig.pm
> +++ b/data/PVE/API2/ClusterConfig.pm
> @@ -485,9 +485,13 @@ __PACKAGE__->register_method ({
>   
>   	    delete $nodelist->{$node};
>   
> -	    PVE::Corosync::update_nodelist($conf, $nodelist);
> +	    # allowed to fail when node is already shut down!
> +	    eval {
> +		PVE::Tools::run_command(['corosync-cfgtool','-k', $nodeid])
> +		    if defined($nodeid);
> +	    };
>   

But what if it fails for a different reason than 'CS_ERR_NOT_EXIST'? 
Shouldn't we match the error?

> -	    PVE::Tools::run_command(['corosync-cfgtool','-k', $nodeid]) if defined($nodeid);
> +	    PVE::Corosync::update_nodelist($conf, $nodelist);
>   	};
>   
>   	$config_change_lock->($code);
> 




^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [pve-devel] [PATCH cluster] fix #3596: handle delnode of offline node
  2021-11-12 11:50 ` [pve-devel] " Fabian Ebner
@ 2021-11-12 12:14   ` Thomas Lamprecht
  2021-11-12 12:46     ` Fabian Ebner
  2021-11-12 12:59     ` Fabian Grünbichler
  0 siblings, 2 replies; 7+ messages in thread
From: Thomas Lamprecht @ 2021-11-12 12:14 UTC (permalink / raw)
  To: Proxmox VE development discussion, Fabian Ebner, Fabian Grünbichler

On 12.11.21 12:50, Fabian Ebner wrote:
> Am 12.11.21 um 09:45 schrieb Fabian Grünbichler:
>> the recommended way is to first shutdown, then delnode, and never let it
>> come back online, in which case corosync-cfgtool won't be able to kill
>> the removed (offline) node.
>>
>> also, the order was wrong - if we first update corosync.conf to remove
>> the node entry from the nodelist, corosync doesn't know about the nodeid
>> anymore, so killing will fail even if the node is still online.
>>
>> Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
>> ---
>>   data/PVE/API2/ClusterConfig.pm | 8 ++++++--
>>   1 file changed, 6 insertions(+), 2 deletions(-)
>>
>> diff --git a/data/PVE/API2/ClusterConfig.pm b/data/PVE/API2/ClusterConfig.pm
>> index 8f4a5bb..5a6a1ac 100644
>> --- a/data/PVE/API2/ClusterConfig.pm
>> +++ b/data/PVE/API2/ClusterConfig.pm
>> @@ -485,9 +485,13 @@ __PACKAGE__->register_method ({
>>             delete $nodelist->{$node};
>>   -        PVE::Corosync::update_nodelist($conf, $nodelist);
>> +        # allowed to fail when node is already shut down!
>> +        eval {
>> +        PVE::Tools::run_command(['corosync-cfgtool','-k', $nodeid])
>> +            if defined($nodeid);
>> +        };
>>   
> 
> But what if it fails for a different reason than 'CS_ERR_NOT_EXIST'? Shouldn't we match the error?

at least that examples is like ENOENT on unlink, an OK error (user could
have -k'illed it before that).




^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [pve-devel] [PATCH cluster] fix #3596: handle delnode of offline node
  2021-11-12 12:14   ` Thomas Lamprecht
@ 2021-11-12 12:46     ` Fabian Ebner
  2021-11-12 13:03       ` Thomas Lamprecht
  2021-11-12 12:59     ` Fabian Grünbichler
  1 sibling, 1 reply; 7+ messages in thread
From: Fabian Ebner @ 2021-11-12 12:46 UTC (permalink / raw)
  To: Thomas Lamprecht, Proxmox VE development discussion,
	Fabian Grünbichler

Am 12.11.21 um 13:14 schrieb Thomas Lamprecht:
> On 12.11.21 12:50, Fabian Ebner wrote:
>> Am 12.11.21 um 09:45 schrieb Fabian Grünbichler:
>>> the recommended way is to first shutdown, then delnode, and never let it
>>> come back online, in which case corosync-cfgtool won't be able to kill
>>> the removed (offline) node.
>>>
>>> also, the order was wrong - if we first update corosync.conf to remove
>>> the node entry from the nodelist, corosync doesn't know about the nodeid
>>> anymore, so killing will fail even if the node is still online.
>>>
>>> Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
>>> ---
>>>    data/PVE/API2/ClusterConfig.pm | 8 ++++++--
>>>    1 file changed, 6 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/data/PVE/API2/ClusterConfig.pm b/data/PVE/API2/ClusterConfig.pm
>>> index 8f4a5bb..5a6a1ac 100644
>>> --- a/data/PVE/API2/ClusterConfig.pm
>>> +++ b/data/PVE/API2/ClusterConfig.pm
>>> @@ -485,9 +485,13 @@ __PACKAGE__->register_method ({
>>>              delete $nodelist->{$node};
>>>    -        PVE::Corosync::update_nodelist($conf, $nodelist);
>>> +        # allowed to fail when node is already shut down!
>>> +        eval {
>>> +        PVE::Tools::run_command(['corosync-cfgtool','-k', $nodeid])
>>> +            if defined($nodeid);
>>> +        };
>>>    
>>
>> But what if it fails for a different reason than 'CS_ERR_NOT_EXIST'? Shouldn't we match the error?
> 
> at least that examples is like ENOENT on unlink, an OK error (user could
> have -k'illed it before that).
> 

My example is when it's *not* that error ;)
With the patch we treat all errors as OK.




^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [pve-devel] [PATCH cluster] fix #3596: handle delnode of offline node
  2021-11-12 12:14   ` Thomas Lamprecht
  2021-11-12 12:46     ` Fabian Ebner
@ 2021-11-12 12:59     ` Fabian Grünbichler
  1 sibling, 0 replies; 7+ messages in thread
From: Fabian Grünbichler @ 2021-11-12 12:59 UTC (permalink / raw)
  To: Fabian Ebner, Proxmox VE development discussion, Thomas Lamprecht

On November 12, 2021 1:14 pm, Thomas Lamprecht wrote:
> On 12.11.21 12:50, Fabian Ebner wrote:
>> Am 12.11.21 um 09:45 schrieb Fabian Grünbichler:
>>> the recommended way is to first shutdown, then delnode, and never let it
>>> come back online, in which case corosync-cfgtool won't be able to kill
>>> the removed (offline) node.
>>>
>>> also, the order was wrong - if we first update corosync.conf to remove
>>> the node entry from the nodelist, corosync doesn't know about the nodeid
>>> anymore, so killing will fail even if the node is still online.
>>>
>>> Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
>>> ---
>>>   data/PVE/API2/ClusterConfig.pm | 8 ++++++--
>>>   1 file changed, 6 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/data/PVE/API2/ClusterConfig.pm b/data/PVE/API2/ClusterConfig.pm
>>> index 8f4a5bb..5a6a1ac 100644
>>> --- a/data/PVE/API2/ClusterConfig.pm
>>> +++ b/data/PVE/API2/ClusterConfig.pm
>>> @@ -485,9 +485,13 @@ __PACKAGE__->register_method ({
>>>             delete $nodelist->{$node};
>>>   -        PVE::Corosync::update_nodelist($conf, $nodelist);
>>> +        # allowed to fail when node is already shut down!
>>> +        eval {
>>> +        PVE::Tools::run_command(['corosync-cfgtool','-k', $nodeid])
>>> +            if defined($nodeid);
>>> +        };
>>>   
>> 
>> But what if it fails for a different reason than 'CS_ERR_NOT_EXIST'? Shouldn't we match the error?
> 
> at least that examples is like ENOENT on unlink, an OK error (user could
> have -k'illed it before that).
>

IMHO it's okay to treat all errors as warnings here - if you follow the 
instructions killing is not possible. if you didn't follow them, and the 
node is online, but killing fails for some reason you still get the 
output, the node is removed from corosync.conf on all nodes, and thus no 
traffic is possible anymore between the cluster and the separated node 
(knet will reject traffic from unknown -i.e., not contained in the 
nodelist- nodes). no traffic means the separated node is kicked out of 
the quorum, so it can't do any harm anymore ;)




^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [pve-devel] [PATCH cluster] fix #3596: handle delnode of offline node
  2021-11-12 12:46     ` Fabian Ebner
@ 2021-11-12 13:03       ` Thomas Lamprecht
  0 siblings, 0 replies; 7+ messages in thread
From: Thomas Lamprecht @ 2021-11-12 13:03 UTC (permalink / raw)
  To: Fabian Ebner, Proxmox VE development discussion, Fabian Grünbichler

On 12.11.21 13:46, Fabian Ebner wrote:
> Am 12.11.21 um 13:14 schrieb Thomas Lamprecht:
>> On 12.11.21 12:50, Fabian Ebner wrote:
>>> Am 12.11.21 um 09:45 schrieb Fabian Grünbichler:
>>>> the recommended way is to first shutdown, then delnode, and never let it
>>>> come back online, in which case corosync-cfgtool won't be able to kill
>>>> the removed (offline) node.
>>>>
>>>> also, the order was wrong - if we first update corosync.conf to remove
>>>> the node entry from the nodelist, corosync doesn't know about the nodeid
>>>> anymore, so killing will fail even if the node is still online.
>>>>
>>>> Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
>>>> ---
>>>>    data/PVE/API2/ClusterConfig.pm | 8 ++++++--
>>>>    1 file changed, 6 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/data/PVE/API2/ClusterConfig.pm b/data/PVE/API2/ClusterConfig.pm
>>>> index 8f4a5bb..5a6a1ac 100644
>>>> --- a/data/PVE/API2/ClusterConfig.pm
>>>> +++ b/data/PVE/API2/ClusterConfig.pm
>>>> @@ -485,9 +485,13 @@ __PACKAGE__->register_method ({
>>>>              delete $nodelist->{$node};
>>>>    -        PVE::Corosync::update_nodelist($conf, $nodelist);
>>>> +        # allowed to fail when node is already shut down!
>>>> +        eval {
>>>> +        PVE::Tools::run_command(['corosync-cfgtool','-k', $nodeid])
>>>> +            if defined($nodeid);
>>>> +        };
>>>>    
>>>
>>> But what if it fails for a different reason than 'CS_ERR_NOT_EXIST'? Shouldn't we match the error?
>>
>> at least that examples is like ENOENT on unlink, an OK error (user could
>> have -k'illed it before that).
>>
> 
> My example is when it's *not* that error ;)
> With the patch we treat all errors as OK.

ah misread, sorry. But anyhow, the other error I see in exec/cfg.c's
message_handler_req_lib_cfg_killnode is CS_ERR_LIBRARY, and that is rather
unlikely.

maybe a warn $@ that silences the "ERR_NOT_EXISTS" one could be nice though




^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2021-11-12 13:03 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-11-12  8:45 [pve-devel] [PATCH cluster] fix #3596: handle delnode of offline node Fabian Grünbichler
2021-11-12 10:04 ` [pve-devel] applied: " Thomas Lamprecht
2021-11-12 11:50 ` [pve-devel] " Fabian Ebner
2021-11-12 12:14   ` Thomas Lamprecht
2021-11-12 12:46     ` Fabian Ebner
2021-11-12 13:03       ` Thomas Lamprecht
2021-11-12 12:59     ` Fabian Grünbichler

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox
Service provided by Proxmox Server Solutions GmbH | Privacy | Legal