[pve-devel] [PATCH cluster] pmxcfs sync: properly check for corosync error

all lists on lists.proxmox.com
 help / color / mirror / Atom feed

* [pve-devel] [PATCH cluster] pmxcfs sync: properly check for corosync error
@ 2020-09-25 12:53 Fabian Grünbichler
  2020-09-25 13:23 ` [pve-devel] applied: " Thomas Lamprecht
  0 siblings, 1 reply; 4+ messages in thread
From: Fabian Grünbichler @ 2020-09-25 12:53 UTC (permalink / raw)
  To: pve-devel

dfsm_send_state_message_full always returns != 0, since it returns
cs_error_t which starts with CS_OK at 1, with values >1 representing
errors.

Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
---
unfortunately not that cause of Alexandre's shutdown/restart issue, but
might have caused some hangs as well since we would be stuck in
START_SYNC in that case..

 data/src/dfsm.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/data/src/dfsm.c b/data/src/dfsm.c
index 529c7f9..172d877 100644
--- a/data/src/dfsm.c
+++ b/data/src/dfsm.c
@@ -1190,7 +1190,7 @@ dfsm_cpg_confchg_callback(
 
 		dfsm_set_mode(dfsm, DFSM_MODE_START_SYNC);
 		if (lowest_nodeid == dfsm->nodeid) {
-			if (!dfsm_send_state_message_full(dfsm, DFSM_MESSAGE_SYNC_START, NULL, 0)) {
+			if (dfsm_send_state_message_full(dfsm, DFSM_MESSAGE_SYNC_START, NULL, 0) != CS_OK) {
 				cfs_dom_critical(dfsm->log_domain, "failed to send SYNC_START message");
 				goto leave;
 			}
-- 
2.20.1





^ permalink raw reply	[flat|nested] 4+ messages in thread

* [pve-devel] applied: [PATCH cluster] pmxcfs sync: properly check for corosync error
  2020-09-25 12:53 [pve-devel] [PATCH cluster] pmxcfs sync: properly check for corosync error Fabian Grünbichler
@ 2020-09-25 13:23 ` Thomas Lamprecht
  2020-09-25 13:36   ` Fabian Grünbichler
  0 siblings, 1 reply; 4+ messages in thread
From: Thomas Lamprecht @ 2020-09-25 13:23 UTC (permalink / raw)
  To: Proxmox VE development discussion, Fabian Grünbichler

On 25.09.20 14:53, Fabian Grünbichler wrote:
> dfsm_send_state_message_full always returns != 0, since it returns
> cs_error_t which starts with CS_OK at 1, with values >1 representing
> errors.
> 
> Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
> ---
> unfortunately not that cause of Alexandre's shutdown/restart issue, but
> might have caused some hangs as well since we would be stuck in
> START_SYNC in that case..
> 
>  data/src/dfsm.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
>

applied, thanks! But as the old wrong code showed up as critical error
"failed to send SYNC_START message" if it worked, it either (almost) never
works here or is not a probable case, else we'd saw this earlier.

(still a valid and appreciated fix, just noting)





^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [pve-devel] applied: [PATCH cluster] pmxcfs sync: properly check for corosync error
  2020-09-25 13:23 ` [pve-devel] applied: " Thomas Lamprecht
@ 2020-09-25 13:36   ` Fabian Grünbichler
  2020-09-25 13:48     ` Thomas Lamprecht
  0 siblings, 1 reply; 4+ messages in thread
From: Fabian Grünbichler @ 2020-09-25 13:36 UTC (permalink / raw)
  To: Thomas Lamprecht, Proxmox VE development discussion


> Thomas Lamprecht <t.lamprecht@proxmox.com> hat am 25.09.2020 15:23 geschrieben:
> 
>  
> On 25.09.20 14:53, Fabian Grünbichler wrote:
> > dfsm_send_state_message_full always returns != 0, since it returns
> > cs_error_t which starts with CS_OK at 1, with values >1 representing
> > errors.
> > 
> > Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
> > ---
> > unfortunately not that cause of Alexandre's shutdown/restart issue, but
> > might have caused some hangs as well since we would be stuck in
> > START_SYNC in that case..
> > 
> >  data/src/dfsm.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> >
> 
> applied, thanks! But as the old wrong code showed up as critical error
> "failed to send SYNC_START message" if it worked, it either (almost) never
> works here or is not a probable case, else we'd saw this earlier.
> 
> (still a valid and appreciated fix, just noting)

no, the old wrong code never triggered the error handling (log + leave), no matter whether the send worked or failed - the return value cannot be 0, so the condition is never true. if the send failed, the code assumed the state machine is now in START_SYNC mode and waits for STATE messages, which will never come since the other nodes haven't switched to START_SYNC..

it would still show up in the logs since cpg_mcast_joined failure is always verbose in the logs, but it would not be obvious that it caused the state machine to take a wrong turn I think.




^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [pve-devel] applied: [PATCH cluster] pmxcfs sync: properly check for corosync error
  2020-09-25 13:36   ` Fabian Grünbichler
@ 2020-09-25 13:48     ` Thomas Lamprecht
  0 siblings, 0 replies; 4+ messages in thread
From: Thomas Lamprecht @ 2020-09-25 13:48 UTC (permalink / raw)
  To: Fabian Grünbichler, Proxmox VE development discussion

On 25.09.20 15:36, Fabian Grünbichler wrote:
> 
>> Thomas Lamprecht <t.lamprecht@proxmox.com> hat am 25.09.2020 15:23 geschrieben:
>>
>>  
>> On 25.09.20 14:53, Fabian Grünbichler wrote:
>>> dfsm_send_state_message_full always returns != 0, since it returns
>>> cs_error_t which starts with CS_OK at 1, with values >1 representing
>>> errors.
>>>
>>> Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
>>> ---
>>> unfortunately not that cause of Alexandre's shutdown/restart issue, but
>>> might have caused some hangs as well since we would be stuck in
>>> START_SYNC in that case..
>>>
>>>  data/src/dfsm.c | 2 +-
>>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>>
>>
>> applied, thanks! But as the old wrong code showed up as critical error
>> "failed to send SYNC_START message" if it worked, it either (almost) never
>> works here or is not a probable case, else we'd saw this earlier.
>>
>> (still a valid and appreciated fix, just noting)
> 
> no, the old wrong code never triggered the error handling (log + leave), no matter whether the send worked or failed - the return value cannot be 0, so the condition is never true. if the send failed, the code assumed the state machine is now in START_SYNC mode and waits for STATE messages, which will never come since the other nodes haven't switched to START_SYNC..
> 

ah yeah, was confused about the CS_OK value for a moment


> it would still show up in the logs since cpg_mcast_joined failure is always verbose in the logs, but it would not be obvious that it caused the state machine to take a wrong turn I think.
> 






^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2020-09-25 13:49 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-09-25 12:53 [pve-devel] [PATCH cluster] pmxcfs sync: properly check for corosync error Fabian Grünbichler
2020-09-25 13:23 ` [pve-devel] applied: " Thomas Lamprecht
2020-09-25 13:36   ` Fabian Grünbichler
2020-09-25 13:48     ` Thomas Lamprecht

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.

Service provided by Proxmox Server Solutions GmbH | Privacy | Legal