public inbox for pve-devel@lists.proxmox.com
 help / color / mirror / Atom feed
* [pve-devel] [PATCH cluster] pmxcfs sync: properly check for corosync error
@ 2020-09-25 12:53 Fabian Grünbichler
  2020-09-25 13:23 ` [pve-devel] applied: " Thomas Lamprecht
  0 siblings, 1 reply; 4+ messages in thread
From: Fabian Grünbichler @ 2020-09-25 12:53 UTC (permalink / raw)
  To: pve-devel

dfsm_send_state_message_full always returns != 0, since it returns
cs_error_t which starts with CS_OK at 1, with values >1 representing
errors.

Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
---
unfortunately not that cause of Alexandre's shutdown/restart issue, but
might have caused some hangs as well since we would be stuck in
START_SYNC in that case..

 data/src/dfsm.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/data/src/dfsm.c b/data/src/dfsm.c
index 529c7f9..172d877 100644
--- a/data/src/dfsm.c
+++ b/data/src/dfsm.c
@@ -1190,7 +1190,7 @@ dfsm_cpg_confchg_callback(
 
 		dfsm_set_mode(dfsm, DFSM_MODE_START_SYNC);
 		if (lowest_nodeid == dfsm->nodeid) {
-			if (!dfsm_send_state_message_full(dfsm, DFSM_MESSAGE_SYNC_START, NULL, 0)) {
+			if (dfsm_send_state_message_full(dfsm, DFSM_MESSAGE_SYNC_START, NULL, 0) != CS_OK) {
 				cfs_dom_critical(dfsm->log_domain, "failed to send SYNC_START message");
 				goto leave;
 			}
-- 
2.20.1





^ permalink raw reply	[flat|nested] 4+ messages in thread

* [pve-devel] applied: [PATCH cluster] pmxcfs sync: properly check for corosync error
  2020-09-25 12:53 [pve-devel] [PATCH cluster] pmxcfs sync: properly check for corosync error Fabian Grünbichler
@ 2020-09-25 13:23 ` Thomas Lamprecht
  2020-09-25 13:36   ` Fabian Grünbichler
  0 siblings, 1 reply; 4+ messages in thread
From: Thomas Lamprecht @ 2020-09-25 13:23 UTC (permalink / raw)
  To: Proxmox VE development discussion, Fabian Grünbichler

On 25.09.20 14:53, Fabian Grünbichler wrote:
> dfsm_send_state_message_full always returns != 0, since it returns
> cs_error_t which starts with CS_OK at 1, with values >1 representing
> errors.
> 
> Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
> ---
> unfortunately not that cause of Alexandre's shutdown/restart issue, but
> might have caused some hangs as well since we would be stuck in
> START_SYNC in that case..
> 
>  data/src/dfsm.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
>

applied, thanks! But as the old wrong code showed up as critical error
"failed to send SYNC_START message" if it worked, it either (almost) never
works here or is not a probable case, else we'd saw this earlier.

(still a valid and appreciated fix, just noting)





^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [pve-devel] applied: [PATCH cluster] pmxcfs sync: properly check for corosync error
  2020-09-25 13:23 ` [pve-devel] applied: " Thomas Lamprecht
@ 2020-09-25 13:36   ` Fabian Grünbichler
  2020-09-25 13:48     ` Thomas Lamprecht
  0 siblings, 1 reply; 4+ messages in thread
From: Fabian Grünbichler @ 2020-09-25 13:36 UTC (permalink / raw)
  To: Thomas Lamprecht, Proxmox VE development discussion


> Thomas Lamprecht <t.lamprecht@proxmox.com> hat am 25.09.2020 15:23 geschrieben:
> 
>  
> On 25.09.20 14:53, Fabian Grünbichler wrote:
> > dfsm_send_state_message_full always returns != 0, since it returns
> > cs_error_t which starts with CS_OK at 1, with values >1 representing
> > errors.
> > 
> > Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
> > ---
> > unfortunately not that cause of Alexandre's shutdown/restart issue, but
> > might have caused some hangs as well since we would be stuck in
> > START_SYNC in that case..
> > 
> >  data/src/dfsm.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> >
> 
> applied, thanks! But as the old wrong code showed up as critical error
> "failed to send SYNC_START message" if it worked, it either (almost) never
> works here or is not a probable case, else we'd saw this earlier.
> 
> (still a valid and appreciated fix, just noting)

no, the old wrong code never triggered the error handling (log + leave), no matter whether the send worked or failed - the return value cannot be 0, so the condition is never true. if the send failed, the code assumed the state machine is now in START_SYNC mode and waits for STATE messages, which will never come since the other nodes haven't switched to START_SYNC..

it would still show up in the logs since cpg_mcast_joined failure is always verbose in the logs, but it would not be obvious that it caused the state machine to take a wrong turn I think.




^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [pve-devel] applied: [PATCH cluster] pmxcfs sync: properly check for corosync error
  2020-09-25 13:36   ` Fabian Grünbichler
@ 2020-09-25 13:48     ` Thomas Lamprecht
  0 siblings, 0 replies; 4+ messages in thread
From: Thomas Lamprecht @ 2020-09-25 13:48 UTC (permalink / raw)
  To: Fabian Grünbichler, Proxmox VE development discussion

On 25.09.20 15:36, Fabian Grünbichler wrote:
> 
>> Thomas Lamprecht <t.lamprecht@proxmox.com> hat am 25.09.2020 15:23 geschrieben:
>>
>>  
>> On 25.09.20 14:53, Fabian Grünbichler wrote:
>>> dfsm_send_state_message_full always returns != 0, since it returns
>>> cs_error_t which starts with CS_OK at 1, with values >1 representing
>>> errors.
>>>
>>> Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
>>> ---
>>> unfortunately not that cause of Alexandre's shutdown/restart issue, but
>>> might have caused some hangs as well since we would be stuck in
>>> START_SYNC in that case..
>>>
>>>  data/src/dfsm.c | 2 +-
>>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>>
>>
>> applied, thanks! But as the old wrong code showed up as critical error
>> "failed to send SYNC_START message" if it worked, it either (almost) never
>> works here or is not a probable case, else we'd saw this earlier.
>>
>> (still a valid and appreciated fix, just noting)
> 
> no, the old wrong code never triggered the error handling (log + leave), no matter whether the send worked or failed - the return value cannot be 0, so the condition is never true. if the send failed, the code assumed the state machine is now in START_SYNC mode and waits for STATE messages, which will never come since the other nodes haven't switched to START_SYNC..
> 

ah yeah, was confused about the CS_OK value for a moment


> it would still show up in the logs since cpg_mcast_joined failure is always verbose in the logs, but it would not be obvious that it caused the state machine to take a wrong turn I think.
> 






^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2020-09-25 13:49 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-09-25 12:53 [pve-devel] [PATCH cluster] pmxcfs sync: properly check for corosync error Fabian Grünbichler
2020-09-25 13:23 ` [pve-devel] applied: " Thomas Lamprecht
2020-09-25 13:36   ` Fabian Grünbichler
2020-09-25 13:48     ` Thomas Lamprecht

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox
Service provided by Proxmox Server Solutions GmbH | Privacy | Legal