* [pve-devel] [PATCH cluster] pmxcfs sync: properly check for corosync error
@ 2020-09-25 12:53 Fabian Grünbichler
2020-09-25 13:23 ` [pve-devel] applied: " Thomas Lamprecht
0 siblings, 1 reply; 4+ messages in thread
From: Fabian Grünbichler @ 2020-09-25 12:53 UTC (permalink / raw)
To: pve-devel
dfsm_send_state_message_full always returns != 0, since it returns
cs_error_t which starts with CS_OK at 1, with values >1 representing
errors.
Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
---
unfortunately not that cause of Alexandre's shutdown/restart issue, but
might have caused some hangs as well since we would be stuck in
START_SYNC in that case..
data/src/dfsm.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/data/src/dfsm.c b/data/src/dfsm.c
index 529c7f9..172d877 100644
--- a/data/src/dfsm.c
+++ b/data/src/dfsm.c
@@ -1190,7 +1190,7 @@ dfsm_cpg_confchg_callback(
dfsm_set_mode(dfsm, DFSM_MODE_START_SYNC);
if (lowest_nodeid == dfsm->nodeid) {
- if (!dfsm_send_state_message_full(dfsm, DFSM_MESSAGE_SYNC_START, NULL, 0)) {
+ if (dfsm_send_state_message_full(dfsm, DFSM_MESSAGE_SYNC_START, NULL, 0) != CS_OK) {
cfs_dom_critical(dfsm->log_domain, "failed to send SYNC_START message");
goto leave;
}
--
2.20.1
^ permalink raw reply [flat|nested] 4+ messages in thread
* [pve-devel] applied: [PATCH cluster] pmxcfs sync: properly check for corosync error
2020-09-25 12:53 [pve-devel] [PATCH cluster] pmxcfs sync: properly check for corosync error Fabian Grünbichler
@ 2020-09-25 13:23 ` Thomas Lamprecht
2020-09-25 13:36 ` Fabian Grünbichler
0 siblings, 1 reply; 4+ messages in thread
From: Thomas Lamprecht @ 2020-09-25 13:23 UTC (permalink / raw)
To: Proxmox VE development discussion, Fabian Grünbichler
On 25.09.20 14:53, Fabian Grünbichler wrote:
> dfsm_send_state_message_full always returns != 0, since it returns
> cs_error_t which starts with CS_OK at 1, with values >1 representing
> errors.
>
> Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
> ---
> unfortunately not that cause of Alexandre's shutdown/restart issue, but
> might have caused some hangs as well since we would be stuck in
> START_SYNC in that case..
>
> data/src/dfsm.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
>
applied, thanks! But as the old wrong code showed up as critical error
"failed to send SYNC_START message" if it worked, it either (almost) never
works here or is not a probable case, else we'd saw this earlier.
(still a valid and appreciated fix, just noting)
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [pve-devel] applied: [PATCH cluster] pmxcfs sync: properly check for corosync error
2020-09-25 13:23 ` [pve-devel] applied: " Thomas Lamprecht
@ 2020-09-25 13:36 ` Fabian Grünbichler
2020-09-25 13:48 ` Thomas Lamprecht
0 siblings, 1 reply; 4+ messages in thread
From: Fabian Grünbichler @ 2020-09-25 13:36 UTC (permalink / raw)
To: Thomas Lamprecht, Proxmox VE development discussion
> Thomas Lamprecht <t.lamprecht@proxmox.com> hat am 25.09.2020 15:23 geschrieben:
>
>
> On 25.09.20 14:53, Fabian Grünbichler wrote:
> > dfsm_send_state_message_full always returns != 0, since it returns
> > cs_error_t which starts with CS_OK at 1, with values >1 representing
> > errors.
> >
> > Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
> > ---
> > unfortunately not that cause of Alexandre's shutdown/restart issue, but
> > might have caused some hangs as well since we would be stuck in
> > START_SYNC in that case..
> >
> > data/src/dfsm.c | 2 +-
> > 1 file changed, 1 insertion(+), 1 deletion(-)
> >
> >
>
> applied, thanks! But as the old wrong code showed up as critical error
> "failed to send SYNC_START message" if it worked, it either (almost) never
> works here or is not a probable case, else we'd saw this earlier.
>
> (still a valid and appreciated fix, just noting)
no, the old wrong code never triggered the error handling (log + leave), no matter whether the send worked or failed - the return value cannot be 0, so the condition is never true. if the send failed, the code assumed the state machine is now in START_SYNC mode and waits for STATE messages, which will never come since the other nodes haven't switched to START_SYNC..
it would still show up in the logs since cpg_mcast_joined failure is always verbose in the logs, but it would not be obvious that it caused the state machine to take a wrong turn I think.
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [pve-devel] applied: [PATCH cluster] pmxcfs sync: properly check for corosync error
2020-09-25 13:36 ` Fabian Grünbichler
@ 2020-09-25 13:48 ` Thomas Lamprecht
0 siblings, 0 replies; 4+ messages in thread
From: Thomas Lamprecht @ 2020-09-25 13:48 UTC (permalink / raw)
To: Fabian Grünbichler, Proxmox VE development discussion
On 25.09.20 15:36, Fabian Grünbichler wrote:
>
>> Thomas Lamprecht <t.lamprecht@proxmox.com> hat am 25.09.2020 15:23 geschrieben:
>>
>>
>> On 25.09.20 14:53, Fabian Grünbichler wrote:
>>> dfsm_send_state_message_full always returns != 0, since it returns
>>> cs_error_t which starts with CS_OK at 1, with values >1 representing
>>> errors.
>>>
>>> Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
>>> ---
>>> unfortunately not that cause of Alexandre's shutdown/restart issue, but
>>> might have caused some hangs as well since we would be stuck in
>>> START_SYNC in that case..
>>>
>>> data/src/dfsm.c | 2 +-
>>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>>
>>
>> applied, thanks! But as the old wrong code showed up as critical error
>> "failed to send SYNC_START message" if it worked, it either (almost) never
>> works here or is not a probable case, else we'd saw this earlier.
>>
>> (still a valid and appreciated fix, just noting)
>
> no, the old wrong code never triggered the error handling (log + leave), no matter whether the send worked or failed - the return value cannot be 0, so the condition is never true. if the send failed, the code assumed the state machine is now in START_SYNC mode and waits for STATE messages, which will never come since the other nodes haven't switched to START_SYNC..
>
ah yeah, was confused about the CS_OK value for a moment
> it would still show up in the logs since cpg_mcast_joined failure is always verbose in the logs, but it would not be obvious that it caused the state machine to take a wrong turn I think.
>
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2020-09-25 13:49 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-09-25 12:53 [pve-devel] [PATCH cluster] pmxcfs sync: properly check for corosync error Fabian Grünbichler
2020-09-25 13:23 ` [pve-devel] applied: " Thomas Lamprecht
2020-09-25 13:36 ` Fabian Grünbichler
2020-09-25 13:48 ` Thomas Lamprecht
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox