* [pve-devel] [PATCH corosync-pve 1/2] cherry-pick fixes
2021-11-09 12:07 [pve-devel] [PATCH corosync-pve/kronosnet 0/4] cherry-pick bug fixes Fabian Grünbichler
@ 2021-11-09 12:07 ` Fabian Grünbichler
2021-11-09 12:07 ` [pve-devel] [PATCH corosync-pve 2/2] bump version to 3.1.5-pve2 Fabian Grünbichler
` (3 subsequent siblings)
4 siblings, 0 replies; 9+ messages in thread
From: Fabian Grünbichler @ 2021-11-09 12:07 UTC (permalink / raw)
To: pve-devel
patch #3 should improve network load in recovery situations
patch #4 fixes a cpg corruption issue discovered while investigating the
knet sequence number bug
Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
---
...cel_hold_on_retransmit-config-option.patch | 132 ++++++++++++++++++
...ch-totempg-buffers-at-the-right-time.patch | 113 +++++++++++++++
debian/patches/series | 2 +
3 files changed, 247 insertions(+)
create mode 100644 debian/patches/0003-totem-Add-cancel_hold_on_retransmit-config-option.patch
create mode 100644 debian/patches/0004-totemsrp-Switch-totempg-buffers-at-the-right-time.patch
diff --git a/debian/patches/0003-totem-Add-cancel_hold_on_retransmit-config-option.patch b/debian/patches/0003-totem-Add-cancel_hold_on_retransmit-config-option.patch
new file mode 100644
index 0000000..7fd66cf
--- /dev/null
+++ b/debian/patches/0003-totem-Add-cancel_hold_on_retransmit-config-option.patch
@@ -0,0 +1,132 @@
+From cdf72925db5a81e546ca8e8d7d8291ee1fc77be4 Mon Sep 17 00:00:00 2001
+From: Jan Friesse <jfriesse@redhat.com>
+Date: Wed, 11 Aug 2021 17:34:05 +0200
+Subject: [PATCH 3/4] totem: Add cancel_hold_on_retransmit config option
+MIME-Version: 1.0
+Content-Type: text/plain; charset=UTF-8
+Content-Transfer-Encoding: 8bit
+
+Previously, existence of retransmit messages canceled holding
+of token (and never allowed representative to enter token hold
+state).
+
+This makes token rotating maximum speed and keeps processor
+resending messages over and over again - overloading network
+and reducing chance to successfully deliver the messages.
+
+Also there were reports of various Antivirus / IPS / IDS which slows
+down delivery of packets with certain sizes (packets bigger than token)
+what make Corosync retransmit messages over and over again.
+
+Proposed solution is to allow representative to enter token hold
+state when there are only retransmit messages. This allows network to
+handle overload and/or gives Antivirus/IPS/IDS enough time scan and
+deliver packets without corosync entering "FAILED TO RECEIVE" state and
+adding more load to network.
+
+Signed-off-by: Jan Friesse <jfriesse@redhat.com>
+Reviewed-by: Christine Caulfield <ccaulfie@redhat.com>
+Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
+---
+ include/corosync/totem/totem.h | 2 ++
+ exec/totemconfig.c | 6 ++++++
+ exec/totemsrp.c | 5 +++--
+ man/corosync.conf.5 | 15 ++++++++++++++-
+ 4 files changed, 25 insertions(+), 3 deletions(-)
+
+diff --git a/include/corosync/totem/totem.h b/include/corosync/totem/totem.h
+index 8b166566..bdb6a15f 100644
+--- a/include/corosync/totem/totem.h
++++ b/include/corosync/totem/totem.h
+@@ -244,6 +244,8 @@ struct totem_config {
+
+ unsigned int block_unlisted_ips;
+
++ unsigned int cancel_token_hold_on_retransmit;
++
+ void (*totem_memb_ring_id_create_or_load) (
+ struct memb_ring_id *memb_ring_id,
+ unsigned int nodeid);
+diff --git a/exec/totemconfig.c b/exec/totemconfig.c
+index 57a1587a..46e09952 100644
+--- a/exec/totemconfig.c
++++ b/exec/totemconfig.c
+@@ -81,6 +81,7 @@
+ #define MAX_MESSAGES 17
+ #define MISS_COUNT_CONST 5
+ #define BLOCK_UNLISTED_IPS 1
++#define CANCEL_TOKEN_HOLD_ON_RETRANSMIT 0
+ /* This constant is not used for knet */
+ #define UDP_NETMTU 1500
+
+@@ -144,6 +145,8 @@ static void *totem_get_param_by_name(struct totem_config *totem_config, const ch
+ return totem_config->knet_compression_model;
+ if (strcmp(param_name, "totem.block_unlisted_ips") == 0)
+ return &totem_config->block_unlisted_ips;
++ if (strcmp(param_name, "totem.cancel_token_hold_on_retransmit") == 0)
++ return &totem_config->cancel_token_hold_on_retransmit;
+
+ return NULL;
+ }
+@@ -365,6 +368,9 @@ void totem_volatile_config_read (struct totem_config *totem_config, icmap_map_t
+
+ totem_volatile_config_set_boolean_value(totem_config, temp_map, "totem.block_unlisted_ips", deleted_key,
+ BLOCK_UNLISTED_IPS);
++
++ totem_volatile_config_set_boolean_value(totem_config, temp_map, "totem.cancel_token_hold_on_retransmit",
++ deleted_key, CANCEL_TOKEN_HOLD_ON_RETRANSMIT);
+ }
+
+ int totem_volatile_config_validate (
+diff --git a/exec/totemsrp.c b/exec/totemsrp.c
+index 949d367b..d24b11fa 100644
+--- a/exec/totemsrp.c
++++ b/exec/totemsrp.c
+@@ -3981,8 +3981,9 @@ static int message_handler_orf_token (
+ transmits_allowed = fcc_calculate (instance, token);
+ mcasted_retransmit = orf_token_rtr (instance, token, &transmits_allowed);
+
+- if (instance->my_token_held == 1 &&
+- (token->rtr_list_entries > 0 || mcasted_retransmit > 0)) {
++ if (instance->totem_config->cancel_token_hold_on_retransmit &&
++ instance->my_token_held == 1 &&
++ (token->rtr_list_entries > 0 || mcasted_retransmit > 0)) {
+ instance->my_token_held = 0;
+ forward_token = 1;
+ }
+diff --git a/man/corosync.conf.5 b/man/corosync.conf.5
+index 0588ad1e..a3771ea7 100644
+--- a/man/corosync.conf.5
++++ b/man/corosync.conf.5
+@@ -32,7 +32,7 @@
+ .\" * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF
+ .\" * THE POSSIBILITY OF SUCH DAMAGE.
+ .\" */
+-.TH COROSYNC_CONF 5 2021-07-23 "corosync Man Page" "Corosync Cluster Engine Programmer's Manual"
++.TH COROSYNC_CONF 5 2021-08-11 "corosync Man Page" "Corosync Cluster Engine Programmer's Manual"
+ .SH NAME
+ corosync.conf - corosync executive configuration file
+
+@@ -584,6 +584,19 @@ with an old configuration.
+
+ The default value is yes.
+
++.TP
++cancel_token_hold_on_retransmit
++Allows Corosync to hold token by representative when there is too much
++retransmit messages. This allows network to process increased load without
++overloading it. Used mechanism is same as described for
++.B hold
++directive.
++
++Some deployments may prefer to never hold token when there is
++retransmit messages. If so, option should be set to yes.
++
++The default value is no.
++
+ .PP
+ Within the
+ .B logging
+--
+2.30.2
+
diff --git a/debian/patches/0004-totemsrp-Switch-totempg-buffers-at-the-right-time.patch b/debian/patches/0004-totemsrp-Switch-totempg-buffers-at-the-right-time.patch
new file mode 100644
index 0000000..2ef9215
--- /dev/null
+++ b/debian/patches/0004-totemsrp-Switch-totempg-buffers-at-the-right-time.patch
@@ -0,0 +1,113 @@
+From 7dce8bc0066c7c76eeb26cc8f6fe4de3221d6798 Mon Sep 17 00:00:00 2001
+From: Jan Friesse <jfriesse@redhat.com>
+Date: Tue, 26 Oct 2021 18:17:59 +0200
+Subject: [PATCH 4/4] totemsrp: Switch totempg buffers at the right time
+MIME-Version: 1.0
+Content-Type: text/plain; charset=UTF-8
+Content-Transfer-Encoding: 8bit
+
+Commit 92e0f9c7bb9b4b6a0da8d64bdf3b2e47ae55b1cc added switching of
+totempg buffers in sync phase. But because buffers got switch too early
+there was a problem when delivering recovered messages (messages got
+corrupted and/or lost). Solution is to switch buffers after recovered
+messages got delivered.
+
+I think it is worth to describe complete history with reproducers so it
+doesn't get lost.
+
+It all started with 402638929e5045ef520a7339696c687fbed0b31b (more info
+about original problem is described in
+https://bugzilla.redhat.com/show_bug.cgi?id=820821). This patch
+solves problem which is way to be reproduced with following reproducer:
+- 2 nodes
+- Both nodes running corosync and testcpg
+- Pause node 1 (SIGSTOP of corosync)
+- On node 1, send some messages by testcpg
+ (it's not answering but this doesn't matter). Simply hit ENTER key
+ few times is enough)
+- Wait till node 2 detects that node 1 left
+- Unpause node 1 (SIGCONT of corosync)
+
+and on node 1 newly mcasted cpg messages got sent before sync barrier,
+so node 2 logs "Unknown node -> we will not deliver message".
+
+Solution was to add switch of totemsrp new messages buffer.
+
+This patch was not enough so new one
+(92e0f9c7bb9b4b6a0da8d64bdf3b2e47ae55b1cc) was created. Reproducer of
+problem was similar, just cpgverify was used instead of testcpg.
+Occasionally when node 1 was unpaused it hang in sync phase because
+there was a partial message in totempg buffers. New sync message had
+different frag cont so it was thrown away and never delivered.
+
+After many years problem was found which is solved by this patch
+(original issue describe in
+https://github.com/corosync/corosync/issues/660).
+Reproducer is more complex:
+- 2 nodes
+- Node 1 is rate-limited (used script on the hypervisor side):
+ ```
+ iface=tapXXXX
+ # ~0.1MB/s in bit/s
+ rate=838856
+ # 1mb/s
+ burst=1048576
+ tc qdisc add dev $iface root handle 1: htb default 1
+ tc class add dev $iface parent 1: classid 1:1 htb rate ${rate}bps \
+ burst ${burst}b
+ tc qdisc add dev $iface handle ffff: ingress
+ tc filter add dev $iface parent ffff: prio 50 basic police rate \
+ ${rate}bps burst ${burst}b mtu 64kb "drop"
+ ```
+- Node 2 is running corosync and cpgverify
+- Node 1 keeps restarting of corosync and running cpgverify in cycle
+ - Console 1: while true; do corosync; sleep 20; \
+ kill $(pidof corosync); sleep 20; done
+ - Console 2: while true; do ./cpgverify;done
+
+And from time to time (reproduced usually in less than 5 minutes)
+cpgverify reports corrupted message.
+
+Signed-off-by: Jan Friesse <jfriesse@redhat.com>
+Reviewed-by: Fabio M. Di Nitto <fdinitto@redhat.com>
+Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
+---
+ exec/totemsrp.c | 16 +++++++++++++++-
+ 1 file changed, 15 insertions(+), 1 deletion(-)
+
+diff --git a/exec/totemsrp.c b/exec/totemsrp.c
+index d24b11fa..fd71771b 100644
+--- a/exec/totemsrp.c
++++ b/exec/totemsrp.c
+@@ -1989,13 +1989,27 @@ static void memb_state_operational_enter (struct totemsrp_instance *instance)
+ trans_memb_list_totemip, instance->my_trans_memb_entries,
+ left_list, instance->my_left_memb_entries,
+ 0, 0, &instance->my_ring_id);
++ /*
++ * Switch new totemsrp messages queue. Messages sent from now on are stored
++ * in different queue so synchronization messages are delivered first. Totempg
++ * buffers will be switched later.
++ */
+ instance->waiting_trans_ack = 1;
+- instance->totemsrp_waiting_trans_ack_cb_fn (1);
+
+ // TODO we need to filter to ensure we only deliver those
+ // messages which are part of instance->my_deliver_memb
+ messages_deliver_to_app (instance, 1, instance->old_ring_state_high_seq_received);
+
++ /*
++ * Switch totempg buffers. This used to be right after
++ * instance->waiting_trans_ack = 1;
++ * line. This was causing problem, because there may be not yet
++ * processed parts of messages in totempg buffers.
++ * So when buffers were switched and recovered messages
++ * got delivered it was not possible to assemble them.
++ */
++ instance->totemsrp_waiting_trans_ack_cb_fn (1);
++
+ instance->my_aru = aru_save;
+
+ /*
+--
+2.30.2
+
diff --git a/debian/patches/series b/debian/patches/series
index fd3a2f0..74c8c39 100644
--- a/debian/patches/series
+++ b/debian/patches/series
@@ -1,2 +1,4 @@
0001-Enable-PrivateTmp-in-the-systemd-service-files.patch
0002-only-start-corosync.service-if-conf-exists.patch
+0003-totem-Add-cancel_hold_on_retransmit-config-option.patch
+0004-totemsrp-Switch-totempg-buffers-at-the-right-time.patch
--
2.30.2
^ permalink raw reply [flat|nested] 9+ messages in thread
* [pve-devel] [PATCH corosync-pve 2/2] bump version to 3.1.5-pve2
2021-11-09 12:07 [pve-devel] [PATCH corosync-pve/kronosnet 0/4] cherry-pick bug fixes Fabian Grünbichler
2021-11-09 12:07 ` [pve-devel] [PATCH corosync-pve 1/2] cherry-pick fixes Fabian Grünbichler
@ 2021-11-09 12:07 ` Fabian Grünbichler
2021-11-09 12:07 ` [pve-devel] [PATCH kronosnet 1/2] fix #3672: cherry-pick knet fixes Fabian Grünbichler
` (2 subsequent siblings)
4 siblings, 0 replies; 9+ messages in thread
From: Fabian Grünbichler @ 2021-11-09 12:07 UTC (permalink / raw)
To: pve-devel
Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
---
debian/changelog | 8 ++++++++
1 file changed, 8 insertions(+)
diff --git a/debian/changelog b/debian/changelog
index 018d175..21c7a19 100644
--- a/debian/changelog
+++ b/debian/changelog
@@ -1,3 +1,11 @@
+corosync (3.1.5-pve2) bullseye; urgency=medium
+
+ * cherry-pick fix for high retransmit load
+
+ * cherry-pick fix for CPG corruption during membership change bug
+
+ -- Proxmox Support Team <support@proxmox.com> Tue, 9 Nov 2021 11:50:52 +0100
+
corosync (3.1.5-pve1) bullseye; urgency=medium
* update to v3.1.5 upstream release
--
2.30.2
^ permalink raw reply [flat|nested] 9+ messages in thread
* [pve-devel] [PATCH kronosnet 1/2] fix #3672: cherry-pick knet fixes
2021-11-09 12:07 [pve-devel] [PATCH corosync-pve/kronosnet 0/4] cherry-pick bug fixes Fabian Grünbichler
2021-11-09 12:07 ` [pve-devel] [PATCH corosync-pve 1/2] cherry-pick fixes Fabian Grünbichler
2021-11-09 12:07 ` [pve-devel] [PATCH corosync-pve 2/2] bump version to 3.1.5-pve2 Fabian Grünbichler
@ 2021-11-09 12:07 ` Fabian Grünbichler
2021-11-09 12:07 ` [pve-devel] [PATCH kronosnet 2/2] bump version to 1.22-pve2 Fabian Grünbichler
2021-11-09 12:31 ` [pve-devel] [PATCH corosync-pve/kronosnet 0/4] cherry-pick bug fixes Thomas Lamprecht
4 siblings, 0 replies; 9+ messages in thread
From: Fabian Grünbichler @ 2021-11-09 12:07 UTC (permalink / raw)
To: pve-devel
see https://github.com/corosync/corosync/issues/660 as well. these are
already queued for 1.23 and taken straight from stable1-proposed.
Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
---
...eq_num-initialization-race-condition.patch | 53 +++++++++++
...or-messages-to-trigger-faster-link-d.patch | 92 +++++++++++++++++++
debian/patches/series | 3 +-
3 files changed, 147 insertions(+), 1 deletion(-)
create mode 100644 debian/patches/0001-host-fix-dst_seq_num-initialization-race-condition.patch
create mode 100644 debian/patches/0002-udp-use-ICMP-error-messages-to-trigger-faster-link-d.patch
diff --git a/debian/patches/0001-host-fix-dst_seq_num-initialization-race-condition.patch b/debian/patches/0001-host-fix-dst_seq_num-initialization-race-condition.patch
new file mode 100644
index 0000000..d01e0d4
--- /dev/null
+++ b/debian/patches/0001-host-fix-dst_seq_num-initialization-race-condition.patch
@@ -0,0 +1,53 @@
+From 7eebe93c5039dad432bdd40101287e7fc04b3d10 Mon Sep 17 00:00:00 2001
+From: "Fabio M. Di Nitto" <fdinitto@redhat.com>
+Date: Mon, 8 Nov 2021 09:14:22 +0100
+Subject: [PATCH 1/2] [host] fix dst_seq_num initialization race condition
+MIME-Version: 1.0
+Content-Type: text/plain; charset=UTF-8
+Content-Transfer-Encoding: 8bit
+
+There is a potential race condition where the sender
+is overloaded, sending data packets before pings
+can kick in and set the correct dst_seq_num.
+
+if this node is starting up (dst_seq_num = 0),
+it can start rejecing valid packets and get stuck.
+
+Set the dst_seq_num to the first seen packet and
+use that as reference instead.
+
+Signed-off-by: Fabio M. Di Nitto <fdinitto@redhat.com>
+Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
+---
+ libknet/host.c | 15 +++++++++++++++
+ 1 file changed, 15 insertions(+)
+
+diff --git a/libknet/host.c b/libknet/host.c
+index ec73c0df..6fca01f8 100644
+--- a/libknet/host.c
++++ b/libknet/host.c
+@@ -573,6 +573,21 @@ int _seq_num_lookup(struct knet_host *host, seq_num_t seq_num, int defrag_buf, i
+ char *dst_cbuf_defrag = host->circular_buffer_defrag;
+ seq_num_t *dst_seq_num = &host->rx_seq_num;
+
++ /*
++ * There is a potential race condition where the sender
++ * is overloaded, sending data packets before pings
++ * can kick in and set the correct dst_seq_num.
++ *
++ * if this node is starting up (dst_seq_num = 0),
++ * it can start rejecing valid packets and get stuck.
++ *
++ * Set the dst_seq_num to the first seen packet and
++ * use that as reference instead.
++ */
++ if (!*dst_seq_num) {
++ *dst_seq_num = seq_num;
++ }
++
+ if (clear_buf) {
+ _clear_cbuffers(host, seq_num);
+ }
+--
+2.30.2
+
diff --git a/debian/patches/0002-udp-use-ICMP-error-messages-to-trigger-faster-link-d.patch b/debian/patches/0002-udp-use-ICMP-error-messages-to-trigger-faster-link-d.patch
new file mode 100644
index 0000000..c8a9990
--- /dev/null
+++ b/debian/patches/0002-udp-use-ICMP-error-messages-to-trigger-faster-link-d.patch
@@ -0,0 +1,92 @@
+From 1d52003ae7814ebf2b47c1ac3463c7d82486a5fd Mon Sep 17 00:00:00 2001
+From: "Fabio M. Di Nitto" <fdinitto@redhat.com>
+Date: Sun, 7 Nov 2021 17:02:05 +0100
+Subject: [PATCH 2/2] [udp] use ICMP error messages to trigger faster link down
+ detection
+MIME-Version: 1.0
+Content-Type: text/plain; charset=UTF-8
+Content-Transfer-Encoding: 8bit
+
+this solves a possible race condition when:
+
+- node1 is running
+- node2 very fast
+- node1 does NOT have enough time to detect that node2 has gone
+ and reset the local seq numbers / buffers
+- node1 will start rejecting valid packets from node2
+
+There is still a potential minor race condition where app
+can restart so fast that kernel / network don't have time
+to generate an ICMP error. This will be addressed using
+instance id in onwire v2 protocol, as suggested by Jan F.
+
+Signed-off-by: Fabio M. Di Nitto <fdinitto@redhat.com>
+Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
+---
+ libknet/transport_udp.c | 44 +++++++++++++++++++++++++++++++++++++++++
+ 1 file changed, 44 insertions(+)
+
+diff --git a/libknet/transport_udp.c b/libknet/transport_udp.c
+index 482c99b1..a1419c89 100644
+--- a/libknet/transport_udp.c
++++ b/libknet/transport_udp.c
+@@ -364,6 +364,46 @@ static int read_errs_from_sock(knet_handle_t knet_h, int sockfd)
+ log_debug(knet_h, KNET_SUB_TRANSP_UDP, "Received ICMP error from %s: %s destination unknown", addr_str, strerror(sock_err->ee_errno));
+ } else {
+ log_debug(knet_h, KNET_SUB_TRANSP_UDP, "Received ICMP error from %s: %s %s", addr_str, strerror(sock_err->ee_errno), addr_remote_str);
++ if ((sock_err->ee_errno == ECONNREFUSED) || /* knet is not running on the other node */
++ (sock_err->ee_errno == ECONNABORTED) || /* local kernel closed the socket */
++ (sock_err->ee_errno == ENONET) || /* network does not exist */
++ (sock_err->ee_errno == ENETUNREACH) || /* network unreachable */
++ (sock_err->ee_errno == EHOSTUNREACH) || /* host unreachable */
++ (sock_err->ee_errno == EHOSTDOWN) || /* host down (from kernel/net/ipv4/icmp.c */
++ (sock_err->ee_errno == ENETDOWN)) { /* network down */
++ struct knet_host *host = NULL;
++ struct knet_link *kn_link = NULL;
++ int link_idx, found = 0;
++
++ for (host = knet_h->host_head; host != NULL; host = host->next) {
++ for (link_idx = 0; link_idx < KNET_MAX_LINK; link_idx++) {
++ kn_link = &host->link[link_idx];
++ if (kn_link->outsock == sockfd) {
++ if (!cmpaddr(&remote, &kn_link->dst_addr)) {
++ found = 1;
++ break;
++ }
++ }
++ }
++ if (found) {
++ break;
++ }
++ }
++
++ if ((host) && (kn_link) &&
++ (kn_link->status.connected)) {
++ log_debug(knet_h, KNET_SUB_TRANSP_UDP, "Setting down host %u link %i", host->host_id, kn_link->link_id);
++ /*
++ * setting transport_connected = 0 will trigger
++ * thread_heartbeat link_down process.
++ *
++ * the process terminates calling into transport_link_down
++ * below that will set transport_connected = 1
++ */
++ kn_link->transport_connected = 0;
++ }
++
++ }
+ }
+ }
+ break;
+@@ -436,5 +476,9 @@ int udp_transport_link_dyn_connect(knet_handle_t knet_h, int sockfd, struct knet
+
+ int udp_transport_link_is_down(knet_handle_t knet_h, struct knet_link *kn_link)
+ {
++ /*
++ * see comments about handling ICMP error messages
++ */
++ kn_link->transport_connected = 1;
+ return 0;
+ }
+--
+2.30.2
+
diff --git a/debian/patches/series b/debian/patches/series
index 8b13789..16fba19 100644
--- a/debian/patches/series
+++ b/debian/patches/series
@@ -1 +1,2 @@
-
+0001-host-fix-dst_seq_num-initialization-race-condition.patch
+0002-udp-use-ICMP-error-messages-to-trigger-faster-link-d.patch
--
2.30.2
^ permalink raw reply [flat|nested] 9+ messages in thread
* [pve-devel] [PATCH kronosnet 2/2] bump version to 1.22-pve2
2021-11-09 12:07 [pve-devel] [PATCH corosync-pve/kronosnet 0/4] cherry-pick bug fixes Fabian Grünbichler
` (2 preceding siblings ...)
2021-11-09 12:07 ` [pve-devel] [PATCH kronosnet 1/2] fix #3672: cherry-pick knet fixes Fabian Grünbichler
@ 2021-11-09 12:07 ` Fabian Grünbichler
2021-11-09 12:31 ` [pve-devel] [PATCH corosync-pve/kronosnet 0/4] cherry-pick bug fixes Thomas Lamprecht
4 siblings, 0 replies; 9+ messages in thread
From: Fabian Grünbichler @ 2021-11-09 12:07 UTC (permalink / raw)
To: pve-devel
Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
---
debian/changelog | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/debian/changelog b/debian/changelog
index b154415..2ef406a 100644
--- a/debian/changelog
+++ b/debian/changelog
@@ -1,3 +1,9 @@
+kronosnet (1.22-pve2) bullseye; urgency=medium
+
+ * cherry pick fixes for membership change under high network load
+
+ -- Proxmox Support Team <support@proxmox.com> Tue, 9 Nov 2021 11:44:52 +0100
+
kronosnet (1.22-pve1) bullseye; urgency=medium
* update to v1.22 upstream release
--
2.30.2
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [pve-devel] [PATCH corosync-pve/kronosnet 0/4] cherry-pick bug fixes
2021-11-09 12:07 [pve-devel] [PATCH corosync-pve/kronosnet 0/4] cherry-pick bug fixes Fabian Grünbichler
` (3 preceding siblings ...)
2021-11-09 12:07 ` [pve-devel] [PATCH kronosnet 2/2] bump version to 1.22-pve2 Fabian Grünbichler
@ 2021-11-09 12:31 ` Thomas Lamprecht
2021-11-09 12:54 ` [pve-devel] applied-series: " Fabian Grünbichler
4 siblings, 1 reply; 9+ messages in thread
From: Thomas Lamprecht @ 2021-11-09 12:31 UTC (permalink / raw)
To: Proxmox VE development discussion, Fabian Grünbichler
On 09.11.21 13:07, Fabian Grünbichler wrote:
> culmination of 4 weeks of triaging together with the respective upstream
> devs and endless hours staring at corosync debug traces, this fixes the
> following issues:
>
> - knet losing join messages if network is overloaded, pushing corosync
> into a retransmit loop, potentially causing a full-cluster fence event
> with just a single node acting up
> - corosync potentially corrupting messages during membership changes
>
> and another one reported by someone else:
>
> - corosync causing high network load by not holding the token in case
> messages are queued for retransmission
>
> all of the fixes are taken from the respective stable queue with
> releases slated for later this week.
>
> corosync:
>
> Fabian Grünbichler (2):
> cherry-pick fixes
> bump version to 3.1.5-pve2
>
> ...cel_hold_on_retransmit-config-option.patch | 132 ++++++++++++++++++
> ...ch-totempg-buffers-at-the-right-time.patch | 113 +++++++++++++++
> debian/changelog | 8 ++
> debian/patches/series | 2 +
> 4 files changed, 255 insertions(+)
> create mode 100644 debian/patches/0003-totem-Add-cancel_hold_on_retransmit-config-option.patch
> create mode 100644 debian/patches/0004-totemsrp-Switch-totempg-buffers-at-the-right-time.patch
>
> kronosnet:
>
> Fabian Grünbichler (2):
> fix #3672: cherry-pick knet fixes
> bump version to 1.22-pve2
>
> ...eq_num-initialization-race-condition.patch | 53 +++++++++++
> ...or-messages-to-trigger-faster-link-d.patch | 92 +++++++++++++++++++
> debian/changelog | 6 ++
> debian/patches/series | 3 +-
> 4 files changed, 153 insertions(+), 1 deletion(-)
> create mode 100644 debian/patches/0001-host-fix-dst_seq_num-initialization-race-condition.patch
> create mode 100644 debian/patches/0002-udp-use-ICMP-error-messages-to-trigger-faster-link-d.patch
>
For all of this:
Acked-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
Reviewed-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
Can you go a head and push + upload packages?
^ permalink raw reply [flat|nested] 9+ messages in thread
* [pve-devel] applied-series: [PATCH corosync-pve/kronosnet 0/4] cherry-pick bug fixes
2021-11-09 12:31 ` [pve-devel] [PATCH corosync-pve/kronosnet 0/4] cherry-pick bug fixes Thomas Lamprecht
@ 2021-11-09 12:54 ` Fabian Grünbichler
2021-11-09 13:21 ` Eneko Lacunza
0 siblings, 1 reply; 9+ messages in thread
From: Fabian Grünbichler @ 2021-11-09 12:54 UTC (permalink / raw)
To: Proxmox VE development discussion, Thomas Lamprecht
On November 9, 2021 1:31 pm, Thomas Lamprecht wrote:
> On 09.11.21 13:07, Fabian Grünbichler wrote:
>> culmination of 4 weeks of triaging together with the respective upstream
>> devs and endless hours staring at corosync debug traces, this fixes the
>> following issues:
>>
>> - knet losing join messages if network is overloaded, pushing corosync
>> into a retransmit loop, potentially causing a full-cluster fence event
>> with just a single node acting up
>> - corosync potentially corrupting messages during membership changes
>>
>> and another one reported by someone else:
>>
>> - corosync causing high network load by not holding the token in case
>> messages are queued for retransmission
>>
>> all of the fixes are taken from the respective stable queue with
>> releases slated for later this week.
>>
>> corosync:
>>
>> Fabian Grünbichler (2):
>> cherry-pick fixes
>> bump version to 3.1.5-pve2
>>
>> ...cel_hold_on_retransmit-config-option.patch | 132 ++++++++++++++++++
>> ...ch-totempg-buffers-at-the-right-time.patch | 113 +++++++++++++++
>> debian/changelog | 8 ++
>> debian/patches/series | 2 +
>> 4 files changed, 255 insertions(+)
>> create mode 100644 debian/patches/0003-totem-Add-cancel_hold_on_retransmit-config-option.patch
>> create mode 100644 debian/patches/0004-totemsrp-Switch-totempg-buffers-at-the-right-time.patch
>>
>> kronosnet:
>>
>> Fabian Grünbichler (2):
>> fix #3672: cherry-pick knet fixes
>> bump version to 1.22-pve2
>>
>> ...eq_num-initialization-race-condition.patch | 53 +++++++++++
>> ...or-messages-to-trigger-faster-link-d.patch | 92 +++++++++++++++++++
>> debian/changelog | 6 ++
>> debian/patches/series | 3 +-
>> 4 files changed, 153 insertions(+), 1 deletion(-)
>> create mode 100644 debian/patches/0001-host-fix-dst_seq_num-initialization-race-condition.patch
>> create mode 100644 debian/patches/0002-udp-use-ICMP-error-messages-to-trigger-faster-link-d.patch
>>
>
> For all of this:
>
> Acked-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
> Reviewed-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
>
> Can you go a head and push + upload packages?
>
done
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [pve-devel] applied-series: [PATCH corosync-pve/kronosnet 0/4] cherry-pick bug fixes
2021-11-09 12:54 ` [pve-devel] applied-series: " Fabian Grünbichler
@ 2021-11-09 13:21 ` Eneko Lacunza
2021-11-09 17:39 ` Fabian Grünbichler
0 siblings, 1 reply; 9+ messages in thread
From: Eneko Lacunza @ 2021-11-09 13:21 UTC (permalink / raw)
To: pve-devel
Hi,
Nice to see this here, I think we have been afected by this for the past
weeks (since we upgraded to PVE 7...); I was starting to think we had
faulty network :)
How can I know when this gets to community repo?
Thanks
El 9/11/21 a las 13:54, Fabian Grünbichler escribió:
> On November 9, 2021 1:31 pm, Thomas Lamprecht wrote:
>> On 09.11.21 13:07, Fabian Grünbichler wrote:
>>> culmination of 4 weeks of triaging together with the respective upstream
>>> devs and endless hours staring at corosync debug traces, this fixes the
>>> following issues:
>>>
>>> - knet losing join messages if network is overloaded, pushing corosync
>>> into a retransmit loop, potentially causing a full-cluster fence event
>>> with just a single node acting up
>>> - corosync potentially corrupting messages during membership changes
>>>
>>> and another one reported by someone else:
>>>
>>> - corosync causing high network load by not holding the token in case
>>> messages are queued for retransmission
>>>
>>> all of the fixes are taken from the respective stable queue with
>>> releases slated for later this week.
>>>
>>> corosync:
>>>
>>> Fabian Grünbichler (2):
>>> cherry-pick fixes
>>> bump version to 3.1.5-pve2
>>>
>>> ...cel_hold_on_retransmit-config-option.patch | 132 ++++++++++++++++++
>>> ...ch-totempg-buffers-at-the-right-time.patch | 113 +++++++++++++++
>>> debian/changelog | 8 ++
>>> debian/patches/series | 2 +
>>> 4 files changed, 255 insertions(+)
>>> create mode 100644 debian/patches/0003-totem-Add-cancel_hold_on_retransmit-config-option.patch
>>> create mode 100644 debian/patches/0004-totemsrp-Switch-totempg-buffers-at-the-right-time.patch
>>>
>>> kronosnet:
>>>
>>> Fabian Grünbichler (2):
>>> fix #3672: cherry-pick knet fixes
>>> bump version to 1.22-pve2
>>>
>>> ...eq_num-initialization-race-condition.patch | 53 +++++++++++
>>> ...or-messages-to-trigger-faster-link-d.patch | 92 +++++++++++++++++++
>>> debian/changelog | 6 ++
>>> debian/patches/series | 3 +-
>>> 4 files changed, 153 insertions(+), 1 deletion(-)
>>> create mode 100644 debian/patches/0001-host-fix-dst_seq_num-initialization-race-condition.patch
>>> create mode 100644 debian/patches/0002-udp-use-ICMP-error-messages-to-trigger-faster-link-d.patch
>>>
>> For all of this:
>>
>> Acked-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
>> Reviewed-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
>>
>> Can you go a head and push + upload packages?
>>
> done
>
>
> _______________________________________________
> pve-devel mailing list
> pve-devel@lists.proxmox.com
> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
Eneko Lacunza
Zuzendari teknikoa | Director técnico
Binovo IT Human Project
Tel. +34 943 569 206 | https://www.binovo.es
Astigarragako Bidea, 2 - 2º izda. Oficina 10-11, 20180 Oiartzun
https://www.youtube.com/user/CANALBINOVO
https://www.linkedin.com/company/37269706/
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [pve-devel] applied-series: [PATCH corosync-pve/kronosnet 0/4] cherry-pick bug fixes
2021-11-09 13:21 ` Eneko Lacunza
@ 2021-11-09 17:39 ` Fabian Grünbichler
0 siblings, 0 replies; 9+ messages in thread
From: Fabian Grünbichler @ 2021-11-09 17:39 UTC (permalink / raw)
To: Proxmox VE development discussion
On November 9, 2021 2:21 pm, Eneko Lacunza wrote:
> Hi,
>
> Nice to see this here, I think we have been afected by this for the past
> weeks (since we upgraded to PVE 7...); I was starting to think we had
> faulty network :)
>
> How can I know when this gets to community repo?
>
> Thanks
it's available on pvetest already
^ permalink raw reply [flat|nested] 9+ messages in thread