From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <pve-devel-bounces@lists.proxmox.com>
Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68])
	by lore.proxmox.com (Postfix) with ESMTPS id 4A1AD1FF189
	for <inbox@lore.proxmox.com>; Fri,  4 Apr 2025 10:30:47 +0200 (CEST)
Received: from firstgate.proxmox.com (localhost [127.0.0.1])
	by firstgate.proxmox.com (Proxmox) with ESMTP id 48F7F197E8;
	Fri,  4 Apr 2025 10:30:35 +0200 (CEST)
References: <20250404075957.80057-1-f.weber@proxmox.com>
User-agent: mu4e 1.10.8; emacs 30.1
From: Maximiliano Sandoval <m.sandoval@proxmox.com>
To: Proxmox VE development discussion <pve-devel@lists.proxmox.com>
Date: Fri, 04 Apr 2025 10:14:17 +0200
In-reply-to: <20250404075957.80057-1-f.weber@proxmox.com>
Message-ID: <s8oa58w72hl.fsf@proxmox.com>
MIME-Version: 1.0
X-SPAM-LEVEL: Spam detection results:  0
 AWL 0.096 Adjusted score from AWL reputation of From: address
 BAYES_00                 -1.9 Bayes spam probability is 0 to 1%
 DMARC_MISSING             0.1 Missing DMARC policy
 KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment
 SPF_HELO_NONE           0.001 SPF: HELO does not publish an SPF Record
 SPF_PASS               -0.001 SPF: sender matches SPF record
Subject: Re: [pve-devel] [PATCH corosync] corosync.service: add patch to
 reduce log spam in broken network setups
X-BeenThere: pve-devel@lists.proxmox.com
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Proxmox VE development discussion <pve-devel.lists.proxmox.com>
List-Unsubscribe: <https://lists.proxmox.com/cgi-bin/mailman/options/pve-devel>, 
 <mailto:pve-devel-request@lists.proxmox.com?subject=unsubscribe>
List-Archive: <http://lists.proxmox.com/pipermail/pve-devel/>
List-Post: <mailto:pve-devel@lists.proxmox.com>
List-Help: <mailto:pve-devel-request@lists.proxmox.com?subject=help>
List-Subscribe: <https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel>, 
 <mailto:pve-devel-request@lists.proxmox.com?subject=subscribe>
Reply-To: Proxmox VE development discussion <pve-devel@lists.proxmox.com>
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Errors-To: pve-devel-bounces@lists.proxmox.com
Sender: "pve-devel" <pve-devel-bounces@lists.proxmox.com>


Friedrich Weber <f.weber@proxmox.com> writes:

> Since c761053 ("Check packets come from the correct interface
> https://github.com/corosync/corosync/issues/750") in kronosnet,
> corosync will produce log messages in certain broken network setups.
> See inner patch for details. Drawing attention to such setups is
> desirable because such setups may experience whole-cluster fences if
> the watchdog is active, see [1].
>
> However, the log volume in such broken setups can be inconveniently
> high. In such a setup, when running the following on node 1:
>
>   # for i in $(seq 100); do dd if=/dev/urandom bs=1M of=/etc/pve/test.bin count=1; done
>
> On node 2, bursts of ~1300 messages per second are observed:
>
>   # journalctl --since="1min ago" -u corosync.service \
>     | cut -d' ' -f 1-3 | uniq -c | sort -n | tail -n 10
>       8 Apr 04 09:51:20
>       8 Apr 04 09:51:24
>       8 Apr 04 09:51:30
>       8 Apr 04 09:51:34
>       8 Apr 04 09:51:40
>      12 Apr 04 09:51:00
>     196 Apr 04 09:51:46
>    1283 Apr 04 09:51:44
>    1329 Apr 04 09:51:43
>    1370 Apr 04 09:51:45
>
> To avoid cluttering the journal, rate-limit log messages to 200 per
> second. See inner patch for details.
>
> [1] https://github.com/corosync/corosync/issues/750
>
> Signed-off-by: Friedrich Weber <f.weber@proxmox.com>
> ---
>
> Notes:
>     I'm a little confused about the rate limit, as with this patch I do
>     see that systemd suppresses messages:
>     
>     Apr 04 09:52:54 coro3 systemd-journald[303]: Suppressed 196 messages from corosync.service
>     
>     but I still see way more than 200 messages per second:
>     
>          11 Apr 04 09:52:00
>          11 Apr 04 09:52:45
>          13 Apr 04 09:52:13
>          14 Apr 04 09:52:12
>          19 Apr 04 09:52:07
>          67 Apr 04 09:52:08
>         400 Apr 04 09:52:54
>         695 Apr 04 09:52:52
>         715 Apr 04 09:52:53
>         835 Apr 04 09:52:51
>     
>     Any idea why?
>
>  ...-rate-limit-log-messages-to-200-per-.patch | 54 +++++++++++++++++++
>  debian/patches/series                         |  1 +
>  2 files changed, 55 insertions(+)
>  create mode 100644 debian/patches/0004-corosync.service-rate-limit-log-messages-to-200-per-.patch
>
> diff --git a/debian/patches/0004-corosync.service-rate-limit-log-messages-to-200-per-.patch b/debian/patches/0004-corosync.service-rate-limit-log-messages-to-200-per-.patch
> new file mode 100644
> index 0000000..0f91b42
> --- /dev/null
> +++ b/debian/patches/0004-corosync.service-rate-limit-log-messages-to-200-per-.patch
> @@ -0,0 +1,54 @@
> +From 5470f01296a3bd8f47fd4bd97939b3a68f00d309 Mon Sep 17 00:00:00 2001
> +From: Friedrich Weber <f.weber@proxmox.com>
> +Date: Fri, 4 Apr 2025 09:14:21 +0200
> +Subject: [PATCH] corosync.service: rate limit log messages to 200 per second
> +
> +Since c761053 ("Check packets come from the correct interface
> +https://github.com/corosync/corosync/issues/750") in kronosnet,
> +corosync will log a message like the following every time a packet is
> +received at the wrong interface, i.e., not the interface on which the
> +corresponding IP is configured:
> +
> +> [KNET  ] udp: Received packet from 10.8.1.1 to 10.8.1.3 on i/f ens20 when expected ens19
> +
> +This is to draw attention to broken network setups that
> +appear to work fine as long as all corosync links are online, but once
> +a link goes down, may go into a state of "asymmetric connectivity"
> +which is problematic for corosync. See [1] for more details.
> +
> +While it is desirable to draw attention to broken setups, the volume
> +of log messages in such clusters can get very high and clutter the
> +journal. In extreme scenarios, occasional bursts of more than 1000
> +messages per second were observed. If we approximate each message with
> +100 bytes, logging 1000 messages per second will produce ~8 GiB of raw
> +logs per day. While this should be a worst case scenario and the
> +logs probably compress well, the volume is still inconveniently high.
> +
> +Hence, use systemd log rate limiting to limit corosync log messages to
> +200 per second, which brings the logs in above scenario down to 1.6
> +GiB/day and should still provide enough headroom to avoid suppressing
> +benign log messages in non-broken setups.
> +
> +[1] https://github.com/corosync/corosync/issues/750
> +
> +Signed-off-by: Friedrich Weber <f.weber@proxmox.com>
> +---
> + init/corosync.service.in | 2 ++

An option that might require lower maintenance would be to ship a
service file override, e.g. at
/lib/systemd/system/corosync.service.d/set-log-rate-limit.conf with
contents:

```
[Service]
LogRateLimitIntervalSec=1s
LogRateLimitBurst=200
```

No strong feelings, it is just a matter of taste.

> + 1 file changed, 2 insertions(+)
> +
> +diff --git a/init/corosync.service.in b/init/corosync.service.in
> +index bd2a48a9..3d7ea2db 100644
> +--- a/init/corosync.service.in
> ++++ b/init/corosync.service.in
> +@@ -10,6 +10,8 @@ EnvironmentFile=-@INITCONFIGDIR@/corosync
> + ExecStart=@SBINDIR@/corosync -f $COROSYNC_OPTIONS
> + ExecStop=@SBINDIR@/corosync-cfgtool -H --force
> + Type=notify
> ++LogRateLimitIntervalSec=1s
> ++LogRateLimitBurst=200

200 hundred messages per second might be a bit too many. Since we are
not sure how many messages a unlucky user might see, I would suggest to
lower it a bit for the time being, 100 is a good round number.

> + 
> + # In typical systemd deployments, both standard outputs are forwarded to
> + # journal (stderr is what's relevant in the pristine corosync configuration),
> +-- 
> +2.39.5
> +
> diff --git a/debian/patches/series b/debian/patches/series
> index 147e793..7a796c4 100644
> --- a/debian/patches/series
> +++ b/debian/patches/series
> @@ -1,3 +1,4 @@
>  0001-Enable-PrivateTmp-in-the-systemd-service-files.patch
>  0002-only-start-corosync.service-if-conf-exists.patch
>  0003-totemsrp-Check-size-of-orf_token-msg.patch
> +0004-corosync.service-rate-limit-log-messages-to-200-per-.patch



_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel