From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: <pve-devel-bounces@lists.proxmox.com> Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68]) by lore.proxmox.com (Postfix) with ESMTPS id 4A1AD1FF189 for <inbox@lore.proxmox.com>; Fri, 4 Apr 2025 10:30:47 +0200 (CEST) Received: from firstgate.proxmox.com (localhost [127.0.0.1]) by firstgate.proxmox.com (Proxmox) with ESMTP id 48F7F197E8; Fri, 4 Apr 2025 10:30:35 +0200 (CEST) References: <20250404075957.80057-1-f.weber@proxmox.com> User-agent: mu4e 1.10.8; emacs 30.1 From: Maximiliano Sandoval <m.sandoval@proxmox.com> To: Proxmox VE development discussion <pve-devel@lists.proxmox.com> Date: Fri, 04 Apr 2025 10:14:17 +0200 In-reply-to: <20250404075957.80057-1-f.weber@proxmox.com> Message-ID: <s8oa58w72hl.fsf@proxmox.com> MIME-Version: 1.0 X-SPAM-LEVEL: Spam detection results: 0 AWL 0.096 Adjusted score from AWL reputation of From: address BAYES_00 -1.9 Bayes spam probability is 0 to 1% DMARC_MISSING 0.1 Missing DMARC policy KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment SPF_HELO_NONE 0.001 SPF: HELO does not publish an SPF Record SPF_PASS -0.001 SPF: sender matches SPF record Subject: Re: [pve-devel] [PATCH corosync] corosync.service: add patch to reduce log spam in broken network setups X-BeenThere: pve-devel@lists.proxmox.com X-Mailman-Version: 2.1.29 Precedence: list List-Id: Proxmox VE development discussion <pve-devel.lists.proxmox.com> List-Unsubscribe: <https://lists.proxmox.com/cgi-bin/mailman/options/pve-devel>, <mailto:pve-devel-request@lists.proxmox.com?subject=unsubscribe> List-Archive: <http://lists.proxmox.com/pipermail/pve-devel/> List-Post: <mailto:pve-devel@lists.proxmox.com> List-Help: <mailto:pve-devel-request@lists.proxmox.com?subject=help> List-Subscribe: <https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel>, <mailto:pve-devel-request@lists.proxmox.com?subject=subscribe> Reply-To: Proxmox VE development discussion <pve-devel@lists.proxmox.com> Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: pve-devel-bounces@lists.proxmox.com Sender: "pve-devel" <pve-devel-bounces@lists.proxmox.com> Friedrich Weber <f.weber@proxmox.com> writes: > Since c761053 ("Check packets come from the correct interface > https://github.com/corosync/corosync/issues/750") in kronosnet, > corosync will produce log messages in certain broken network setups. > See inner patch for details. Drawing attention to such setups is > desirable because such setups may experience whole-cluster fences if > the watchdog is active, see [1]. > > However, the log volume in such broken setups can be inconveniently > high. In such a setup, when running the following on node 1: > > # for i in $(seq 100); do dd if=/dev/urandom bs=1M of=/etc/pve/test.bin count=1; done > > On node 2, bursts of ~1300 messages per second are observed: > > # journalctl --since="1min ago" -u corosync.service \ > | cut -d' ' -f 1-3 | uniq -c | sort -n | tail -n 10 > 8 Apr 04 09:51:20 > 8 Apr 04 09:51:24 > 8 Apr 04 09:51:30 > 8 Apr 04 09:51:34 > 8 Apr 04 09:51:40 > 12 Apr 04 09:51:00 > 196 Apr 04 09:51:46 > 1283 Apr 04 09:51:44 > 1329 Apr 04 09:51:43 > 1370 Apr 04 09:51:45 > > To avoid cluttering the journal, rate-limit log messages to 200 per > second. See inner patch for details. > > [1] https://github.com/corosync/corosync/issues/750 > > Signed-off-by: Friedrich Weber <f.weber@proxmox.com> > --- > > Notes: > I'm a little confused about the rate limit, as with this patch I do > see that systemd suppresses messages: > > Apr 04 09:52:54 coro3 systemd-journald[303]: Suppressed 196 messages from corosync.service > > but I still see way more than 200 messages per second: > > 11 Apr 04 09:52:00 > 11 Apr 04 09:52:45 > 13 Apr 04 09:52:13 > 14 Apr 04 09:52:12 > 19 Apr 04 09:52:07 > 67 Apr 04 09:52:08 > 400 Apr 04 09:52:54 > 695 Apr 04 09:52:52 > 715 Apr 04 09:52:53 > 835 Apr 04 09:52:51 > > Any idea why? > > ...-rate-limit-log-messages-to-200-per-.patch | 54 +++++++++++++++++++ > debian/patches/series | 1 + > 2 files changed, 55 insertions(+) > create mode 100644 debian/patches/0004-corosync.service-rate-limit-log-messages-to-200-per-.patch > > diff --git a/debian/patches/0004-corosync.service-rate-limit-log-messages-to-200-per-.patch b/debian/patches/0004-corosync.service-rate-limit-log-messages-to-200-per-.patch > new file mode 100644 > index 0000000..0f91b42 > --- /dev/null > +++ b/debian/patches/0004-corosync.service-rate-limit-log-messages-to-200-per-.patch > @@ -0,0 +1,54 @@ > +From 5470f01296a3bd8f47fd4bd97939b3a68f00d309 Mon Sep 17 00:00:00 2001 > +From: Friedrich Weber <f.weber@proxmox.com> > +Date: Fri, 4 Apr 2025 09:14:21 +0200 > +Subject: [PATCH] corosync.service: rate limit log messages to 200 per second > + > +Since c761053 ("Check packets come from the correct interface > +https://github.com/corosync/corosync/issues/750") in kronosnet, > +corosync will log a message like the following every time a packet is > +received at the wrong interface, i.e., not the interface on which the > +corresponding IP is configured: > + > +> [KNET ] udp: Received packet from 10.8.1.1 to 10.8.1.3 on i/f ens20 when expected ens19 > + > +This is to draw attention to broken network setups that > +appear to work fine as long as all corosync links are online, but once > +a link goes down, may go into a state of "asymmetric connectivity" > +which is problematic for corosync. See [1] for more details. > + > +While it is desirable to draw attention to broken setups, the volume > +of log messages in such clusters can get very high and clutter the > +journal. In extreme scenarios, occasional bursts of more than 1000 > +messages per second were observed. If we approximate each message with > +100 bytes, logging 1000 messages per second will produce ~8 GiB of raw > +logs per day. While this should be a worst case scenario and the > +logs probably compress well, the volume is still inconveniently high. > + > +Hence, use systemd log rate limiting to limit corosync log messages to > +200 per second, which brings the logs in above scenario down to 1.6 > +GiB/day and should still provide enough headroom to avoid suppressing > +benign log messages in non-broken setups. > + > +[1] https://github.com/corosync/corosync/issues/750 > + > +Signed-off-by: Friedrich Weber <f.weber@proxmox.com> > +--- > + init/corosync.service.in | 2 ++ An option that might require lower maintenance would be to ship a service file override, e.g. at /lib/systemd/system/corosync.service.d/set-log-rate-limit.conf with contents: ``` [Service] LogRateLimitIntervalSec=1s LogRateLimitBurst=200 ``` No strong feelings, it is just a matter of taste. > + 1 file changed, 2 insertions(+) > + > +diff --git a/init/corosync.service.in b/init/corosync.service.in > +index bd2a48a9..3d7ea2db 100644 > +--- a/init/corosync.service.in > ++++ b/init/corosync.service.in > +@@ -10,6 +10,8 @@ EnvironmentFile=-@INITCONFIGDIR@/corosync > + ExecStart=@SBINDIR@/corosync -f $COROSYNC_OPTIONS > + ExecStop=@SBINDIR@/corosync-cfgtool -H --force > + Type=notify > ++LogRateLimitIntervalSec=1s > ++LogRateLimitBurst=200 200 hundred messages per second might be a bit too many. Since we are not sure how many messages a unlucky user might see, I would suggest to lower it a bit for the time being, 100 is a good round number. > + > + # In typical systemd deployments, both standard outputs are forwarded to > + # journal (stderr is what's relevant in the pristine corosync configuration), > +-- > +2.39.5 > + > diff --git a/debian/patches/series b/debian/patches/series > index 147e793..7a796c4 100644 > --- a/debian/patches/series > +++ b/debian/patches/series > @@ -1,3 +1,4 @@ > 0001-Enable-PrivateTmp-in-the-systemd-service-files.patch > 0002-only-start-corosync.service-if-conf-exists.patch > 0003-totemsrp-Check-size-of-orf_token-msg.patch > +0004-corosync.service-rate-limit-log-messages-to-200-per-.patch _______________________________________________ pve-devel mailing list pve-devel@lists.proxmox.com https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel