From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: <pve-devel-bounces@lists.proxmox.com> Received: from firstgate.proxmox.com (firstgate.proxmox.com [IPv6:2a01:7e0:0:424::9]) by lore.proxmox.com (Postfix) with ESMTPS id B745F1FF164 for <inbox@lore.proxmox.com>; Fri, 4 Jul 2025 14:39:27 +0200 (CEST) Received: from firstgate.proxmox.com (localhost [127.0.0.1]) by firstgate.proxmox.com (Proxmox) with ESMTP id B237537BED; Fri, 4 Jul 2025 14:40:07 +0200 (CEST) References: <20250519130935.365142-1-m.sandoval@proxmox.com> <20250519130935.365142-4-m.sandoval@proxmox.com> <199c147b-c7bb-40c4-99ab-3d4d09d51e5c@proxmox.com> User-agent: mu4e 1.10.8; emacs 30.1 From: Maximiliano Sandoval <m.sandoval@proxmox.com> To: Thomas Lamprecht <t.lamprecht@proxmox.com> Date: Fri, 04 Jul 2025 14:32:52 +0200 In-reply-to: <199c147b-c7bb-40c4-99ab-3d4d09d51e5c@proxmox.com> Message-ID: <s8osejc2kql.fsf@proxmox.com> MIME-Version: 1.0 X-SPAM-LEVEL: Spam detection results: 0 AWL 0.099 Adjusted score from AWL reputation of From: address BAYES_00 -1.9 Bayes spam probability is 0 to 1% DMARC_MISSING 0.1 Missing DMARC policy KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment RCVD_IN_VALIDITY_CERTIFIED_BLOCKED 0.218 ADMINISTRATOR NOTICE: The query to Validity was blocked. See https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more information. RCVD_IN_VALIDITY_RPBL_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to Validity was blocked. See https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more information. RCVD_IN_VALIDITY_SAFE_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to Validity was blocked. See https://knowledge.validity.com/hc/en-us/articles/20961730681243 for more information. SPF_HELO_NONE 0.001 SPF: HELO does not publish an SPF Record SPF_PASS -0.001 SPF: sender matches SPF record Subject: Re: [pve-devel] [PATCH ha-manager 3/3] watchdog: sync journal after sending expiration related messages X-BeenThere: pve-devel@lists.proxmox.com X-Mailman-Version: 2.1.29 Precedence: list List-Id: Proxmox VE development discussion <pve-devel.lists.proxmox.com> List-Unsubscribe: <https://lists.proxmox.com/cgi-bin/mailman/options/pve-devel>, <mailto:pve-devel-request@lists.proxmox.com?subject=unsubscribe> List-Archive: <http://lists.proxmox.com/pipermail/pve-devel/> List-Post: <mailto:pve-devel@lists.proxmox.com> List-Help: <mailto:pve-devel-request@lists.proxmox.com?subject=help> List-Subscribe: <https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel>, <mailto:pve-devel-request@lists.proxmox.com?subject=subscribe> Reply-To: Proxmox VE development discussion <pve-devel@lists.proxmox.com> Cc: Proxmox VE development discussion <pve-devel@lists.proxmox.com> Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: pve-devel-bounces@lists.proxmox.com Sender: "pve-devel" <pve-devel-bounces@lists.proxmox.com> Thomas Lamprecht <t.lamprecht@proxmox.com> writes: > Am 19.05.25 um 15:09 schrieb Maximiliano Sandoval: >> One sync comes after warning that the watchdog is about to expire, and a >> second right after the watchdog expires. >> >> To maximize the chances the log will contain entries relevant to a fence >> event. This would be extremely useful for detecting whether a node >> fenced. >> >> Signed-off-by: Maximiliano Sandoval <m.sandoval@proxmox.com> >> --- >> src/watchdog-mux.c | 2 ++ >> 1 file changed, 2 insertions(+) >> >> diff --git a/src/watchdog-mux.c b/src/watchdog-mux.c >> index e14c768..8669b10 100644 >> --- a/src/watchdog-mux.c >> +++ b/src/watchdog-mux.c >> @@ -268,11 +268,13 @@ main(void) >> ) { >> client_list[i].warning_state = WARNING_ISSUED; >> fprintf(stderr, "client watchdog is about to expire\n"); >> + sync_journal_unsafe(); > > The "unsafe" is there for a reason, on a loaded machine doing above > might trigger a few times and create a zombie left over process for > each of those. > > Simplest fix might be doing a double fork there so that the parent > process does not exist anymore, in which case systemd collects the > child process exit status, albeit that wouldn't be the most efficient > solution. > >> } >> >> if ((ctime - client_list[i].time) > client_watchdog_timeout) { >> update_watchdog = 0; >> fprintf(stderr, "client watchdog expired - disable watchdog updates\n"); >> + sync_journal_unsafe(); > > This is basically useless compared to the status quo, there is already > such a call a few (compiled) instructions after that branch hits anyway > as we break the main loop then. We do not (always) break out of the loop. ```c for (;;) { nfds = epoll_wait(epollfd, events, MAX_EVENTS, 1000); if (nfds == -1) { ... } if (nfds == 0) { // timeout // check for timeouts if (update_watchdog) { ... } if (update_watchdog) { ... } continue; } if (!update_watchdog) { break; } ``` if the wait_epoll keeps timing out, then nfds is 0 and we `continue` before hitting the break. This is what I observe locally whenever I test a fence on my local cluster by disconnecting all corosync NICs on a host hosting a HA resource. _______________________________________________ pve-devel mailing list pve-devel@lists.proxmox.com https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel