* [pve-devel] [PATCH ha-manager v2 0/5] watchdog: sync log to disk before and after expiring
@ 2025-06-25 13:23 Maximiliano Sandoval
2025-06-25 13:23 ` [pve-devel] [PATCH ha-manager v2 1/5] watchdog-mux: Use #define for 60s timeout Maximiliano Sandoval
` (4 more replies)
0 siblings, 5 replies; 6+ messages in thread
From: Maximiliano Sandoval @ 2025-06-25 13:23 UTC (permalink / raw)
To: pve-devel
Without a clear-cut message in the log, it is very hard to provide a definitive
answer to whether a host fenced or not. In some cases the journal on the disk
can be missing up to 2 minutes since its last logged entry and the time where
another node detects the corosync link is down, with such a gap, the fenced node
would not even record that it lost conenction and it is not possible to
fully-determine if the node was fenced or not.
This series:
- adds a second warning 10 seconds before the watchdog expires
- syncs the journal to disk after the warning was issued
- syncs the journal to disk after the watchdog expires
Differences from v1:
- Define the warning cuttoff based on the 60 second timeout
- Change log messages and constant names
- When not immediately fencing, run journal sync in double fork
Maximiliano Sandoval (5):
watchdog-mux: Use #define for 60s timeout
watchdog-mux: split if block in two if blocks
watchdog-mux: warn when about to expire
watchdog-mux: sync journal after logging expiration message
watchdog-mux: sync journal right after fencing warning
src/watchdog-mux.c | 52 +++++++++++++++++++++++++++++++++++++++++-----
1 file changed, 47 insertions(+), 5 deletions(-)
--
2.39.5
_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
^ permalink raw reply [flat|nested] 6+ messages in thread
* [pve-devel] [PATCH ha-manager v2 1/5] watchdog-mux: Use #define for 60s timeout
2025-06-25 13:23 [pve-devel] [PATCH ha-manager v2 0/5] watchdog: sync log to disk before and after expiring Maximiliano Sandoval
@ 2025-06-25 13:23 ` Maximiliano Sandoval
2025-06-25 13:23 ` [pve-devel] [PATCH ha-manager v2 2/5] watchdog-mux: split if block in two if blocks Maximiliano Sandoval
` (3 subsequent siblings)
4 siblings, 0 replies; 6+ messages in thread
From: Maximiliano Sandoval @ 2025-06-25 13:23 UTC (permalink / raw)
To: pve-devel
This change allows to have a second constant defined in terms of this
one.
Signed-off-by: Maximiliano Sandoval <m.sandoval@proxmox.com>
---
src/watchdog-mux.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/src/watchdog-mux.c b/src/watchdog-mux.c
index e324c20..d38116b 100644
--- a/src/watchdog-mux.c
+++ b/src/watchdog-mux.c
@@ -29,9 +29,10 @@
#define JOURNALCTL_BIN "/bin/journalctl"
+#define CLIENT_WATCHDOG_TIMEOUT 60
+
int watchdog_fd = -1;
int watchdog_timeout = 10;
-int client_watchdog_timeout = 60;
int update_watchdog = 1;
typedef struct {
@@ -234,7 +235,7 @@ int main(void) {
time_t ctime = time(NULL);
for (i = 0; i < MAX_CLIENTS; i++) {
if (client_list[i].fd != 0 && client_list[i].time != 0 &&
- ((ctime - client_list[i].time) > client_watchdog_timeout)) {
+ ((ctime - client_list[i].time) > CLIENT_WATCHDOG_TIMEOUT)) {
update_watchdog = 0;
fprintf(stderr, "client watchdog expired - disable watchdog updates\n");
}
--
2.39.5
_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
^ permalink raw reply [flat|nested] 6+ messages in thread
* [pve-devel] [PATCH ha-manager v2 2/5] watchdog-mux: split if block in two if blocks
2025-06-25 13:23 [pve-devel] [PATCH ha-manager v2 0/5] watchdog: sync log to disk before and after expiring Maximiliano Sandoval
2025-06-25 13:23 ` [pve-devel] [PATCH ha-manager v2 1/5] watchdog-mux: Use #define for 60s timeout Maximiliano Sandoval
@ 2025-06-25 13:23 ` Maximiliano Sandoval
2025-06-25 13:23 ` [pve-devel] [PATCH ha-manager v2 3/5] watchdog-mux: warn when about to expire Maximiliano Sandoval
` (2 subsequent siblings)
4 siblings, 0 replies; 6+ messages in thread
From: Maximiliano Sandoval @ 2025-06-25 13:23 UTC (permalink / raw)
To: pve-devel
The sole purpose of this commit is to make the following commit's diff
easier to read.
Signed-off-by: Maximiliano Sandoval <m.sandoval@proxmox.com>
---
src/watchdog-mux.c | 9 +++++----
1 file changed, 5 insertions(+), 4 deletions(-)
diff --git a/src/watchdog-mux.c b/src/watchdog-mux.c
index d38116b..2b8cebf 100644
--- a/src/watchdog-mux.c
+++ b/src/watchdog-mux.c
@@ -234,10 +234,11 @@ int main(void) {
int i;
time_t ctime = time(NULL);
for (i = 0; i < MAX_CLIENTS; i++) {
- if (client_list[i].fd != 0 && client_list[i].time != 0 &&
- ((ctime - client_list[i].time) > CLIENT_WATCHDOG_TIMEOUT)) {
- update_watchdog = 0;
- fprintf(stderr, "client watchdog expired - disable watchdog updates\n");
+ if (client_list[i].fd != 0 && client_list[i].time != 0) {
+ if ((ctime - client_list[i].time) > CLIENT_WATCHDOG_TIMEOUT) {
+ update_watchdog = 0;
+ fprintf(stderr, "client watchdog expired - disable watchdog updates\n");
+ }
}
}
}
--
2.39.5
_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
^ permalink raw reply [flat|nested] 6+ messages in thread
* [pve-devel] [PATCH ha-manager v2 3/5] watchdog-mux: warn when about to expire
2025-06-25 13:23 [pve-devel] [PATCH ha-manager v2 0/5] watchdog: sync log to disk before and after expiring Maximiliano Sandoval
2025-06-25 13:23 ` [pve-devel] [PATCH ha-manager v2 1/5] watchdog-mux: Use #define for 60s timeout Maximiliano Sandoval
2025-06-25 13:23 ` [pve-devel] [PATCH ha-manager v2 2/5] watchdog-mux: split if block in two if blocks Maximiliano Sandoval
@ 2025-06-25 13:23 ` Maximiliano Sandoval
2025-06-25 13:23 ` [pve-devel] [PATCH ha-manager v2 4/5] watchdog-mux: sync journal after logging expiration message Maximiliano Sandoval
2025-06-25 13:23 ` [pve-devel] [PATCH ha-manager v2 5/5] watchdog-mux: sync journal right after fencing warning Maximiliano Sandoval
4 siblings, 0 replies; 6+ messages in thread
From: Maximiliano Sandoval @ 2025-06-25 13:23 UTC (permalink / raw)
To: pve-devel
Signed-off-by: Maximiliano Sandoval <m.sandoval@proxmox.com>
---
src/watchdog-mux.c | 21 +++++++++++++++++++++
1 file changed, 21 insertions(+)
diff --git a/src/watchdog-mux.c b/src/watchdog-mux.c
index 2b8cebf..0518e86 100644
--- a/src/watchdog-mux.c
+++ b/src/watchdog-mux.c
@@ -30,15 +30,23 @@
#define JOURNALCTL_BIN "/bin/journalctl"
#define CLIENT_WATCHDOG_TIMEOUT 60
+#define CLIENT_WATCHDOG_TIMEOUT_WARNING (CLIENT_WATCHDOG_TIMEOUT - 10)
int watchdog_fd = -1;
int watchdog_timeout = 10;
int update_watchdog = 1;
+enum warning_state_t {
+ NONE,
+ WARNING_ISSUED,
+ FENCE_AVERTED,
+};
+
typedef struct {
int fd;
time_t time;
int magic_close;
+ enum warning_state_t warning_state;
} wd_client_t;
#define MAX_CLIENTS 100
@@ -53,6 +61,7 @@ static wd_client_t *alloc_client(int fd, time_t time) {
client_list[i].fd = fd;
client_list[i].time = time;
client_list[i].magic_close = 0;
+ client_list[i].warning_state = NONE;
return &client_list[i];
}
}
@@ -235,6 +244,18 @@ int main(void) {
time_t ctime = time(NULL);
for (i = 0; i < MAX_CLIENTS; i++) {
if (client_list[i].fd != 0 && client_list[i].time != 0) {
+ if (client_list[i].warning_state == WARNING_ISSUED &&
+ (ctime - client_list[i].time) <= CLIENT_WATCHDOG_TIMEOUT_WARNING) {
+ client_list[i].warning_state = FENCE_AVERTED;
+ fprintf(stderr, "client watchdog was updated before expiring\n");
+ }
+
+ if (client_list[i].warning_state != WARNING_ISSUED &&
+ (ctime - client_list[i].time) > CLIENT_WATCHDOG_TIMEOUT_WARNING) {
+ client_list[i].warning_state = WARNING_ISSUED;
+ fprintf(stderr, "client watchdog is about to expire\n");
+ }
+
if ((ctime - client_list[i].time) > CLIENT_WATCHDOG_TIMEOUT) {
update_watchdog = 0;
fprintf(stderr, "client watchdog expired - disable watchdog updates\n");
--
2.39.5
_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
^ permalink raw reply [flat|nested] 6+ messages in thread
* [pve-devel] [PATCH ha-manager v2 4/5] watchdog-mux: sync journal after logging expiration message
2025-06-25 13:23 [pve-devel] [PATCH ha-manager v2 0/5] watchdog: sync log to disk before and after expiring Maximiliano Sandoval
` (2 preceding siblings ...)
2025-06-25 13:23 ` [pve-devel] [PATCH ha-manager v2 3/5] watchdog-mux: warn when about to expire Maximiliano Sandoval
@ 2025-06-25 13:23 ` Maximiliano Sandoval
2025-06-25 13:23 ` [pve-devel] [PATCH ha-manager v2 5/5] watchdog-mux: sync journal right after fencing warning Maximiliano Sandoval
4 siblings, 0 replies; 6+ messages in thread
From: Maximiliano Sandoval @ 2025-06-25 13:23 UTC (permalink / raw)
To: pve-devel
Sync right after the watchdog expires. This would be extremely useful
for detecting whether a node fenced.
Signed-off-by: Maximiliano Sandoval <m.sandoval@proxmox.com>
---
src/watchdog-mux.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/src/watchdog-mux.c b/src/watchdog-mux.c
index 0518e86..2287625 100644
--- a/src/watchdog-mux.c
+++ b/src/watchdog-mux.c
@@ -259,6 +259,7 @@ int main(void) {
if ((ctime - client_list[i].time) > CLIENT_WATCHDOG_TIMEOUT) {
update_watchdog = 0;
fprintf(stderr, "client watchdog expired - disable watchdog updates\n");
+ sync_journal_unsafe();
}
}
}
--
2.39.5
_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
^ permalink raw reply [flat|nested] 6+ messages in thread
* [pve-devel] [PATCH ha-manager v2 5/5] watchdog-mux: sync journal right after fencing warning
2025-06-25 13:23 [pve-devel] [PATCH ha-manager v2 0/5] watchdog: sync log to disk before and after expiring Maximiliano Sandoval
` (3 preceding siblings ...)
2025-06-25 13:23 ` [pve-devel] [PATCH ha-manager v2 4/5] watchdog-mux: sync journal after logging expiration message Maximiliano Sandoval
@ 2025-06-25 13:23 ` Maximiliano Sandoval
4 siblings, 0 replies; 6+ messages in thread
From: Maximiliano Sandoval @ 2025-06-25 13:23 UTC (permalink / raw)
To: pve-devel
Since this journal entry can be logged multiple times in the lifespan on
the process, we double fork to prevent accumulating zombie processes.
Signed-off-by: Maximiliano Sandoval <m.sandoval@proxmox.com>
---
src/watchdog-mux.c | 18 ++++++++++++++++++
1 file changed, 18 insertions(+)
diff --git a/src/watchdog-mux.c b/src/watchdog-mux.c
index 2287625..93f9bd9 100644
--- a/src/watchdog-mux.c
+++ b/src/watchdog-mux.c
@@ -12,6 +12,7 @@
#include <sys/stat.h>
#include <sys/types.h>
#include <sys/un.h>
+#include <sys/wait.h>
#include <time.h>
#include <unistd.h>
@@ -116,6 +117,22 @@ static void sync_journal_unsafe(void) {
}
}
+// Like sync_journal_unsafe but we double fork so we don't leave trailing zombie
+// processes.
+static void sync_journal_in_fork(void) {
+ pid_t child = fork();
+ if (child == 0) {
+ child = fork();
+ if (child == 0) {
+ execl(JOURNALCTL_BIN, JOURNALCTL_BIN, "--sync", NULL);
+ exit(-1);
+ }
+ exit(0);
+ } else if (child > 0) {
+ wait(NULL);
+ }
+}
+
int main(void) {
struct sockaddr_un my_addr, peer_addr;
socklen_t peer_addr_size;
@@ -254,6 +271,7 @@ int main(void) {
(ctime - client_list[i].time) > CLIENT_WATCHDOG_TIMEOUT_WARNING) {
client_list[i].warning_state = WARNING_ISSUED;
fprintf(stderr, "client watchdog is about to expire\n");
+ sync_journal_in_fork();
}
if ((ctime - client_list[i].time) > CLIENT_WATCHDOG_TIMEOUT) {
--
2.39.5
_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2025-06-25 13:24 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-06-25 13:23 [pve-devel] [PATCH ha-manager v2 0/5] watchdog: sync log to disk before and after expiring Maximiliano Sandoval
2025-06-25 13:23 ` [pve-devel] [PATCH ha-manager v2 1/5] watchdog-mux: Use #define for 60s timeout Maximiliano Sandoval
2025-06-25 13:23 ` [pve-devel] [PATCH ha-manager v2 2/5] watchdog-mux: split if block in two if blocks Maximiliano Sandoval
2025-06-25 13:23 ` [pve-devel] [PATCH ha-manager v2 3/5] watchdog-mux: warn when about to expire Maximiliano Sandoval
2025-06-25 13:23 ` [pve-devel] [PATCH ha-manager v2 4/5] watchdog-mux: sync journal after logging expiration message Maximiliano Sandoval
2025-06-25 13:23 ` [pve-devel] [PATCH ha-manager v2 5/5] watchdog-mux: sync journal right after fencing warning Maximiliano Sandoval
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.