* [pve-devel] [PATCH ha-manager 0/3] watchdog: sync log to disk before and after expiring
@ 2025-05-19 13:09 Maximiliano Sandoval
2025-05-19 13:09 ` [pve-devel] [PATCH ha-manager 1/3] watchdog: separate if in two parts Maximiliano Sandoval
` (2 more replies)
0 siblings, 3 replies; 4+ messages in thread
From: Maximiliano Sandoval @ 2025-05-19 13:09 UTC (permalink / raw)
To: pve-devel
It is very hard to provide a definitive answer to whether a host fenced or not.
In some cases the journal on the disk can be missing up to 2 minutes since its
last logged entry and the time where another node detects the corosync link is
down, with such a gap, the fenced node would not even record that it lost
conenction and it is not possible to fully-determine if the node was fenced or
not.
This series:
- adds a second warning 10 seconds before the watchdog expires
- syncs the journal to disk after the warning was issued
- syncs the journal to disk after the watchdog expires
The variable names in the second commit could use some feedback. The way the
warning timeout is defined was arbitrary (10 seconds before the fence).
Maximiliano Sandoval (3):
watchdog: separate if in two parts
watchdog: warn when about to expire
watchdog: sync journal after sending expiration related messages
src/watchdog-mux.c | 40 +++++++++++++++++++++++++++++++++-------
1 file changed, 33 insertions(+), 7 deletions(-)
--
2.39.5
_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
^ permalink raw reply [flat|nested] 4+ messages in thread
* [pve-devel] [PATCH ha-manager 1/3] watchdog: separate if in two parts
2025-05-19 13:09 [pve-devel] [PATCH ha-manager 0/3] watchdog: sync log to disk before and after expiring Maximiliano Sandoval
@ 2025-05-19 13:09 ` Maximiliano Sandoval
2025-05-19 13:09 ` [pve-devel] [PATCH ha-manager 2/3] watchdog: warn when about to expire Maximiliano Sandoval
2025-05-19 13:09 ` [pve-devel] [PATCH ha-manager 3/3] watchdog: sync journal after sending expiration related messages Maximiliano Sandoval
2 siblings, 0 replies; 4+ messages in thread
From: Maximiliano Sandoval @ 2025-05-19 13:09 UTC (permalink / raw)
To: pve-devel
The sole purpose of this commit is to make the following commit's diff
easier to read.
Signed-off-by: Maximiliano Sandoval <m.sandoval@proxmox.com>
---
src/watchdog-mux.c | 12 +++++-------
1 file changed, 5 insertions(+), 7 deletions(-)
diff --git a/src/watchdog-mux.c b/src/watchdog-mux.c
index b4bcc0c..a9017b3 100644
--- a/src/watchdog-mux.c
+++ b/src/watchdog-mux.c
@@ -243,13 +243,11 @@ main(void)
int i;
time_t ctime = time(NULL);
for (i = 0; i < MAX_CLIENTS; i++) {
- if (
- client_list[i].fd != 0
- && client_list[i].time != 0
- && ((ctime - client_list[i].time) > client_watchdog_timeout)
- ) {
- update_watchdog = 0;
- fprintf(stderr, "client watchdog expired - disable watchdog updates\n");
+ if (client_list[i].fd != 0 && client_list[i].time != 0) {
+ if ((ctime - client_list[i].time) > client_watchdog_timeout) {
+ update_watchdog = 0;
+ fprintf(stderr, "client watchdog expired - disable watchdog updates\n");
+ }
}
}
}
--
2.39.5
_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
^ permalink raw reply [flat|nested] 4+ messages in thread
* [pve-devel] [PATCH ha-manager 2/3] watchdog: warn when about to expire
2025-05-19 13:09 [pve-devel] [PATCH ha-manager 0/3] watchdog: sync log to disk before and after expiring Maximiliano Sandoval
2025-05-19 13:09 ` [pve-devel] [PATCH ha-manager 1/3] watchdog: separate if in two parts Maximiliano Sandoval
@ 2025-05-19 13:09 ` Maximiliano Sandoval
2025-05-19 13:09 ` [pve-devel] [PATCH ha-manager 3/3] watchdog: sync journal after sending expiration related messages Maximiliano Sandoval
2 siblings, 0 replies; 4+ messages in thread
From: Maximiliano Sandoval @ 2025-05-19 13:09 UTC (permalink / raw)
To: pve-devel
Signed-off-by: Maximiliano Sandoval <m.sandoval@proxmox.com>
---
src/watchdog-mux.c | 26 ++++++++++++++++++++++++++
1 file changed, 26 insertions(+)
diff --git a/src/watchdog-mux.c b/src/watchdog-mux.c
index a9017b3..e14c768 100644
--- a/src/watchdog-mux.c
+++ b/src/watchdog-mux.c
@@ -29,15 +29,24 @@
#define JOURNALCTL_BIN "/bin/journalctl"
+#define CLIENT_WATCHDOG_TIMEOUT_WARNING 50
+
int watchdog_fd = -1;
int watchdog_timeout = 10;
int client_watchdog_timeout = 60;
int update_watchdog = 1;
+enum warning_state_t {
+ NONE,
+ WARNING_ISSUED,
+ CRISIS_AVERTED,
+};
+
typedef struct {
int fd;
time_t time;
int magic_close;
+ enum warning_state_t warning_state;
} wd_client_t;
#define MAX_CLIENTS 100
@@ -54,6 +63,7 @@ alloc_client(int fd, time_t time)
client_list[i].fd = fd;
client_list[i].time = time;
client_list[i].magic_close = 0;
+ client_list[i].warning_state = NONE;
return &client_list[i];
}
}
@@ -244,6 +254,22 @@ main(void)
time_t ctime = time(NULL);
for (i = 0; i < MAX_CLIENTS; i++) {
if (client_list[i].fd != 0 && client_list[i].time != 0) {
+ if (
+ client_list[i].warning_state == WARNING_ISSUED
+ && (ctime - client_list[i].time) <= CLIENT_WATCHDOG_TIMEOUT_WARNING
+ ) {
+ client_list[i].warning_state = CRISIS_AVERTED;
+ fprintf(stderr, "phew, client watchdog was updated before expiring\n");
+ }
+
+ if (
+ client_list[i].warning_state != WARNING_ISSUED
+ && (ctime - client_list[i].time) > CLIENT_WATCHDOG_TIMEOUT_WARNING
+ ) {
+ client_list[i].warning_state = WARNING_ISSUED;
+ fprintf(stderr, "client watchdog is about to expire\n");
+ }
+
if ((ctime - client_list[i].time) > client_watchdog_timeout) {
update_watchdog = 0;
fprintf(stderr, "client watchdog expired - disable watchdog updates\n");
--
2.39.5
_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
^ permalink raw reply [flat|nested] 4+ messages in thread
* [pve-devel] [PATCH ha-manager 3/3] watchdog: sync journal after sending expiration related messages
2025-05-19 13:09 [pve-devel] [PATCH ha-manager 0/3] watchdog: sync log to disk before and after expiring Maximiliano Sandoval
2025-05-19 13:09 ` [pve-devel] [PATCH ha-manager 1/3] watchdog: separate if in two parts Maximiliano Sandoval
2025-05-19 13:09 ` [pve-devel] [PATCH ha-manager 2/3] watchdog: warn when about to expire Maximiliano Sandoval
@ 2025-05-19 13:09 ` Maximiliano Sandoval
2 siblings, 0 replies; 4+ messages in thread
From: Maximiliano Sandoval @ 2025-05-19 13:09 UTC (permalink / raw)
To: pve-devel
One sync comes after warning that the watchdog is about to expire, and a
second right after the watchdog expires.
To maximize the chances the log will contain entries relevant to a fence
event. This would be extremely useful for detecting whether a node
fenced.
Signed-off-by: Maximiliano Sandoval <m.sandoval@proxmox.com>
---
src/watchdog-mux.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/src/watchdog-mux.c b/src/watchdog-mux.c
index e14c768..8669b10 100644
--- a/src/watchdog-mux.c
+++ b/src/watchdog-mux.c
@@ -268,11 +268,13 @@ main(void)
) {
client_list[i].warning_state = WARNING_ISSUED;
fprintf(stderr, "client watchdog is about to expire\n");
+ sync_journal_unsafe();
}
if ((ctime - client_list[i].time) > client_watchdog_timeout) {
update_watchdog = 0;
fprintf(stderr, "client watchdog expired - disable watchdog updates\n");
+ sync_journal_unsafe();
}
}
}
--
2.39.5
_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2025-05-19 13:10 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-05-19 13:09 [pve-devel] [PATCH ha-manager 0/3] watchdog: sync log to disk before and after expiring Maximiliano Sandoval
2025-05-19 13:09 ` [pve-devel] [PATCH ha-manager 1/3] watchdog: separate if in two parts Maximiliano Sandoval
2025-05-19 13:09 ` [pve-devel] [PATCH ha-manager 2/3] watchdog: warn when about to expire Maximiliano Sandoval
2025-05-19 13:09 ` [pve-devel] [PATCH ha-manager 3/3] watchdog: sync journal after sending expiration related messages Maximiliano Sandoval
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.
Service provided by Proxmox Server Solutions GmbH | Privacy | Legal