public inbox for pve-devel@lists.proxmox.com
 help / color / mirror / Atom feed
* [PATCH manager] fix #4911: external metric: better handle failed connections
@ 2026-02-18 17:44 Lukas Sichert
  2026-02-18 18:23 ` Thomas Lamprecht
  0 siblings, 1 reply; 2+ messages in thread
From: Lukas Sichert @ 2026-02-18 17:44 UTC (permalink / raw)
  To: pve-devel; +Cc: Lukas Sichert

When an external metric server configured to use TCP is unreachable, the
storage and VM indicators of the cluster nodes in the UI turn gray. This
is because, currently, a failed connection attempt raises an unhandled
exception, which aborts the status update flow.  As the connection
attempts happen at the beginning of the update process, status
information is then not broadcasted within the system or across the
cluster. After five minutes without updates, the frontend marks the
indicators as gray.

To catch connection errors, wrap connection establishment in an eval
block. The implementation ensures that other connections to external
metric servers are still established, even if one fails.

Signed-off-by: Lukas Sichert <l.sichert@proxmox.com>
---

Notes:
    Although this commit correctly catches the errors, it produces a lot of
    warnings in the system log. Every 10 seconds, 4 connection attempts
    (LXC, QEMU, Node and Storage) are made and therefore also 4 warnings are
    printed to the system log.
    
    Better implementations might be:
    	1. Establish connection once and then reuse it for all 4 data package
    	   types.
    	   This might be the most elegant solution, but if the connection is
    	   lost in between the messages, it could lead to other errors.
    	2. Use a Perl-hash that is propagated through the function call
    	   layers.  Check for uniqueness of the error, and if unique,
    	   add error message to dictionary. Then at the end print the messages
    	   in the dictionary to the system log.
    	  This solution is less elegant, but it is robust and does not
    	  require bigger changes in the code.
    	3. Use a timer and only print a warning if sufficient time has
    	   elapsed from the last print.
    	   This option might be slightly more elegant than option 2, however
    	   it could  lead to nondeterministic behavior, because it depends on
    	   which function the timer runs out in.
    
    If anyone has any suggestions regarding this, please let me know.
    Thanks in advance!

 PVE/ExtMetric.pm | 26 +++++++++++++++-----------
 1 file changed, 15 insertions(+), 11 deletions(-)

diff --git a/PVE/ExtMetric.pm b/PVE/ExtMetric.pm
index ebc2817b..bb36930e 100644
--- a/PVE/ExtMetric.pm
+++ b/PVE/ExtMetric.pm
@@ -52,17 +52,21 @@ sub transactions_start {
         $cfg,
         sub {
             my ($plugin, $id, $plugin_config) = @_;
-
-            my $connection = $plugin->_connect($plugin_config, $id);
-
-            push @$transactions,
-                {
-                    connection => $connection,
-                    cfg => $plugin_config,
-                    id => $id,
-                    data => '',
-                };
-        },
+            eval {
+              my $connection = $plugin->_connect($plugin_config, $id);
+
+              push @$transactions,
+                  {
+                      connection => $connection,
+                      cfg => $plugin_config,
+                      id => $id,
+                      data => '',
+                  };
+            };
+            if (my $err = $@) {
+                syslog( "warning", "connection for plugin '$id' failed: $err");
+            }    
+          },
     );
 
     return $transactions;
-- 
2.47.3




^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: [PATCH manager] fix #4911: external metric: better handle failed connections
  2026-02-18 17:44 [PATCH manager] fix #4911: external metric: better handle failed connections Lukas Sichert
@ 2026-02-18 18:23 ` Thomas Lamprecht
  0 siblings, 0 replies; 2+ messages in thread
From: Thomas Lamprecht @ 2026-02-18 18:23 UTC (permalink / raw)
  To: Lukas Sichert, pve-devel

Am 18.02.26 um 18:44 schrieb Lukas Sichert:
> When an external metric server configured to use TCP is unreachable, the
> storage and VM indicators of the cluster nodes in the UI turn gray. This
> is because, currently, a failed connection attempt raises an unhandled
> exception, which aborts the status update flow.  As the connection
> attempts happen at the beginning of the update process, status
> information is then not broadcasted within the system or across the
> cluster. After five minutes without updates, the frontend marks the
> indicators as gray.
> 
> To catch connection errors, wrap connection establishment in an eval
> block. The implementation ensures that other connections to external
> metric servers are still established, even if one fails.
> 
> Signed-off-by: Lukas Sichert <l.sichert@proxmox.com>
> ---
> 
> Notes:
>     Although this commit correctly catches the errors, it produces a lot of
>     warnings in the system log. Every 10 seconds, 4 connection attempts
>     (LXC, QEMU, Node and Storage) are made and therefore also 4 warnings are
>     printed to the system log.
>     
>     Better implementations might be:
>     	1. Establish connection once and then reuse it for all 4 data package
>     	   types.
>     	   This might be the most elegant solution, but if the connection is
>     	   lost in between the messages, it could lead to other errors.

Would share some more state, which might make the code slightly uglier, but
one would need to try to see if that's actually true.

And yeah, trying to re-connect at least once in that case might be warranted.

>     	2. Use a Perl-hash that is propagated through the function call
>     	   layers.  Check for uniqueness of the error, and if unique,
>     	   add error message to dictionary. Then at the end print the messages
>     	   in the dictionary to the system log.
>     	  This solution is less elegant, but it is robust and does not
>     	  require bigger changes in the code.
>     	3. Use a timer and only print a warning if sufficient time has
>     	   elapsed from the last print.
>     	   This option might be slightly more elegant than option 2, however
>     	   it could  lead to nondeterministic behavior, because it depends on
>     	   which function the timer runs out in.
>     
>     If anyone has any suggestions regarding this, please let me know.
>     Thanks in advance!

I mean, the systemd journal is good at compression, so size, or rather
log-retention wise, the frequent logging won't really matter that much.
So would maybe ignore that completely.

> 
>  PVE/ExtMetric.pm | 26 +++++++++++++++-----------
>  1 file changed, 15 insertions(+), 11 deletions(-)
> 
> diff --git a/PVE/ExtMetric.pm b/PVE/ExtMetric.pm
> index ebc2817b..bb36930e 100644
> --- a/PVE/ExtMetric.pm
> +++ b/PVE/ExtMetric.pm
> @@ -52,17 +52,21 @@ sub transactions_start {
>          $cfg,
>          sub {
>              my ($plugin, $id, $plugin_config) = @_;
> -
> -            my $connection = $plugin->_connect($plugin_config, $id);
> -
> -            push @$transactions,
> -                {
> -                    connection => $connection,
> -                    cfg => $plugin_config,
> -                    id => $id,
> -                    data => '',
> -                };
> -        },
> +            eval {
> +              my $connection = $plugin->_connect($plugin_config, $id);

You only care about catching the error here, below's push cannot really cause
one, so might be nicer written as:

my $connection = eval { $plugin->_connect($plugin_config, $id) };

if (my $err = $@) {
    syslog( "warning", "connection for plugin '$id' failed: $err");
    return;
}

And keep below as is.

Another variant might be to catch these at an higher, more central level in the
statd code path. But not sure if that's really better, it has been a while since I
looked at all parts involved here.

> +
> +              push @$transactions,
> +                  {
> +                      connection => $connection,
> +                      cfg => $plugin_config,
> +                      id => $id,
> +                      data => '',
> +                  };
> +            };
> +            if (my $err = $@) {
> +                syslog( "warning", "connection for plugin '$id' failed: $err");
> +            }    
> +          },
>      );
>  
>      return $transactions;





^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2026-02-18 18:22 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2026-02-18 17:44 [PATCH manager] fix #4911: external metric: better handle failed connections Lukas Sichert
2026-02-18 18:23 ` Thomas Lamprecht

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox
Service provided by Proxmox Server Solutions GmbH | Privacy | Legal