From: Thomas Lamprecht <t.lamprecht@proxmox.com>
To: pve-devel@lists.proxmox.com
Subject: [PATCH v2 storage 06/13] multipath: broadcast per-node map health to the cluster KV store
Date: Fri, 3 Jul 2026 14:46:06 +0200 [thread overview]
Message-ID: <20260703124707.1172980-8-t.lamprecht@proxmox.com> (raw)
In-Reply-To: <20260703124707.1172980-2-t.lamprecht@proxmox.com>
Map health is inherently per-node: each node has its own paths to the
same LUN, so whether a LUN has full path redundancy can only be told
per node. To make a cluster-wide view possible, reduce the local maps
to a small per-WWID summary and push it under the cluster KV key
'multipath' via pmxcfs.
A present value also signals that the node is actively multipathing:
clear the key when no maps are assembled, so the status aggregation can
combine just the active nodes without extra bookkeeping. The summary
stays well under the 32 KiB KV limit for typical setups (about 130
bytes per map); the full per-path detail stays behind the per-node
disks/multipath API.
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
---
Changes in v2:
- timestamp the broadcast payload and skip rebroadcasting unchanged
state, refreshing every 60s instead of every status cycle
src/PVE/Multipath.pm | 86 +++++++++++++++++++++++++++++++++
src/test/run_multipath_tests.pl | 50 +++++++++++++++++++
2 files changed, 136 insertions(+)
diff --git a/src/PVE/Multipath.pm b/src/PVE/Multipath.pm
index 64118b3..ec87b08 100644
--- a/src/PVE/Multipath.pm
+++ b/src/PVE/Multipath.pm
@@ -392,4 +392,90 @@ sub reconfigure {
run_command([$MULTIPATHD, 'reconfigure']);
}
+# Pure: reduce the rich get_maps() output to the compact per-WWID dict broadcast under the cluster
+# KV key 'multipath'. Holds only what the cluster-wide health matrix needs; the full per-path detail
+# stays available behind the per-node disks/multipath API.
+sub summarize_maps_for_broadcast {
+ my ($maps) = @_;
+
+ my $out = {};
+ for my $map ($maps->@*) {
+ next if !defined($map->{wwid});
+ $out->{ $map->{wwid} } = {
+ state => $map->{health},
+ 'paths-active' => $map->{'paths-active'} // 0,
+ 'paths-total' => $map->{'paths-total'} // 0,
+ defined($map->{transport}) ? (transport => $map->{transport}) : (),
+ defined($map->{size}) ? (size => $map->{size}) : (),
+ };
+ }
+ return $out;
+}
+
+# Rebroadcast protocol for the node KV values: an unchanged value is refreshed every
+# $KV_REFRESH_SECONDS instead of on every status cycle (each KV write is a cluster-wide corosync
+# message), and consumers treat a payload timestamp older than $KV_STALE_SECONDS as coming from a
+# reporter that stopped updating, so a stalled status daemon's last snapshot does not read as
+# current health. Relies on the cluster-wide clock sync corosync needs anyway.
+our $KV_REFRESH_SECONDS = 60;
+our $KV_STALE_SECONDS = 3 * $KV_REFRESH_SECONDS;
+
+# Stamp $data (a hashref, or undef to clear) with the broadcast time and push it into the cluster
+# KV store under $key. Skips the write while the content is unchanged and fresh, or when the key is
+# already cleared; a write that dies (an oversized value, for example) is not recorded and is
+# retried on the next call. Transport hiccups get absorbed by pve-cluster itself, a lost refresh
+# repeats within $KV_REFRESH_SECONDS and consumers demote older values either way.
+my $last_kv = {};
+
+my sub update_node_kv {
+ my ($key, $data) = @_;
+
+ require PVE::Cluster;
+
+ my $canonical = JSON->new->canonical;
+ my $content = defined($data) ? $canonical->encode($data) : undef;
+ my $now = time();
+ if (my $last = $last_kv->{$key}) {
+ if (defined($content) && defined($last->{content})) {
+ return if $content eq $last->{content} && $now - $last->{time} < $KV_REFRESH_SECONDS;
+ } elsif (!defined($content) && !defined($last->{content})) {
+ return;
+ }
+ }
+
+ my $value = defined($content) ? $canonical->encode({ $data->%*, time => $now }) : undef;
+ eval { PVE::Cluster::broadcast_node_kv($key, $value) };
+ if (my $err = $@) {
+ warn "multipath: broadcasting '$key' failed - $err";
+ return;
+ }
+ $last_kv->{$key} = { content => $content, time => $now };
+}
+
+# Push a compact per-WWID health snapshot into the cluster KV store under the key 'multipath'. A
+# present value also means "this node is actively multipathing", so clear the key when no maps are
+# assembled and the status aggregation then only combines the active nodes. Never throws, so it is
+# safe to call from a status loop where multipath is not the primary concern.
+sub broadcast_health {
+ if (!is_running()) {
+ update_node_kv('multipath', undef);
+ return;
+ }
+
+ my $maps = eval { get_maps() };
+ if (my $err = $@) {
+ # keep the last value: it ages past $KV_STALE_SECONDS and consumers demote it, which beats
+ # clearing (that would read as "no maps assembled" and flag the LUNs as missing)
+ warn "multipath: collecting maps for broadcast failed - $err";
+ return;
+ }
+
+ my $summary = summarize_maps_for_broadcast($maps);
+ if (!%$summary) {
+ update_node_kv('multipath', undef);
+ return;
+ }
+ update_node_kv('multipath', { maps => $summary });
+}
+
1;
diff --git a/src/test/run_multipath_tests.pl b/src/test/run_multipath_tests.pl
index 09b6061..a4ad57d 100755
--- a/src/test/run_multipath_tests.pl
+++ b/src/test/run_multipath_tests.pl
@@ -357,4 +357,54 @@ is(
'the overrides writer trims trailing whitespace',
);
+# --- broadcast summary (per-WWID condensation of get_maps for the cluster KV) ---
+my $summary = PVE::Multipath::summarize_maps_for_broadcast($maps);
+is_deeply(
+ [sort keys %$summary],
+ [sort map { $_->{wwid} } $maps->@*],
+ 'every map with a WWID appears in the summary',
+);
+is($summary->{ $a->{wwid} }->{state}, 'optimal', 'optimal map summarized as optimal');
+is($summary->{ $a->{wwid} }->{'paths-active'}, 2, 'optimal map active path count carried');
+is($summary->{ $a->{wwid} }->{'paths-total'}, 2, 'optimal map total path count carried');
+is($summary->{ $b->{wwid} }->{state}, 'degraded', 'degraded map summarized as degraded');
+is($summary->{ $c->{wwid} }->{state}, 'failed', 'failed map summarized as failed');
+ok(
+ !exists $summary->{ $a->{wwid} }->{transport},
+ 'transport omitted when not derived (get_maps fills it live)',
+);
+
+is_deeply(
+ PVE::Multipath::summarize_maps_for_broadcast([]),
+ {},
+ 'empty maps list summarizes to empty hash (caller clears the KV)',
+);
+
+# transport/size propagate when the caller (get_maps) has set them
+my $enriched = [{
+ wwid => '3600x',
+ health => 'optimal',
+ 'paths-active' => 2,
+ 'paths-total' => 2,
+ transport => 'iscsi',
+ size => 34359738368,
+}];
+my $enr = PVE::Multipath::summarize_maps_for_broadcast($enriched);
+is($enr->{'3600x'}->{transport}, 'iscsi', 'transport carried into the summary');
+is($enr->{'3600x'}->{size}, 34359738368, 'size carried into the summary');
+
+# size budget: well under the 32 KiB pmxcfs KV limit even for many maps
+my $many = [
+ map { {
+ wwid => sprintf('3600140500000000000000000000%04x', $_),
+ health => 'optimal',
+ 'paths-active' => 4,
+ 'paths-total' => 4,
+ transport => 'iscsi',
+ size => 1099511627776,
+ } } 0 .. 99
+];
+my $big = JSON::encode_json(PVE::Multipath::summarize_maps_for_broadcast($many));
+ok(length($big) < 32 * 1024, "100-map summary (" . length($big) . " B) fits the KV size limit");
+
done_testing();
--
2.47.3
next prev parent reply other threads:[~2026-07-03 15:32 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-07-03 12:46 [PATCH v2 storage,cluster,manager 0/13] multipath: cluster-wide config, storage and health overview Thomas Lamprecht
2026-07-03 12:46 ` [PATCH v2 storage 01/13] multipath: add helper library and managed configuration Thomas Lamprecht
2026-07-03 12:46 ` [PATCH v2 storage 02/13] api: disks: add read-only multipath status endpoint Thomas Lamprecht
2026-07-03 12:46 ` [PATCH v2 storage 03/13] api: multipath: add cluster-wide configuration endpoints Thomas Lamprecht
2026-07-03 12:46 ` [PATCH v2 storage 04/13] multipath: add storage plugin for multipath LUNs Thomas Lamprecht
2026-07-03 12:46 ` [PATCH v2 storage 05/13] lvm: allow a multipath storage as the base device Thomas Lamprecht
2026-07-03 12:46 ` Thomas Lamprecht [this message]
2026-07-03 12:46 ` [PATCH v2 storage 07/13] api: multipath: add cluster-wide health status endpoint Thomas Lamprecht
2026-07-03 12:46 ` [PATCH v2 cluster 08/13] pmxcfs: track cluster-wide multipath configuration Thomas Lamprecht
2026-07-03 12:46 ` [PATCH v2 manager 09/13] pvestatd: apply the cluster-wide multipath config on each node Thomas Lamprecht
2026-07-03 12:46 ` [PATCH v2 manager 10/13] api: cluster: mount the multipath configuration endpoint Thomas Lamprecht
2026-07-03 12:46 ` [PATCH v2 manager 11/13] pvestatd: broadcast multipath map health to the cluster Thomas Lamprecht
2026-07-03 12:46 ` [PATCH v2 manager 12/13] ui: dc: add multipath health matrix and config editor Thomas Lamprecht
2026-07-03 12:46 ` [PATCH v2 manager 13/13] ui: node: show multipath maps and their paths under Disks Thomas Lamprecht
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260703124707.1172980-8-t.lamprecht@proxmox.com \
--to=t.lamprecht@proxmox.com \
--cc=pve-devel@lists.proxmox.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox