From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from gate001.proxmox.com (gate001.proxmox.com [45.144.208.40]) by lore.proxmox.com (Postfix) with ESMTPS id 1BE2C1FF142 for ; Fri, 03 Jul 2026 17:32:42 +0200 (CEST) Received: from gate001.proxmox.com (localhost.localdomain [127.0.0.1]) by gate001.proxmox.com (Proxmox) with ESMTP id 05A652145B; Fri, 03 Jul 2026 17:31:28 +0200 (CEST) From: Thomas Lamprecht To: pve-devel@lists.proxmox.com Subject: [PATCH v2 storage 06/13] multipath: broadcast per-node map health to the cluster KV store Date: Fri, 3 Jul 2026 14:46:06 +0200 Message-ID: <20260703124707.1172980-8-t.lamprecht@proxmox.com> X-Mailer: git-send-email 2.47.3 In-Reply-To: <20260703124707.1172980-2-t.lamprecht@proxmox.com> References: <20260703124707.1172980-2-t.lamprecht@proxmox.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Bm-Milter-Handled: 55990f41-d878-4baa-be0a-ee34c49e34d2 X-Bm-Transport-Timestamp: 1783092616012 X-SPAM-LEVEL: Spam detection results: 0 AWL 0.023 Adjusted score from AWL reputation of From: address DMARC_MISSING 0.1 Missing DMARC policy KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment (newer systems) SPF_HELO_NONE 0.001 SPF: HELO does not publish an SPF Record SPF_PASS -0.001 SPF: sender matches SPF record Message-ID-Hash: SRYESC32GDUYCNK7VDVXROWGWWK6TFHW X-Message-ID-Hash: SRYESC32GDUYCNK7VDVXROWGWWK6TFHW X-MailFrom: t.lamprecht@proxmox.com X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; loop; banned-address; emergency; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header X-Mailman-Version: 3.3.10 Precedence: list List-Id: Proxmox VE development discussion List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: Map health is inherently per-node: each node has its own paths to the same LUN, so whether a LUN has full path redundancy can only be told per node. To make a cluster-wide view possible, reduce the local maps to a small per-WWID summary and push it under the cluster KV key 'multipath' via pmxcfs. A present value also signals that the node is actively multipathing: clear the key when no maps are assembled, so the status aggregation can combine just the active nodes without extra bookkeeping. The summary stays well under the 32 KiB KV limit for typical setups (about 130 bytes per map); the full per-path detail stays behind the per-node disks/multipath API. Signed-off-by: Thomas Lamprecht --- Changes in v2: - timestamp the broadcast payload and skip rebroadcasting unchanged state, refreshing every 60s instead of every status cycle src/PVE/Multipath.pm | 86 +++++++++++++++++++++++++++++++++ src/test/run_multipath_tests.pl | 50 +++++++++++++++++++ 2 files changed, 136 insertions(+) diff --git a/src/PVE/Multipath.pm b/src/PVE/Multipath.pm index 64118b3..ec87b08 100644 --- a/src/PVE/Multipath.pm +++ b/src/PVE/Multipath.pm @@ -392,4 +392,90 @@ sub reconfigure { run_command([$MULTIPATHD, 'reconfigure']); } +# Pure: reduce the rich get_maps() output to the compact per-WWID dict broadcast under the cluster +# KV key 'multipath'. Holds only what the cluster-wide health matrix needs; the full per-path detail +# stays available behind the per-node disks/multipath API. +sub summarize_maps_for_broadcast { + my ($maps) = @_; + + my $out = {}; + for my $map ($maps->@*) { + next if !defined($map->{wwid}); + $out->{ $map->{wwid} } = { + state => $map->{health}, + 'paths-active' => $map->{'paths-active'} // 0, + 'paths-total' => $map->{'paths-total'} // 0, + defined($map->{transport}) ? (transport => $map->{transport}) : (), + defined($map->{size}) ? (size => $map->{size}) : (), + }; + } + return $out; +} + +# Rebroadcast protocol for the node KV values: an unchanged value is refreshed every +# $KV_REFRESH_SECONDS instead of on every status cycle (each KV write is a cluster-wide corosync +# message), and consumers treat a payload timestamp older than $KV_STALE_SECONDS as coming from a +# reporter that stopped updating, so a stalled status daemon's last snapshot does not read as +# current health. Relies on the cluster-wide clock sync corosync needs anyway. +our $KV_REFRESH_SECONDS = 60; +our $KV_STALE_SECONDS = 3 * $KV_REFRESH_SECONDS; + +# Stamp $data (a hashref, or undef to clear) with the broadcast time and push it into the cluster +# KV store under $key. Skips the write while the content is unchanged and fresh, or when the key is +# already cleared; a write that dies (an oversized value, for example) is not recorded and is +# retried on the next call. Transport hiccups get absorbed by pve-cluster itself, a lost refresh +# repeats within $KV_REFRESH_SECONDS and consumers demote older values either way. +my $last_kv = {}; + +my sub update_node_kv { + my ($key, $data) = @_; + + require PVE::Cluster; + + my $canonical = JSON->new->canonical; + my $content = defined($data) ? $canonical->encode($data) : undef; + my $now = time(); + if (my $last = $last_kv->{$key}) { + if (defined($content) && defined($last->{content})) { + return if $content eq $last->{content} && $now - $last->{time} < $KV_REFRESH_SECONDS; + } elsif (!defined($content) && !defined($last->{content})) { + return; + } + } + + my $value = defined($content) ? $canonical->encode({ $data->%*, time => $now }) : undef; + eval { PVE::Cluster::broadcast_node_kv($key, $value) }; + if (my $err = $@) { + warn "multipath: broadcasting '$key' failed - $err"; + return; + } + $last_kv->{$key} = { content => $content, time => $now }; +} + +# Push a compact per-WWID health snapshot into the cluster KV store under the key 'multipath'. A +# present value also means "this node is actively multipathing", so clear the key when no maps are +# assembled and the status aggregation then only combines the active nodes. Never throws, so it is +# safe to call from a status loop where multipath is not the primary concern. +sub broadcast_health { + if (!is_running()) { + update_node_kv('multipath', undef); + return; + } + + my $maps = eval { get_maps() }; + if (my $err = $@) { + # keep the last value: it ages past $KV_STALE_SECONDS and consumers demote it, which beats + # clearing (that would read as "no maps assembled" and flag the LUNs as missing) + warn "multipath: collecting maps for broadcast failed - $err"; + return; + } + + my $summary = summarize_maps_for_broadcast($maps); + if (!%$summary) { + update_node_kv('multipath', undef); + return; + } + update_node_kv('multipath', { maps => $summary }); +} + 1; diff --git a/src/test/run_multipath_tests.pl b/src/test/run_multipath_tests.pl index 09b6061..a4ad57d 100755 --- a/src/test/run_multipath_tests.pl +++ b/src/test/run_multipath_tests.pl @@ -357,4 +357,54 @@ is( 'the overrides writer trims trailing whitespace', ); +# --- broadcast summary (per-WWID condensation of get_maps for the cluster KV) --- +my $summary = PVE::Multipath::summarize_maps_for_broadcast($maps); +is_deeply( + [sort keys %$summary], + [sort map { $_->{wwid} } $maps->@*], + 'every map with a WWID appears in the summary', +); +is($summary->{ $a->{wwid} }->{state}, 'optimal', 'optimal map summarized as optimal'); +is($summary->{ $a->{wwid} }->{'paths-active'}, 2, 'optimal map active path count carried'); +is($summary->{ $a->{wwid} }->{'paths-total'}, 2, 'optimal map total path count carried'); +is($summary->{ $b->{wwid} }->{state}, 'degraded', 'degraded map summarized as degraded'); +is($summary->{ $c->{wwid} }->{state}, 'failed', 'failed map summarized as failed'); +ok( + !exists $summary->{ $a->{wwid} }->{transport}, + 'transport omitted when not derived (get_maps fills it live)', +); + +is_deeply( + PVE::Multipath::summarize_maps_for_broadcast([]), + {}, + 'empty maps list summarizes to empty hash (caller clears the KV)', +); + +# transport/size propagate when the caller (get_maps) has set them +my $enriched = [{ + wwid => '3600x', + health => 'optimal', + 'paths-active' => 2, + 'paths-total' => 2, + transport => 'iscsi', + size => 34359738368, +}]; +my $enr = PVE::Multipath::summarize_maps_for_broadcast($enriched); +is($enr->{'3600x'}->{transport}, 'iscsi', 'transport carried into the summary'); +is($enr->{'3600x'}->{size}, 34359738368, 'size carried into the summary'); + +# size budget: well under the 32 KiB pmxcfs KV limit even for many maps +my $many = [ + map { { + wwid => sprintf('3600140500000000000000000000%04x', $_), + health => 'optimal', + 'paths-active' => 4, + 'paths-total' => 4, + transport => 'iscsi', + size => 1099511627776, + } } 0 .. 99 +]; +my $big = JSON::encode_json(PVE::Multipath::summarize_maps_for_broadcast($many)); +ok(length($big) < 32 * 1024, "100-map summary (" . length($big) . " B) fits the KV size limit"); + done_testing(); -- 2.47.3