From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from gate001.proxmox.com (gate001.proxmox.com [IPv6:2a0f:8001:1:32::40]) by lore.proxmox.com (Postfix) with ESMTPS id ADFE31FF142 for ; Fri, 03 Jul 2026 17:32:49 +0200 (CEST) Received: from gate001.proxmox.com (localhost.localdomain [127.0.0.1]) by gate001.proxmox.com (Proxmox) with ESMTP id 573F8214A2; Fri, 03 Jul 2026 17:31:28 +0200 (CEST) From: Thomas Lamprecht To: pve-devel@lists.proxmox.com Subject: [PATCH v2 storage 07/13] api: multipath: add cluster-wide health status endpoint Date: Fri, 3 Jul 2026 14:46:07 +0200 Message-ID: <20260703124707.1172980-9-t.lamprecht@proxmox.com> X-Mailer: git-send-email 2.47.3 In-Reply-To: <20260703124707.1172980-2-t.lamprecht@proxmox.com> References: <20260703124707.1172980-2-t.lamprecht@proxmox.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Bm-Milter-Handled: 55990f41-d878-4baa-be0a-ee34c49e34d2 X-Bm-Transport-Timestamp: 1783092616120 X-SPAM-LEVEL: Spam detection results: 0 AWL 0.020 Adjusted score from AWL reputation of From: address DMARC_MISSING 0.1 Missing DMARC policy KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment (newer systems) SPF_HELO_NONE 0.001 SPF: HELO does not publish an SPF Record SPF_PASS -0.001 SPF: sender matches SPF record Message-ID-Hash: LWRQHZI74OUWT3AKMKPMLUN4LRPXNM5R X-Message-ID-Hash: LWRQHZI74OUWT3AKMKPMLUN4LRPXNM5R X-MailFrom: t.lamprecht@proxmox.com X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; loop; banned-address; emergency; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header X-Mailman-Version: 3.3.10 Precedence: list List-Id: Proxmox VE development discussion List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: A per-node view cannot tell whether a LUN is healthy across the whole cluster. Add an endpoint that collects the per-node broadcasts and combines them into a per-WWID by per-node matrix, rolled up to one cluster-state per LUN. The broadcasts are cross-checked against live membership, so a stale value from an offline node reads as 'unknown' rather than as healthy. The roll-up is taken over the nodes that are actively multipathing, so a LUN that is optimal on three nodes but degraded on a fourth shows up as degraded instead of hiding behind the healthy majority. A node where a multipath storage is enabled but that broadcasts nothing is surfaced as missing rather than vanishing from the matrix. Consuming storages are labeled from the cluster storage config. Signed-off-by: Thomas Lamprecht --- Changes in v2: - status returns { luns, nodes } with per-node config-apply errors - broadcast and surface a node's apply failure (liveness-checked) - mark a LUN 'missing' only on nodes expected to carry it, derived per LUN from the consuming storage chain's node restrictions - demote broadcasts older than three refresh intervals to 'unknown' - tolerate malformed values from a peer's broadcast instead of dying - clamp unrecognized peer states; drop the unneeded protected flag src/PVE/API2/Multipath.pm | 135 ++++++++++++++++++++++++ src/PVE/Multipath.pm | 132 ++++++++++++++++++++++++ src/test/run_multipath_tests.pl | 176 ++++++++++++++++++++++++++++++++ 3 files changed, 443 insertions(+) diff --git a/src/PVE/API2/Multipath.pm b/src/PVE/API2/Multipath.pm index cb138f6..d43217a 100644 --- a/src/PVE/API2/Multipath.pm +++ b/src/PVE/API2/Multipath.pm @@ -3,6 +3,9 @@ package PVE::API2::Multipath; use strict; use warnings; +use JSON qw(decode_json); + +use PVE::Cluster; use PVE::Exception qw(raise_param_exc); use PVE::JSONSchema qw(get_standard_option); use PVE::Tools qw(extract_param); @@ -127,6 +130,138 @@ __PACKAGE__->register_method({ }, }); +__PACKAGE__->register_method({ + name => 'status', + path => 'status', + method => 'GET', + description => "Cluster-wide multipath health: a per-WWID by per-node matrix" + . " rolled up over the nodes that are actively multipathing.", + permissions => { + check => ['perm', '/', ['Sys.Audit']], + }, + parameters => { + additionalProperties => 0, + properties => {}, + }, + returns => { + type => 'object', + properties => { + luns => { + type => 'array', + description => "Per-WWID health: a per-node map-state matrix rolled up to one" + . " cluster-state per LUN.", + items => { + type => 'object', + additionalProperties => 1, + properties => { + wwid => { type => 'string', description => 'The LUN WWID.' }, + alias => { + type => 'string', + description => 'The configured alias, if any.', + optional => 1, + }, + 'used-by' => { + type => 'string', + description => 'The storage consuming this LUN, if any.', + optional => 1, + }, + size => { + type => 'integer', + description => 'LUN size in bytes, as reported by a node.', + optional => 1, + }, + 'cluster-state' => { + type => 'string', + description => "Worst map state across the actively multipathing" + . " nodes: 'optimal', 'degraded' (some paths down on a node)," + . " 'missing' (an active node has not assembled it), 'failed'" + . " (no active path), or 'unknown' (no active node reports it).", + enum => ['optimal', 'degraded', 'missing', 'failed', 'unknown'], + }, + nodes => { + type => 'object', + description => 'Per-node map state, keyed by node name.', + additionalProperties => 1, + }, + }, + }, + }, + nodes => { + type => 'object', + description => + "Per-node config-apply status, keyed by node name; only nodes that" + . " failed to apply the configuration appear.", + additionalProperties => 1, + }, + }, + }, + code => sub { + my $cfg = PVE::Multipath::ClusterConfig::read_config(); + + my $now = time(); + my $raw_kv = PVE::Cluster::get_node_kv('multipath'); + my $node_kv = {}; + my $stale = {}; + for my $node (keys %$raw_kv) { + my $decoded = eval { decode_json($raw_kv->{$node}) }; + next if !$decoded || ref($decoded) ne 'HASH' || ref($decoded->{maps}) ne 'HASH'; + $node_kv->{$node} = $decoded->{maps}; + # a reporter that stopped refreshing its broadcast gets demoted below: its last + # snapshot must not read as current health + $stale->{$node} = 1 + if !defined($decoded->{time}) + || $now - $decoded->{time} > $PVE::Multipath::KV_STALE_SECONDS; + } + + my $expectations = PVE::Multipath::cluster_storage_expectations(); + my $allow_wwids = PVE::Multipath::Config::wwid_list($cfg); + + # per-LUN expected node set: the consuming storage chain when known, else wherever a + # multipath storage is enabled + my $expected = + { map { $_ => $expectations->{'wwid-nodes'}->{$_} // $expectations->{nodes} } + $allow_wwids->@* }; + + # resolve liveness for every node we might place in the matrix: those that broadcast and + # those a multipath storage expects (and that may be silent) + my $members = PVE::Cluster::get_members() // {}; + my $online = {}; + for my $node (keys %$node_kv, keys $expectations->{nodes}->%*) { + # standalone clusters carry no member info; treat the reporter as live + $online->{$node} = + (!%$members || ($members->{$node} && $members->{$node}->{online})) + && !$stale->{$node} ? 1 : 0; + } + + my $luns = PVE::Multipath::aggregate_cluster_status( + $allow_wwids, + PVE::Multipath::Config::aliases($cfg), + $expectations->{consumers}, + $node_kv, + $online, + $expected, + ); + + # surface nodes that could not apply the cluster config, so local drift is visible instead + # of silently diverging from the cluster-wide configuration + my $apply_kv = PVE::Cluster::get_node_kv('multipath-apply'); + my $nodes = {}; + for my $node (keys %$apply_kv) { + # an offline node cannot clear its own KV, so skip its stale apply error (same liveness + # rule as the health roll-up); a standalone cluster has no member info, so keep it + next if %$members && !($members->{$node} && $members->{$node}->{online}); + my $decoded = eval { decode_json($apply_kv->{$node}) }; + next if !$decoded || ref($decoded) ne 'HASH' || !$decoded->{error}; + $nodes->{$node} = { + 'apply-error' => $decoded->{error}, + 'apply-time' => $decoded->{time}, + }; + } + + return { luns => $luns, nodes => $nodes }; + }, +}); + __PACKAGE__->register_method({ name => 'read_overrides', path => 'overrides', diff --git a/src/PVE/Multipath.pm b/src/PVE/Multipath.pm index ec87b08..67165c6 100644 --- a/src/PVE/Multipath.pm +++ b/src/PVE/Multipath.pm @@ -478,4 +478,136 @@ sub broadcast_health { update_node_kv('multipath', { maps => $summary }); } +# Broadcast whether this node could apply the cluster-wide multipath config, so the datacenter view +# can flag a node whose local multipathd state drifted from the configured one instead of failing +# silently. Pass the error to publish it, or nothing to clear it after a successful apply. +sub broadcast_apply_status { + my ($error) = @_; + update_node_kv('multipath-apply', $error ? { error => "$error" } : undef); +} + +# Severity ordering for rolling per-node states up into a cluster state; a higher number is worse. +# 'unknown' is a stale or offline node and never drives the roll-up, so it sits below 'optimal'. +my $STATE_RANK = { + unknown => -1, + optimal => 0, + degraded => 1, + missing => 2, + failed => 3, +}; + +# Clamp a state string from another node's broadcast to the known set, so a newer or buggy peer +# cannot inject an unrankable state into the matrix. +my sub known_state { + my ($state) = @_; + return defined($state) && exists($STATE_RANK->{$state}) ? $state : 'unknown'; +} + +# Pure: fold the per-node health summaries (already JSON-decoded) into a per-WWID cluster matrix. +# Inputs: +# $allow_wwids arrayref, the cluster WWID allow-list +# $aliases { wwid => name } +# $used_by { wwid => storage-id } of consuming LVM storages +# $node_kv { node => { wwid => summary } } as broadcast by broadcast_health() +# $online { node => bool }; a node absent here counts as offline +# $expected { wwid => { node => 1 } } nodes that are supposed to carry each allow-listed +# LUN, derived from storage_expectations() +# +# The cluster-state rolls up over the nodes that should carry each LUN: an expected node without +# the map is 'missing', whether it broadcasts other maps or nothing at all (it lost every path and +# cleared its broadcast). A node outside the LUN's expected set is never marked missing, so a SAN +# zoned to only some nodes or a purely hand-managed map does not drag unrelated nodes red; a real +# state such a node reports still counts. An allow-listed LUN without expectation info (no +# multipath storage configured) falls back to expecting every actively multipathing node, the only +# signal left then. Offline nodes show as 'unknown' and never drive the roll-up. WWIDs outside the +# allow-list list only the nodes that actually report them. +sub aggregate_cluster_status { + my ($allow_wwids, $aliases, $used_by, $node_kv, $online, $expected) = @_; + + $allow_wwids //= []; + $aliases //= {}; + $used_by //= {}; + $node_kv //= {}; + $online //= {}; + $expected //= {}; + + my %allow = map { $_ => 1 } $allow_wwids->@*; + + # report the allow-list plus any WWID a node actually sees + my %wwids = %allow; + for my $node (keys %$node_kv) { + $wwids{$_} = 1 for keys $node_kv->{$node}->%*; + } + + my $active_nodes = { map { $_ => 1 } grep { $online->{$_} } keys %$node_kv }; + + my $res = []; + for my $wwid (sort keys %wwids) { + my $nodes = {}; + my $worst = 'optimal'; + my $have_active = 0; + my $size; + + my $rank = sub { + my ($state) = @_; + $worst = $state if $STATE_RANK->{$state} > $STATE_RANK->{$worst}; + }; + + my $exp = $allow{$wwid} ? $expected->{$wwid} : undef; + $exp = $active_nodes if $allow{$wwid} && !($exp && %$exp); + + for my $node (sort keys %$node_kv) { + my $entry = $node_kv->{$node}->{$wwid}; + + if (!$online->{$node}) { + $nodes->{$node} = { state => 'unknown' } if $entry; + next; + } + + if ($entry) { + my $state = known_state($entry->{state}); + $have_active = 1 if $state ne 'unknown'; + $nodes->{$node} = { + state => $state, + 'paths-active' => $entry->{'paths-active'}, + 'paths-total' => $entry->{'paths-total'}, + defined($entry->{transport}) ? (transport => $entry->{transport}) : (), + }; + $size //= $entry->{size}; + $rank->($state); + } elsif ($exp && $exp->{$node}) { + # expected to carry this LUN but has not assembled it + $have_active = 1; + $nodes->{$node} = { state => 'missing' }; + $rank->('missing'); + } + } + + # Expected nodes with no broadcast at all are missing the map (online) or unreachable + # (offline); fold them in so a node that lost every path surfaces instead of vanishing. + for my $node (sort keys %{ $exp // {} }) { + next if exists $nodes->{$node}; + if ($online->{$node}) { + $have_active = 1; + $nodes->{$node} = { state => 'missing' }; + $rank->('missing'); + } else { + $nodes->{$node} = { state => 'unknown' }; + } + } + + push $res->@*, + { + wwid => $wwid, + defined($aliases->{$wwid}) ? (alias => $aliases->{$wwid}) : (), + defined($used_by->{$wwid}) ? ('used-by' => $used_by->{$wwid}) : (), + defined($size) ? (size => $size) : (), + 'cluster-state' => $have_active ? $worst : 'unknown', + nodes => $nodes, + }; + } + + return $res; +} + 1; diff --git a/src/test/run_multipath_tests.pl b/src/test/run_multipath_tests.pl index a4ad57d..793c4b4 100755 --- a/src/test/run_multipath_tests.pl +++ b/src/test/run_multipath_tests.pl @@ -407,4 +407,180 @@ my $many = [ my $big = JSON::encode_json(PVE::Multipath::summarize_maps_for_broadcast($many)); ok(length($big) < 32 * 1024, "100-map summary (" . length($big) . " B) fits the KV size limit"); +# --- cluster status aggregation --- +my $node_kv = { + nodeA => { + wA => + { state => 'optimal', 'paths-active' => 2, 'paths-total' => 2, transport => 'iscsi' }, + wB => { + state => 'optimal', + 'paths-active' => 2, + 'paths-total' => 2, + transport => 'iscsi', + size => 42, + }, + }, + nodeB => { + wA => + { state => 'degraded', 'paths-active' => 1, 'paths-total' => 2, transport => 'iscsi' }, + # nodeB is active but does not see wB + }, + nodeC => { + # stale broadcast from an offline node + wA => { state => 'optimal', 'paths-active' => 2, 'paths-total' => 2 }, + }, +}; +my $agg = PVE::Multipath::aggregate_cluster_status( + ['wA', 'wB', 'wZ'], # allow-list incl. an unseen WWID + { wA => 'lun-a' }, # alias + { wB => 'mptank' }, # used-by + $node_kv, + { nodeA => 1, nodeB => 1, nodeC => 0 }, # nodeC offline +); +my %by_wwid = map { $_->{wwid} => $_ } $agg->@*; + +is_deeply([sort keys %by_wwid], ['wA', 'wB', 'wZ'], 'matrix covers allow-list and seen WWIDs'); + +is($by_wwid{wA}->{alias}, 'lun-a', 'alias surfaced on the WWID row'); +is($by_wwid{wA}->{'cluster-state'}, 'degraded', 'degraded on one active node rolls up to degraded'); +is($by_wwid{wA}->{nodes}->{nodeA}->{state}, 'optimal', 'per-node optimal cell kept'); +is($by_wwid{wA}->{nodes}->{nodeB}->{state}, 'degraded', 'per-node degraded cell kept'); +is($by_wwid{wA}->{nodes}->{nodeC}->{state}, 'unknown', + 'offline node with stale data shows unknown'); + +is($by_wwid{wB}->{'used-by'}, 'mptank', 'consuming storage surfaced as used-by'); +is($by_wwid{wB}->{size}, 42, 'LUN size surfaced from a reporting node'); +is( + $by_wwid{wB}->{'cluster-state'}, + 'missing', + 'active node not assembling the LUN rolls up to missing', +); +is( + $by_wwid{wB}->{nodes}->{nodeB}->{state}, + 'missing', + 'missing marked on the active node lacking it', +); + +is( + $by_wwid{wZ}->{'cluster-state'}, + 'missing', + 'allow-listed WWID no active node assembled is missing everywhere', +); +is($by_wwid{wZ}->{nodes}->{nodeA}->{state}, 'missing', 'active node missing the allow-listed WWID'); +ok(!exists $by_wwid{wZ}->{nodes}->{nodeC}, 'offline node contributes no cell for an unseen WWID'); + +# a WWID only an offline node ever reported, with no online active node, is unknown +my $agg_off = PVE::Multipath::aggregate_cluster_status( + ['wA'], + {}, + {}, + { dead => { wA => { state => 'optimal', 'paths-active' => 2, 'paths-total' => 2 } } }, + { dead => 0 }, +); +is( + $agg_off->[0]->{'cluster-state'}, + 'unknown', + 'no online active node leaves the cluster-state unknown', +); +is($agg_off->[0]->{nodes}->{dead}->{state}, 'unknown', 'stale offline node shown as unknown'); + +# failure outranks degraded in the roll-up +my $agg2 = PVE::Multipath::aggregate_cluster_status( + ['wA'], + {}, + {}, + { + n1 => { wA => { state => 'degraded', 'paths-active' => 1, 'paths-total' => 2 } }, + n2 => { wA => { state => 'failed', 'paths-active' => 0, 'paths-total' => 2 } }, + }, + { n1 => 1, n2 => 1 }, +); +is($agg2->[0]->{'cluster-state'}, 'failed', 'failed outranks degraded in the cluster roll-up'); + +# --- expected-node set: a node that lost all paths (silent) must not vanish --- +# nodeS is expected (a multipath storage is enabled there) and online, but +# broadcasts nothing - e.g. every path to the SAN is down so it cleared its KV. +my $exp_kv = { + nodeA => { wA => { state => 'optimal', 'paths-active' => 2, 'paths-total' => 2 } }, +}; +my $online = { nodeA => 1, nodeS => 1, nodeOff => 0 }; +my $expected = { wA => { nodeA => 1, nodeS => 1, nodeOff => 1 } }; +my $eagg = PVE::Multipath::aggregate_cluster_status( + ['wA'], {}, {}, $exp_kv, $online, $expected, +); +my $row = $eagg->[0]; +is($row->{nodes}->{nodeA}->{state}, 'optimal', 'reporting node keeps its real state'); +is( + $row->{nodes}->{nodeS}->{state}, + 'missing', + 'expected online but silent node shows missing instead of vanishing', +); +is($row->{nodes}->{nodeOff}->{state}, 'unknown', 'expected offline node shows unknown'); +is($row->{'cluster-state'}, 'missing', 'a silent expected node drags the cluster-state to missing'); + +# without $expected the silent node would have been invisible (regression guard +# for the old behavior, proving the new param is what surfaces it) +my $noexp = PVE::Multipath::aggregate_cluster_status(['wA'], {}, {}, $exp_kv, $online); +ok( + !exists $noexp->[0]->{nodes}->{nodeS}, + 'without the expected set the silent node is absent (the gap the param closes)', +); +is($noexp->[0]->{'cluster-state'}, 'optimal', 'and the cluster-state would falsely read optimal'); + +# expected augmentation applies only to allow-listed WWIDs, not to a LUN that a +# node merely happens to report off-list +my $offlist = PVE::Multipath::aggregate_cluster_status( + [], + {}, + {}, + { nodeA => { wX => { state => 'optimal', 'paths-active' => 1, 'paths-total' => 1 } } }, + { nodeA => 1, nodeS => 1 }, + {}, +); +ok( + !exists $offlist->[0]->{nodes}->{nodeS}, + 'non-allow-listed WWID does not synthesize missing cells for expected nodes', +); + +# a broadcasting node outside a LUN's expected set is never marked missing (a SAN zoned to only +# some nodes), and a hand-made off-list map lists only the nodes that report it +my $zoned = PVE::Multipath::aggregate_cluster_status( + ['wA'], + {}, + {}, + { + n1 => { wA => { state => 'optimal', 'paths-active' => 2, 'paths-total' => 2 } }, + n2 => { wHand => { state => 'degraded', 'paths-active' => 1, 'paths-total' => 2 } }, + }, + { n1 => 1, n2 => 1 }, + { wA => { n1 => 1 } }, +); +my %zoned_rows = map { $_->{wwid} => $_ } $zoned->@*; +ok( + !exists $zoned_rows{wA}->{nodes}->{n2}, + 'a multipathing node outside the expected set is not marked missing', +); +is($zoned_rows{wA}->{'cluster-state'}, 'optimal', 'the zoned LUN stays green on its own nodes'); +ok( + !exists $zoned_rows{wHand}->{nodes}->{n1}, + 'an off-list map does not drag other multipathing nodes into its row', +); +is( + $zoned_rows{wHand}->{'cluster-state'}, + 'degraded', + 'an off-list map still surfaces its real state', +); + +# an unrecognized state from a (newer or buggy) peer clamps to unknown and does not fake a report +my $clamped = PVE::Multipath::aggregate_cluster_status( + ['wA'], + {}, + {}, + { n1 => { wA => { state => 'frobnicated', 'paths-active' => 1, 'paths-total' => 2 } } }, + { n1 => 1 }, + { wA => { n1 => 1 } }, +); +is($clamped->[0]->{nodes}->{n1}->{state}, 'unknown', 'unrecognized peer state clamps to unknown'); +is($clamped->[0]->{'cluster-state'}, 'unknown', 'a clamped state does not count as active'); + done_testing(); -- 2.47.3