From: Thomas Lamprecht <t.lamprecht@proxmox.com>
To: pve-devel@lists.proxmox.com
Subject: [PATCH v2 storage 07/13] api: multipath: add cluster-wide health status endpoint
Date: Fri, 3 Jul 2026 14:46:07 +0200 [thread overview]
Message-ID: <20260703124707.1172980-9-t.lamprecht@proxmox.com> (raw)
In-Reply-To: <20260703124707.1172980-2-t.lamprecht@proxmox.com>
A per-node view cannot tell whether a LUN is healthy across the whole
cluster. Add an endpoint that collects the per-node broadcasts and
combines them into a per-WWID by per-node matrix, rolled up to one
cluster-state per LUN.
The broadcasts are cross-checked against live membership, so a stale
value from an offline node reads as 'unknown' rather than as healthy.
The roll-up is taken over the nodes that are actively multipathing, so
a LUN that is optimal on three nodes but degraded on a fourth shows up
as degraded instead of hiding behind the healthy majority. A node where
a multipath storage is enabled but that broadcasts nothing is surfaced
as missing rather than vanishing from the matrix. Consuming storages
are labeled from the cluster storage config.
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
---
Changes in v2:
- status returns { luns, nodes } with per-node config-apply errors
- broadcast and surface a node's apply failure (liveness-checked)
- mark a LUN 'missing' only on nodes expected to carry it, derived
per LUN from the consuming storage chain's node restrictions
- demote broadcasts older than three refresh intervals to 'unknown'
- tolerate malformed values from a peer's broadcast instead of dying
- clamp unrecognized peer states; drop the unneeded protected flag
src/PVE/API2/Multipath.pm | 135 ++++++++++++++++++++++++
src/PVE/Multipath.pm | 132 ++++++++++++++++++++++++
src/test/run_multipath_tests.pl | 176 ++++++++++++++++++++++++++++++++
3 files changed, 443 insertions(+)
diff --git a/src/PVE/API2/Multipath.pm b/src/PVE/API2/Multipath.pm
index cb138f6..d43217a 100644
--- a/src/PVE/API2/Multipath.pm
+++ b/src/PVE/API2/Multipath.pm
@@ -3,6 +3,9 @@ package PVE::API2::Multipath;
use strict;
use warnings;
+use JSON qw(decode_json);
+
+use PVE::Cluster;
use PVE::Exception qw(raise_param_exc);
use PVE::JSONSchema qw(get_standard_option);
use PVE::Tools qw(extract_param);
@@ -127,6 +130,138 @@ __PACKAGE__->register_method({
},
});
+__PACKAGE__->register_method({
+ name => 'status',
+ path => 'status',
+ method => 'GET',
+ description => "Cluster-wide multipath health: a per-WWID by per-node matrix"
+ . " rolled up over the nodes that are actively multipathing.",
+ permissions => {
+ check => ['perm', '/', ['Sys.Audit']],
+ },
+ parameters => {
+ additionalProperties => 0,
+ properties => {},
+ },
+ returns => {
+ type => 'object',
+ properties => {
+ luns => {
+ type => 'array',
+ description => "Per-WWID health: a per-node map-state matrix rolled up to one"
+ . " cluster-state per LUN.",
+ items => {
+ type => 'object',
+ additionalProperties => 1,
+ properties => {
+ wwid => { type => 'string', description => 'The LUN WWID.' },
+ alias => {
+ type => 'string',
+ description => 'The configured alias, if any.',
+ optional => 1,
+ },
+ 'used-by' => {
+ type => 'string',
+ description => 'The storage consuming this LUN, if any.',
+ optional => 1,
+ },
+ size => {
+ type => 'integer',
+ description => 'LUN size in bytes, as reported by a node.',
+ optional => 1,
+ },
+ 'cluster-state' => {
+ type => 'string',
+ description => "Worst map state across the actively multipathing"
+ . " nodes: 'optimal', 'degraded' (some paths down on a node),"
+ . " 'missing' (an active node has not assembled it), 'failed'"
+ . " (no active path), or 'unknown' (no active node reports it).",
+ enum => ['optimal', 'degraded', 'missing', 'failed', 'unknown'],
+ },
+ nodes => {
+ type => 'object',
+ description => 'Per-node map state, keyed by node name.',
+ additionalProperties => 1,
+ },
+ },
+ },
+ },
+ nodes => {
+ type => 'object',
+ description =>
+ "Per-node config-apply status, keyed by node name; only nodes that"
+ . " failed to apply the configuration appear.",
+ additionalProperties => 1,
+ },
+ },
+ },
+ code => sub {
+ my $cfg = PVE::Multipath::ClusterConfig::read_config();
+
+ my $now = time();
+ my $raw_kv = PVE::Cluster::get_node_kv('multipath');
+ my $node_kv = {};
+ my $stale = {};
+ for my $node (keys %$raw_kv) {
+ my $decoded = eval { decode_json($raw_kv->{$node}) };
+ next if !$decoded || ref($decoded) ne 'HASH' || ref($decoded->{maps}) ne 'HASH';
+ $node_kv->{$node} = $decoded->{maps};
+ # a reporter that stopped refreshing its broadcast gets demoted below: its last
+ # snapshot must not read as current health
+ $stale->{$node} = 1
+ if !defined($decoded->{time})
+ || $now - $decoded->{time} > $PVE::Multipath::KV_STALE_SECONDS;
+ }
+
+ my $expectations = PVE::Multipath::cluster_storage_expectations();
+ my $allow_wwids = PVE::Multipath::Config::wwid_list($cfg);
+
+ # per-LUN expected node set: the consuming storage chain when known, else wherever a
+ # multipath storage is enabled
+ my $expected =
+ { map { $_ => $expectations->{'wwid-nodes'}->{$_} // $expectations->{nodes} }
+ $allow_wwids->@* };
+
+ # resolve liveness for every node we might place in the matrix: those that broadcast and
+ # those a multipath storage expects (and that may be silent)
+ my $members = PVE::Cluster::get_members() // {};
+ my $online = {};
+ for my $node (keys %$node_kv, keys $expectations->{nodes}->%*) {
+ # standalone clusters carry no member info; treat the reporter as live
+ $online->{$node} =
+ (!%$members || ($members->{$node} && $members->{$node}->{online}))
+ && !$stale->{$node} ? 1 : 0;
+ }
+
+ my $luns = PVE::Multipath::aggregate_cluster_status(
+ $allow_wwids,
+ PVE::Multipath::Config::aliases($cfg),
+ $expectations->{consumers},
+ $node_kv,
+ $online,
+ $expected,
+ );
+
+ # surface nodes that could not apply the cluster config, so local drift is visible instead
+ # of silently diverging from the cluster-wide configuration
+ my $apply_kv = PVE::Cluster::get_node_kv('multipath-apply');
+ my $nodes = {};
+ for my $node (keys %$apply_kv) {
+ # an offline node cannot clear its own KV, so skip its stale apply error (same liveness
+ # rule as the health roll-up); a standalone cluster has no member info, so keep it
+ next if %$members && !($members->{$node} && $members->{$node}->{online});
+ my $decoded = eval { decode_json($apply_kv->{$node}) };
+ next if !$decoded || ref($decoded) ne 'HASH' || !$decoded->{error};
+ $nodes->{$node} = {
+ 'apply-error' => $decoded->{error},
+ 'apply-time' => $decoded->{time},
+ };
+ }
+
+ return { luns => $luns, nodes => $nodes };
+ },
+});
+
__PACKAGE__->register_method({
name => 'read_overrides',
path => 'overrides',
diff --git a/src/PVE/Multipath.pm b/src/PVE/Multipath.pm
index ec87b08..67165c6 100644
--- a/src/PVE/Multipath.pm
+++ b/src/PVE/Multipath.pm
@@ -478,4 +478,136 @@ sub broadcast_health {
update_node_kv('multipath', { maps => $summary });
}
+# Broadcast whether this node could apply the cluster-wide multipath config, so the datacenter view
+# can flag a node whose local multipathd state drifted from the configured one instead of failing
+# silently. Pass the error to publish it, or nothing to clear it after a successful apply.
+sub broadcast_apply_status {
+ my ($error) = @_;
+ update_node_kv('multipath-apply', $error ? { error => "$error" } : undef);
+}
+
+# Severity ordering for rolling per-node states up into a cluster state; a higher number is worse.
+# 'unknown' is a stale or offline node and never drives the roll-up, so it sits below 'optimal'.
+my $STATE_RANK = {
+ unknown => -1,
+ optimal => 0,
+ degraded => 1,
+ missing => 2,
+ failed => 3,
+};
+
+# Clamp a state string from another node's broadcast to the known set, so a newer or buggy peer
+# cannot inject an unrankable state into the matrix.
+my sub known_state {
+ my ($state) = @_;
+ return defined($state) && exists($STATE_RANK->{$state}) ? $state : 'unknown';
+}
+
+# Pure: fold the per-node health summaries (already JSON-decoded) into a per-WWID cluster matrix.
+# Inputs:
+# $allow_wwids arrayref, the cluster WWID allow-list
+# $aliases { wwid => name }
+# $used_by { wwid => storage-id } of consuming LVM storages
+# $node_kv { node => { wwid => summary } } as broadcast by broadcast_health()
+# $online { node => bool }; a node absent here counts as offline
+# $expected { wwid => { node => 1 } } nodes that are supposed to carry each allow-listed
+# LUN, derived from storage_expectations()
+#
+# The cluster-state rolls up over the nodes that should carry each LUN: an expected node without
+# the map is 'missing', whether it broadcasts other maps or nothing at all (it lost every path and
+# cleared its broadcast). A node outside the LUN's expected set is never marked missing, so a SAN
+# zoned to only some nodes or a purely hand-managed map does not drag unrelated nodes red; a real
+# state such a node reports still counts. An allow-listed LUN without expectation info (no
+# multipath storage configured) falls back to expecting every actively multipathing node, the only
+# signal left then. Offline nodes show as 'unknown' and never drive the roll-up. WWIDs outside the
+# allow-list list only the nodes that actually report them.
+sub aggregate_cluster_status {
+ my ($allow_wwids, $aliases, $used_by, $node_kv, $online, $expected) = @_;
+
+ $allow_wwids //= [];
+ $aliases //= {};
+ $used_by //= {};
+ $node_kv //= {};
+ $online //= {};
+ $expected //= {};
+
+ my %allow = map { $_ => 1 } $allow_wwids->@*;
+
+ # report the allow-list plus any WWID a node actually sees
+ my %wwids = %allow;
+ for my $node (keys %$node_kv) {
+ $wwids{$_} = 1 for keys $node_kv->{$node}->%*;
+ }
+
+ my $active_nodes = { map { $_ => 1 } grep { $online->{$_} } keys %$node_kv };
+
+ my $res = [];
+ for my $wwid (sort keys %wwids) {
+ my $nodes = {};
+ my $worst = 'optimal';
+ my $have_active = 0;
+ my $size;
+
+ my $rank = sub {
+ my ($state) = @_;
+ $worst = $state if $STATE_RANK->{$state} > $STATE_RANK->{$worst};
+ };
+
+ my $exp = $allow{$wwid} ? $expected->{$wwid} : undef;
+ $exp = $active_nodes if $allow{$wwid} && !($exp && %$exp);
+
+ for my $node (sort keys %$node_kv) {
+ my $entry = $node_kv->{$node}->{$wwid};
+
+ if (!$online->{$node}) {
+ $nodes->{$node} = { state => 'unknown' } if $entry;
+ next;
+ }
+
+ if ($entry) {
+ my $state = known_state($entry->{state});
+ $have_active = 1 if $state ne 'unknown';
+ $nodes->{$node} = {
+ state => $state,
+ 'paths-active' => $entry->{'paths-active'},
+ 'paths-total' => $entry->{'paths-total'},
+ defined($entry->{transport}) ? (transport => $entry->{transport}) : (),
+ };
+ $size //= $entry->{size};
+ $rank->($state);
+ } elsif ($exp && $exp->{$node}) {
+ # expected to carry this LUN but has not assembled it
+ $have_active = 1;
+ $nodes->{$node} = { state => 'missing' };
+ $rank->('missing');
+ }
+ }
+
+ # Expected nodes with no broadcast at all are missing the map (online) or unreachable
+ # (offline); fold them in so a node that lost every path surfaces instead of vanishing.
+ for my $node (sort keys %{ $exp // {} }) {
+ next if exists $nodes->{$node};
+ if ($online->{$node}) {
+ $have_active = 1;
+ $nodes->{$node} = { state => 'missing' };
+ $rank->('missing');
+ } else {
+ $nodes->{$node} = { state => 'unknown' };
+ }
+ }
+
+ push $res->@*,
+ {
+ wwid => $wwid,
+ defined($aliases->{$wwid}) ? (alias => $aliases->{$wwid}) : (),
+ defined($used_by->{$wwid}) ? ('used-by' => $used_by->{$wwid}) : (),
+ defined($size) ? (size => $size) : (),
+ 'cluster-state' => $have_active ? $worst : 'unknown',
+ nodes => $nodes,
+ };
+ }
+
+ return $res;
+}
+
1;
diff --git a/src/test/run_multipath_tests.pl b/src/test/run_multipath_tests.pl
index a4ad57d..793c4b4 100755
--- a/src/test/run_multipath_tests.pl
+++ b/src/test/run_multipath_tests.pl
@@ -407,4 +407,180 @@ my $many = [
my $big = JSON::encode_json(PVE::Multipath::summarize_maps_for_broadcast($many));
ok(length($big) < 32 * 1024, "100-map summary (" . length($big) . " B) fits the KV size limit");
+# --- cluster status aggregation ---
+my $node_kv = {
+ nodeA => {
+ wA =>
+ { state => 'optimal', 'paths-active' => 2, 'paths-total' => 2, transport => 'iscsi' },
+ wB => {
+ state => 'optimal',
+ 'paths-active' => 2,
+ 'paths-total' => 2,
+ transport => 'iscsi',
+ size => 42,
+ },
+ },
+ nodeB => {
+ wA =>
+ { state => 'degraded', 'paths-active' => 1, 'paths-total' => 2, transport => 'iscsi' },
+ # nodeB is active but does not see wB
+ },
+ nodeC => {
+ # stale broadcast from an offline node
+ wA => { state => 'optimal', 'paths-active' => 2, 'paths-total' => 2 },
+ },
+};
+my $agg = PVE::Multipath::aggregate_cluster_status(
+ ['wA', 'wB', 'wZ'], # allow-list incl. an unseen WWID
+ { wA => 'lun-a' }, # alias
+ { wB => 'mptank' }, # used-by
+ $node_kv,
+ { nodeA => 1, nodeB => 1, nodeC => 0 }, # nodeC offline
+);
+my %by_wwid = map { $_->{wwid} => $_ } $agg->@*;
+
+is_deeply([sort keys %by_wwid], ['wA', 'wB', 'wZ'], 'matrix covers allow-list and seen WWIDs');
+
+is($by_wwid{wA}->{alias}, 'lun-a', 'alias surfaced on the WWID row');
+is($by_wwid{wA}->{'cluster-state'}, 'degraded', 'degraded on one active node rolls up to degraded');
+is($by_wwid{wA}->{nodes}->{nodeA}->{state}, 'optimal', 'per-node optimal cell kept');
+is($by_wwid{wA}->{nodes}->{nodeB}->{state}, 'degraded', 'per-node degraded cell kept');
+is($by_wwid{wA}->{nodes}->{nodeC}->{state}, 'unknown',
+ 'offline node with stale data shows unknown');
+
+is($by_wwid{wB}->{'used-by'}, 'mptank', 'consuming storage surfaced as used-by');
+is($by_wwid{wB}->{size}, 42, 'LUN size surfaced from a reporting node');
+is(
+ $by_wwid{wB}->{'cluster-state'},
+ 'missing',
+ 'active node not assembling the LUN rolls up to missing',
+);
+is(
+ $by_wwid{wB}->{nodes}->{nodeB}->{state},
+ 'missing',
+ 'missing marked on the active node lacking it',
+);
+
+is(
+ $by_wwid{wZ}->{'cluster-state'},
+ 'missing',
+ 'allow-listed WWID no active node assembled is missing everywhere',
+);
+is($by_wwid{wZ}->{nodes}->{nodeA}->{state}, 'missing', 'active node missing the allow-listed WWID');
+ok(!exists $by_wwid{wZ}->{nodes}->{nodeC}, 'offline node contributes no cell for an unseen WWID');
+
+# a WWID only an offline node ever reported, with no online active node, is unknown
+my $agg_off = PVE::Multipath::aggregate_cluster_status(
+ ['wA'],
+ {},
+ {},
+ { dead => { wA => { state => 'optimal', 'paths-active' => 2, 'paths-total' => 2 } } },
+ { dead => 0 },
+);
+is(
+ $agg_off->[0]->{'cluster-state'},
+ 'unknown',
+ 'no online active node leaves the cluster-state unknown',
+);
+is($agg_off->[0]->{nodes}->{dead}->{state}, 'unknown', 'stale offline node shown as unknown');
+
+# failure outranks degraded in the roll-up
+my $agg2 = PVE::Multipath::aggregate_cluster_status(
+ ['wA'],
+ {},
+ {},
+ {
+ n1 => { wA => { state => 'degraded', 'paths-active' => 1, 'paths-total' => 2 } },
+ n2 => { wA => { state => 'failed', 'paths-active' => 0, 'paths-total' => 2 } },
+ },
+ { n1 => 1, n2 => 1 },
+);
+is($agg2->[0]->{'cluster-state'}, 'failed', 'failed outranks degraded in the cluster roll-up');
+
+# --- expected-node set: a node that lost all paths (silent) must not vanish ---
+# nodeS is expected (a multipath storage is enabled there) and online, but
+# broadcasts nothing - e.g. every path to the SAN is down so it cleared its KV.
+my $exp_kv = {
+ nodeA => { wA => { state => 'optimal', 'paths-active' => 2, 'paths-total' => 2 } },
+};
+my $online = { nodeA => 1, nodeS => 1, nodeOff => 0 };
+my $expected = { wA => { nodeA => 1, nodeS => 1, nodeOff => 1 } };
+my $eagg = PVE::Multipath::aggregate_cluster_status(
+ ['wA'], {}, {}, $exp_kv, $online, $expected,
+);
+my $row = $eagg->[0];
+is($row->{nodes}->{nodeA}->{state}, 'optimal', 'reporting node keeps its real state');
+is(
+ $row->{nodes}->{nodeS}->{state},
+ 'missing',
+ 'expected online but silent node shows missing instead of vanishing',
+);
+is($row->{nodes}->{nodeOff}->{state}, 'unknown', 'expected offline node shows unknown');
+is($row->{'cluster-state'}, 'missing', 'a silent expected node drags the cluster-state to missing');
+
+# without $expected the silent node would have been invisible (regression guard
+# for the old behavior, proving the new param is what surfaces it)
+my $noexp = PVE::Multipath::aggregate_cluster_status(['wA'], {}, {}, $exp_kv, $online);
+ok(
+ !exists $noexp->[0]->{nodes}->{nodeS},
+ 'without the expected set the silent node is absent (the gap the param closes)',
+);
+is($noexp->[0]->{'cluster-state'}, 'optimal', 'and the cluster-state would falsely read optimal');
+
+# expected augmentation applies only to allow-listed WWIDs, not to a LUN that a
+# node merely happens to report off-list
+my $offlist = PVE::Multipath::aggregate_cluster_status(
+ [],
+ {},
+ {},
+ { nodeA => { wX => { state => 'optimal', 'paths-active' => 1, 'paths-total' => 1 } } },
+ { nodeA => 1, nodeS => 1 },
+ {},
+);
+ok(
+ !exists $offlist->[0]->{nodes}->{nodeS},
+ 'non-allow-listed WWID does not synthesize missing cells for expected nodes',
+);
+
+# a broadcasting node outside a LUN's expected set is never marked missing (a SAN zoned to only
+# some nodes), and a hand-made off-list map lists only the nodes that report it
+my $zoned = PVE::Multipath::aggregate_cluster_status(
+ ['wA'],
+ {},
+ {},
+ {
+ n1 => { wA => { state => 'optimal', 'paths-active' => 2, 'paths-total' => 2 } },
+ n2 => { wHand => { state => 'degraded', 'paths-active' => 1, 'paths-total' => 2 } },
+ },
+ { n1 => 1, n2 => 1 },
+ { wA => { n1 => 1 } },
+);
+my %zoned_rows = map { $_->{wwid} => $_ } $zoned->@*;
+ok(
+ !exists $zoned_rows{wA}->{nodes}->{n2},
+ 'a multipathing node outside the expected set is not marked missing',
+);
+is($zoned_rows{wA}->{'cluster-state'}, 'optimal', 'the zoned LUN stays green on its own nodes');
+ok(
+ !exists $zoned_rows{wHand}->{nodes}->{n1},
+ 'an off-list map does not drag other multipathing nodes into its row',
+);
+is(
+ $zoned_rows{wHand}->{'cluster-state'},
+ 'degraded',
+ 'an off-list map still surfaces its real state',
+);
+
+# an unrecognized state from a (newer or buggy) peer clamps to unknown and does not fake a report
+my $clamped = PVE::Multipath::aggregate_cluster_status(
+ ['wA'],
+ {},
+ {},
+ { n1 => { wA => { state => 'frobnicated', 'paths-active' => 1, 'paths-total' => 2 } } },
+ { n1 => 1 },
+ { wA => { n1 => 1 } },
+);
+is($clamped->[0]->{nodes}->{n1}->{state}, 'unknown', 'unrecognized peer state clamps to unknown');
+is($clamped->[0]->{'cluster-state'}, 'unknown', 'a clamped state does not count as active');
+
done_testing();
--
2.47.3
next prev parent reply other threads:[~2026-07-03 15:32 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-07-03 12:46 [PATCH v2 storage,cluster,manager 0/13] multipath: cluster-wide config, storage and health overview Thomas Lamprecht
2026-07-03 12:46 ` [PATCH v2 storage 01/13] multipath: add helper library and managed configuration Thomas Lamprecht
2026-07-03 12:46 ` [PATCH v2 storage 02/13] api: disks: add read-only multipath status endpoint Thomas Lamprecht
2026-07-03 12:46 ` [PATCH v2 storage 03/13] api: multipath: add cluster-wide configuration endpoints Thomas Lamprecht
2026-07-03 12:46 ` [PATCH v2 storage 04/13] multipath: add storage plugin for multipath LUNs Thomas Lamprecht
2026-07-03 12:46 ` [PATCH v2 storage 05/13] lvm: allow a multipath storage as the base device Thomas Lamprecht
2026-07-03 12:46 ` [PATCH v2 storage 06/13] multipath: broadcast per-node map health to the cluster KV store Thomas Lamprecht
2026-07-03 12:46 ` Thomas Lamprecht [this message]
2026-07-03 12:46 ` [PATCH v2 cluster 08/13] pmxcfs: track cluster-wide multipath configuration Thomas Lamprecht
2026-07-03 12:46 ` [PATCH v2 manager 09/13] pvestatd: apply the cluster-wide multipath config on each node Thomas Lamprecht
2026-07-03 12:46 ` [PATCH v2 manager 10/13] api: cluster: mount the multipath configuration endpoint Thomas Lamprecht
2026-07-03 12:46 ` [PATCH v2 manager 11/13] pvestatd: broadcast multipath map health to the cluster Thomas Lamprecht
2026-07-03 12:46 ` [PATCH v2 manager 12/13] ui: dc: add multipath health matrix and config editor Thomas Lamprecht
2026-07-03 12:46 ` [PATCH v2 manager 13/13] ui: node: show multipath maps and their paths under Disks Thomas Lamprecht
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260703124707.1172980-9-t.lamprecht@proxmox.com \
--to=t.lamprecht@proxmox.com \
--cc=pve-devel@lists.proxmox.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox