From: Thomas Lamprecht <t.lamprecht@proxmox.com>
To: pve-devel@lists.proxmox.com
Subject: [PATCH v2 storage 07/13] api: multipath: add cluster-wide health status endpoint
Date: Fri, 3 Jul 2026 14:46:07 +0200 [thread overview]
Message-ID: <20260703124707.1172980-9-t.lamprecht@proxmox.com> (raw)
In-Reply-To: <20260703124707.1172980-2-t.lamprecht@proxmox.com>
A per-node view cannot tell whether a LUN is healthy across the whole
cluster. Add an endpoint that collects the per-node broadcasts and
combines them into a per-WWID by per-node matrix, rolled up to one
cluster-state per LUN.
The broadcasts are cross-checked against live membership, so a stale
value from an offline node reads as 'unknown' rather than as healthy.
The roll-up is taken over the nodes that are actively multipathing, so
a LUN that is optimal on three nodes but degraded on a fourth shows up
as degraded instead of hiding behind the healthy majority. A node where
a multipath storage is enabled but that broadcasts nothing is surfaced
as missing rather than vanishing from the matrix. Consuming storages
are labeled from the cluster storage config.
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
---
Changes in v2:
- status returns { luns, nodes } with per-node config-apply errors
- broadcast and surface a node's apply failure (liveness-checked)
- mark a LUN 'missing' only on nodes expected to carry it, derived
per LUN from the consuming storage chain's node restrictions
- demote broadcasts older than three refresh intervals to 'unknown'
- tolerate malformed values from a peer's broadcast instead of dying
- clamp unrecognized peer states; drop the unneeded protected flag
src/PVE/API2/Multipath.pm | 135 ++++++++++++++++++++++++
src/PVE/Multipath.pm | 132 ++++++++++++++++++++++++
src/test/run_multipath_tests.pl | 176 ++++++++++++++++++++++++++++++++
3 files changed, 443 insertions(+)
diff --git a/src/PVE/API2/Multipath.pm b/src/PVE/API2/Multipath.pm
index cb138f6..d43217a 100644
--- a/src/PVE/API2/Multipath.pm
+++ b/src/PVE/API2/Multipath.pm
@@ -3,6 +3,9 @@ package PVE::API2::Multipath;
use strict;
use warnings;
+use JSON qw(decode_json);
+
+use PVE::Cluster;
use PVE::Exception qw(raise_param_exc);
use PVE::JSONSchema qw(get_standard_option);
use PVE::Tools qw(extract_param);
@@ -127,6 +130,138 @@ __PACKAGE__->register_method({
},
});
+__PACKAGE__->register_method({
+ name => 'status',
+ path => 'status',
+ method => 'GET',
+ description => "Cluster-wide multipath health: a per-WWID by per-node matrix"
+ . " rolled up over the nodes that are actively multipathing.",
+ permissions => {
+ check => ['perm', '/', ['Sys.Audit']],
+ },
+ parameters => {
+ additionalProperties => 0,
+ properties => {},
+ },
+ returns => {
+ type => 'object',
+ properties => {
+ luns => {
+ type => 'array',
+ description => "Per-WWID health: a per-node map-state matrix rolled up to one"
+ . " cluster-state per LUN.",
+ items => {
+ type => 'object',
+ additionalProperties => 1,
+ properties => {
+ wwid => { type => 'string', description => 'The LUN WWID.' },
+ alias => {
+ type => 'string',
+ description => 'The configured alias, if any.',
+ optional => 1,
+ },
+ 'used-by' => {
+ type => 'string',
+ description => 'The storage consuming this LUN, if any.',
+ optional => 1,
+ },
+ size => {
+ type => 'integer',
+ description => 'LUN size in bytes, as reported by a node.',
+ optional => 1,
+ },
+ 'cluster-state' => {
+ type => 'string',
+ description => "Worst map state across the actively multipathing"
+ . " nodes: 'optimal', 'degraded' (some paths down on a node),"
+ . " 'missing' (an active node has not assembled it), 'failed'"
+ . " (no active path), or 'unknown' (no active node reports it).",
+ enum => ['optimal', 'degraded', 'missing', 'failed', 'unknown'],
+ },
+ nodes => {
+ type => 'object',
+ description => 'Per-node map state, keyed by node name.',
+ additionalProperties => 1,
+ },
+ },
+ },
+ },
+ nodes => {
+ type => 'object',
+ description =>
+ "Per-node config-apply status, keyed by node name; only nodes that"
+ . " failed to apply the configuration appear.",
+ additionalProperties => 1,
+ },
+ },
+ },
+ code => sub {
+ my $cfg = PVE::Multipath::ClusterConfig::read_config();
+
+ my $now = time();
+ my $raw_kv = PVE::Cluster::get_node_kv('multipath');
+ my $node_kv = {};
+ my $stale = {};
+ for my $node (keys %$raw_kv) {
+ my $decoded = eval { decode_json($raw_kv->{$node}) };
+ next if !$decoded || ref($decoded) ne 'HASH' || ref($decoded->{maps}) ne 'HASH';
+ $node_kv->{$node} = $decoded->{maps};
+ # a reporter that stopped refreshing its broadcast gets demoted below: its last
+ # snapshot must not read as current health
+ $stale->{$node} = 1
+ if !defined($decoded->{time})
+ || $now - $decoded->{time} > $PVE::Multipath::KV_STALE_SECONDS;
+ }
+
+ my $expectations = PVE::Multipath::cluster_storage_expectations();
+ my $allow_wwids = PVE::Multipath::Config::wwid_list($cfg);
+
+ # per-LUN expected node set: the consuming storage chain when known, else wherever a
+ # multipath storage is enabled
+ my $expected =
+ { map { $_ => $expectations->{'wwid-nodes'}->{$_} // $expectations->{nodes} }
+ $allow_wwids->@* };
+
+ # resolve liveness for every node we might place in the matrix: those that broadcast and
+ # those a multipath storage expects (and that may be silent)
+ my $members = PVE::Cluster::get_members() // {};
+ my $online = {};
+ for my $node (keys %$node_kv, keys $expectations->{nodes}->%*) {
+ # standalone clusters carry no member info; treat the reporter as live
+ $online->{$node} =
+ (!%$members || ($members->{$node} && $members->{$node}->{online}))
+ && !$stale->{$node} ? 1 : 0;
+ }
+
+ my $luns = PVE::Multipath::aggregate_cluster_status(
+ $allow_wwids,
+ PVE::Multipath::Config::aliases($cfg),
+ $expectations->{consumers},
+ $node_kv,
+ $online,
+ $expected,
+ );
+
+ # surface nodes that could not apply the cluster config, so local drift is visible instead
+ # of silently diverging from the cluster-wide configuration
+ my $apply_kv = PVE::Cluster::get_node_kv('multipath-apply');
+ my $nodes = {};
+ for my $node (keys %$apply_kv) {
+ # an offline node cannot clear its own KV, so skip its stale apply error (same liveness
+ # rule as the health roll-up); a standalone cluster has no member info, so keep it
+ next if %$members && !($members->{$node} && $members->{$node}->{online});
+ my $decoded = eval { decode_json($apply_kv->{$node}) };
+ next if !$decoded || ref($decoded) ne 'HASH' || !$decoded->{error};
+ $nodes->{$node} = {
+ 'apply-error' => $decoded->{error},
+ 'apply-time' => $decoded->{time},
+ };
+ }
+
+ return { luns => $luns, nodes => $nodes };
+ },
+});
+
__PACKAGE__->register_method({
name => 'read_overrides',
path => 'overrides',
diff --git a/src/PVE/Multipath.pm b/src/PVE/Multipath.pm
index ec87b08..67165c6 100644
--- a/src/PVE/Multipath.pm
+++ b/src/PVE/Multipath.pm
@@ -478,4 +478,136 @@ sub broadcast_health {
update_node_kv('multipath', { maps => $summary });
}
+# Broadcast whether this node could apply the cluster-wide multipath config, so the datacenter view
+# can flag a node whose local multipathd state drifted from the configured one instead of failing
+# silently. Pass the error to publish it, or nothing to clear it after a successful apply.
+sub broadcast_apply_status {
+ my ($error) = @_;
+ update_node_kv('multipath-apply', $error ? { error => "$error" } : undef);
+}
+
+# Severity ordering for rolling per-node states up into a cluster state; a higher number is worse.
+# 'unknown' is a stale or offline node and never drives the roll-up, so it sits below 'optimal'.
+my $STATE_RANK = {
+ unknown => -1,
+ optimal => 0,
+ degraded => 1,
+ missing => 2,
+ failed => 3,
+};
+
+# Clamp a state string from another node's broadcast to the known set, so a newer or buggy peer
+# cannot inject an unrankable state into the matrix.
+my sub known_state {
+ my ($state) = @_;
+ return defined($state) && exists($STATE_RANK->{$state}) ? $state : 'unknown';
+}
+
+# Pure: fold the per-node health summaries (already JSON-decoded) into a per-WWID cluster matrix.
+# Inputs:
+# $allow_wwids arrayref, the cluster WWID allow-list
+# $aliases { wwid => name }
+# $used_by { wwid => storage-id } of consuming LVM storages
+# $node_kv { node => { wwid => summary } } as broadcast by broadcast_health()
+# $online { node => bool }; a node absent here counts as offline
+# $expected { wwid => { node => 1 } } nodes that are supposed to carry each allow-listed
+# LUN, derived from storage_expectations()
+#
+# The cluster-state rolls up over the nodes that should carry each LUN: an expected node without
+# the map is 'missing', whether it broadcasts other maps or nothing at all (it lost every path and
+# cleared its broadcast). A node outside the LUN's expected set is never marked missing, so a SAN
+# zoned to only some nodes or a purely hand-managed map does not drag unrelated nodes red; a real
+# state such a node reports still counts. An allow-listed LUN without expectation info (no
+# multipath storage configured) falls back to expecting every actively multipathing node, the only
+# signal left then. Offline nodes show as 'unknown' and never drive the roll-up. WWIDs outside the
+# allow-list list only the nodes that actually report them.
+sub aggregate_cluster_status {
+ my ($allow_wwids, $aliases, $used_by, $node_kv, $online, $expected) = @_;
+
+ $allow_wwids //= [];
+ $aliases //= {};
+ $used_by //= {};
+ $node_kv //= {};
+ $online //= {};
+ $expected //= {};
+
+ my %allow = map { $_ => 1 } $allow_wwids->@*;
+
+ # report the allow-list plus any WWID a node actually sees
+ my %wwids = %allow;
+ for my $node (keys %$node_kv) {
+ $wwids{$_} = 1 for keys $node_kv->{$node}->%*;
+ }
+
+ my $active_nodes = { map { $_ => 1 } grep { $online->{$_} } keys %$node_kv };
+
+ my $res = [];
+ for my $wwid (sort keys %wwids) {
+ my $nodes = {};
+ my $worst = 'optimal';
+ my $have_active = 0;
+ my $size;
+
+ my $rank = sub {
+ my ($state) = @_;
+ $worst = $state if $STATE_RANK->{$state} > $STATE_RANK->{$worst};
+ };
+
+ my $exp = $allow{$wwid} ? $expected->{$wwid} : undef;
+ $exp = $active_nodes if $allow{$wwid} && !($exp && %$exp);
+
+ for my $node (sort keys %$node_kv) {
+ my $entry = $node_kv->{$node}->{$wwid};
+
+ if (!$online->{$node}) {
+ $nodes->{$node} = { state => 'unknown' } if $entry;
+ next;
+ }
+
+ if ($entry) {
+ my $state = known_state($entry->{state});
+ $have_active = 1 if $state ne 'unknown';
+ $nodes->{$node} = {
+ state => $state,
+ 'paths-active' => $entry->{'paths-active'},
+ 'paths-total' => $entry->{'paths-total'},
+ defined($entry->{transport}) ? (transport => $entry->{transport}) : (),
+ };
+ $size //= $entry->{size};
+ $rank->($state);
+ } elsif ($exp && $exp->{$node}) {
+ # expected to carry this LUN but has not assembled it
+ $have_active = 1;
+ $nodes->{$node} = { state => 'missing' };
+ $rank->('missing');
+ }
+ }
+
+ # Expected nodes with no broadcast at all are missing the map (online) or unreachable
+ # (offline); fold them in so a node that lost every path surfaces instead of vanishing.
+ for my $node (sort keys %{ $exp // {} }) {
+ next if exists $nodes->{$node};
+ if ($online->{$node}) {
+ $have_active = 1;
+ $nodes->{$node} = { state => 'missing' };
+ $rank->('missing');
+ } else {
+ $nodes->{$node} = { state => 'unknown' };
+ }
+ }
+
+ push $res->@*,
+ {
+ wwid => $wwid,
+ defined($aliases->{$wwid}) ? (alias => $aliases->{$wwid}) : (),
+ defined($used_by->{$wwid}) ? ('used-by' => $used_by->{$wwid}) : (),
+ defined($size) ? (size => $size) : (),
+ 'cluster-state' => $have_active ? $worst : 'unknown',
+ nodes => $nodes,
+ };
+ }
+
+ return $res;
+}
+
1;
diff --git a/src/test/run_multipath_tests.pl b/src/test/run_multipath_tests.pl
index a4ad57d..793c4b4 100755
--- a/src/test/run_multipath_tests.pl
+++ b/src/test/run_multipath_tests.pl
@@ -407,4 +407,180 @@ my $many = [
my $big = JSON::encode_json(PVE::Multipath::summarize_maps_for_broadcast($many));
ok(length($big) < 32 * 1024, "100-map summary (" . length($big) . " B) fits the KV size limit");
+# --- cluster status aggregation ---
+my $node_kv = {
+ nodeA => {
+ wA =>
+ { state => 'optimal', 'paths-active' => 2, 'paths-total' => 2, transport => 'iscsi' },
+ wB => {
+ state => 'optimal',
+ 'paths-active' => 2,
+ 'paths-total' => 2,
+ transport => 'iscsi',
+ size => 42,
+ },
+ },
+ nodeB => {
+ wA =>
+ { state => 'degraded', 'paths-active' => 1, 'paths-total' => 2, transport => 'iscsi' },
+ # nodeB is active but does not see wB
+ },
+ nodeC => {
+ # stale broadcast from an offline node
+ wA => { state => 'optimal', 'paths-active' => 2, 'paths-total' => 2 },
+ },
+};
+my $agg = PVE::Multipath::aggregate_cluster_status(
+ ['wA', 'wB', 'wZ'], # allow-list incl. an unseen WWID
+ { wA => 'lun-a' }, # alias
+ { wB => 'mptank' }, # used-by
+ $node_kv,
+ { nodeA => 1, nodeB => 1, nodeC => 0 }, # nodeC offline
+);
+my %by_wwid = map { $_->{wwid} => $_ } $agg->@*;
+
+is_deeply([sort keys %by_wwid], ['wA', 'wB', 'wZ'], 'matrix covers allow-list and seen WWIDs');
+
+is($by_wwid{wA}->{alias}, 'lun-a', 'alias surfaced on the WWID row');
+is($by_wwid{wA}->{'cluster-state'}, 'degraded', 'degraded on one active node rolls up to degraded');
+is($by_wwid{wA}->{nodes}->{nodeA}->{state}, 'optimal', 'per-node optimal cell kept');
+is($by_wwid{wA}->{nodes}->{nodeB}->{state}, 'degraded', 'per-node degraded cell kept');
+is($by_wwid{wA}->{nodes}->{nodeC}->{state}, 'unknown',
+ 'offline node with stale data shows unknown');
+
+is($by_wwid{wB}->{'used-by'}, 'mptank', 'consuming storage surfaced as used-by');
+is($by_wwid{wB}->{size}, 42, 'LUN size surfaced from a reporting node');
+is(
+ $by_wwid{wB}->{'cluster-state'},
+ 'missing',
+ 'active node not assembling the LUN rolls up to missing',
+);
+is(
+ $by_wwid{wB}->{nodes}->{nodeB}->{state},
+ 'missing',
+ 'missing marked on the active node lacking it',
+);
+
+is(
+ $by_wwid{wZ}->{'cluster-state'},
+ 'missing',
+ 'allow-listed WWID no active node assembled is missing everywhere',
+);
+is($by_wwid{wZ}->{nodes}->{nodeA}->{state}, 'missing', 'active node missing the allow-listed WWID');
+ok(!exists $by_wwid{wZ}->{nodes}->{nodeC}, 'offline node contributes no cell for an unseen WWID');
+
+# a WWID only an offline node ever reported, with no online active node, is unknown
+my $agg_off = PVE::Multipath::aggregate_cluster_status(
+ ['wA'],
+ {},
+ {},
+ { dead => { wA => { state => 'optimal', 'paths-active' => 2, 'paths-total' => 2 } } },
+ { dead => 0 },
+);
+is(
+ $agg_off->[0]->{'cluster-state'},
+ 'unknown',
+ 'no online active node leaves the cluster-state unknown',
+);
+is($agg_off->[0]->{nodes}->{dead}->{state}, 'unknown', 'stale offline node shown as unknown');
+
+# failure outranks degraded in the roll-up
+my $agg2 = PVE::Multipath::aggregate_cluster_status(
+ ['wA'],
+ {},
+ {},
+ {
+ n1 => { wA => { state => 'degraded', 'paths-active' => 1, 'paths-total' => 2 } },
+ n2 => { wA => { state => 'failed', 'paths-active' => 0, 'paths-total' => 2 } },
+ },
+ { n1 => 1, n2 => 1 },
+);
+is($agg2->[0]->{'cluster-state'}, 'failed', 'failed outranks degraded in the cluster roll-up');
+
+# --- expected-node set: a node that lost all paths (silent) must not vanish ---
+# nodeS is expected (a multipath storage is enabled there) and online, but
+# broadcasts nothing - e.g. every path to the SAN is down so it cleared its KV.
+my $exp_kv = {
+ nodeA => { wA => { state => 'optimal', 'paths-active' => 2, 'paths-total' => 2 } },
+};
+my $online = { nodeA => 1, nodeS => 1, nodeOff => 0 };
+my $expected = { wA => { nodeA => 1, nodeS => 1, nodeOff => 1 } };
+my $eagg = PVE::Multipath::aggregate_cluster_status(
+ ['wA'], {}, {}, $exp_kv, $online, $expected,
+);
+my $row = $eagg->[0];
+is($row->{nodes}->{nodeA}->{state}, 'optimal', 'reporting node keeps its real state');
+is(
+ $row->{nodes}->{nodeS}->{state},
+ 'missing',
+ 'expected online but silent node shows missing instead of vanishing',
+);
+is($row->{nodes}->{nodeOff}->{state}, 'unknown', 'expected offline node shows unknown');
+is($row->{'cluster-state'}, 'missing', 'a silent expected node drags the cluster-state to missing');
+
+# without $expected the silent node would have been invisible (regression guard
+# for the old behavior, proving the new param is what surfaces it)
+my $noexp = PVE::Multipath::aggregate_cluster_status(['wA'], {}, {}, $exp_kv, $online);
+ok(
+ !exists $noexp->[0]->{nodes}->{nodeS},
+ 'without the expected set the silent node is absent (the gap the param closes)',
+);
+is($noexp->[0]->{'cluster-state'}, 'optimal', 'and the cluster-state would falsely read optimal');
+
+# expected augmentation applies only to allow-listed WWIDs, not to a LUN that a
+# node merely happens to report off-list
+my $offlist = PVE::Multipath::aggregate_cluster_status(
+ [],
+ {},
+ {},
+ { nodeA => { wX => { state => 'optimal', 'paths-active' => 1, 'paths-total' => 1 } } },
+ { nodeA => 1, nodeS => 1 },
+ {},
+);
+ok(
+ !exists $offlist->[0]->{nodes}->{nodeS},
+ 'non-allow-listed WWID does not synthesize missing cells for expected nodes',
+);
+
+# a broadcasting node outside a LUN's expected set is never marked missing (a SAN zoned to only
+# some nodes), and a hand-made off-list map lists only the nodes that report it
+my $zoned = PVE::Multipath::aggregate_cluster_status(
+ ['wA'],
+ {},
+ {},
+ {
+ n1 => { wA => { state => 'optimal', 'paths-active' => 2, 'paths-total' => 2 } },
+ n2 => { wHand => { state => 'degraded', 'paths-active' => 1, 'paths-total' => 2 } },
+ },
+ { n1 => 1, n2 => 1 },
+ { wA => { n1 => 1 } },
+);
+my %zoned_rows = map { $_->{wwid} => $_ } $zoned->@*;
+ok(
+ !exists $zoned_rows{wA}->{nodes}->{n2},
+ 'a multipathing node outside the expected set is not marked missing',
+);
+is($zoned_rows{wA}->{'cluster-state'}, 'optimal', 'the zoned LUN stays green on its own nodes');
+ok(
+ !exists $zoned_rows{wHand}->{nodes}->{n1},
+ 'an off-list map does not drag other multipathing nodes into its row',
+);
+is(
+ $zoned_rows{wHand}->{'cluster-state'},
+ 'degraded',
+ 'an off-list map still surfaces its real state',
+);
+
+# an unrecognized state from a (newer or buggy) peer clamps to unknown and does not fake a report
+my $clamped = PVE::Multipath::aggregate_cluster_status(
+ ['wA'],
+ {},
+ {},
+ { n1 => { wA => { state => 'frobnicated', 'paths-active' => 1, 'paths-total' => 2 } } },
+ { n1 => 1 },
+ { wA => { n1 => 1 } },
+);
+is($clamped->[0]->{nodes}->{n1}->{state}, 'unknown', 'unrecognized peer state clamps to unknown');
+is($clamped->[0]->{'cluster-state'}, 'unknown', 'a clamped state does not count as active');
+
done_testing();
--
2.47.3
next prev parent reply other threads:[~2026-07-03 15:32 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-07-03 12:46 [PATCH v2 storage,cluster,manager 0/13] multipath: cluster-wide config, storage and health overview Thomas Lamprecht
2026-07-03 12:46 ` [PATCH v2 storage 01/13] multipath: add helper library and managed configuration Thomas Lamprecht
2026-07-03 12:46 ` [PATCH v2 storage 02/13] api: disks: add read-only multipath status endpoint Thomas Lamprecht
2026-07-03 12:46 ` [PATCH v2 storage 03/13] api: multipath: add cluster-wide configuration endpoints Thomas Lamprecht
2026-07-03 12:46 ` [PATCH v2 storage 04/13] multipath: add storage plugin for multipath LUNs Thomas Lamprecht
2026-07-03 12:46 ` [PATCH v2 storage 05/13] lvm: allow a multipath storage as the base device Thomas Lamprecht
2026-07-03 12:46 ` [PATCH v2 storage 06/13] multipath: broadcast per-node map health to the cluster KV store Thomas Lamprecht
2026-07-03 12:46 ` Thomas Lamprecht [this message]
2026-07-03 12:46 ` [PATCH v2 cluster 08/13] pmxcfs: track cluster-wide multipath configuration Thomas Lamprecht
2026-07-03 12:46 ` [PATCH v2 manager 09/13] pvestatd: apply the cluster-wide multipath config on each node Thomas Lamprecht
2026-07-03 12:46 ` [PATCH v2 manager 10/13] api: cluster: mount the multipath configuration endpoint Thomas Lamprecht
2026-07-03 12:46 ` [PATCH v2 manager 11/13] pvestatd: broadcast multipath map health to the cluster Thomas Lamprecht
2026-07-03 12:46 ` [PATCH v2 manager 12/13] ui: dc: add multipath health matrix and config editor Thomas Lamprecht
2026-07-03 12:46 ` [PATCH v2 manager 13/13] ui: node: show multipath maps and their paths under Disks Thomas Lamprecht
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260703124707.1172980-9-t.lamprecht@proxmox.com \
--to=t.lamprecht@proxmox.com \
--cc=pve-devel@lists.proxmox.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.