* [pve-devel] [PATCH perl-rs v2] fix: sdn: fabrics: always add node-ip to all fabric interfaces
@ 2025-09-16 12:40 Gabriel Goller
2025-10-01 15:00 ` Stefan Hanreich
0 siblings, 1 reply; 3+ messages in thread
From: Gabriel Goller @ 2025-09-16 12:40 UTC (permalink / raw)
To: pve-devel
OpenFabric can handle completely unnumbered interfaces, so interfaces
which don't have an ip configured. This works great in many setups, but
can cause some intricate ARP issues in some specific setups.
The problem
===========
Consider a setup like this:
┌────────┐ ┌────────┐ ┌────────┐
│ Node1 ├─────┤ Node2 ├─────┤ Node3 │
│10.0.1.1│ │10.0.1.2│ │10.0.1.3│
└────────┘ └────────┘ └────────┘
When pinging from Node1 to Node2 and from Node2 to Node3 everything
works. A ping from Node1 to Node3 also works, but ONLY if there has been
a ping between Node2-Node3 and Node2-Node1 in the last 60 seconds. This
is because there is a subtle difference between pinging from Node2 to
Node3 vs an ARP lookup from Node2 to Node3 happening because Node2 wants
to forward a packet from Node1 to Node3.
We can see this difference quite easily by just tcpdumping all the ARP
packages exiting Node2.
Scenario 1: Ping from Node2 to Node3
------------------------------------
A simple lookup in the routing table will tell us that to reach Node3 we
need to go through interface (e.g.) ens20 onlink (so just throw it onto
the link withouth any nexthop check) and use the src address of
10.0.1.2. So ping will create a simple ICMP packet with the destination
address of 10.0.1.3 and the source address of 10.0.1.2. To send the
package the mac address of the destination needs to be figured out, so a
ARP packet is send [0]. The arp_announce sysctl is 0 by default, so we
simply lookup if any local interface has the address 10.0.1.2 (the
source address of the packet we want to send out), if yes, then we
choose that as a source address for the ARP packet as well and send it
out [1]. This will generate an ARP request like the following:
Request who-has 10.0.1.2 tell 10.0.1.3
We get a nice response:
Reply 10.0.1.2 is-at bc:24:11:91:a0:49
And send a ping at the specified address, which will work and give us a
result.
The second scenario does not work that well, because the ARP packet is
constructed wrongly.
Scenario 2: Packet forward from Node2 to Node3
----------------------------------------------
In this scenario we ping from Node1 to Node3, which means that Node2
needs to forward our packet. When the packet arrives at Node2, we again
check the routing table, where we see the same entry as before, so
10.0.1.3 is available at (e.g.) ens20 onlink with the src address
10.0.1.2. This is fine, but we still need to do an ARP request to lookup
the mac address of the neighbor which is attached at ens20. So we take
the packet we get from Node1, which has a source address of 10.0.1.1, and
call arp_solicit on it to make an ARP request and find the neighbors mac
address. arp_solicit will take the source address of the packet
(10.0.1.1) and lookup to search it locally. This check fails because
10.0.1.1 is not available locally (there is a direct route to it, but
it's not configured on any local interface (RTN_LOCAL)). arp_solicit
will thus [2] call inet_select_addr, which goes through all the ip
addresses on the current interface (there are none, because this
interface is unnumbered) and then iterate through all the other
interfaces on the host and select the first one with 'scope link'. This
ip will then be selected as the source address for the outgoing ARP
package. Now if we're lucky this is the dummy interface on our node and
we select the correct source address (10.0.1.2) -- but we could also be
unlucky and it selects a completely random address from another
interface e.g. 172.16.0.26. If we're unlucky arp_solicit will send out
the following ARP packet:
Request who-has 10.0.1.3 tell 172.16.0.26
We will get a correct response but the response will end up on another
interface (because 172.16.0.26 is not on the same interface as
10.0.1.2). This means we will send out these ARP requests repeatedly and
never get an answer, so the ping from Node1 to Node3 will get
"Destination host unreachable errors".
For more information check out the simplified version of the arp_solicit
function below:
The logic is implemented in the arp_solicit function which is shown here
in a simplified form:
static void arp_solicit(struct neighbour *neigh, struct sk_buff *skb)
{
__be32 saddr = 0;
struct net_device *dev = neigh->dev;
__be32 target = *(__be32 *)neigh->primary_key;
switch (IN_DEV_ARP_ANNOUNCE(in_dev)) {
default:
case 0: /* By default announce any local IP */
if (skb && inet_addr_type_dev_table(dev_net(dev), dev,
ip_hdr(skb)->saddr) == RTN_LOCAL)
saddr = ip_hdr(skb)->saddr;
break;
case 1: /* Restrict announcements of saddr in same subnet */
if (!skb)
break;
saddr = ip_hdr(skb)->saddr;
if (inet_addr_type_dev_table(dev_net(dev), dev,
saddr) == RTN_LOCAL) {
/* saddr should be known to target */
if (inet_addr_onlink(in_dev, target, saddr))
break;
}
saddr = 0;
break;
case 2: /* Avoid secondary IPs, get a primary/preferred one */
break;
}
if (!saddr)
saddr = inet_select_addr(dev, target, RT_SCOPE_LINK);
arp_send_dst(ARPOP_REQUEST, ETH_P_ARP, target, dev, saddr,
dst_hw, dev->dev_addr, NULL, dst);
}
How to fix this
===============
We could fix this by tweaking the arp_announce option, but we would need
to be careful and only do this on outgoing interfaces. The problem is
here that we can't tell which interfaces will only be outgoing
interfaces.
A much simpler solution would be to just set the ip address of the node
(in our case for Node2 10.0.1.2) on every interfaces which is in the
fabric. In that case inet_select_addr will select the correct address
because it's set on the outgoing interface. This means all the ARP
requests will be sent out with the correct source address.
[0]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/net/ipv4/arp.c#n333
[1]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/net/ipv4/arp.c#n352
[2]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/net/ipv4/arp.c#n373
Signed-off-by: Gabriel Goller <g.goller@proxmox.com>
---
Changelog:
v1:
* explained the problem in the commit message better, no functional
change.
pve-rs/src/bindings/sdn/fabrics.rs | 17 +++++++++++++----
1 file changed, 13 insertions(+), 4 deletions(-)
diff --git a/pve-rs/src/bindings/sdn/fabrics.rs b/pve-rs/src/bindings/sdn/fabrics.rs
index 587b1d68c8fb..9d5fa6c53d70 100644
--- a/pve-rs/src/bindings/sdn/fabrics.rs
+++ b/pve-rs/src/bindings/sdn/fabrics.rs
@@ -544,12 +544,21 @@ pub mod pve_rs_sdn_fabrics {
write!(interfaces, "{interface}")?;
}
- // If not ip is configured, add auto and empty iface to bring interface up
+ // If no ip is configured, add auto and iface with node ip to bring interface up
+ // OpenFabric doesn't really need an ip on the interface, but the problem
+ // is that arp can't tell which source address to use in some cases, so
+ // it's better if we set the node address on all the fabric interfaces.
if let (None, None) = (interface.ip(), interface.ip6()) {
+ let cidr = Cidr::from(if let Some(ip) = node.ip() {
+ IpAddr::from(ip)
+ } else if let Some(ip) = node.ip6() {
+ IpAddr::from(ip)
+ } else {
+ anyhow::bail!("there has to be a ipv4 or ipv6 node address");
+ });
+ let interface = render_interface(interface.name(), cidr, false)?;
writeln!(interfaces)?;
- writeln!(interfaces, "auto {}", interface.name())?;
- writeln!(interfaces, "iface {}", interface.name())?;
- writeln!(interfaces, "\tip-forward 1")?;
+ write!(interfaces, "{interface}")?;
}
}
}
--
2.47.3
_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: [pve-devel] [PATCH perl-rs v2] fix: sdn: fabrics: always add node-ip to all fabric interfaces
2025-09-16 12:40 [pve-devel] [PATCH perl-rs v2] fix: sdn: fabrics: always add node-ip to all fabric interfaces Gabriel Goller
@ 2025-10-01 15:00 ` Stefan Hanreich
2025-10-02 17:04 ` Gabriel Goller
0 siblings, 1 reply; 3+ messages in thread
From: Stefan Hanreich @ 2025-10-01 15:00 UTC (permalink / raw)
To: Proxmox VE development discussion, Gabriel Goller
Have some additional comments I wanted to post on the ML so they don't
get lost, but they're more food for though for further improvements of
the whole SDN / Fabrics stack wrt VRFs.
Tested the patch series and looked at the code - LGTM!
Reviewed-by: Stefan Hanreich <s.hanreich@proxmox.com>
Tested-by: Stefan Hanreich <s.hanreich@proxmox.com>
On 9/16/25 2:41 PM, Gabriel Goller wrote:
> OpenFabric can handle completely unnumbered interfaces, so interfaces
> which don't have an ip configured. This works great in many setups, but
> can cause some intricate ARP issues in some specific setups.
>
> The problem
> ===========
>
> Consider a setup like this:
>
> ┌────────┐ ┌────────┐ ┌────────┐
> │ Node1 ├─────┤ Node2 ├─────┤ Node3 │
> │10.0.1.1│ │10.0.1.2│ │10.0.1.3│
> └────────┘ └────────┘ └────────┘
>
> When pinging from Node1 to Node2 and from Node2 to Node3 everything
> works. A ping from Node1 to Node3 also works, but ONLY if there has been
> a ping between Node2-Node3 and Node2-Node1 in the last 60 seconds. This
> is because there is a subtle difference between pinging from Node2 to
> Node3 vs an ARP lookup from Node2 to Node3 happening because Node2 wants
> to forward a packet from Node1 to Node3.
>
> We can see this difference quite easily by just tcpdumping all the ARP
> packages exiting Node2.
>
> Scenario 1: Ping from Node2 to Node3
> ------------------------------------
>
> A simple lookup in the routing table will tell us that to reach Node3 we
> need to go through interface (e.g.) ens20 onlink (so just throw it onto
> the link withouth any nexthop check) and use the src address of
> 10.0.1.2. So ping will create a simple ICMP packet with the destination
> address of 10.0.1.3 and the source address of 10.0.1.2. To send the
> package the mac address of the destination needs to be figured out, so a
> ARP packet is send [0]. The arp_announce sysctl is 0 by default, so we
> simply lookup if any local interface has the address 10.0.1.2 (the
> source address of the packet we want to send out), if yes, then we
> choose that as a source address for the ARP packet as well and send it
> out [1]. This will generate an ARP request like the following:
>
> Request who-has 10.0.1.2 tell 10.0.1.3
>
> We get a nice response:
>
> Reply 10.0.1.2 is-at bc:24:11:91:a0:49
>
> And send a ping at the specified address, which will work and give us a
> result.
>
> The second scenario does not work that well, because the ARP packet is
> constructed wrongly.
>
> Scenario 2: Packet forward from Node2 to Node3
> ----------------------------------------------
>
> In this scenario we ping from Node1 to Node3, which means that Node2
> needs to forward our packet. When the packet arrives at Node2, we again
> check the routing table, where we see the same entry as before, so
> 10.0.1.3 is available at (e.g.) ens20 onlink with the src address
> 10.0.1.2. This is fine, but we still need to do an ARP request to lookup
> the mac address of the neighbor which is attached at ens20. So we take
> the packet we get from Node1, which has a source address of 10.0.1.1, and
> call arp_solicit on it to make an ARP request and find the neighbors mac
> address. arp_solicit will take the source address of the packet
> (10.0.1.1) and lookup to search it locally. This check fails because
> 10.0.1.1 is not available locally (there is a direct route to it, but
> it's not configured on any local interface (RTN_LOCAL)). arp_solicit
> will thus [2] call inet_select_addr, which goes through all the ip
> addresses on the current interface (there are none, because this
> interface is unnumbered) and then iterate through all the other
> interfaces on the host and select the first one with 'scope link'. This
> ip will then be selected as the source address for the outgoing ARP
> package. Now if we're lucky this is the dummy interface on our node and
> we select the correct source address (10.0.1.2) -- but we could also be
> unlucky and it selects a completely random address from another
> interface e.g. 172.16.0.26. If we're unlucky arp_solicit will send out
> the following ARP packet:
>
> Request who-has 10.0.1.3 tell 172.16.0.26
>
> We will get a correct response but the response will end up on another
> interface (because 172.16.0.26 is not on the same interface as
> 10.0.1.2). This means we will send out these ARP requests repeatedly and
> never get an answer, so the ping from Node1 to Node3 will get
> "Destination host unreachable errors".
Interesting, that the src IP address directive from the route is not
considered at all:
172.16.123.2 nhid 162 via 172.16.123.2 dev ens22 [..] src 172.16.123.1
Didn't dig further into it, there's a good chance there's a good reason
for that I just didn't see immediately.
What I also found interesting while jumping down this rabbit hole is the
following comment / code section in the inet_select_addr function [1]:
/* For VRFs, the VRF device takes the place of the loopback device,
* with addresses on it being preferred. Note in such cases the
* loopback device will be among the devices that fail the master_idx
* equality check in the loop below.
*/
So, in that case (iiuc) one could side-step that problem by
compartmentalizing the fabric inside its own VRF (further reinforcing my
belief in implementing VRF support sooner than later to avoid issues
like this when running everything in one routing table, particularly
multiple fabrics).
fabricd has no VRF support atm though (could potentially run it via ip
vrf exec, but that seems hacky) - OSPF and BGP do.
[1]
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/net/ipv4/devinet.c?h=v6.17#n1359
> For more information check out the simplified version of the arp_solicit
> function below:
>
> The logic is implemented in the arp_solicit function which is shown here
> in a simplified form:
>
> static void arp_solicit(struct neighbour *neigh, struct sk_buff *skb)
> {
> __be32 saddr = 0;
> struct net_device *dev = neigh->dev;
> __be32 target = *(__be32 *)neigh->primary_key;
>
> switch (IN_DEV_ARP_ANNOUNCE(in_dev)) {
> default:
> case 0: /* By default announce any local IP */
> if (skb && inet_addr_type_dev_table(dev_net(dev), dev,
> ip_hdr(skb)->saddr) == RTN_LOCAL)
> saddr = ip_hdr(skb)->saddr;
> break;
> case 1: /* Restrict announcements of saddr in same subnet */
> if (!skb)
> break;
> saddr = ip_hdr(skb)->saddr;
> if (inet_addr_type_dev_table(dev_net(dev), dev,
> saddr) == RTN_LOCAL) {
> /* saddr should be known to target */
> if (inet_addr_onlink(in_dev, target, saddr))
> break;
> }
> saddr = 0;
> break;
> case 2: /* Avoid secondary IPs, get a primary/preferred one */
> break;
> }
>
> if (!saddr)
> saddr = inet_select_addr(dev, target, RT_SCOPE_LINK);
>
> arp_send_dst(ARPOP_REQUEST, ETH_P_ARP, target, dev, saddr,
> dst_hw, dev->dev_addr, NULL, dst);
> }
>
> How to fix this
> ===============
>
> We could fix this by tweaking the arp_announce option, but we would need
> to be careful and only do this on outgoing interfaces. The problem is
> here that we can't tell which interfaces will only be outgoing
> interfaces.
> A much simpler solution would be to just set the ip address of the node
> (in our case for Node2 10.0.1.2) on every interfaces which is in the
> fabric. In that case inet_select_addr will select the correct address
> because it's set on the outgoing interface. This means all the ARP
> requests will be sent out with the correct source address.
>
> [0]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/net/ipv4/arp.c#n333
> [1]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/net/ipv4/arp.c#n352
> [2]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/net/ipv4/arp.c#n373
>
> Signed-off-by: Gabriel Goller <g.goller@proxmox.com>
> ---
>
> Changelog:
> v1:
> * explained the problem in the commit message better, no functional
> change.
>
> pve-rs/src/bindings/sdn/fabrics.rs | 17 +++++++++++++----
> 1 file changed, 13 insertions(+), 4 deletions(-)
>
> diff --git a/pve-rs/src/bindings/sdn/fabrics.rs b/pve-rs/src/bindings/sdn/fabrics.rs
> index 587b1d68c8fb..9d5fa6c53d70 100644
> --- a/pve-rs/src/bindings/sdn/fabrics.rs
> +++ b/pve-rs/src/bindings/sdn/fabrics.rs
> @@ -544,12 +544,21 @@ pub mod pve_rs_sdn_fabrics {
> write!(interfaces, "{interface}")?;
> }
>
> - // If not ip is configured, add auto and empty iface to bring interface up
> + // If no ip is configured, add auto and iface with node ip to bring interface up
> + // OpenFabric doesn't really need an ip on the interface, but the problem
> + // is that arp can't tell which source address to use in some cases, so
> + // it's better if we set the node address on all the fabric interfaces.
> if let (None, None) = (interface.ip(), interface.ip6()) {
> + let cidr = Cidr::from(if let Some(ip) = node.ip() {
> + IpAddr::from(ip)
> + } else if let Some(ip) = node.ip6() {
> + IpAddr::from(ip)
> + } else {
> + anyhow::bail!("there has to be a ipv4 or ipv6 node address");
> + });
> + let interface = render_interface(interface.name(), cidr, false)?;
> writeln!(interfaces)?;
> - writeln!(interfaces, "auto {}", interface.name())?;
> - writeln!(interfaces, "iface {}", interface.name())?;
> - writeln!(interfaces, "\tip-forward 1")?;
> + write!(interfaces, "{interface}")?;
> }
> }
> }
_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: [pve-devel] [PATCH perl-rs v2] fix: sdn: fabrics: always add node-ip to all fabric interfaces
2025-10-01 15:00 ` Stefan Hanreich
@ 2025-10-02 17:04 ` Gabriel Goller
0 siblings, 0 replies; 3+ messages in thread
From: Gabriel Goller @ 2025-10-02 17:04 UTC (permalink / raw)
To: Stefan Hanreich; +Cc: Proxmox VE development discussion
On 01.10.2025 17:00, Stefan Hanreich wrote:
>Have some additional comments I wanted to post on the ML so they don't
>get lost, but they're more food for though for further improvements of
>the whole SDN / Fabrics stack wrt VRFs.
>
>Tested the patch series and looked at the code - LGTM!
>
>Reviewed-by: Stefan Hanreich <s.hanreich@proxmox.com>
>Tested-by: Stefan Hanreich <s.hanreich@proxmox.com>
Thanks for the review!
>On 9/16/25 2:41 PM, Gabriel Goller wrote:
>> [snip]
>> In this scenario we ping from Node1 to Node3, which means that Node2
>> needs to forward our packet. When the packet arrives at Node2, we again
>> check the routing table, where we see the same entry as before, so
>> 10.0.1.3 is available at (e.g.) ens20 onlink with the src address
>> 10.0.1.2. This is fine, but we still need to do an ARP request to lookup
>> the mac address of the neighbor which is attached at ens20. So we take
>> the packet we get from Node1, which has a source address of 10.0.1.1, and
>> call arp_solicit on it to make an ARP request and find the neighbors mac
>> address. arp_solicit will take the source address of the packet
>> (10.0.1.1) and lookup to search it locally. This check fails because
>> 10.0.1.1 is not available locally (there is a direct route to it, but
>> it's not configured on any local interface (RTN_LOCAL)). arp_solicit
>> will thus [2] call inet_select_addr, which goes through all the ip
>> addresses on the current interface (there are none, because this
>> interface is unnumbered) and then iterate through all the other
>> interfaces on the host and select the first one with 'scope link'. This
>> ip will then be selected as the source address for the outgoing ARP
>> package. Now if we're lucky this is the dummy interface on our node and
>> we select the correct source address (10.0.1.2) -- but we could also be
>> unlucky and it selects a completely random address from another
>> interface e.g. 172.16.0.26. If we're unlucky arp_solicit will send out
>> the following ARP packet:
>>
>> Request who-has 10.0.1.3 tell 172.16.0.26
>>
>> We will get a correct response but the response will end up on another
>> interface (because 172.16.0.26 is not on the same interface as
>> 10.0.1.2). This means we will send out these ARP requests repeatedly and
>> never get an answer, so the ping from Node1 to Node3 will get
>> "Destination host unreachable errors".
>
>Interesting, that the src IP address directive from the route is not
>considered at all:
>
>172.16.123.2 nhid 162 via 172.16.123.2 dev ens22 [..] src 172.16.123.1
>
>Didn't dig further into it, there's a good chance there's a good reason
>for that I just didn't see immediately.
Hmm yeah, I'll look into this. Currently the whole neighbor system is
quite generic and limited, I'll see if there is any way we can add
"hints" to the arp_solicit function when it's called from ip_forward.
>What I also found interesting while jumping down this rabbit hole is the
>following comment / code section in the inet_select_addr function [1]:
>
>/* For VRFs, the VRF device takes the place of the loopback device,
> * with addresses on it being preferred. Note in such cases the
> * loopback device will be among the devices that fail the master_idx
> * equality check in the loop below.
> */
>
>So, in that case (iiuc) one could side-step that problem by
>compartmentalizing the fabric inside its own VRF (further reinforcing my
>belief in implementing VRF support sooner than later to avoid issues
>like this when running everything in one routing table, particularly
>multiple fabrics).
>
>fabricd has no VRF support atm though (could potentially run it via ip
>vrf exec, but that seems hacky) - OSPF and BGP do.
I agree, in theory this would be very nice, but hacking vrf support into
lots of stuff that doesn't have vrf support might be tricky. Another
example is wireguard, where vrf support is also kinda tricky.
I'll look into OpenFabric VRF support though, this shouldn't be too
tricky to implement as ISIS already supports it.
>[1]
>https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/net/ipv4/devinet.c?h=v6.17#n1359
>
>> For more information check out the simplified version of the arp_solicit
>> function below:
>> [snip]
_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2025-10-02 17:05 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-09-16 12:40 [pve-devel] [PATCH perl-rs v2] fix: sdn: fabrics: always add node-ip to all fabric interfaces Gabriel Goller
2025-10-01 15:00 ` Stefan Hanreich
2025-10-02 17:04 ` Gabriel Goller
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox