public inbox for pve-devel@lists.proxmox.com
 help / color / mirror / Atom feed
* [RFC cluster/docs/ifupdown2/manager/network/proxmox{-ebpf,-ve-rs,-perl-rs} 00/16] sdn: add microsegmentation support
@ 2026-06-09 13:25 Hannes Laimer
  2026-06-09 13:25 ` [PATCH proxmox-ebpf 01/16] agent: add userspace coordinator and stateless policy subsystem Hannes Laimer
                   ` (15 more replies)
  0 siblings, 16 replies; 17+ messages in thread
From: Hannes Laimer @ 2026-06-09 13:25 UTC (permalink / raw)
  To: pve-devel

This adds support for microsegmentation using eBPF programs attached to
interfaces. Mostly the tap/veth interfaces on the guests directly.

# Overview
Each guest interface can be assigned a group, then, on these groups it is
possible to define policies/rules between these groups. A policy is a mapping
 `(src,dst) -> allow`
Each interface can be assigned to only a *single* group. Groups however can be
part of groups, that should cover most use cases I came up with. In case multiple
policies would match (possibly because groups/sub-groups could have policies that overlap)
the most specific policy wins: the one that names the destination group closest in
the tree, and if two name it equally closely, the one that names the source closest, so:

 - web      (db,web)->allow
   - wiki                    # from `web`: (db,web)->allow
     * CT100,net0
   - wol    (db,wol)->deny   # closer than the (db,web)->allow inherited from web, so deny wins
     * CT101,net0            # => packets from db to CT101,net0 are denied

 - db       (web,db)->allow
   * CT200,net1              # packets from `web` reaching CT200,net1 will be allowed

Because each group has a single parent, the destination-then-source ordering always
picks exactly one rule, so there is no tie to resolve (a malformed config with
duplicate rules falls back to deny).

If an interface has a group assigned, it needs an explicit (0,dst) rule to
accept un-tagged packets.

Enforcement happens on the receiving side, cause that's where we for sure know both
 - where the packet is from
 - and, its destination (us)

# config and API
The config is a single section-config, `sdn/microseg.cfg`, owned by Rust (the types
live in `proxmox-ve-config`, with a `PVE::RS::SDN::Microseg` binding) the same way the
fabric config is. It holds four object types, keyed by section id:
 - group, a numeric `mark` (auto-assigned if omitted) and an optional `parent`
 - rule, `(src,dst) -> allow`, an absent `src` matches un-tagged traffic
 - assignment, binds a guest NIC (`vmid` plus `iface` index) to a group
 - bridge, marks a bridge-facing interface as an SRv6 carrier (no UI yet)
The API exposes per-type endpoints under
`/cluster/sdn/microseg/{group,rule,assignment,bridge}`, guarded by SDN.Audit for reads
and SDN.Allocate for writes. Rule and assignment ids are derived from their contents
(`p{src_mark}-{dst_mark}`, `vm{vmid}i{iface}`), groups and bridges take a chosen id. On
commit the config is rendered into `.running-config`, which is what the agent reads.

# skb->mark
Every packet in the kernel lives in a `sk_buff` struct, this struct has a
32-bit field `mark`. This field can be written to and read while the packet
passes through the kernel. We don't do that currently in our SDN stack, so
currently we can't really overwrite stuff written by a different part of our
stack. But this is something to keep in mind in case we should end up using
this again in the future. For now microsegmentation only uses the lower 16 bits.
We use this to assign packets to groups, with these 16 bits we support up to
65535 distinct groups (mark 0 is reserved for unstamped traffic). This `mark`
however only lives as long as the packet does not leave the kernel, and given
the nature of networking, chances are it'll leave it eventually. To transport
this group identity between hosts we have to attach it to the actual network
packet that hits the wire. For this we currently support both
 - VXLAN-GBP, the 16-bit GBP field on the VXLAN encapsulation
 - and, SRv6, the 16-bit Tag field on the SRH
conveniently the kernel already handles putting the `mark` into the VXLAN-GBP
encapsulation, and also taking it out of there (found out way too late :P). For
SRv6 we need a small eBPF program that handles this.

# eBPF
Both tagging (setting `mark`) and enforcement happen in an eBPF program
directly attached to the guest's interface. The programs themselves read and
write to/from two maps:
 - tap_to_group, so we know what to set `mark` to
 - rules, map of (src,dst)->action, we know src from the `mark` and `dst` is
   the group we are in

Specifically, on ingress we set the mark, and on egress we enforce.

# Implementation

We have an `agent` that reads the running sdn config, and applies that state to
the kernel. The binary doing this is stateless, it runs on SDN apply, tap_plug
of interfaces of guests and on boot after pve-sdn-commit but before guests
start. It keeps track of what is currently running loaded in the kernel with
two files under `/run/proxmox-ebpf`. One keeping track of the bpf program that
was compiled into the binary, and one for keeping track of changes to the
data structures the bpf program accesses. The distinction is useful cause for
only a program change we can swap the currently attached program with the
updated one atomically. In case the data structure (the maps) changed there is
no other way than to tear-down all the current state and repopulate the maps
and re-attach the programs. Swapping out the maps first would have old programs
interact with a new data schema, and swapping the programs first will lead
to new programs accessing an old data schema. Either way, not good, so we wipe
the state in that case and re-build it. So, in case we change the structure of
the data our bpf programs access, there will be a bunch of ms where the
configured segmentation is not enforced. But that is the only scenario where
that can happen.

## aya, aya-ebpf
Aya is a rust lib that helps with working with BPF more easily. It has both a
userspace part(`aya`) and a part(`aya-ebpf`) that compiles to bpf. I chose to
only use the userspace part, and use C for the BPF programs themselves. The reasons were:
 - we don't have `aya-ebpf` packaged
 - `aya-ebpf` *requires* nightly toolchain to compile
 - the bpf programs are very small, in case this should change at some point we
   can reconsider. Also, it doesn't matter what produces the .o file, so this
   could be swapped in really easily if we decided to

So we compile the bpf program written in C to a .o targeting bpf using clang
and include it when compiling the agent binary.

## agent
The agent is not a long running daemon, the triggers mentioned before should cover
all situations where we'd have to touch the kernel state. When started, the
first thing it does is check both the program and schema version, so what it
was compiled with, and what the last invocation did to the kernel. If either
the program or the schema differs it either swaps out the program currently
loaded by the kernel and pins it. If the schema version differs it does that as
well, but additionally also tears down all state and rebuilds it.

The agent also populates the `rules` map, so it creates all the inherited rules
and resolves multiple matches to the closest one as described before.

## SRv6
This also adds support for SRv6 as a transport, we don't support that yet in
our SDN stack, but could be useful for basic routing between simple zones on
multiple nodes. A bit like EVPN-lite since it's only L3. I did most of my
testing with SRv6 as transport between kernels, before I found out the kernel
does it for VXLAN automatically. I didn't wire up any UI for it yet, but I
decided to include it in this RFC for now.

# enabling VXLAN-GBP
The kernel only moves the mark into the GBP field if the vxlan device was
created with the GBP flag set, and it's create-only: the kernel won't
toggle it on a running device. ifupdown2 had no attribute for this, so
this series includes a small ifupdown2 patch adding `vxlan-gbp`, which
threads the flag through to the netlink/iproute2 create path.

# testing
I have put pre-built packages on sani, these include ifupdown2 with the patch

# building
1. proxmox-ve-rs (install)
2. proxmox-ebpf
3. pve-cluster (install)
4. proxmox-perl-rs (install)
5. pve-network
6. pve-manager

# notes
I did not send the first commit for `proxmox-ebpf` since it contains a
`vmlinux.h` which is rather large but is needed for compiling. The commit is in
my staff repo.

# feedback wanted
 - the general desgin, specifically with the agent being one-shot and triggered
   rather than a running service
 - I put the VXLAN-GBP flag into the zones that run on vxlan for now, so it has
   to be enabled on the zone, and will affect all vxlan interfaces that are
   part of it. The only real alternative is enabling it always, or trying to
   infer if a nic is assigned a vnet that is in a zone that does vxlan and
   corss-referencing that with potentially configured microseg
   groups/assignments. This seemed brittle, and since the needed flag is
   additionally create-only, a simple flag on the zone appeard to be the better
   approach
 - and the ui, very likely needs some polishing, and im very open to
   alternative approaches


proxmox-ebpf:

Hannes Laimer (3):
  agent: add userspace coordinator and stateless policy subsystem
  bpf: add bridge subsystem
  debian: add packaging and boot-time oneshot unit

 Makefile                    |  66 +++++++
 debian/changelog            |   5 +
 debian/control              |  34 ++++
 debian/copyright            |  18 ++
 debian/proxmox-ebpf.install |   1 +
 debian/proxmox-ebpf.postrm  |  11 ++
 debian/proxmox-ebpf.prerm   |  12 ++
 debian/proxmox-ebpf.service |  15 ++
 debian/rules                |  33 ++++
 debian/source/format        |   1 +
 include/mark.h              |  30 +++
 src/agent.rs                | 105 ++++++++++
 src/bridge/bpf/srv6.bpf.c   |  76 +++++++
 src/bridge/mod.rs           |  75 +++++++
 src/main.rs                 |  69 +++++++
 src/policy/bpf/tap.bpf.c    |  66 +++++++
 src/policy/bpf/types.h      |  23 +++
 src/policy/mod.rs           | 268 +++++++++++++++++++++++++
 src/policy/types.rs         |  45 +++++
 src/running_config.rs       |  38 ++++
 src/state.rs                | 303 ++++++++++++++++++++++++++++
 src/subsystem.rs            | 384 ++++++++++++++++++++++++++++++++++++
 src/tc.rs                   | 152 ++++++++++++++
 23 files changed, 1830 insertions(+)
 create mode 100644 Makefile
 create mode 100644 debian/changelog
 create mode 100644 debian/control
 create mode 100644 debian/copyright
 create mode 100644 debian/proxmox-ebpf.install
 create mode 100755 debian/proxmox-ebpf.postrm
 create mode 100755 debian/proxmox-ebpf.prerm
 create mode 100644 debian/proxmox-ebpf.service
 create mode 100755 debian/rules
 create mode 100644 debian/source/format
 create mode 100644 include/mark.h
 create mode 100644 src/agent.rs
 create mode 100644 src/bridge/bpf/srv6.bpf.c
 create mode 100644 src/bridge/mod.rs
 create mode 100644 src/main.rs
 create mode 100644 src/policy/bpf/tap.bpf.c
 create mode 100644 src/policy/bpf/types.h
 create mode 100644 src/policy/mod.rs
 create mode 100644 src/policy/types.rs
 create mode 100644 src/running_config.rs
 create mode 100644 src/state.rs
 create mode 100644 src/subsystem.rs
 create mode 100644 src/tc.rs


proxmox-ve-rs:

Hannes Laimer (1):
  ve-config: sdn: add microseg config types

 proxmox-ve-config/src/sdn/config.rs   |   9 +-
 proxmox-ve-config/src/sdn/microseg.rs | 847 ++++++++++++++++++++++++++
 proxmox-ve-config/src/sdn/mod.rs      |   1 +
 3 files changed, 856 insertions(+), 1 deletion(-)
 create mode 100644 proxmox-ve-config/src/sdn/microseg.rs


proxmox-perl-rs:

Hannes Laimer (1):
  sdn: add microseg config binding

 pve-rs/Makefile                     |   1 +
 pve-rs/src/bindings/sdn/microseg.rs | 172 ++++++++++++++++++++++++++++
 pve-rs/src/bindings/sdn/mod.rs      |   1 +
 3 files changed, 174 insertions(+)
 create mode 100644 pve-rs/src/bindings/sdn/microseg.rs


pve-cluster:

Hannes Laimer (1):
  cfs: add 'sdn/microseg.cfg' to observed files

 src/PVE/Cluster.pm | 1 +
 1 file changed, 1 insertion(+)


pve-network:

Hannes Laimer (4):
  sdn: microseg: add config and API
  sdn: zones: trigger microseg apply on tap_plug
  sdn: zones: add vxlan-gbp option to vxlan and evpn zones
  evpn: disable vxlan-learning on create if GBP is enabled

 src/PVE/API2/Network/SDN.pm                   |  12 +
 src/PVE/API2/Network/SDN/Makefile             |   2 +
 src/PVE/API2/Network/SDN/Microseg.pm          | 126 +++++++
 .../API2/Network/SDN/Microseg/Assignment.pm   | 163 +++++++++
 src/PVE/API2/Network/SDN/Microseg/Bridge.pm   | 171 ++++++++++
 src/PVE/API2/Network/SDN/Microseg/Group.pm    | 171 ++++++++++
 src/PVE/API2/Network/SDN/Microseg/Makefile    |   8 +
 src/PVE/API2/Network/SDN/Microseg/Rule.pm     | 163 +++++++++
 src/PVE/Network/SDN.pm                        |   5 +
 src/PVE/Network/SDN/Makefile                  |   1 +
 src/PVE/Network/SDN/Microseg.pm               | 316 ++++++++++++++++++
 src/PVE/Network/SDN/Zones.pm                  |   6 +
 src/PVE/Network/SDN/Zones/EvpnPlugin.pm       |  11 +
 src/PVE/Network/SDN/Zones/VxlanPlugin.pm      |   9 +
 14 files changed, 1164 insertions(+)
 create mode 100644 src/PVE/API2/Network/SDN/Microseg.pm
 create mode 100644 src/PVE/API2/Network/SDN/Microseg/Assignment.pm
 create mode 100644 src/PVE/API2/Network/SDN/Microseg/Bridge.pm
 create mode 100644 src/PVE/API2/Network/SDN/Microseg/Group.pm
 create mode 100644 src/PVE/API2/Network/SDN/Microseg/Makefile
 create mode 100644 src/PVE/API2/Network/SDN/Microseg/Rule.pm
 create mode 100644 src/PVE/Network/SDN/Microseg.pm


pve-manager:

Hannes Laimer (3):
  ui: sdn: add microsegmentation
  network: apply microseg state on reload
  ui: sdn: zones: add vxlan-gbp checkbox to vxlan and evpn

 PVE/API2/Network.pm                           |   4 +
 www/manager6/Makefile                         |   9 +
 www/manager6/Utils.js                         |  23 +
 www/manager6/dc/Config.js                     |   8 +
 www/manager6/form/MicrosegGroupSelector.js    |  64 +++
 www/manager6/form/MicrosegGuestNicSelector.js | 107 +++++
 www/manager6/form/MicrosegGuestSelector.js    |  83 ++++
 www/manager6/sdn/MicrosegView.js              | 408 ++++++++++++++++++
 www/manager6/sdn/microseg/AssignmentEdit.js   |  63 +++
 www/manager6/sdn/microseg/Base.js             |  88 ++++
 www/manager6/sdn/microseg/GroupEdit.js        |  61 +++
 www/manager6/sdn/microseg/PolicyView.js       | 221 ++++++++++
 www/manager6/sdn/microseg/RuleEdit.js         |  49 +++
 www/manager6/sdn/zones/EvpnEdit.js            |   8 +
 www/manager6/sdn/zones/VxlanEdit.js           |  11 +
 15 files changed, 1207 insertions(+)
 create mode 100644 www/manager6/form/MicrosegGroupSelector.js
 create mode 100644 www/manager6/form/MicrosegGuestNicSelector.js
 create mode 100644 www/manager6/form/MicrosegGuestSelector.js
 create mode 100644 www/manager6/sdn/MicrosegView.js
 create mode 100644 www/manager6/sdn/microseg/AssignmentEdit.js
 create mode 100644 www/manager6/sdn/microseg/Base.js
 create mode 100644 www/manager6/sdn/microseg/GroupEdit.js
 create mode 100644 www/manager6/sdn/microseg/PolicyView.js
 create mode 100644 www/manager6/sdn/microseg/RuleEdit.js


pve-docs:

Hannes Laimer (2):
  sdn: add microsegmentation section
  sdn: add VXLAN-GBP flag to evpn/vxlan zone sections

 pvesdn.adoc | 87 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 87 insertions(+)


ifupdown2:

Hannes Laimer (1):
  d/patches: add support for VXLAN-GBP flag

 ...addons-vxlan-add-vxlan-gbp-attribute.patch | 228 ++++++++++++++++++
 debian/patches/series                         |   1 +
 2 files changed, 229 insertions(+)
 create mode 100644 debian/patches/pve/0016-addons-vxlan-add-vxlan-gbp-attribute.patch


Summary over all repositories:
  62 files changed, 5548 insertions(+), 1 deletions(-)

-- 
Generated by murpp 0.11.0




^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2026-06-09 13:28 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-09 13:25 [RFC cluster/docs/ifupdown2/manager/network/proxmox{-ebpf,-ve-rs,-perl-rs} 00/16] sdn: add microsegmentation support Hannes Laimer
2026-06-09 13:25 ` [PATCH proxmox-ebpf 01/16] agent: add userspace coordinator and stateless policy subsystem Hannes Laimer
2026-06-09 13:25 ` [PATCH proxmox-ebpf 02/16] bpf: add bridge subsystem Hannes Laimer
2026-06-09 13:25 ` [PATCH proxmox-ebpf 03/16] debian: add packaging and boot-time oneshot unit Hannes Laimer
2026-06-09 13:25 ` [PATCH proxmox-ve-rs 04/16] ve-config: sdn: add microseg config types Hannes Laimer
2026-06-09 13:25 ` [PATCH proxmox-perl-rs 05/16] sdn: add microseg config binding Hannes Laimer
2026-06-09 13:25 ` [PATCH pve-cluster 06/16] cfs: add 'sdn/microseg.cfg' to observed files Hannes Laimer
2026-06-09 13:25 ` [PATCH pve-network 07/16] sdn: microseg: add config and API Hannes Laimer
2026-06-09 13:25 ` [PATCH pve-network 08/16] sdn: zones: trigger microseg apply on tap_plug Hannes Laimer
2026-06-09 13:25 ` [PATCH pve-network 09/16] sdn: zones: add vxlan-gbp option to vxlan and evpn zones Hannes Laimer
2026-06-09 13:25 ` [PATCH pve-network 10/16] evpn: disable vxlan-learning on create if GBP is enabled Hannes Laimer
2026-06-09 13:25 ` [PATCH pve-manager 11/16] ui: sdn: add microsegmentation Hannes Laimer
2026-06-09 13:25 ` [PATCH pve-manager 12/16] network: apply microseg state on reload Hannes Laimer
2026-06-09 13:25 ` [PATCH pve-manager 13/16] ui: sdn: zones: add vxlan-gbp checkbox to vxlan and evpn Hannes Laimer
2026-06-09 13:25 ` [PATCH pve-docs 14/16] sdn: add microsegmentation section Hannes Laimer
2026-06-09 13:25 ` [PATCH pve-docs 15/16] sdn: add VXLAN-GBP flag to evpn/vxlan zone sections Hannes Laimer
2026-06-09 13:25 ` [PATCH ifupdown2 16/16] d/patches: add support for VXLAN-GBP flag Hannes Laimer

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox
Service provided by Proxmox Server Solutions GmbH | Privacy | Legal