From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from firstgate.proxmox.com (firstgate.proxmox.com [IPv6:2a01:7e0:0:424::9]) by lore.proxmox.com (Postfix) with ESMTPS id 0439D1FF146 for ; Tue, 09 Jun 2026 15:25:56 +0200 (CEST) Received: from firstgate.proxmox.com (localhost [127.0.0.1]) by firstgate.proxmox.com (Proxmox) with ESMTP id 6A83811E9B; Tue, 9 Jun 2026 15:25:39 +0200 (CEST) From: Hannes Laimer To: pve-devel@lists.proxmox.com Subject: [RFC cluster/docs/ifupdown2/manager/network/proxmox{-ebpf,-ve-rs,-perl-rs} 00/16] sdn: add microsegmentation support Date: Tue, 9 Jun 2026 15:25:06 +0200 Message-ID: <20260609132522.235917-1-h.laimer@proxmox.com> X-Mailer: git-send-email 2.47.3 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Bm-Milter-Handled: 55990f41-d878-4baa-be0a-ee34c49e34d2 X-Bm-Transport-Timestamp: 1781011483290 X-SPAM-LEVEL: Spam detection results: 0 AWL -1.417 Adjusted score from AWL reputation of From: address BAYES_00 -1.9 Bayes spam probability is 0 to 1% DMARC_MISSING 0.1 Missing DMARC policy KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment SPF_HELO_NONE 0.001 SPF: HELO does not publish an SPF Record SPF_PASS -0.001 SPF: sender matches SPF record URIBL_BLACK 3 Contains an URL listed in the URIBL blacklist [types.rs] Message-ID-Hash: TEFS22G5AOK4PU7STDT6UZQLARSG6ORD X-Message-ID-Hash: TEFS22G5AOK4PU7STDT6UZQLARSG6ORD X-MailFrom: h.laimer@proxmox.com X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; loop; banned-address; emergency; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header X-Mailman-Version: 3.3.10 Precedence: list List-Id: Proxmox VE development discussion List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: This adds support for microsegmentation using eBPF programs attached to interfaces. Mostly the tap/veth interfaces on the guests directly. # Overview Each guest interface can be assigned a group, then, on these groups it is possible to define policies/rules between these groups. A policy is a mapping `(src,dst) -> allow` Each interface can be assigned to only a *single* group. Groups however can be part of groups, that should cover most use cases I came up with. In case multiple policies would match (possibly because groups/sub-groups could have policies that overlap) the most specific policy wins: the one that names the destination group closest in the tree, and if two name it equally closely, the one that names the source closest, so: - web (db,web)->allow - wiki # from `web`: (db,web)->allow * CT100,net0 - wol (db,wol)->deny # closer than the (db,web)->allow inherited from web, so deny wins * CT101,net0 # => packets from db to CT101,net0 are denied - db (web,db)->allow * CT200,net1 # packets from `web` reaching CT200,net1 will be allowed Because each group has a single parent, the destination-then-source ordering always picks exactly one rule, so there is no tie to resolve (a malformed config with duplicate rules falls back to deny). If an interface has a group assigned, it needs an explicit (0,dst) rule to accept un-tagged packets. Enforcement happens on the receiving side, cause that's where we for sure know both - where the packet is from - and, its destination (us) # config and API The config is a single section-config, `sdn/microseg.cfg`, owned by Rust (the types live in `proxmox-ve-config`, with a `PVE::RS::SDN::Microseg` binding) the same way the fabric config is. It holds four object types, keyed by section id: - group, a numeric `mark` (auto-assigned if omitted) and an optional `parent` - rule, `(src,dst) -> allow`, an absent `src` matches un-tagged traffic - assignment, binds a guest NIC (`vmid` plus `iface` index) to a group - bridge, marks a bridge-facing interface as an SRv6 carrier (no UI yet) The API exposes per-type endpoints under `/cluster/sdn/microseg/{group,rule,assignment,bridge}`, guarded by SDN.Audit for reads and SDN.Allocate for writes. Rule and assignment ids are derived from their contents (`p{src_mark}-{dst_mark}`, `vm{vmid}i{iface}`), groups and bridges take a chosen id. On commit the config is rendered into `.running-config`, which is what the agent reads. # skb->mark Every packet in the kernel lives in a `sk_buff` struct, this struct has a 32-bit field `mark`. This field can be written to and read while the packet passes through the kernel. We don't do that currently in our SDN stack, so currently we can't really overwrite stuff written by a different part of our stack. But this is something to keep in mind in case we should end up using this again in the future. For now microsegmentation only uses the lower 16 bits. We use this to assign packets to groups, with these 16 bits we support up to 65535 distinct groups (mark 0 is reserved for unstamped traffic). This `mark` however only lives as long as the packet does not leave the kernel, and given the nature of networking, chances are it'll leave it eventually. To transport this group identity between hosts we have to attach it to the actual network packet that hits the wire. For this we currently support both - VXLAN-GBP, the 16-bit GBP field on the VXLAN encapsulation - and, SRv6, the 16-bit Tag field on the SRH conveniently the kernel already handles putting the `mark` into the VXLAN-GBP encapsulation, and also taking it out of there (found out way too late :P). For SRv6 we need a small eBPF program that handles this. # eBPF Both tagging (setting `mark`) and enforcement happen in an eBPF program directly attached to the guest's interface. The programs themselves read and write to/from two maps: - tap_to_group, so we know what to set `mark` to - rules, map of (src,dst)->action, we know src from the `mark` and `dst` is the group we are in Specifically, on ingress we set the mark, and on egress we enforce. # Implementation We have an `agent` that reads the running sdn config, and applies that state to the kernel. The binary doing this is stateless, it runs on SDN apply, tap_plug of interfaces of guests and on boot after pve-sdn-commit but before guests start. It keeps track of what is currently running loaded in the kernel with two files under `/run/proxmox-ebpf`. One keeping track of the bpf program that was compiled into the binary, and one for keeping track of changes to the data structures the bpf program accesses. The distinction is useful cause for only a program change we can swap the currently attached program with the updated one atomically. In case the data structure (the maps) changed there is no other way than to tear-down all the current state and repopulate the maps and re-attach the programs. Swapping out the maps first would have old programs interact with a new data schema, and swapping the programs first will lead to new programs accessing an old data schema. Either way, not good, so we wipe the state in that case and re-build it. So, in case we change the structure of the data our bpf programs access, there will be a bunch of ms where the configured segmentation is not enforced. But that is the only scenario where that can happen. ## aya, aya-ebpf Aya is a rust lib that helps with working with BPF more easily. It has both a userspace part(`aya`) and a part(`aya-ebpf`) that compiles to bpf. I chose to only use the userspace part, and use C for the BPF programs themselves. The reasons were: - we don't have `aya-ebpf` packaged - `aya-ebpf` *requires* nightly toolchain to compile - the bpf programs are very small, in case this should change at some point we can reconsider. Also, it doesn't matter what produces the .o file, so this could be swapped in really easily if we decided to So we compile the bpf program written in C to a .o targeting bpf using clang and include it when compiling the agent binary. ## agent The agent is not a long running daemon, the triggers mentioned before should cover all situations where we'd have to touch the kernel state. When started, the first thing it does is check both the program and schema version, so what it was compiled with, and what the last invocation did to the kernel. If either the program or the schema differs it either swaps out the program currently loaded by the kernel and pins it. If the schema version differs it does that as well, but additionally also tears down all state and rebuilds it. The agent also populates the `rules` map, so it creates all the inherited rules and resolves multiple matches to the closest one as described before. ## SRv6 This also adds support for SRv6 as a transport, we don't support that yet in our SDN stack, but could be useful for basic routing between simple zones on multiple nodes. A bit like EVPN-lite since it's only L3. I did most of my testing with SRv6 as transport between kernels, before I found out the kernel does it for VXLAN automatically. I didn't wire up any UI for it yet, but I decided to include it in this RFC for now. # enabling VXLAN-GBP The kernel only moves the mark into the GBP field if the vxlan device was created with the GBP flag set, and it's create-only: the kernel won't toggle it on a running device. ifupdown2 had no attribute for this, so this series includes a small ifupdown2 patch adding `vxlan-gbp`, which threads the flag through to the netlink/iproute2 create path. # testing I have put pre-built packages on sani, these include ifupdown2 with the patch # building 1. proxmox-ve-rs (install) 2. proxmox-ebpf 3. pve-cluster (install) 4. proxmox-perl-rs (install) 5. pve-network 6. pve-manager # notes I did not send the first commit for `proxmox-ebpf` since it contains a `vmlinux.h` which is rather large but is needed for compiling. The commit is in my staff repo. # feedback wanted - the general desgin, specifically with the agent being one-shot and triggered rather than a running service - I put the VXLAN-GBP flag into the zones that run on vxlan for now, so it has to be enabled on the zone, and will affect all vxlan interfaces that are part of it. The only real alternative is enabling it always, or trying to infer if a nic is assigned a vnet that is in a zone that does vxlan and corss-referencing that with potentially configured microseg groups/assignments. This seemed brittle, and since the needed flag is additionally create-only, a simple flag on the zone appeard to be the better approach - and the ui, very likely needs some polishing, and im very open to alternative approaches proxmox-ebpf: Hannes Laimer (3): agent: add userspace coordinator and stateless policy subsystem bpf: add bridge subsystem debian: add packaging and boot-time oneshot unit Makefile | 66 +++++++ debian/changelog | 5 + debian/control | 34 ++++ debian/copyright | 18 ++ debian/proxmox-ebpf.install | 1 + debian/proxmox-ebpf.postrm | 11 ++ debian/proxmox-ebpf.prerm | 12 ++ debian/proxmox-ebpf.service | 15 ++ debian/rules | 33 ++++ debian/source/format | 1 + include/mark.h | 30 +++ src/agent.rs | 105 ++++++++++ src/bridge/bpf/srv6.bpf.c | 76 +++++++ src/bridge/mod.rs | 75 +++++++ src/main.rs | 69 +++++++ src/policy/bpf/tap.bpf.c | 66 +++++++ src/policy/bpf/types.h | 23 +++ src/policy/mod.rs | 268 +++++++++++++++++++++++++ src/policy/types.rs | 45 +++++ src/running_config.rs | 38 ++++ src/state.rs | 303 ++++++++++++++++++++++++++++ src/subsystem.rs | 384 ++++++++++++++++++++++++++++++++++++ src/tc.rs | 152 ++++++++++++++ 23 files changed, 1830 insertions(+) create mode 100644 Makefile create mode 100644 debian/changelog create mode 100644 debian/control create mode 100644 debian/copyright create mode 100644 debian/proxmox-ebpf.install create mode 100755 debian/proxmox-ebpf.postrm create mode 100755 debian/proxmox-ebpf.prerm create mode 100644 debian/proxmox-ebpf.service create mode 100755 debian/rules create mode 100644 debian/source/format create mode 100644 include/mark.h create mode 100644 src/agent.rs create mode 100644 src/bridge/bpf/srv6.bpf.c create mode 100644 src/bridge/mod.rs create mode 100644 src/main.rs create mode 100644 src/policy/bpf/tap.bpf.c create mode 100644 src/policy/bpf/types.h create mode 100644 src/policy/mod.rs create mode 100644 src/policy/types.rs create mode 100644 src/running_config.rs create mode 100644 src/state.rs create mode 100644 src/subsystem.rs create mode 100644 src/tc.rs proxmox-ve-rs: Hannes Laimer (1): ve-config: sdn: add microseg config types proxmox-ve-config/src/sdn/config.rs | 9 +- proxmox-ve-config/src/sdn/microseg.rs | 847 ++++++++++++++++++++++++++ proxmox-ve-config/src/sdn/mod.rs | 1 + 3 files changed, 856 insertions(+), 1 deletion(-) create mode 100644 proxmox-ve-config/src/sdn/microseg.rs proxmox-perl-rs: Hannes Laimer (1): sdn: add microseg config binding pve-rs/Makefile | 1 + pve-rs/src/bindings/sdn/microseg.rs | 172 ++++++++++++++++++++++++++++ pve-rs/src/bindings/sdn/mod.rs | 1 + 3 files changed, 174 insertions(+) create mode 100644 pve-rs/src/bindings/sdn/microseg.rs pve-cluster: Hannes Laimer (1): cfs: add 'sdn/microseg.cfg' to observed files src/PVE/Cluster.pm | 1 + 1 file changed, 1 insertion(+) pve-network: Hannes Laimer (4): sdn: microseg: add config and API sdn: zones: trigger microseg apply on tap_plug sdn: zones: add vxlan-gbp option to vxlan and evpn zones evpn: disable vxlan-learning on create if GBP is enabled src/PVE/API2/Network/SDN.pm | 12 + src/PVE/API2/Network/SDN/Makefile | 2 + src/PVE/API2/Network/SDN/Microseg.pm | 126 +++++++ .../API2/Network/SDN/Microseg/Assignment.pm | 163 +++++++++ src/PVE/API2/Network/SDN/Microseg/Bridge.pm | 171 ++++++++++ src/PVE/API2/Network/SDN/Microseg/Group.pm | 171 ++++++++++ src/PVE/API2/Network/SDN/Microseg/Makefile | 8 + src/PVE/API2/Network/SDN/Microseg/Rule.pm | 163 +++++++++ src/PVE/Network/SDN.pm | 5 + src/PVE/Network/SDN/Makefile | 1 + src/PVE/Network/SDN/Microseg.pm | 316 ++++++++++++++++++ src/PVE/Network/SDN/Zones.pm | 6 + src/PVE/Network/SDN/Zones/EvpnPlugin.pm | 11 + src/PVE/Network/SDN/Zones/VxlanPlugin.pm | 9 + 14 files changed, 1164 insertions(+) create mode 100644 src/PVE/API2/Network/SDN/Microseg.pm create mode 100644 src/PVE/API2/Network/SDN/Microseg/Assignment.pm create mode 100644 src/PVE/API2/Network/SDN/Microseg/Bridge.pm create mode 100644 src/PVE/API2/Network/SDN/Microseg/Group.pm create mode 100644 src/PVE/API2/Network/SDN/Microseg/Makefile create mode 100644 src/PVE/API2/Network/SDN/Microseg/Rule.pm create mode 100644 src/PVE/Network/SDN/Microseg.pm pve-manager: Hannes Laimer (3): ui: sdn: add microsegmentation network: apply microseg state on reload ui: sdn: zones: add vxlan-gbp checkbox to vxlan and evpn PVE/API2/Network.pm | 4 + www/manager6/Makefile | 9 + www/manager6/Utils.js | 23 + www/manager6/dc/Config.js | 8 + www/manager6/form/MicrosegGroupSelector.js | 64 +++ www/manager6/form/MicrosegGuestNicSelector.js | 107 +++++ www/manager6/form/MicrosegGuestSelector.js | 83 ++++ www/manager6/sdn/MicrosegView.js | 408 ++++++++++++++++++ www/manager6/sdn/microseg/AssignmentEdit.js | 63 +++ www/manager6/sdn/microseg/Base.js | 88 ++++ www/manager6/sdn/microseg/GroupEdit.js | 61 +++ www/manager6/sdn/microseg/PolicyView.js | 221 ++++++++++ www/manager6/sdn/microseg/RuleEdit.js | 49 +++ www/manager6/sdn/zones/EvpnEdit.js | 8 + www/manager6/sdn/zones/VxlanEdit.js | 11 + 15 files changed, 1207 insertions(+) create mode 100644 www/manager6/form/MicrosegGroupSelector.js create mode 100644 www/manager6/form/MicrosegGuestNicSelector.js create mode 100644 www/manager6/form/MicrosegGuestSelector.js create mode 100644 www/manager6/sdn/MicrosegView.js create mode 100644 www/manager6/sdn/microseg/AssignmentEdit.js create mode 100644 www/manager6/sdn/microseg/Base.js create mode 100644 www/manager6/sdn/microseg/GroupEdit.js create mode 100644 www/manager6/sdn/microseg/PolicyView.js create mode 100644 www/manager6/sdn/microseg/RuleEdit.js pve-docs: Hannes Laimer (2): sdn: add microsegmentation section sdn: add VXLAN-GBP flag to evpn/vxlan zone sections pvesdn.adoc | 87 +++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 87 insertions(+) ifupdown2: Hannes Laimer (1): d/patches: add support for VXLAN-GBP flag ...addons-vxlan-add-vxlan-gbp-attribute.patch | 228 ++++++++++++++++++ debian/patches/series | 1 + 2 files changed, 229 insertions(+) create mode 100644 debian/patches/pve/0016-addons-vxlan-add-vxlan-gbp-attribute.patch Summary over all repositories: 62 files changed, 5548 insertions(+), 1 deletions(-) -- Generated by murpp 0.11.0