[PATCH proxmox-ebpf 01/16] agent: add userspace coordinator and stateless policy subsystem

public inbox for pve-devel@lists.proxmox.com
 help / color / mirror / Atom feed

From: Hannes Laimer <h.laimer@proxmox.com>
To: pve-devel@lists.proxmox.com
Subject: [PATCH proxmox-ebpf 01/16] agent: add userspace coordinator and stateless policy subsystem
Date: Tue,  9 Jun 2026 15:25:07 +0200	[thread overview]
Message-ID: <20260609132522.235917-2-h.laimer@proxmox.com> (raw)
In-Reply-To: <20260609132522.235917-1-h.laimer@proxmox.com>

The userspace agent reads the desired state from the SDN running-config and
runs one-shot per event
 - at boot: full pass
 - at sdn apply: full pass
 - at tap_plug: only additive, for the interface in question

It updates the pinned kernel state (the BPF maps and the attached
programs) to match. A full pass covers the whole host. It writes the
per-interface group mappings and the policy rules, and attaches or
detaches programs so the right interfaces are covered. The tap_plug path
is narrow. It sets the group mapping for the one new interface and
attaches its program, leaving the policy rules and other interfaces
untouched.

The programs are loaded and verified once, then pinned in bpffs and
reused on later runs. A run reloads only when the embedded BPF object or
its map schema changed, so the common path attaches links and syncs maps
against what is already in the kernel without going through the verifier.

The agent manages multiple BPF programs that do not share state, here
called `subsystems`. The first, `policy`, filters guest traffic by group
with a stateless BPF tap program backed by pinned maps.

Signed-off-by: Hannes Laimer <h.laimer@proxmox.com>
---
 include/mark.h           |  30 +++
 src/agent.rs             |  85 +++++++++
 src/main.rs              |  68 +++++++
 src/policy/bpf/tap.bpf.c |  66 +++++++
 src/policy/bpf/types.h   |  23 +++
 src/policy/mod.rs        | 268 +++++++++++++++++++++++++++
 src/policy/types.rs      |  45 +++++
 src/running_config.rs    |  38 ++++
 src/state.rs             | 286 +++++++++++++++++++++++++++++
 src/subsystem.rs         | 383 +++++++++++++++++++++++++++++++++++++++
 src/tc.rs                | 151 +++++++++++++++
 11 files changed, 1443 insertions(+)
 create mode 100644 include/mark.h
 create mode 100644 src/agent.rs
 create mode 100644 src/main.rs
 create mode 100644 src/policy/bpf/tap.bpf.c
 create mode 100644 src/policy/bpf/types.h
 create mode 100644 src/policy/mod.rs
 create mode 100644 src/policy/types.rs
 create mode 100644 src/running_config.rs
 create mode 100644 src/state.rs
 create mode 100644 src/subsystem.rs
 create mode 100644 src/tc.rs

diff --git a/include/mark.h b/include/mark.h
new file mode 100644
index 0000000..e3c0577
--- /dev/null
+++ b/include/mark.h
@@ -0,0 +1,30 @@
+#ifndef PROXMOX_EBPF_MARK_H
+#define PROXMOX_EBPF_MARK_H
+
+// Include after vmlinux.h and <bpf/bpf_helpers.h>, which provide struct
+// __sk_buff, the __uNN types, __always_inline, and barrier_var().
+//
+// skb->mark is shared with the rest of the host (firewall fwmark, routing,
+// ...). Microseg owns only the low 16 bits, which hold the group id and are
+// also what travels on the wire. The helpers below touch only that range, so
+// any bits other subsystems set are preserved.
+#define MICROSEG_MARK_MASK 0xffffu
+
+// Replace microseg's bits in skb->mark with the group, leaving the rest untouched.
+static __always_inline void microseg_mark_set(struct __sk_buff *skb, __u16 group) {
+    __u32 mark = skb->mark;
+    mark = (mark & ~MICROSEG_MARK_MASK) | ((__u32)group & MICROSEG_MARK_MASK);
+    // skb->mark is BPF context, which the verifier only allows at its full
+    // 32-bit width. Without the barrier clang narrows the write to a 16-bit
+    // store of the low half, a valid little-endian read-modify-write that the
+    // verifier then rejects as an invalid context access.
+    barrier_var(mark);
+    skb->mark = mark;
+}
+
+// The group currently stamped in skb->mark.
+static __always_inline __u16 microseg_mark_get(struct __sk_buff *skb) {
+    return (__u16)(skb->mark & MICROSEG_MARK_MASK);
+}
+
+#endif
diff --git a/src/agent.rs b/src/agent.rs
new file mode 100644
index 0000000..ae8eb66
--- /dev/null
+++ b/src/agent.rs
@@ -0,0 +1,85 @@
+//! Agent orchestrator. Builds the per-host [`DesiredState`] from the SDN running-config and applies
+//! it through the [`policy`](crate::policy) subsystem. A full pass ([`apply`](Agent::apply)) covers
+//! the whole host, the tap_plug path
+//! ([`apply_guest_iface_policy`](Agent::apply_guest_iface_policy)) programs a single interface.
+//!
+//! The binary runs one-shot per event (boot, an SDN apply, a tap_plug), not as a resident daemon.
+//! Every invocation first makes sure the programs are loaded (installing them on the first run or a
+//! version change), then applies. Maps and links are pinned in bpffs, so they persist across
+//! invocations and re-running never interrupts traffic.
+
+use anyhow::Context;
+
+use crate::{policy::PolicySubsystem, running_config, state::DesiredState};
+
+pub struct Agent {
+    policy: PolicySubsystem,
+}
+
+impl Agent {
+    pub fn new() -> Self {
+        Self {
+            policy: PolicySubsystem::new(),
+        }
+    }
+
+    /// Full pass over the host.
+    pub fn apply(&mut self) -> anyhow::Result<()> {
+        let Some(state) = build_state()? else {
+            return Ok(());
+        };
+        log::debug!(
+            "applying: {} groups, {} rules, {} assignments",
+            state.groups.len(),
+            state.rules.len(),
+            state.assignments.len(),
+        );
+        if log::log_enabled!(log::Level::Trace) {
+            for (id, g) in &state.groups {
+                log::trace!("group {id} = '{}'", g.name);
+            }
+        }
+        if let Err(e) = self.policy.apply(&state) {
+            log::error!("policy: apply: {e:#}");
+        }
+        Ok(())
+    }
+
+    /// Fast path for a guest NIC that just appeared (a tap_plug). Programs just that interface.
+    pub fn apply_guest_iface_policy(&mut self, iface: &str) -> anyhow::Result<()> {
+        let state = match build_state() {
+            Ok(Some(state)) => state,
+            Ok(None) => return Ok(()),
+            Err(e) => {
+                log::error!("apply {iface}: load running config: {e:#}");
+                return Ok(());
+            }
+        };
+        self.policy.apply_guest_iface(&state, iface)
+    }
+
+    /// Detach everything and drop all pinned/run state, for package removal.
+    pub fn clear(&self) -> anyhow::Result<()> {
+        self.policy.clear()?;
+        Ok(())
+    }
+}
+
+fn build_state() -> anyhow::Result<Option<DesiredState>> {
+    let microseg = running_config::load_microseg()?;
+    let hostname = read_hostname()?;
+    match DesiredState::build(&microseg, &hostname) {
+        Ok(s) => Ok(Some(s)),
+        Err(e) => {
+            log::error!("desired state: {e:#}");
+            Ok(None)
+        }
+    }
+}
+
+fn read_hostname() -> anyhow::Result<String> {
+    let hostname = nix::unistd::gethostname().context("gethostname")?;
+    hostname
+        .into_string()
+        .map_err(|os| anyhow::anyhow!("non-utf8 hostname: {os:?}"))
+}
diff --git a/src/main.rs b/src/main.rs
new file mode 100644
index 0000000..6b3c16c
--- /dev/null
+++ b/src/main.rs
@@ -0,0 +1,68 @@
+mod agent;
+mod policy;
+mod running_config;
+mod state;
+mod subsystem;
+mod tc;
+
+use agent::Agent;
+use anyhow::{anyhow, bail};
+use proxmox_log::{LevelFilter, Logger};
+
+const HELP: &str = "\
+Usage: proxmox-ebpf <command> [<interface>]
+
+Commands:
+  apply [<iface>]   apply the SDN running-config to BPF state. With <iface>,
+                    only that interface.
+  clear             detach all programs and drop pinned state (used on removal)
+  help              show this help
+";
+
+#[derive(Debug, Clone, Copy, PartialEq, Eq)]
+enum Command {
+    Apply,
+    Clear,
+    Help,
+}
+
+impl std::str::FromStr for Command {
+    type Err = anyhow::Error;
+    fn from_str(s: &str) -> Result<Self, Self::Err> {
+        Ok(match s {
+            "apply" => Command::Apply,
+            "clear" => Command::Clear,
+            "help" => Command::Help,
+            other => bail!("{other:?} is not a valid command"),
+        })
+    }
+}
+
+fn init_logger() -> anyhow::Result<()> {
+    Logger::from_env("PROXMOX_EBPF_LOG", LevelFilter::INFO)
+        .stderr_pve()
+        .init()?;
+    Ok(())
+}
+
+fn main() -> anyhow::Result<()> {
+    let mut args = pico_args::Arguments::from_env();
+    let cmd: Command = args
+        .subcommand()?
+        .ok_or_else(|| anyhow!("no command specified\n\n{HELP}"))?
+        .parse()?;
+
+    init_logger()?;
+
+    match cmd {
+        Command::Help => {
+            println!("{HELP}");
+            Ok(())
+        }
+        Command::Apply => match args.opt_free_from_str::<String>()? {
+            None => Agent::new().apply(),
+            Some(iface) => Agent::new().apply_guest_iface_policy(&iface),
+        },
+        Command::Clear => Agent::new().clear(),
+    }
+}
diff --git a/src/policy/bpf/tap.bpf.c b/src/policy/bpf/tap.bpf.c
new file mode 100644
index 0000000..b9a1b51
--- /dev/null
+++ b/src/policy/bpf/tap.bpf.c
@@ -0,0 +1,66 @@
+#include "vmlinux.h"
+#include <bpf/bpf_helpers.h>
+#include "mark.h"
+#include "types.h"
+#include "bpf_debug.h"
+
+char LICENSE[] SEC("license") = "GPL";
+
+#define TC_ACT_OK 0
+#define TC_ACT_SHOT 2
+
+struct {
+    __uint(type, BPF_MAP_TYPE_HASH);
+    __type(key, __u32); // ifindex
+    __type(value, struct guest_group);
+    __uint(max_entries, 65560); // max # of tap interfaces, ~6 MB
+    __uint(pinning, LIBBPF_PIN_BY_NAME);
+} tap_to_group SEC(".maps");
+
+struct {
+    __uint(type, BPF_MAP_TYPE_HASH);
+    __type(key, struct rule_key);
+    __type(value, struct rule_value);
+    __uint(max_entries, 1048576);
+    __uint(pinning, LIBBPF_PIN_BY_NAME);
+} rules SEC(".maps");
+
+// Place group id in lower 16 bits of sk(i)b(idi)->mark.
+SEC("classifier")
+int tc_tap_ingress(struct __sk_buff *skb) {
+    __u32 ifidx = skb->ifindex;
+    struct guest_group *g = bpf_map_lookup_elem(&tap_to_group, &ifidx);
+    if (!g) return TC_ACT_OK;
+    DBG("mark: src_grp=%u ifidx=%u", g->group, ifidx);
+    microseg_mark_set(skb, g->group);
+    return TC_ACT_OK;
+}
+
+
+// Enforce on egress of the destination tap, where both ends are known, dst
+// from this tap and src from skb->mark.
+//
+//  - look up (src_group, dst_group) in `rules`, if no entry default-deny
+//  - for unstamped packets (src_group=0) to be allowed a explicity
+//    (0,dst_group)->1 is required 
+SEC("classifier")
+int tc_tap_egress(struct __sk_buff *skb) {
+    __u32 ifidx = skb->ifindex;
+    __u16 src_group = microseg_mark_get(skb);
+
+    struct guest_group *g = bpf_map_lookup_elem(&tap_to_group, &ifidx);
+    if (!g) return TC_ACT_OK;
+
+    struct rule_key k = { .src_group = src_group, .dst_group = g->group };
+    struct rule_value *r = bpf_map_lookup_elem(&rules, &k);
+    if (!r) {
+        DBG("deny (no rule): src_grp=%u dst_grp=%u", src_group, g->group);
+        return TC_ACT_SHOT;
+    }
+    if (!r->allow) {
+        DBG("deny (explicit): src_grp=%u dst_grp=%u", src_group, g->group);
+        return TC_ACT_SHOT;
+    }
+    return TC_ACT_OK;
+}
+
diff --git a/src/policy/bpf/types.h b/src/policy/bpf/types.h
new file mode 100644
index 0000000..67a86a9
--- /dev/null
+++ b/src/policy/bpf/types.h
@@ -0,0 +1,23 @@
+#ifndef PROXMOX_EBPF_POLICY_TYPES_H
+#define PROXMOX_EBPF_POLICY_TYPES_H
+
+// Keep in sync with ../types.rs.
+// __u8/__u16/__u32 are provided by vmlinux.h
+
+struct guest_group {
+    __u16 group;
+    __u16 _pad;
+};
+
+struct rule_key {
+    __u16 src_group;
+    __u16 dst_group;
+};
+
+struct rule_value {
+    __u8 allow;
+    __u8 _pad[3];
+};
+
+#endif
+
diff --git a/src/policy/mod.rs b/src/policy/mod.rs
new file mode 100644
index 0000000..9b3eae8
--- /dev/null
+++ b/src/policy/mod.rs
@@ -0,0 +1,268 @@
+//! Tap-side enforcement. Attaches BPF programs to per-VM tap/veth interfaces that (a) stamp
+//! `skb->mark` with the source group on ingress, and (b) drop packets on egress if no `(src_group,
+//! dst_group) -> allow` rule exists for the pair. Driven by the [`DesiredState`] derived from
+//! `/etc/pve/sdn/.running-config`, resolved per host via `veth{vmid}i{iface}` /
+//! `tap{vmid}i{iface}`.
+//!
+//! Two maps hold the state, `tap_to_group` (ifindex -> group) and `rules` ((src, dst) -> allow).
+
+mod types;
+
+use std::collections::{HashMap, HashSet};
+
+use anyhow::Context;
+use aya::include_bytes_aligned;
+
+use self::types::*;
+use crate::state::{DesiredState, ResolvedAssignment, ResolvedRule};
+use crate::subsystem::TcPrograms;
+use crate::tc::Direction;
+
+const NAME: &str = "policy";
+
+const POLICY_OBJ: &[u8] = include_bytes_aligned!(concat!(env!("OUT_DIR"), "/tap.bpf.o"));
+const POLICY_FINGERPRINT: u64 = TcPrograms::obj_fingerprint(POLICY_OBJ);
+
+// BUMP THIS when a semantically-incompatible change to a map definition in tap.bpf.c is made
+const SCHEMA_VERSION: u32 = 1;
+
+fn program_name(dir: Direction) -> &'static str {
+    match dir {
+        Direction::Ingress => "tc_tap_ingress",
+        Direction::Egress => "tc_tap_egress",
+    }
+}
+
+pub struct PolicySubsystem {
+    programs: TcPrograms,
+}
+
+impl PolicySubsystem {
+    pub fn new() -> Self {
+        Self {
+            programs: TcPrograms::new(
+                NAME,
+                POLICY_OBJ,
+                POLICY_FINGERPRINT,
+                program_name,
+                Some(SCHEMA_VERSION),
+            ),
+        }
+    }
+
+    /// Full pass over the host (boot, after an SDN apply). Holds the apply lock exclusively, so its
+    /// enumerate/detach/attach cannot interleave with a concurrent single-interface apply.
+    pub fn apply(&mut self, state: &DesiredState) -> anyhow::Result<()> {
+        let _lock = self.programs.lock_exclusive()?;
+        self.programs.ensure_loaded()?;
+        self.apply_full(state)
+    }
+
+    /// Detach all tap programs and drop pinned/run state, for package removal.
+    pub fn clear(&self) -> anyhow::Result<()> {
+        self.programs.clear()
+    }
+
+    /// Program a single guest NIC that just appeared (a tap_plug).
+    ///
+    /// An unassigned NIC has nothing to enforce and returns `Ok`. For an assigned NIC a failure to
+    /// install enforcement propagates, so the agent exits non-zero and the plug does not bring the
+    /// NIC up unenforced.
+    pub fn apply_guest_iface(&mut self, state: &DesiredState, iface: &str) -> anyhow::Result<()> {
+        let Some((vmid, idx)) = parse_tap_name(iface) else {
+            log::warn!("apply_guest_iface: '{iface}' is not a guest NIC name, skipping");
+            return Ok(());
+        };
+
+        let Some(group) = state
+            .assignments
+            .iter()
+            .find(|a| a.vmid == vmid && a.iface == idx)
+            .map(|a| a.group)
+        else {
+            log::debug!(
+                "apply_guest_iface: {iface} (vmid={vmid} iface={idx}) is unassigned, nothing to attach"
+            );
+            return Ok(());
+        };
+
+        // If the programs are already loaded, take the lock shared so concurrent guest plugs don't
+        // serialize. A needed (re)install rebuilds shared state, so take it exclusively and do a
+        // full apply, which covers this NIC too.
+        if self.programs.is_current() {
+            let _lock = self.programs.lock_shared()?;
+            self.apply_one(iface, group)
+        } else {
+            let _lock = self.programs.lock_exclusive()?;
+            self.programs.ensure_loaded()?;
+            self.apply_full(state)
+        }
+        .with_context(|| format!("enforcement for assigned NIC {iface}"))
+    }
+
+    fn apply_full(&mut self, state: &DesiredState) -> anyhow::Result<()> {
+        log::trace!("policy full apply");
+        let assignments = resolve_local_assignments(&state.assignments);
+        self.sync_tap_to_group(&assignments)?;
+        let desired: HashSet<u32> = assignments.keys().copied().collect();
+        self.programs.reconcile(&desired)?;
+        self.sync_rules(&state.rules)?;
+        Ok(())
+    }
+
+    /// Additive single-interface programming. Sets this interface's `tap_to_group` entry and
+    /// attaches its links, propagating failure so the plug does not bring the NIC up unenforced.
+    /// The global `rules` map and the other interfaces are not touched.
+    fn apply_one(&mut self, iface: &str, group: u16) -> anyhow::Result<()> {
+        let Ok(ifindex) = nix::net::if_::if_nametoindex(iface) else {
+            log::info!("apply_guest_iface: {iface} is gone, nothing to do");
+            return Ok(());
+        };
+
+        {
+            let mut tap_to_group = self.programs.hash_map::<u32, GuestGroup>("tap_to_group")?;
+            tap_to_group.insert(ifindex, GuestGroup { group, _pad: 0 }, 0)?;
+        }
+        log::info!("apply_guest_iface: {iface} (ifindex {ifindex}) -> group {group}");
+
+        self.programs.attach_iface(ifindex)?;
+        Ok(())
+    }
+
+    fn sync_tap_to_group(&mut self, desired: &HashMap<u32, u16>) -> anyhow::Result<()> {
+        let mut tap_to_group = self.programs.hash_map::<u32, GuestGroup>("tap_to_group")?;
+        let live: HashMap<u32, u16> = tap_to_group
+            .iter()
+            .filter_map(|r| r.ok().map(|(k, v)| (k, v.group)))
+            .collect();
+        let mut written = 0usize;
+        for (&ifidx, &group) in desired {
+            if live.get(&ifidx) != Some(&group) {
+                tap_to_group.insert(ifidx, GuestGroup { group, _pad: 0 }, 0)?;
+                written += 1;
+            }
+        }
+        let mut removed = 0usize;
+        for &ifidx in live.keys().filter(|i| !desired.contains_key(i)) {
+            log::debug!("drop: group mapping for {ifidx}");
+            let _ = tap_to_group.remove(&ifidx);
+            removed += 1;
+        }
+        if written + removed > 0 {
+            log::info!("tap_to_group: {written} written, {removed} removed");
+        }
+        Ok(())
+    }
+
+    fn sync_rules(&mut self, rules: &[ResolvedRule]) -> anyhow::Result<()> {
+        let mut rules_map = self.programs.hash_map::<RuleKey, RuleValue>("rules")?;
+        let live: HashMap<RuleKey, u8> = rules_map
+            .iter()
+            .filter_map(|r| r.ok().map(|(k, v)| (k, v.allow)))
+            .collect();
+        let mut desired_keys = HashSet::new();
+        let mut written = 0usize;
+        for r in rules {
+            let k = RuleKey {
+                src_group: r.src_group,
+                dst_group: r.dst_group,
+            };
+            desired_keys.insert(k);
+            let want_allow = r.allow as u8;
+            if live.get(&k) != Some(&want_allow) {
+                rules_map.insert(
+                    k,
+                    RuleValue {
+                        allow: want_allow,
+                        _pad: [0; 3],
+                    },
+                    0,
+                )?;
+                written += 1;
+                log::trace!(
+                    "rule: write src={} dst={} allow={}",
+                    k.src_group,
+                    k.dst_group,
+                    r.allow
+                );
+            }
+        }
+        let stale: Vec<_> = live
+            .keys()
+            .filter(|k| !desired_keys.contains(k))
+            .copied()
+            .collect();
+        for k in &stale {
+            let _ = rules_map.remove(k);
+        }
+        if written + stale.len() > 0 {
+            log::info!("rules: {written} written, {} removed", stale.len());
+        } else {
+            log::debug!("rules: {} entries, no changes", desired_keys.len());
+        }
+        Ok(())
+    }
+}
+
+/// Parse a guest NIC name (`tap<vmid>i<idx>` or `veth<vmid>i<idx>`) into its (vmid, iface) pair.
+fn parse_tap_name(name: &str) -> Option<(u32, u32)> {
+    let rest = name
+        .strip_prefix("tap")
+        .or_else(|| name.strip_prefix("veth"))?;
+    let (vmid, idx) = rest.split_once('i')?;
+    Some((vmid.parse().ok()?, idx.parse().ok()?))
+}
+
+fn resolve_local_assignments(assignments: &[ResolvedAssignment]) -> HashMap<u32, u16> {
+    let by_name: HashMap<String, u32> = match nix::net::if_::if_nameindex() {
+        Ok(ifaces) => ifaces
+            .iter()
+            .filter_map(|i| i.name().to_str().ok().map(|n| (n.to_owned(), i.index())))
+            .collect(),
+        Err(e) => {
+            log::error!("if_nameindex: {e:#}");
+            return HashMap::new();
+        }
+    };
+
+    let mut out = HashMap::new();
+    for a in assignments {
+        let veth = format!("veth{}i{}", a.vmid, a.iface);
+        let tap = format!("tap{}i{}", a.vmid, a.iface);
+        if let Some(&ifidx) = by_name.get(&veth).or_else(|| by_name.get(&tap)) {
+            log::debug!(
+                "resolved vmid={} iface={} -> ifindex {ifidx} group {}",
+                a.vmid,
+                a.iface,
+                a.group,
+            );
+            out.insert(ifidx, a.group);
+        } else {
+            log::trace!("skipping vmid={} iface={}: no local tap", a.vmid, a.iface);
+        }
+    }
+    log::debug!("{} managed taps on this host", out.len());
+    out
+}
+
+#[cfg(test)]
+mod tests {
+    use super::*;
+
+    #[test]
+    fn parse_tap_name_handles_qemu_and_lxc() {
+        assert_eq!(parse_tap_name("tap101i0"), Some((101, 0)));
+        assert_eq!(parse_tap_name("veth200i3"), Some((200, 3)));
+        assert_eq!(parse_tap_name("tap1000i12"), Some((1000, 12)));
+    }
+
+    #[test]
+    fn parse_tap_name_rejects_other_interfaces() {
+        assert_eq!(parse_tap_name("vmbr0"), None);
+        assert_eq!(parse_tap_name("fwln101i0"), None);
+        assert_eq!(parse_tap_name("veth200i-3"), None);
+        assert_eq!(parse_tap_name("eth0"), None);
+        assert_eq!(parse_tap_name("tap101"), None);
+        assert_eq!(parse_tap_name("tapfooibar"), None);
+    }
+}
diff --git a/src/policy/types.rs b/src/policy/types.rs
new file mode 100644
index 0000000..174aecc
--- /dev/null
+++ b/src/policy/types.rs
@@ -0,0 +1,45 @@
+//! Keep in sync with `bpf/types.h`.
+
+#[repr(C)]
+#[derive(Copy, Clone)]
+pub struct GuestGroup {
+    pub group: u16,
+    pub _pad: u16,
+}
+
+#[repr(C)]
+#[derive(Copy, Clone, Hash, PartialEq, Eq)]
+pub struct RuleKey {
+    pub src_group: u16,
+    pub dst_group: u16,
+}
+
+#[repr(C)]
+#[derive(Copy, Clone)]
+pub struct RuleValue {
+    pub allow: u8,
+    pub _pad: [u8; 3],
+}
+
+unsafe impl aya::Pod for GuestGroup {}
+unsafe impl aya::Pod for RuleKey {}
+unsafe impl aya::Pod for RuleValue {}
+
+#[cfg(test)]
+mod layout {
+    use super::*;
+    use core::mem::{offset_of, size_of};
+
+    #[test]
+    fn sizes() {
+        assert_eq!(size_of::<GuestGroup>(), 4);
+        assert_eq!(size_of::<RuleKey>(), 4);
+        assert_eq!(size_of::<RuleValue>(), 4);
+    }
+
+    #[test]
+    fn offsets() {
+        assert_eq!(offset_of!(RuleKey, dst_group), 2);
+        assert_eq!(offset_of!(GuestGroup, _pad), 2);
+    }
+}
diff --git a/src/running_config.rs b/src/running_config.rs
new file mode 100644
index 0000000..a2003de
--- /dev/null
+++ b/src/running_config.rs
@@ -0,0 +1,38 @@
+//! Loader for `/etc/pve/sdn/.running-config`, the JSON snapshot perl writes on every SDN commit.
+//! The deserialize types live in [`proxmox_ve_config::sdn`]; this module is just the agent-side
+//! I/O wrapper around them.
+
+use std::path::Path;
+
+use anyhow::Context;
+use proxmox_ve_config::sdn::config::RunningConfig;
+use proxmox_ve_config::sdn::microseg::MicrosegRunningConfig;
+
+pub const PATH: &str = "/etc/pve/sdn/.running-config";
+
+/// Read and parse the full SDN running config.
+///
+/// Returns `Ok(None)` if the file does not exist yet. This is the legitimate state of a node where
+/// SDN has never been committed. Any other I/O or parse error propagates.
+pub fn load() -> anyhow::Result<Option<RunningConfig>> {
+    load_from(Path::new(PATH))
+}
+
+/// Read and project just the microseg block, returning `default()` if the file or the `microseg`
+/// key is absent. Most agent code only cares about microseg, so this is the convenient entry
+/// point.
+pub fn load_microseg() -> anyhow::Result<MicrosegRunningConfig> {
+    Ok(load()?
+        .and_then(|c| c.microseg().cloned())
+        .unwrap_or_default())
+}
+
+fn load_from(path: &Path) -> anyhow::Result<Option<RunningConfig>> {
+    let raw = match std::fs::read_to_string(path) {
+        Ok(s) => s,
+        Err(e) if e.kind() == std::io::ErrorKind::NotFound => return Ok(None),
+        Err(e) => return Err(e).with_context(|| format!("read {}", path.display())),
+    };
+    let cfg = serde_json::from_str(&raw).with_context(|| format!("parse {}", path.display()))?;
+    Ok(Some(cfg))
+}
diff --git a/src/state.rs b/src/state.rs
new file mode 100644
index 0000000..879562b
--- /dev/null
+++ b/src/state.rs
@@ -0,0 +1,286 @@
+//! Resolved desired state derived from the SDN running config plus the local hostname.
+//!
+//! [`MicrosegRunningConfig`] is admin-facing, groups are referenced by name, rules and assignments
+//! name them. This module does the one-time name -> id resolution so subsystems see a straight-line
+//! view of "what should be applied on this host".
+
+use std::collections::{HashMap, HashSet};
+
+use anyhow::{Context, bail};
+
+use proxmox_ve_config::sdn::microseg::MicrosegRunningConfig;
+
+#[derive(Debug, Default)]
+pub struct DesiredState {
+    pub groups: HashMap<u16, GroupInfo>,
+    pub rules: Vec<ResolvedRule>,
+    pub assignments: Vec<ResolvedAssignment>,
+}
+
+#[derive(Debug)]
+pub struct GroupInfo {
+    pub name: String,
+}
+
+#[derive(Debug)]
+pub struct ResolvedRule {
+    /// 0 if the source was unstamped (rule.src was None)
+    pub src_group: u16,
+    pub dst_group: u16,
+    pub allow: bool,
+}
+
+#[derive(Debug)]
+pub struct ResolvedAssignment {
+    pub vmid: u32,
+    pub iface: u32,
+    pub group: u16,
+}
+
+impl DesiredState {
+    pub fn build(cfg: &MicrosegRunningConfig, _this_node: &str) -> anyhow::Result<Self> {
+        let mut name_to_id: HashMap<&str, u16> = HashMap::new();
+        let mut parent_of: HashMap<&str, &str> = HashMap::new();
+        let mut groups: HashMap<u16, GroupInfo> = HashMap::new();
+        for (name, g) in cfg.groups() {
+            if name_to_id.insert(name, g.mark()).is_some() {
+                bail!("duplicate group name '{name}'");
+            }
+            if let Some(parent) = g.parent() {
+                parent_of.insert(name, parent);
+            }
+            if groups
+                .insert(
+                    g.mark(),
+                    GroupInfo {
+                        name: name.to_string(),
+                    },
+                )
+                .is_some()
+            {
+                bail!("duplicate group mark {} (name '{name}')", g.mark());
+            }
+        }
+
+        let resolve = |name: &str, ctx: &str| -> anyhow::Result<u16> {
+            name_to_id
+                .get(name)
+                .copied()
+                .with_context(|| format!("{ctx} references unknown group '{name}'"))
+        };
+
+        // Flatten the hierarchy into concrete (src, dst) pairs here so the data plane is a plain
+        // lookup -- a rule on a group covers every group nested under it. On overlap the most
+        // specific wins: destination closest in the tree, then source. Valid config never ties
+        // (single parent, no duplicates), so equal-distance can't really happen, still kept a
+        // branch for it (fallback is AND over the ties).
+        let descendants = group_descendants(&name_to_id, &parent_of);
+        let unstamped = [(0u16, 0u16)];
+        let mut verdicts: HashMap<(u16, u16), ((u16, u16), bool)> = HashMap::new();
+        for (id, r) in cfg.rules() {
+            let src: &[(u16, u16)] = match r.src() {
+                None => &unstamped[..],
+                Some(s) => descendants
+                    .get(s)
+                    .with_context(|| format!("rule '{id}' src references unknown group '{s}'"))?
+                    .as_slice(),
+            };
+            let dst = descendants.get(r.dst()).with_context(|| {
+                format!("rule '{id}' dst references unknown group '{}'", r.dst())
+            })?;
+            let allow = r.allow();
+            for &(src_group, src_dist) in src {
+                for &(dst_group, dst_dist) in dst {
+                    let specificity = (dst_dist, src_dist);
+                    verdicts
+                        .entry((src_group, dst_group))
+                        .and_modify(|best| {
+                            if specificity < best.0 {
+                                *best = (specificity, allow);
+                            } else if specificity == best.0 {
+                                best.1 &= allow;
+                            }
+                        })
+                        .or_insert((specificity, allow));
+                }
+            }
+        }
+        let rules = verdicts
+            .into_iter()
+            .map(|((src_group, dst_group), (_, allow))| ResolvedRule {
+                src_group,
+                dst_group,
+                allow,
+            })
+            .collect();
+
+        let mut assignments = Vec::new();
+        for (id, a) in cfg.assignments() {
+            let group = resolve(a.group(), &format!("assignment '{id}' group"))?;
+            assignments.push(ResolvedAssignment {
+                vmid: a.vmid(),
+                iface: a.iface(),
+                group,
+            });
+        }
+
+        Ok(Self {
+            groups,
+            rules,
+            assignments,
+        })
+    }
+}
+
+/// Map each group name to every group nested under it as `(mark, distance)` pairs, where
+/// distance is the number of steps from that descendant up to this group (0 for the group itself).
+/// A rule on a group expands across these, and the most specific rule wins per pair. Guards against
+/// cycles and unknown parents so a corrupt running-config can't hang the agent -- defense in depth,
+/// since the write path already rejects both.
+fn group_descendants(
+    name_to_id: &HashMap<&str, u16>,
+    parent_of: &HashMap<&str, &str>,
+) -> HashMap<String, Vec<(u16, u16)>> {
+    let mut out: HashMap<String, Vec<(u16, u16)>> = HashMap::new();
+    for (&name, &id) in name_to_id {
+        let mut cur = name;
+        let mut dist = 0u16;
+        let mut seen: HashSet<&str> = HashSet::new();
+        loop {
+            if !seen.insert(cur) {
+                break;
+            }
+            out.entry(cur.to_string()).or_default().push((id, dist));
+            match parent_of.get(cur) {
+                Some(&parent) if name_to_id.contains_key(parent) => {
+                    cur = parent;
+                    dist += 1;
+                }
+                _ => break,
+            }
+        }
+    }
+    out
+}
+
+#[cfg(test)]
+mod tests {
+    use super::*;
+
+    fn cfg(json: &str) -> MicrosegRunningConfig {
+        serde_json::from_str(json).expect("parse running config")
+    }
+
+    #[test]
+    fn nested_group_inherits_parent_rule() {
+        // web is nested under app, so a rule to app also covers web with no rule of its own.
+        let state = DesiredState::build(
+            &cfg(r#"{"ids":{
+                "app":{"type":"group","id":"app","mark":10},
+                "web":{"type":"group","id":"web","mark":11,"parent":"app"},
+                "mon":{"type":"group","id":"mon","mark":20},
+                "r1":{"type":"rule","id":"r1","src":"mon","dst":"app","allow":1}
+            }}"#),
+            "node1",
+        )
+        .expect("build desired state");
+
+        let verdict = |src: u16, dst: u16| {
+            state
+                .rules
+                .iter()
+                .find(|r| r.src_group == src && r.dst_group == dst)
+                .map(|r| r.allow)
+        };
+
+        assert_eq!(verdict(20, 10), Some(true)); // mon to app
+        assert_eq!(verdict(20, 11), Some(true)); // mon to web, inherited from app
+        assert_eq!(state.rules.len(), 2);
+    }
+
+    #[test]
+    fn nearest_rule_in_the_tree_wins() {
+        // A more specific rule on a child overrides the parent rule in either direction, a deny
+        // under a broad allow and an allow under a broad deny.
+        let state = DesiredState::build(
+            &cfg(r#"{"ids":{
+                "app":{"type":"group","id":"app","mark":10},
+                "web":{"type":"group","id":"web","mark":11,"parent":"app"},
+                "db":{"type":"group","id":"db","mark":20},
+                "dbprimary":{"type":"group","id":"dbprimary","mark":21,"parent":"db"},
+                "ext":{"type":"group","id":"ext","mark":30},
+                "r1":{"type":"rule","id":"r1","src":"ext","dst":"app","allow":1},
+                "r2":{"type":"rule","id":"r2","src":"ext","dst":"web","allow":0},
+                "r3":{"type":"rule","id":"r3","src":"ext","dst":"db","allow":0},
+                "r4":{"type":"rule","id":"r4","src":"ext","dst":"dbprimary","allow":1}
+            }}"#),
+            "node1",
+        )
+        .expect("build desired state");
+
+        let verdict = |src: u16, dst: u16| {
+            state
+                .rules
+                .iter()
+                .find(|r| r.src_group == src && r.dst_group == dst)
+                .map(|r| r.allow)
+        };
+
+        assert_eq!(verdict(30, 10), Some(true)); // ext to app, broad allow
+        assert_eq!(verdict(30, 11), Some(false)); // ext to web, specific deny overrides
+        assert_eq!(verdict(30, 20), Some(false)); // ext to db, broad deny
+        assert_eq!(verdict(30, 21), Some(true)); // ext to dbprimary, specific allow overrides
+    }
+
+    #[test]
+    fn destination_specificity_is_primary() {
+        // web is under app, dbprimary under db. For web -> dbprimary two rules overlap: web -> db
+        // names the source exactly but the destination only via its parent, app -> dbprimary names
+        // the destination exactly. The destination is the primary axis, so app -> dbprimary wins
+        // and its action (allow) stands.
+        let state = DesiredState::build(
+            &cfg(r#"{"ids":{
+                "app":{"type":"group","id":"app","mark":10},
+                "web":{"type":"group","id":"web","mark":11,"parent":"app"},
+                "db":{"type":"group","id":"db","mark":20},
+                "dbprimary":{"type":"group","id":"dbprimary","mark":21,"parent":"db"},
+                "r1":{"type":"rule","id":"r1","src":"web","dst":"db","allow":0},
+                "r2":{"type":"rule","id":"r2","src":"app","dst":"dbprimary","allow":1}
+            }}"#),
+            "node1",
+        )
+        .expect("build desired state");
+
+        let allow = state
+            .rules
+            .iter()
+            .find(|r| r.src_group == 11 && r.dst_group == 21)
+            .map(|r| r.allow);
+        assert_eq!(allow, Some(true));
+    }
+
+    #[test]
+    fn source_specificity_breaks_destination_ties() {
+        // app -> svc and web -> svc both name the destination exactly, so the destination distance
+        // ties and the more specific source decides: web -> svc names the source exactly and wins
+        // over app -> svc, which only names the source's parent.
+        let state = DesiredState::build(
+            &cfg(r#"{"ids":{
+                "app":{"type":"group","id":"app","mark":10},
+                "web":{"type":"group","id":"web","mark":11,"parent":"app"},
+                "svc":{"type":"group","id":"svc","mark":30},
+                "r1":{"type":"rule","id":"r1","src":"app","dst":"svc","allow":0},
+                "r2":{"type":"rule","id":"r2","src":"web","dst":"svc","allow":1}
+            }}"#),
+            "node1",
+        )
+        .expect("build desired state");
+
+        let allow = state
+            .rules
+            .iter()
+            .find(|r| r.src_group == 11 && r.dst_group == 30)
+            .map(|r| r.allow);
+        assert_eq!(allow, Some(true));
+    }
+}
diff --git a/src/subsystem.rs b/src/subsystem.rs
new file mode 100644
index 0000000..7b944b4
--- /dev/null
+++ b/src/subsystem.rs
@@ -0,0 +1,383 @@
+//! The shared subsystem mechanism. An ingress and egress tc program with their maps and links,
+//! pinned under `/sys/fs/bpf/proxmox-ebpf/<name>/`. The loaded BPF stays in the kernel between
+//! invocations, so [`TcPrograms::ensure_loaded`] loads and verifies only on the first run and on a
+//! version change. Everything else attaches links and syncs maps against what is already there.
+//! The [`policy`](crate::policy) subsystem owns a [`TcPrograms`].
+
+use std::{collections::HashSet, fs::File, io::ErrorKind, path::PathBuf};
+
+use anyhow::Context;
+use aya::{EbpfLoader, programs::SchedClassifier};
+use nix::fcntl::{Flock, FlockArg};
+
+use crate::tc::{self, DIRECTIONS, Direction};
+
+pub const VERIFY_ROOT: &str = "/sys/fs/bpf/proxmox-ebpf-test";
+
+const PIN_ROOT: &str = "/sys/fs/bpf/proxmox-ebpf";
+const RUN_ROOT: &str = "/run/proxmox-ebpf";
+
+fn pin_root_for(name: &str) -> PathBuf {
+    PathBuf::from(PIN_ROOT).join(name)
+}
+
+// Small persisted state under /run/proxmox-ebpf/<name>/<key>, used to decide on each run whether
+// to tear down (schema changed) and whether to refresh existing links (the object changed).
+fn read_state(name: &str, key: &str) -> Option<u64> {
+    std::fs::read_to_string(PathBuf::from(RUN_ROOT).join(name).join(key))
+        .ok()
+        .and_then(|s| s.trim().parse().ok())
+}
+
+fn write_state(name: &str, key: &str, value: u64) -> anyhow::Result<()> {
+    let path = PathBuf::from(RUN_ROOT).join(name).join(key);
+    if let Some(parent) = path.parent() {
+        std::fs::create_dir_all(parent)?;
+    }
+    std::fs::write(path, format!("{value}\n"))?;
+    Ok(())
+}
+
+/// The pinned tc programs for one subsystem.
+///
+/// Holds only the immutable description. The loaded BPF lives in the kernel, pinned, and is
+/// reached back through those pins, so normal operation runs no verifier.
+pub struct TcPrograms {
+    name: &'static str,
+    obj: &'static [u8],
+    fingerprint: u64,
+    prog_name: fn(Direction) -> &'static str,
+    schema_version: Option<u32>,
+}
+
+impl TcPrograms {
+    pub fn new(
+        name: &'static str,
+        obj: &'static [u8],
+        fingerprint: u64,
+        prog_name: fn(Direction) -> &'static str,
+        schema_version: Option<u32>,
+    ) -> Self {
+        Self {
+            name,
+            obj,
+            fingerprint,
+            prog_name,
+            schema_version,
+        }
+    }
+
+    fn pin_root(&self) -> PathBuf {
+        pin_root_for(self.name)
+    }
+    fn links_dir(&self) -> PathBuf {
+        self.pin_root().join("links")
+    }
+    fn link_pin_path(&self, ifindex: u32, dir: Direction) -> PathBuf {
+        self.links_dir().join(tc::pin_filename(ifindex, dir))
+    }
+    fn prog_dir(&self) -> PathBuf {
+        self.pin_root().join("prog")
+    }
+    fn prog_pin_path(&self, dir: Direction) -> PathBuf {
+        self.prog_dir().join(dir.as_str())
+    }
+
+    /// Open a pinned BPF hash map by name, for the owning subsystem to sync. Valid once
+    /// [`ensure_loaded`](Self::ensure_loaded) has run, which is every caller's first step.
+    pub fn hash_map<K: aya::Pod, V: aya::Pod>(
+        &self,
+        name: &str,
+    ) -> anyhow::Result<aya::maps::HashMap<aya::maps::MapData, K, V>> {
+        let map = aya::maps::MapData::from_pin(self.pin_root().join(name))
+            .with_context(|| format!("{}: open pinned map {name}", self.name))?;
+        Ok(aya::maps::HashMap::try_from(aya::maps::Map::HashMap(map))?)
+    }
+
+    /// Open one direction's pinned program. A plain `BPF_OBJ_GET`, no verifier, since the program
+    /// was checked once when [`ensure_loaded`](Self::ensure_loaded) installed it.
+    fn program(&self, dir: Direction) -> anyhow::Result<SchedClassifier> {
+        SchedClassifier::from_pin(self.prog_pin_path(dir))
+            .with_context(|| format!("{}: open pinned program {}", self.name, dir.as_str()))
+    }
+
+    /// FNV-1a over the embedded object. A `const fn`, so each subsystem folds it into a `const` at
+    /// compile time and the per-invocation path never rehashes a constant.
+    pub const fn obj_fingerprint(obj: &[u8]) -> u64 {
+        let mut hash = 0xcbf29ce484222325u64;
+        let mut i = 0;
+        while i < obj.len() {
+            hash ^= obj[i] as u64;
+            hash = hash.wrapping_mul(0x100000001b3);
+            i += 1;
+        }
+        hash
+    }
+
+    /// The per-subsystem apply lock under `/run`, one file taken in two modes. A full apply (and
+    /// any install/teardown) takes it [exclusively](Self::lock_exclusive) so its enumerate, detach
+    /// and attach run as one unit. An additive single-interface apply takes it
+    /// [shared](Self::lock_shared) so guest plugs run concurrently and only block while a full
+    /// apply holds it. The kernel drops the lock if the process dies.
+    fn lock(&self, arg: FlockArg) -> anyhow::Result<Flock<File>> {
+        let dir = PathBuf::from(RUN_ROOT).join(self.name);
+        std::fs::create_dir_all(&dir)?;
+        let file = File::create(dir.join("lock"))?;
+        Flock::lock(file, arg).map_err(|(_, e)| anyhow::Error::new(e))
+    }
+
+    /// Take the apply lock in shared mode, for an additive single-interface apply.
+    pub fn lock_shared(&self) -> anyhow::Result<Flock<File>> {
+        self.lock(FlockArg::LockShared)
+    }
+
+    /// Take the apply lock exclusively, for a full apply or an install/teardown.
+    pub fn lock_exclusive(&self) -> anyhow::Result<Flock<File>> {
+        self.lock(FlockArg::LockExclusive)
+    }
+
+    /// Make sure the programs are loaded and pinned. The load runs only when there is no current
+    /// pin, when the schema version changed (a rebuild), or when the object changed (a refresh).
+    /// Otherwise the programs and maps pinned by an earlier run are reused untouched. Returns
+    /// whether a (re)install happened, so the caller follows with a full apply.
+    ///
+    /// Takes no lock of its own: the caller holds the apply lock exclusively whenever an install
+    /// may be needed (it gates on [`is_current`](Self::is_current) first), so the install path
+    /// here is already serialized. A subsystem with no maps passes no schema version and so never
+    /// verifies or tears down.
+    pub fn ensure_loaded(&self) -> anyhow::Result<bool> {
+        if self.is_current() {
+            return Ok(false);
+        }
+
+        // rebuild when the pinned programs are of an unknown or different schema: a different
+        // recorded schema_version, or none recorded while programs are still pinned (the /run
+        // markers were lost but the bpffs pins survived, e.g. across a service restart) -- the
+        // loaded map layout is then unknown, so tear down rather than bind new code to old maps.
+        let schema_changed = self.programs_pinned()
+            && match self.schema_version {
+                Some(v) => read_state(self.name, "schema_version") != Some(v as u64),
+                None => false,
+            };
+        let code_changed = read_state(self.name, "fingerprint") != Some(self.fingerprint);
+
+        if schema_changed {
+            log::warn!("{}: schema changed or unknown, rebuilding", self.name);
+            // verify the new code loads against throw-away state before tearing the old down, so a
+            // verifier rejection can't leave us with the old state wiped and nothing to replace it
+            self.verify().with_context(|| {
+                format!(
+                    "{}: new BPF code does not load against fresh state",
+                    self.name
+                )
+            })?;
+            self.tear_down().context("tear_down")?;
+        }
+
+        self.load_and_pin().context("load_and_pin")?;
+
+        // refresh links pinned by a previous run onto the new code. no-op on a fresh install or
+        // right after a teardown, where there are no links yet
+        if code_changed {
+            self.swap_existing_links();
+        }
+
+        if let Some(v) = self.schema_version
+            && let Err(e) = write_state(self.name, "schema_version", v as u64)
+        {
+            log::warn!("{}: failed to persist schema_version: {e:#}", self.name);
+        }
+        if let Err(e) = write_state(self.name, "fingerprint", self.fingerprint) {
+            log::warn!("{}: failed to persist fingerprint: {e:#}", self.name);
+        }
+        Ok(true)
+    }
+
+    /// True when the pinned programs already match this build, the `/run` fingerprint and schema
+    /// version agree with the embedded object and both programs are pinned. Lock-free; callers use
+    /// it to decide whether a run needs the exclusive lock (an install) or only the shared one.
+    pub fn is_current(&self) -> bool {
+        let schema_changed = match self.schema_version {
+            Some(v) => read_state(self.name, "schema_version").is_some_and(|s| s != v as u64),
+            None => false,
+        };
+        let code_changed = read_state(self.name, "fingerprint") != Some(self.fingerprint);
+        !schema_changed && !code_changed && self.programs_pinned()
+    }
+
+    fn programs_pinned(&self) -> bool {
+        DIRECTIONS
+            .iter()
+            .all(|&dir| self.prog_pin_path(dir).exists())
+    }
+
+    fn verify(&self) -> anyhow::Result<()> {
+        tc::verify(&[self.obj], &DIRECTIONS.map(self.prog_name))
+    }
+
+    fn tear_down(&self) -> anyhow::Result<()> {
+        match std::fs::remove_dir_all(self.pin_root()) {
+            Ok(()) => {}
+            Err(e) if e.kind() == ErrorKind::NotFound => {}
+            Err(e) => return Err(e.into()),
+        }
+        std::fs::create_dir_all(self.links_dir())?;
+        Ok(())
+    }
+
+    /// Load and verify the object, then pin its programs. The only place the verifier runs. The
+    /// `bpf` handle is dropped at the end, the pinned programs and maps stay resident in the
+    /// kernel.
+    fn load_and_pin(&self) -> anyhow::Result<()> {
+        std::fs::create_dir_all(self.links_dir())?;
+        std::fs::create_dir_all(self.prog_dir())?;
+        let mut bpf = EbpfLoader::new()
+            .map_pin_path(self.pin_root())
+            .load(self.obj)?;
+        for dir in DIRECTIONS {
+            let prog: &mut SchedClassifier =
+                bpf.program_mut((self.prog_name)(dir)).unwrap().try_into()?;
+            prog.load()?;
+            let path = self.prog_pin_path(dir);
+            // refreshed program is a new kernel object but the old pin file still exists, so
+            // remove it before re-pinning or pin() hits EEXIST
+            match std::fs::remove_file(&path) {
+                Ok(()) => {}
+                Err(e) if e.kind() == ErrorKind::NotFound => {}
+                Err(e) => return Err(e).context("remove stale program pin"),
+            }
+            prog.pin(&path)?;
+        }
+        Ok(())
+    }
+
+    fn live_links(&self) -> anyhow::Result<Vec<(u32, Direction)>> {
+        tc::read_pinned_links(&self.links_dir())
+    }
+
+    fn swap_existing_links(&self) {
+        let live = match self.live_links() {
+            Ok(l) => l,
+            Err(e) => {
+                log::error!("{}: read pinned links: {e:#}", self.name);
+                return;
+            }
+        };
+        for (ifindex, dir) in live {
+            if let Err(e) = self.swap_pinned_link(ifindex, dir) {
+                log::error!("{}: swap link {ifindex}-{}: {e:#}", self.name, dir.as_str());
+            }
+        }
+    }
+
+    /// Make the attached set match `desired`. Detach interfaces no longer wanted, attach the ones
+    /// missing. Refreshing existing links onto new code is done by
+    /// [`ensure_loaded`](Self::ensure_loaded). Runs under the caller's exclusive apply lock, so the
+    /// live set it samples cannot change under it.
+    ///
+    /// Returns an error if any interface failed to attach (after attempting all of them): a NIC
+    /// left without its program would pass traffic unenforced, so that surfaces as a failed apply
+    /// rather than a silent fail-open. A failed detach only leaves over-enforcement and is logged.
+    pub fn reconcile(&self, desired: &HashSet<u32>) -> anyhow::Result<()> {
+        let live: HashSet<(u32, Direction)> = self.live_links()?.into_iter().collect();
+
+        for &(ifidx, dir) in &live {
+            if desired.contains(&ifidx) {
+                continue;
+            }
+            log::debug!("{}: detach {ifidx}-{}", self.name, dir.as_str());
+            if let Err(e) = tc::detach_pinned_link(&self.link_pin_path(ifidx, dir)) {
+                log::error!("{}: detach {ifidx}-{}: {e:#}", self.name, dir.as_str());
+            } else {
+                log::info!("{}: detached {ifidx}-{}", self.name, dir.as_str());
+            }
+        }
+
+        let mut failed = 0usize;
+        for &ifidx in desired {
+            for dir in DIRECTIONS {
+                if live.contains(&(ifidx, dir)) {
+                    continue;
+                }
+                if let Err(e) = self.attach_and_pin(ifidx, dir) {
+                    log::error!("{}: attach {ifidx}-{}: {e:#}", self.name, dir.as_str());
+                    failed += 1;
+                }
+            }
+        }
+        if failed > 0 {
+            anyhow::bail!("{}: {failed} interface(s) failed to attach", self.name);
+        }
+        Ok(())
+    }
+
+    /// Attach the programs to a single interface, additively, without touching others.
+    ///
+    /// If a pin already exists, swap the program in place with no traffic gap. The swap only
+    /// succeeds while the link is still on the live netdev at this ifindex; a recycled ifindex
+    /// leaves a defunct pin, so the swap fails and we reclaim it and attach fresh. A failed attach
+    /// propagates so the caller can refuse to bring the NIC up unenforced.
+    pub fn attach_iface(&self, ifindex: u32) -> anyhow::Result<()> {
+        for dir in DIRECTIONS {
+            let path = self.link_pin_path(ifindex, dir);
+            if path.exists() {
+                match self.swap_pinned_link(ifindex, dir) {
+                    Ok(()) => continue,
+                    Err(e) => {
+                        log::debug!(
+                            "{}: {ifindex}-{} swap failed ({e:#}), rebuilding",
+                            self.name,
+                            dir.as_str()
+                        );
+                        if let Err(e) = tc::detach_pinned_link(&path) {
+                            log::warn!(
+                                "{}: reclaim {ifindex}-{}: unpin stale link: {e:#}",
+                                self.name,
+                                dir.as_str()
+                            );
+                            let _ = std::fs::remove_file(&path);
+                        }
+                    }
+                }
+            }
+            self.attach_and_pin(ifindex, dir)
+                .with_context(|| format!("{}: attach {ifindex}-{}", self.name, dir.as_str()))?;
+        }
+        Ok(())
+    }
+
+    fn swap_pinned_link(&self, ifindex: u32, dir: Direction) -> anyhow::Result<()> {
+        let path = self.link_pin_path(ifindex, dir);
+        let mut prog = self.program(dir)?;
+        tc::swap_pinned_link(&mut prog, &path)?;
+        Ok(())
+    }
+
+    fn attach_and_pin(&self, ifindex: u32, dir: Direction) -> anyhow::Result<()> {
+        let path = self.link_pin_path(ifindex, dir);
+        let mut prog = self.program(dir)?;
+        tc::attach_and_pin(&mut prog, ifindex, dir, &path)?;
+        Ok(())
+    }
+
+    /// Detach every pinned link and drop all pinned and `/run` state for this subsystem, under the
+    /// exclusive lock. For package removal: leaving links attached would keep interfaces enforcing
+    /// the last-applied policy with no agent left to update them. Best-effort past the lock -- logs
+    /// and continues so one failure does not strand the rest.
+    pub fn clear(&self) -> anyhow::Result<()> {
+        let _lock = self.lock_exclusive()?;
+        for (ifindex, dir) in self.live_links().unwrap_or_default() {
+            if let Err(e) = tc::detach_pinned_link(&self.link_pin_path(ifindex, dir)) {
+                log::warn!("{}: detach {ifindex}-{}: {e:#}", self.name, dir.as_str());
+            }
+        }
+        for path in [self.pin_root(), PathBuf::from(RUN_ROOT).join(self.name)] {
+            if let Err(e) = std::fs::remove_dir_all(&path)
+                && e.kind() != ErrorKind::NotFound
+            {
+                log::warn!("{}: remove {}: {e:#}", self.name, path.display());
+            }
+        }
+        Ok(())
+    }
+}
diff --git a/src/tc.rs b/src/tc.rs
new file mode 100644
index 0000000..d89304d
--- /dev/null
+++ b/src/tc.rs
@@ -0,0 +1,151 @@
+//! TC link plumbing for the [`policy`](crate::policy) subsystem. A `Direction` enum,
+//! attach/swap/detach free functions, and a uniform pin-filename layout `{ifindex}-{direction}`.
+
+use std::{io::ErrorKind, path::Path, str::FromStr};
+
+use anyhow::Context;
+use aya::{
+    EbpfLoader,
+    programs::{
+        SchedClassifier, TcAttachType,
+        links::{FdLink, PinnedLink},
+        tc,
+    },
+};
+
+use crate::subsystem::VERIFY_ROOT;
+
+#[derive(Copy, Clone, Hash, PartialEq, Eq, Debug)]
+pub enum Direction {
+    Ingress,
+    Egress,
+}
+
+pub const DIRECTIONS: [Direction; 2] = [Direction::Ingress, Direction::Egress];
+
+impl Direction {
+    pub fn as_str(self) -> &'static str {
+        match self {
+            Self::Ingress => "ingress",
+            Self::Egress => "egress",
+        }
+    }
+    pub fn aya_type(self) -> TcAttachType {
+        match self {
+            Self::Ingress => TcAttachType::Ingress,
+            Self::Egress => TcAttachType::Egress,
+        }
+    }
+}
+
+impl FromStr for Direction {
+    type Err = ();
+    fn from_str(s: &str) -> Result<Self, Self::Err> {
+        match s {
+            "ingress" => Ok(Self::Ingress),
+            "egress" => Ok(Self::Egress),
+            _ => Err(()),
+        }
+    }
+}
+
+/// RAII handle for the shared verify root. Constructor wipes any stale contents and creates the
+/// dir. `Drop` removes it. Cleanup happens even if the verify body panics or returns Err.
+struct VerifyRoot;
+
+impl VerifyRoot {
+    fn new() -> anyhow::Result<Self> {
+        let _ = std::fs::remove_dir_all(VERIFY_ROOT);
+        std::fs::create_dir_all(VERIFY_ROOT)
+            .with_context(|| format!("create verify root {VERIFY_ROOT}"))?;
+        Ok(Self)
+    }
+    fn path(&self) -> &Path {
+        Path::new(VERIFY_ROOT)
+    }
+}
+
+impl Drop for VerifyRoot {
+    fn drop(&mut self) {
+        if let Err(e) = std::fs::remove_dir_all(VERIFY_ROOT) {
+            log::warn!("failed to clean up verify root {VERIFY_ROOT}: {e:#}");
+        }
+    }
+}
+
+/// Loads every named program in each object against a throwaway pin root, catching verifier
+/// regressions before any real state is touched.
+pub fn verify(objects: &[&[u8]], program_names: &[&str]) -> anyhow::Result<()> {
+    let root = VerifyRoot::new()?;
+    for &obj in objects {
+        let mut bpf = EbpfLoader::new().map_pin_path(root.path()).load(obj)?;
+        for &name in program_names {
+            let p: &mut SchedClassifier = bpf.program_mut(name).unwrap().try_into()?;
+            p.load()?;
+        }
+    }
+    Ok(())
+}
+
+pub fn pin_filename(ifindex: u32, dir: Direction) -> String {
+    format!("{ifindex}-{}", dir.as_str())
+}
+
+/// Ensure clsact qdisc on the iface, attach `prog` in `dir`, pin the link.
+pub fn attach_and_pin(
+    prog: &mut SchedClassifier,
+    ifindex: u32,
+    dir: Direction,
+    pin_path: &Path,
+) -> anyhow::Result<()> {
+    let name = nix::net::if_::if_indextoname(ifindex)?;
+    let name = name.to_str()?;
+    let _ = tc::qdisc_add_clsact(name);
+    let link_id = prog.attach(name, dir.aya_type())?;
+    let link = prog.take_link(link_id)?;
+    let fd_link: FdLink = link.try_into()?;
+    fd_link.pin(pin_path)?;
+    Ok(())
+}
+
+/// Rebind a pinned link to `prog` via LINK_UPDATE. Atomic, traffic sees no detach/reattach gap.
+pub fn swap_pinned_link(prog: &mut SchedClassifier, pin_path: &Path) -> anyhow::Result<()> {
+    let pinned = PinnedLink::from_pin(pin_path)?;
+    let fd_link: FdLink = pinned.into();
+    let link = fd_link.try_into()?;
+    let new_id = prog.attach_to_link(link)?;
+    // take the handle out of aya's internal tracking, we have the pin
+    let _ = prog.take_link(new_id)?;
+    Ok(())
+}
+
+pub fn detach_pinned_link(pin_path: &Path) -> anyhow::Result<()> {
+    let pinned = PinnedLink::from_pin(pin_path)?;
+    let _fd_link = pinned.unpin()?;
+    Ok(())
+}
+
+/// Read and parse every pin file in `links_dir` as `{ifindex}-{direction}`. Unrecognized names are
+/// skipped with a warning.
+pub fn read_pinned_links(links_dir: &Path) -> anyhow::Result<Vec<(u32, Direction)>> {
+    let mut out = Vec::new();
+    let dir = match std::fs::read_dir(links_dir) {
+        Ok(d) => d,
+        Err(e) if e.kind() == ErrorKind::NotFound => return Ok(out),
+        Err(e) => return Err(e.into()),
+    };
+    for entry in dir {
+        let entry = entry?;
+        let name = entry.file_name();
+        let name = name.to_string_lossy();
+        let Some((ifidx_str, dir_str)) = name.split_once('-') else {
+            log::warn!("unrecognized pin file in links dir: {name}");
+            continue;
+        };
+        match (ifidx_str.parse::<u32>(), dir_str.parse::<Direction>()) {
+            (Ok(ifidx), Ok(d)) => out.push((ifidx, d)),
+            _ => log::warn!("unrecognized pin file in links dir: {name}"),
+        }
+    }
+    Ok(out)
+}
-- 
2.47.3

next prev parent reply	other threads:[~2026-06-09 13:28 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-06-09 13:25 [RFC cluster/docs/ifupdown2/manager/network/proxmox{-ebpf,-ve-rs,-perl-rs} 00/16] sdn: add microsegmentation support Hannes Laimer
2026-06-09 13:25 ` Hannes Laimer [this message]
2026-06-09 13:25 ` [PATCH proxmox-ebpf 02/16] bpf: add bridge subsystem Hannes Laimer
2026-06-09 13:25 ` [PATCH proxmox-ebpf 03/16] debian: add packaging and boot-time oneshot unit Hannes Laimer
2026-06-09 13:25 ` [PATCH proxmox-ve-rs 04/16] ve-config: sdn: add microseg config types Hannes Laimer
2026-06-09 13:25 ` [PATCH proxmox-perl-rs 05/16] sdn: add microseg config binding Hannes Laimer
2026-06-09 13:25 ` [PATCH pve-cluster 06/16] cfs: add 'sdn/microseg.cfg' to observed files Hannes Laimer
2026-06-09 13:25 ` [PATCH pve-network 07/16] sdn: microseg: add config and API Hannes Laimer
2026-06-09 13:25 ` [PATCH pve-network 08/16] sdn: zones: trigger microseg apply on tap_plug Hannes Laimer
2026-06-09 13:25 ` [PATCH pve-network 09/16] sdn: zones: add vxlan-gbp option to vxlan and evpn zones Hannes Laimer
2026-06-09 13:25 ` [PATCH pve-network 10/16] evpn: disable vxlan-learning on create if GBP is enabled Hannes Laimer
2026-06-09 13:25 ` [PATCH pve-manager 11/16] ui: sdn: add microsegmentation Hannes Laimer
2026-06-09 13:25 ` [PATCH pve-manager 12/16] network: apply microseg state on reload Hannes Laimer
2026-06-09 13:25 ` [PATCH pve-manager 13/16] ui: sdn: zones: add vxlan-gbp checkbox to vxlan and evpn Hannes Laimer
2026-06-09 13:25 ` [PATCH pve-docs 14/16] sdn: add microsegmentation section Hannes Laimer
2026-06-09 13:25 ` [PATCH pve-docs 15/16] sdn: add VXLAN-GBP flag to evpn/vxlan zone sections Hannes Laimer
2026-06-09 13:25 ` [PATCH ifupdown2 16/16] d/patches: add support for VXLAN-GBP flag Hannes Laimer

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:e3c0577 dfblob:ae8eb66 dfblob:6b3c16c dfblob:b9a1b51
dfblob:67a86a9 dfblob:9b3eae8 dfblob:174aecc dfblob:a2003de
dfblob:879562b dfblob:7b944b4 dfblob:d89304d )
 OR (
bs:"[PATCH proxmox-ebpf 01/16] agent: add userspace coordinator and stateless policy subsystem" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260609132522.235917-2-h.laimer@proxmox.com \
    --to=h.laimer@proxmox.com \
    --cc=pve-devel@lists.proxmox.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox

Service provided by Proxmox Server Solutions GmbH | Privacy | Legal