* [pve-devel] [PATCH RFC container] Add device passthrough @ 2023-10-19 12:18 Filip Schauer 2023-10-20 7:08 ` Wolfgang Bumiller 0 siblings, 1 reply; 6+ messages in thread From: Filip Schauer @ 2023-10-19 12:18 UTC (permalink / raw) To: pve-devel Signed-off-by: Filip Schauer <f.schauer@proxmox.com> --- Is it reasonable to add a "dev[n]" argument to the pct.conf, given that device mount points only allow passing through block devices? src/PVE/LXC.pm | 14 ++++++++++++++ src/PVE/LXC/Config.pm | 32 ++++++++++++++++++++++++++++++++ 2 files changed, 46 insertions(+) diff --git a/src/PVE/LXC.pm b/src/PVE/LXC.pm index c9b5ba7..6090534 100644 --- a/src/PVE/LXC.pm +++ b/src/PVE/LXC.pm @@ -639,6 +639,20 @@ sub update_lxc_config { $raw .= "lxc.mount.auto = sys:mixed\n"; } + foreach my $k (keys %$conf) { + next if $k !~ m/^dev(\d+)$/; + my $devpath = $conf->{$k}; + die "Device $devpath does not exist\n" unless (-e $devpath); + + my ($mode, $rdev) = (stat($devpath))[2, 6]; + die "Could not find major and minor ids of device $devpath.\n" unless ($mode && $rdev); + + my $major = PVE::Tools::dev_t_major($rdev); + my $minor = PVE::Tools::dev_t_minor($rdev); + $raw .= "lxc.cgroup2.devices.allow = c $major:$minor rw\n"; + $raw .= "lxc.mount.entry = $devpath " . substr($devpath, 1) . " none bind,create=file\n"; + } + # WARNING: DO NOT REMOVE this without making sure that loop device nodes # cannot be exposed to the container with r/w access (cgroup perms). # When this is enabled mounts will still remain in the monitor's namespace diff --git a/src/PVE/LXC/Config.pm b/src/PVE/LXC/Config.pm index 56e1f10..4665ab1 100644 --- a/src/PVE/LXC/Config.pm +++ b/src/PVE/LXC/Config.pm @@ -29,6 +29,7 @@ mkdir $lockdir; mkdir "/etc/pve/nodes/$nodename/lxc"; my $MAX_MOUNT_POINTS = 256; my $MAX_UNUSED_DISKS = $MAX_MOUNT_POINTS; +my $MAX_DEVICES = 256; # BEGIN implemented abstract methods from PVE::AbstractConfig @@ -908,6 +909,37 @@ for (my $i = 0; $i < $MAX_UNUSED_DISKS; $i++) { } } +PVE::JSONSchema::register_format('pve-lxc-dev-string', \&verify_lxc_dev_string); +sub verify_lxc_dev_string { + my ($dev, $noerr) = @_; + + if ($dev !~ m!^/dev/!) { + return undef if $noerr; + die "$dev does not start with /dev/\n"; + } + + return $dev; +} + +my $dev_desc = { + dev => { + type => 'string', + default_key => 1, + format => 'pve-lxc-dev-string', + format_description => 'Path', + description => 'Device to pass through to the container', + verbose_description => 'Path to the device to pass through to the container' + } +}; + +for (my $i = 0; $i < $MAX_DEVICES; $i++) { + $confdesc->{"dev$i"} = { + optional => 1, + type => 'string', format => $dev_desc, + description => "Device to pass through to the container", + } +} + sub parse_pct_config { my ($filename, $raw, $strict) = @_; -- 2.39.2 ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [pve-devel] [PATCH RFC container] Add device passthrough 2023-10-19 12:18 [pve-devel] [PATCH RFC container] Add device passthrough Filip Schauer @ 2023-10-20 7:08 ` Wolfgang Bumiller 2023-10-20 7:51 ` Dominik Csapak 0 siblings, 1 reply; 6+ messages in thread From: Wolfgang Bumiller @ 2023-10-20 7:08 UTC (permalink / raw) To: Filip Schauer; +Cc: pve-devel On Thu, Oct 19, 2023 at 02:18:56PM +0200, Filip Schauer wrote: > Signed-off-by: Filip Schauer <f.schauer@proxmox.com> > --- > Is it reasonable to add a "dev[n]" argument to the pct.conf, given that > device mount points only allow passing through block devices? Why would they only allow block devices? Also, Dominik recently added resource mappings for qemu for USB & PCI. PCI might be tricky, but for USB we may be able to use these mappings as well. That said, "raw" `/dev` node pass-through still makes sense as a separate thing for containers anyway since raw `lxc....` entries in the container config can currently be very inconvenient to deal with particularly with unprivileged containers (read on below for why...) > > src/PVE/LXC.pm | 14 ++++++++++++++ > src/PVE/LXC/Config.pm | 32 ++++++++++++++++++++++++++++++++ > 2 files changed, 46 insertions(+) > > diff --git a/src/PVE/LXC.pm b/src/PVE/LXC.pm > index c9b5ba7..6090534 100644 > --- a/src/PVE/LXC.pm > +++ b/src/PVE/LXC.pm > @@ -639,6 +639,20 @@ sub update_lxc_config { > $raw .= "lxc.mount.auto = sys:mixed\n"; > } > > + foreach my $k (keys %$conf) { > + next if $k !~ m/^dev(\d+)$/; We should add a helper `sub foreach_passthrough_device()` taking a sub which gets `$key, $raw_value, $sanitized_path`. The last one would currently be the `substr($path, 1)`. Having to do this later on just makes it easier to mess this up. Also, seeing this now this reminds me that - but not as part of this series - we should add a lot of `foreach_*` subs for this sort of iteration like we have in qemu-server and replace such existing constructs if possible... > + my $devpath = $conf->{$k}; > + die "Device $devpath does not exist\n" unless (-e $devpath); > + > + my ($mode, $rdev) = (stat($devpath))[2, 6]; > + die "Could not find major and minor ids of device $devpath.\n" unless ($mode && $rdev); > + > + my $major = PVE::Tools::dev_t_major($rdev); > + my $minor = PVE::Tools::dev_t_minor($rdev); > + $raw .= "lxc.cgroup2.devices.allow = c $major:$minor rw\n"; ^ the 'c' isn't necessarily correct. You can use `S_ISBLK($mode) ? 'b' : 'c'`. `S_ISBLK` comes from the `Fcntl` package. (You can find more info on file modes and S_ISBLK in `man 7 inode`) > + $raw .= "lxc.mount.entry = $devpath " . substr($devpath, 1) . " none bind,create=file\n"; ^ See above - the substr() there stands out quite weird :-) > + } > + > # WARNING: DO NOT REMOVE this without making sure that loop device nodes > # cannot be exposed to the container with r/w access (cgroup perms). > # When this is enabled mounts will still remain in the monitor's namespace > diff --git a/src/PVE/LXC/Config.pm b/src/PVE/LXC/Config.pm > index 56e1f10..4665ab1 100644 > --- a/src/PVE/LXC/Config.pm > +++ b/src/PVE/LXC/Config.pm > @@ -29,6 +29,7 @@ mkdir $lockdir; > mkdir "/etc/pve/nodes/$nodename/lxc"; > my $MAX_MOUNT_POINTS = 256; > my $MAX_UNUSED_DISKS = $MAX_MOUNT_POINTS; > +my $MAX_DEVICES = 256; > > # BEGIN implemented abstract methods from PVE::AbstractConfig > > @@ -908,6 +909,37 @@ for (my $i = 0; $i < $MAX_UNUSED_DISKS; $i++) { > } > } > > +PVE::JSONSchema::register_format('pve-lxc-dev-string', \&verify_lxc_dev_string); > +sub verify_lxc_dev_string { > + my ($dev, $noerr) = @_; > + > + if ($dev !~ m!^/dev/!) { > + return undef if $noerr; > + die "$dev does not start with /dev/\n"; This is where it gets interesting :-) I can't find it now, but I faintly recall that at some point I had suggested collecting devices in a different path than `/dev` to create device nodes with the correct owner/group/permissions for unprivileged containers to people on the forum. See, we don't just need the device to be visible in the container, but we also need it to be able to give the container's root user - which for unprivileged containers is usually uid 100000 - access to it. It *may* be fine for *some* devices to just `chown()` them in `/dev`, but we generally don't want to reown devices on the host itself, so it may make more sense to `mknod()` the same device somewhere in `/var/lib/lxc/$vmid/passthrough` and `chown()` it there. (One alternative would of course be ACL entries on the host, but IMO having a directory specific for a container where you can see all its accessible devics has its advantages...) (With newer kernels we might be able to instead use an idmapped bind mount, though, but pve-8's initial kernel does not support idmapping on tmpfs yet :-/ (that starts at 6.3 IIRC)) We should also think about whether we'd like to support the containers calling `mknod()` to "create" (but not really) those devices themselves. for privileged containers this is easy enough, privileged means privileged, so they can just do whatever. For unprivileged containers, however, `mknod()` does not work at all, and even if it did, we cannot allow it directly, since we have no guarantee that after the next reboot the major/minor numbers will be the same after the next reboot. So if we do want to support this, we'd need to teach `pve-lxc-syscalld` to know about allowed devices and give it the ability to hot-bind-mount such nodes (rather than actually mknod()ing them). > + } > + > + return $dev; > +} > + > +my $dev_desc = { > + dev => { > + type => 'string', > + default_key => 1, > + format => 'pve-lxc-dev-string', > + format_description => 'Path', > + description => 'Device to pass through to the container', > + verbose_description => 'Path to the device to pass through to the container' > + } > +}; > + > +for (my $i = 0; $i < $MAX_DEVICES; $i++) { > + $confdesc->{"dev$i"} = { > + optional => 1, > + type => 'string', format => $dev_desc, > + description => "Device to pass through to the container", > + } > +} > + > sub parse_pct_config { > my ($filename, $raw, $strict) = @_; > > -- > 2.39.2 ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [pve-devel] [PATCH RFC container] Add device passthrough 2023-10-20 7:08 ` Wolfgang Bumiller @ 2023-10-20 7:51 ` Dominik Csapak 2023-10-20 8:29 ` Thomas Lamprecht 0 siblings, 1 reply; 6+ messages in thread From: Dominik Csapak @ 2023-10-20 7:51 UTC (permalink / raw) To: Proxmox VE development discussion, Wolfgang Bumiller, Filip Schauer On 10/20/23 09:08, Wolfgang Bumiller wrote: > On Thu, Oct 19, 2023 at 02:18:56PM +0200, Filip Schauer wrote: >> Signed-off-by: Filip Schauer <f.schauer@proxmox.com> >> --- >> Is it reasonable to add a "dev[n]" argument to the pct.conf, given that >> device mount points only allow passing through block devices? > > Why would they only allow block devices? > > Also, Dominik recently added resource mappings for qemu for USB & PCI. > PCI might be tricky, but for USB we may be able to use these mappings as > well. > That said, "raw" `/dev` node pass-through still makes sense as a > separate thing for containers anyway since raw `lxc....` entries in the > container config can currently be very inconvenient to deal with > particularly with unprivileged containers (read on below for why...) just to chime in here, i don't think it'll be easily possible to reuse the pci/usb maps as is since we'd have to map from pciid /usb-vendor/device (or path) to a device node? i don't think thats generally possible, since the driver does not always make that info easily available (e.g. multi gpu setup and /dev/dri/cardX, or usb-to-serial adapters and /dev/ttySX ?) i guess it could work, but we probably would have to implement that for every driver out there? what i would like to see however is to integrate a new type of mapping for container devices specifically so that the ux is the same (create mappings for whole cluster, assigning privileges, etc) we still can provide a 'raw' mechanic as well for those who only ever use root@pam on a single node, but I'd not be against only using mappings either ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [pve-devel] [PATCH RFC container] Add device passthrough 2023-10-20 7:51 ` Dominik Csapak @ 2023-10-20 8:29 ` Thomas Lamprecht 2023-10-20 8:39 ` Dominik Csapak 0 siblings, 1 reply; 6+ messages in thread From: Thomas Lamprecht @ 2023-10-20 8:29 UTC (permalink / raw) To: Proxmox VE development discussion, Dominik Csapak, Wolfgang Bumiller, Filip Schauer Am 20/10/2023 um 09:51 schrieb Dominik Csapak: > On 10/20/23 09:08, Wolfgang Bumiller wrote: >> Also, Dominik recently added resource mappings for qemu for USB & PCI. >> PCI might be tricky, but for USB we may be able to use these mappings as >> well. >> That said, "raw" `/dev` node pass-through still makes sense as a >> separate thing for containers anyway since raw `lxc....` entries in the >> container config can currently be very inconvenient to deal with >> particularly with unprivileged containers (read on below for why...) > > just to chime in here, i don't think it'll be easily possible to reuse > the pci/usb maps as is since we'd have to map from pciid /usb-vendor/device > (or path) to a device node? i don't think thats generally possible, since > the driver does not always make that info easily available > (e.g. multi gpu setup and /dev/dri/cardX, or usb-to-serial adapters > and /dev/ttySX ?) i guess it could work, but we probably would have > to implement that for every driver out there? > USB should be workable via resolving to /dev/bus/usb/*, PCI could be, theoretically, but isn't now and probably won't be anytime soon – i.e., as Wolfgang mentioned off list, there's a reason that there's no /dev/bus/pci/ > what i would like to see however is to integrate a new type of mapping > for container devices specifically so that the ux is the same > (create mappings for whole cluster, assigning privileges, etc) I'd try hard to re-use the USB mappings, those seem to be one of the most common pass-through setups for containers (e.g., for those home automation zigbee/matter/... adapters, or in some countries DVB-T ones, be it for TV or ADS-B plane tracking). If we can make USB work then we'd have the basic concept of attaching a mapping done, adding a new type of (block/char) device mapping could then be an independent task for later to keep scope a bit smaller. Fixing Wolfgang's comments for workable pass-through for unprivileged containers is probably the most important change needed for now. I'd then even be open to apply this with (root@pam only!) absolute path to /dev support only, but IMO resolving the mapping itself should not be too hard, so if using /dev/bus/usb/ works having that in there from the start would be definitively nice. ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [pve-devel] [PATCH RFC container] Add device passthrough 2023-10-20 8:29 ` Thomas Lamprecht @ 2023-10-20 8:39 ` Dominik Csapak 2023-10-24 13:00 ` Filip Schauer 0 siblings, 1 reply; 6+ messages in thread From: Dominik Csapak @ 2023-10-20 8:39 UTC (permalink / raw) To: Thomas Lamprecht, Proxmox VE development discussion, Wolfgang Bumiller, Filip Schauer On 10/20/23 10:29, Thomas Lamprecht wrote: > Am 20/10/2023 um 09:51 schrieb Dominik Csapak: >> On 10/20/23 09:08, Wolfgang Bumiller wrote: >>> Also, Dominik recently added resource mappings for qemu for USB & PCI. >>> PCI might be tricky, but for USB we may be able to use these mappings as >>> well. >>> That said, "raw" `/dev` node pass-through still makes sense as a >>> separate thing for containers anyway since raw `lxc....` entries in the >>> container config can currently be very inconvenient to deal with >>> particularly with unprivileged containers (read on below for why...) >> >> just to chime in here, i don't think it'll be easily possible to reuse >> the pci/usb maps as is since we'd have to map from pciid /usb-vendor/device >> (or path) to a device node? i don't think thats generally possible, since >> the driver does not always make that info easily available >> (e.g. multi gpu setup and /dev/dri/cardX, or usb-to-serial adapters >> and /dev/ttySX ?) i guess it could work, but we probably would have >> to implement that for every driver out there? >> > > USB should be workable via resolving to /dev/bus/usb/*, PCI could be, > theoretically, but isn't now and probably won't be anytime soon – i.e., > as Wolfgang mentioned off list, there's a reason that there's no > /dev/bus/pci/ > >> what i would like to see however is to integrate a new type of mapping >> for container devices specifically so that the ux is the same >> (create mappings for whole cluster, assigning privileges, etc) > > I'd try hard to re-use the USB mappings, those seem to be one of the > most common pass-through setups for containers (e.g., for those > home automation zigbee/matter/... adapters, or in some countries DVB-T > ones, be it for TV or ADS-B plane tracking). i guess, but sadly the /dev/bus/usb endpoint is mostly not what you want to pass-through but the driver specific /dev/ttySX /dev/dvb/X and so on (there are situations where the /dev/bus/usb path is the wanted one, but there are many where it isn't) and while we can map from those to the usb device/vendor/path the reverse mapping is not so easy (at least when i tried i did not found a generic way via udev or similar i would welcome it though if there is a way to do that of course > > If we can make USB work then we'd have the basic concept of attaching > a mapping done, adding a new type of (block/char) device mapping could > then be an independent task for later to keep scope a bit smaller. > > Fixing Wolfgang's comments for workable pass-through for unprivileged > containers is probably the most important change needed for now. > > I'd then even be open to apply this with (root@pam only!) absolute > path to /dev support only, but IMO resolving the mapping itself should > not be too hard, so if using /dev/bus/usb/ works having that in there > from the start would be definitively nice. ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [pve-devel] [PATCH RFC container] Add device passthrough 2023-10-20 8:39 ` Dominik Csapak @ 2023-10-24 13:00 ` Filip Schauer 0 siblings, 0 replies; 6+ messages in thread From: Filip Schauer @ 2023-10-24 13:00 UTC (permalink / raw) To: Dominik Csapak, Thomas Lamprecht, Proxmox VE development discussion, Wolfgang Bumiller A patch v2 is available: https://lists.proxmox.com/pipermail/pve-devel/2023-October/059617.html On 20/10/2023 10:39, Dominik Csapak wrote: > On 10/20/23 10:29, Thomas Lamprecht wrote: >> Am 20/10/2023 um 09:51 schrieb Dominik Csapak: >>> On 10/20/23 09:08, Wolfgang Bumiller wrote: >>>> Also, Dominik recently added resource mappings for qemu for USB & PCI. >>>> PCI might be tricky, but for USB we may be able to use these >>>> mappings as >>>> well. >>>> That said, "raw" `/dev` node pass-through still makes sense as a >>>> separate thing for containers anyway since raw `lxc....` entries in >>>> the >>>> container config can currently be very inconvenient to deal with >>>> particularly with unprivileged containers (read on below for why...) >>> >>> just to chime in here, i don't think it'll be easily possible to reuse >>> the pci/usb maps as is since we'd have to map from pciid >>> /usb-vendor/device >>> (or path) to a device node? i don't think thats generally possible, >>> since >>> the driver does not always make that info easily available >>> (e.g. multi gpu setup and /dev/dri/cardX, or usb-to-serial adapters >>> and /dev/ttySX ?) i guess it could work, but we probably would have >>> to implement that for every driver out there? >>> >> >> USB should be workable via resolving to /dev/bus/usb/*, PCI could be, >> theoretically, but isn't now and probably won't be anytime soon – i.e., >> as Wolfgang mentioned off list, there's a reason that there's no >> /dev/bus/pci/ >> >>> what i would like to see however is to integrate a new type of mapping >>> for container devices specifically so that the ux is the same >>> (create mappings for whole cluster, assigning privileges, etc) >> >> I'd try hard to re-use the USB mappings, those seem to be one of the >> most common pass-through setups for containers (e.g., for those >> home automation zigbee/matter/... adapters, or in some countries DVB-T >> ones, be it for TV or ADS-B plane tracking). > > i guess, but sadly the /dev/bus/usb endpoint is mostly not what you want > to pass-through but the driver specific /dev/ttySX /dev/dvb/X and so on > (there are situations where the /dev/bus/usb path is the wanted one, > but there are many where it isn't) > > and while we can map from those to the usb device/vendor/path the reverse > mapping is not so easy (at least when i tried i did not found a > generic way > via udev or similar > > i would welcome it though if there is a way to do that of course > >> >> If we can make USB work then we'd have the basic concept of attaching >> a mapping done, adding a new type of (block/char) device mapping could >> then be an independent task for later to keep scope a bit smaller. >> >> Fixing Wolfgang's comments for workable pass-through for unprivileged >> containers is probably the most important change needed for now. >> >> I'd then even be open to apply this with (root@pam only!) absolute >> path to /dev support only, but IMO resolving the mapping itself should >> not be too hard, so if using /dev/bus/usb/ works having that in there >> from the start would be definitively nice. > > ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2023-10-24 13:00 UTC | newest] Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2023-10-19 12:18 [pve-devel] [PATCH RFC container] Add device passthrough Filip Schauer 2023-10-20 7:08 ` Wolfgang Bumiller 2023-10-20 7:51 ` Dominik Csapak 2023-10-20 8:29 ` Thomas Lamprecht 2023-10-20 8:39 ` Dominik Csapak 2023-10-24 13:00 ` Filip Schauer
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox