public inbox for pve-devel@lists.proxmox.com
 help / color / mirror / Atom feed
* [pve-devel] [PATCH RFC container] Add device passthrough
@ 2023-10-19 12:18 Filip Schauer
  2023-10-20  7:08 ` Wolfgang Bumiller
  0 siblings, 1 reply; 6+ messages in thread
From: Filip Schauer @ 2023-10-19 12:18 UTC (permalink / raw)
  To: pve-devel

Signed-off-by: Filip Schauer <f.schauer@proxmox.com>
---
Is it reasonable to add a "dev[n]" argument to the pct.conf, given that
device mount points only allow passing through block devices?

 src/PVE/LXC.pm        | 14 ++++++++++++++
 src/PVE/LXC/Config.pm | 32 ++++++++++++++++++++++++++++++++
 2 files changed, 46 insertions(+)

diff --git a/src/PVE/LXC.pm b/src/PVE/LXC.pm
index c9b5ba7..6090534 100644
--- a/src/PVE/LXC.pm
+++ b/src/PVE/LXC.pm
@@ -639,6 +639,20 @@ sub update_lxc_config {
 	$raw .= "lxc.mount.auto = sys:mixed\n";
     }
 
+    foreach my $k (keys %$conf) {
+	next if $k !~ m/^dev(\d+)$/;
+	my $devpath = $conf->{$k};
+	die "Device $devpath does not exist\n" unless (-e $devpath);
+
+	my ($mode, $rdev) = (stat($devpath))[2, 6];
+	die "Could not find major and minor ids of device $devpath.\n" unless ($mode && $rdev);
+
+	my $major = PVE::Tools::dev_t_major($rdev);
+	my $minor = PVE::Tools::dev_t_minor($rdev);
+	$raw .= "lxc.cgroup2.devices.allow = c $major:$minor rw\n";
+	$raw .= "lxc.mount.entry = $devpath " . substr($devpath, 1) . " none bind,create=file\n";
+    }
+
     # WARNING: DO NOT REMOVE this without making sure that loop device nodes
     # cannot be exposed to the container with r/w access (cgroup perms).
     # When this is enabled mounts will still remain in the monitor's namespace
diff --git a/src/PVE/LXC/Config.pm b/src/PVE/LXC/Config.pm
index 56e1f10..4665ab1 100644
--- a/src/PVE/LXC/Config.pm
+++ b/src/PVE/LXC/Config.pm
@@ -29,6 +29,7 @@ mkdir $lockdir;
 mkdir "/etc/pve/nodes/$nodename/lxc";
 my $MAX_MOUNT_POINTS = 256;
 my $MAX_UNUSED_DISKS = $MAX_MOUNT_POINTS;
+my $MAX_DEVICES = 256;
 
 # BEGIN implemented abstract methods from PVE::AbstractConfig
 
@@ -908,6 +909,37 @@ for (my $i = 0; $i < $MAX_UNUSED_DISKS; $i++) {
     }
 }
 
+PVE::JSONSchema::register_format('pve-lxc-dev-string', \&verify_lxc_dev_string);
+sub verify_lxc_dev_string {
+    my ($dev, $noerr) = @_;
+
+    if ($dev !~ m!^/dev/!) {
+	return undef if $noerr;
+	die "$dev does not start with /dev/\n";
+    }
+
+    return $dev;
+}
+
+my $dev_desc = {
+    dev => {
+	type => 'string',
+	default_key => 1,
+	format => 'pve-lxc-dev-string',
+	format_description => 'Path',
+	description => 'Device to pass through to the container',
+	verbose_description => 'Path to the device to pass through to the container'
+    }
+};
+
+for (my $i = 0; $i < $MAX_DEVICES; $i++) {
+    $confdesc->{"dev$i"} = {
+	optional => 1,
+	type => 'string', format => $dev_desc,
+	description => "Device to pass through to the container",
+    }
+}
+
 sub parse_pct_config {
     my ($filename, $raw, $strict) = @_;
 
-- 
2.39.2





^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [pve-devel] [PATCH RFC container] Add device passthrough
  2023-10-19 12:18 [pve-devel] [PATCH RFC container] Add device passthrough Filip Schauer
@ 2023-10-20  7:08 ` Wolfgang Bumiller
  2023-10-20  7:51   ` Dominik Csapak
  0 siblings, 1 reply; 6+ messages in thread
From: Wolfgang Bumiller @ 2023-10-20  7:08 UTC (permalink / raw)
  To: Filip Schauer; +Cc: pve-devel

On Thu, Oct 19, 2023 at 02:18:56PM +0200, Filip Schauer wrote:
> Signed-off-by: Filip Schauer <f.schauer@proxmox.com>
> ---
> Is it reasonable to add a "dev[n]" argument to the pct.conf, given that
> device mount points only allow passing through block devices?

Why would they only allow block devices?

Also, Dominik recently added resource mappings for qemu for USB & PCI.
PCI might be tricky, but for USB we may be able to use these mappings as
well.
That said, "raw" `/dev` node pass-through still makes sense as a
separate thing for containers anyway since raw `lxc....` entries in the
container config can currently be very inconvenient to deal with
particularly with unprivileged containers (read on below for why...)

> 
>  src/PVE/LXC.pm        | 14 ++++++++++++++
>  src/PVE/LXC/Config.pm | 32 ++++++++++++++++++++++++++++++++
>  2 files changed, 46 insertions(+)
> 
> diff --git a/src/PVE/LXC.pm b/src/PVE/LXC.pm
> index c9b5ba7..6090534 100644
> --- a/src/PVE/LXC.pm
> +++ b/src/PVE/LXC.pm
> @@ -639,6 +639,20 @@ sub update_lxc_config {
>  	$raw .= "lxc.mount.auto = sys:mixed\n";
>      }
>  
> +    foreach my $k (keys %$conf) {
> +	next if $k !~ m/^dev(\d+)$/;

We should add a helper `sub foreach_passthrough_device()` taking a
sub which gets `$key, $raw_value, $sanitized_path`.

The last one would currently be the `substr($path, 1)`. Having to do
this later on just makes it easier to mess this up.

Also, seeing this now this reminds me that - but not as part of this
series - we should add a lot of `foreach_*` subs for this sort of
iteration like we have in qemu-server and replace such existing
constructs if possible...

> +	my $devpath = $conf->{$k};
> +	die "Device $devpath does not exist\n" unless (-e $devpath);
> +
> +	my ($mode, $rdev) = (stat($devpath))[2, 6];
> +	die "Could not find major and minor ids of device $devpath.\n" unless ($mode && $rdev);
> +
> +	my $major = PVE::Tools::dev_t_major($rdev);
> +	my $minor = PVE::Tools::dev_t_minor($rdev);
> +	$raw .= "lxc.cgroup2.devices.allow = c $major:$minor rw\n";

^ the 'c' isn't necessarily correct.
You can use `S_ISBLK($mode) ? 'b' : 'c'`. `S_ISBLK` comes from the `Fcntl` package.
(You can find more info on file modes and S_ISBLK in `man 7 inode`)

> +	$raw .= "lxc.mount.entry = $devpath " . substr($devpath, 1) . " none bind,create=file\n";

^ See above - the substr() there stands out quite weird :-)

> +    }
> +
>      # WARNING: DO NOT REMOVE this without making sure that loop device nodes
>      # cannot be exposed to the container with r/w access (cgroup perms).
>      # When this is enabled mounts will still remain in the monitor's namespace
> diff --git a/src/PVE/LXC/Config.pm b/src/PVE/LXC/Config.pm
> index 56e1f10..4665ab1 100644
> --- a/src/PVE/LXC/Config.pm
> +++ b/src/PVE/LXC/Config.pm
> @@ -29,6 +29,7 @@ mkdir $lockdir;
>  mkdir "/etc/pve/nodes/$nodename/lxc";
>  my $MAX_MOUNT_POINTS = 256;
>  my $MAX_UNUSED_DISKS = $MAX_MOUNT_POINTS;
> +my $MAX_DEVICES = 256;
>  
>  # BEGIN implemented abstract methods from PVE::AbstractConfig
>  
> @@ -908,6 +909,37 @@ for (my $i = 0; $i < $MAX_UNUSED_DISKS; $i++) {
>      }
>  }
>  
> +PVE::JSONSchema::register_format('pve-lxc-dev-string', \&verify_lxc_dev_string);
> +sub verify_lxc_dev_string {
> +    my ($dev, $noerr) = @_;
> +
> +    if ($dev !~ m!^/dev/!) {
> +	return undef if $noerr;
> +	die "$dev does not start with /dev/\n";

This is where it gets interesting :-)

I can't find it now, but I faintly recall that at some point I had
suggested collecting devices in a different path than `/dev` to create
device nodes with the correct owner/group/permissions for unprivileged
containers to people on the forum.

See, we don't just need the device to be visible in the container, but
we also need it to be able to give the container's root user - which for
unprivileged containers is usually uid 100000 - access to it.

It *may* be fine for *some* devices to just `chown()` them in `/dev`,
but we generally don't want to reown devices on the host itself, so it
may make more sense to `mknod()` the same device somewhere in
`/var/lib/lxc/$vmid/passthrough` and `chown()` it there. (One
alternative would of course be ACL entries on the host, but IMO having a
directory specific for a container where you can see all its accessible
devics has its advantages...)

(With newer kernels we might be able to instead use an idmapped bind
mount, though, but pve-8's initial kernel does not support idmapping on
tmpfs yet :-/ (that starts at 6.3 IIRC))

We should also think about whether we'd like to support the containers
calling `mknod()` to "create" (but not really) those devices themselves.
for privileged containers this is easy enough, privileged means
privileged, so they can just do whatever.
For unprivileged containers, however, `mknod()` does not work at all,
and even if it did, we cannot allow it directly, since we have no
guarantee that after the next reboot the major/minor numbers will be the
same after the next reboot. So if we do want to support this, we'd need
to teach `pve-lxc-syscalld` to know about allowed devices and give it
the ability to hot-bind-mount such nodes (rather than actually
mknod()ing them).

> +    }
> +
> +    return $dev;
> +}
> +
> +my $dev_desc = {
> +    dev => {
> +	type => 'string',
> +	default_key => 1,
> +	format => 'pve-lxc-dev-string',
> +	format_description => 'Path',
> +	description => 'Device to pass through to the container',
> +	verbose_description => 'Path to the device to pass through to the container'
> +    }
> +};
> +
> +for (my $i = 0; $i < $MAX_DEVICES; $i++) {
> +    $confdesc->{"dev$i"} = {
> +	optional => 1,
> +	type => 'string', format => $dev_desc,
> +	description => "Device to pass through to the container",
> +    }
> +}
> +
>  sub parse_pct_config {
>      my ($filename, $raw, $strict) = @_;
>  
> -- 
> 2.39.2




^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [pve-devel] [PATCH RFC container] Add device passthrough
  2023-10-20  7:08 ` Wolfgang Bumiller
@ 2023-10-20  7:51   ` Dominik Csapak
  2023-10-20  8:29     ` Thomas Lamprecht
  0 siblings, 1 reply; 6+ messages in thread
From: Dominik Csapak @ 2023-10-20  7:51 UTC (permalink / raw)
  To: Proxmox VE development discussion, Wolfgang Bumiller, Filip Schauer

On 10/20/23 09:08, Wolfgang Bumiller wrote:
> On Thu, Oct 19, 2023 at 02:18:56PM +0200, Filip Schauer wrote:
>> Signed-off-by: Filip Schauer <f.schauer@proxmox.com>
>> ---
>> Is it reasonable to add a "dev[n]" argument to the pct.conf, given that
>> device mount points only allow passing through block devices?
> 
> Why would they only allow block devices?
> 
> Also, Dominik recently added resource mappings for qemu for USB & PCI.
> PCI might be tricky, but for USB we may be able to use these mappings as
> well.
> That said, "raw" `/dev` node pass-through still makes sense as a
> separate thing for containers anyway since raw `lxc....` entries in the
> container config can currently be very inconvenient to deal with
> particularly with unprivileged containers (read on below for why...)

just to chime in here, i don't think it'll be easily possible to reuse
the pci/usb maps as is since we'd have to map from pciid /usb-vendor/device
(or path) to a device node? i don't think thats generally possible, since
the driver does not always make that info easily available
(e.g. multi gpu setup and /dev/dri/cardX, or usb-to-serial adapters
and /dev/ttySX ?) i guess it could work, but we probably would have
to implement that for every driver out there?

what i would like to see however is to integrate a new type of mapping
for container devices specifically so that the ux is the same
(create mappings for whole cluster, assigning privileges, etc)

we still can provide a 'raw' mechanic as well for those who
only ever use root@pam on a single node, but I'd not be
against only using mappings either




^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [pve-devel] [PATCH RFC container] Add device passthrough
  2023-10-20  7:51   ` Dominik Csapak
@ 2023-10-20  8:29     ` Thomas Lamprecht
  2023-10-20  8:39       ` Dominik Csapak
  0 siblings, 1 reply; 6+ messages in thread
From: Thomas Lamprecht @ 2023-10-20  8:29 UTC (permalink / raw)
  To: Proxmox VE development discussion, Dominik Csapak,
	Wolfgang Bumiller, Filip Schauer

Am 20/10/2023 um 09:51 schrieb Dominik Csapak:
> On 10/20/23 09:08, Wolfgang Bumiller wrote:
>> Also, Dominik recently added resource mappings for qemu for USB & PCI.
>> PCI might be tricky, but for USB we may be able to use these mappings as
>> well.
>> That said, "raw" `/dev` node pass-through still makes sense as a
>> separate thing for containers anyway since raw `lxc....` entries in the
>> container config can currently be very inconvenient to deal with
>> particularly with unprivileged containers (read on below for why...)
> 
> just to chime in here, i don't think it'll be easily possible to reuse
> the pci/usb maps as is since we'd have to map from pciid /usb-vendor/device
> (or path) to a device node? i don't think thats generally possible, since
> the driver does not always make that info easily available
> (e.g. multi gpu setup and /dev/dri/cardX, or usb-to-serial adapters
> and /dev/ttySX ?) i guess it could work, but we probably would have
> to implement that for every driver out there?
> 

USB should be workable via resolving to /dev/bus/usb/*, PCI could be,
theoretically, but isn't now and probably won't be anytime soon – i.e.,
as Wolfgang mentioned off list, there's a reason that there's no
/dev/bus/pci/

> what i would like to see however is to integrate a new type of mapping
> for container devices specifically so that the ux is the same
> (create mappings for whole cluster, assigning privileges, etc)

I'd try hard to re-use the USB mappings, those seem to be one of the
most common pass-through setups for containers (e.g., for those
home automation zigbee/matter/... adapters, or in some countries DVB-T
ones, be it for TV or ADS-B plane tracking).

If we can make USB work then we'd have the basic concept of attaching
a mapping done, adding a new type of (block/char) device mapping could
then be an independent task for later to keep scope a bit smaller.

Fixing Wolfgang's comments for workable pass-through for unprivileged
containers is probably the most important change needed for now.

I'd then even be open to apply this with (root@pam only!) absolute
path to /dev support only, but IMO resolving the mapping itself should
not be too hard, so if using /dev/bus/usb/ works having that in there
from the start would be definitively nice.




^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [pve-devel] [PATCH RFC container] Add device passthrough
  2023-10-20  8:29     ` Thomas Lamprecht
@ 2023-10-20  8:39       ` Dominik Csapak
  2023-10-24 13:00         ` Filip Schauer
  0 siblings, 1 reply; 6+ messages in thread
From: Dominik Csapak @ 2023-10-20  8:39 UTC (permalink / raw)
  To: Thomas Lamprecht, Proxmox VE development discussion,
	Wolfgang Bumiller, Filip Schauer

On 10/20/23 10:29, Thomas Lamprecht wrote:
> Am 20/10/2023 um 09:51 schrieb Dominik Csapak:
>> On 10/20/23 09:08, Wolfgang Bumiller wrote:
>>> Also, Dominik recently added resource mappings for qemu for USB & PCI.
>>> PCI might be tricky, but for USB we may be able to use these mappings as
>>> well.
>>> That said, "raw" `/dev` node pass-through still makes sense as a
>>> separate thing for containers anyway since raw `lxc....` entries in the
>>> container config can currently be very inconvenient to deal with
>>> particularly with unprivileged containers (read on below for why...)
>>
>> just to chime in here, i don't think it'll be easily possible to reuse
>> the pci/usb maps as is since we'd have to map from pciid /usb-vendor/device
>> (or path) to a device node? i don't think thats generally possible, since
>> the driver does not always make that info easily available
>> (e.g. multi gpu setup and /dev/dri/cardX, or usb-to-serial adapters
>> and /dev/ttySX ?) i guess it could work, but we probably would have
>> to implement that for every driver out there?
>>
> 
> USB should be workable via resolving to /dev/bus/usb/*, PCI could be,
> theoretically, but isn't now and probably won't be anytime soon – i.e.,
> as Wolfgang mentioned off list, there's a reason that there's no
> /dev/bus/pci/
> 
>> what i would like to see however is to integrate a new type of mapping
>> for container devices specifically so that the ux is the same
>> (create mappings for whole cluster, assigning privileges, etc)
> 
> I'd try hard to re-use the USB mappings, those seem to be one of the
> most common pass-through setups for containers (e.g., for those
> home automation zigbee/matter/... adapters, or in some countries DVB-T
> ones, be it for TV or ADS-B plane tracking).

i guess, but sadly the /dev/bus/usb endpoint is mostly not what you want
to pass-through but the driver specific /dev/ttySX /dev/dvb/X and so on
(there are situations where the /dev/bus/usb path is the wanted one,
but there are many where it isn't)

and while we can map from those to the usb device/vendor/path the reverse
mapping is not so easy (at least when i tried i did not found a generic way
via udev or similar

i would welcome it though if there is a way to do that of course

> 
> If we can make USB work then we'd have the basic concept of attaching
> a mapping done, adding a new type of (block/char) device mapping could
> then be an independent task for later to keep scope a bit smaller.
> 
> Fixing Wolfgang's comments for workable pass-through for unprivileged
> containers is probably the most important change needed for now.
> 
> I'd then even be open to apply this with (root@pam only!) absolute
> path to /dev support only, but IMO resolving the mapping itself should
> not be too hard, so if using /dev/bus/usb/ works having that in there
> from the start would be definitively nice.





^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [pve-devel] [PATCH RFC container] Add device passthrough
  2023-10-20  8:39       ` Dominik Csapak
@ 2023-10-24 13:00         ` Filip Schauer
  0 siblings, 0 replies; 6+ messages in thread
From: Filip Schauer @ 2023-10-24 13:00 UTC (permalink / raw)
  To: Dominik Csapak, Thomas Lamprecht,
	Proxmox VE development discussion, Wolfgang Bumiller

A patch v2 is available:

https://lists.proxmox.com/pipermail/pve-devel/2023-October/059617.html

On 20/10/2023 10:39, Dominik Csapak wrote:
> On 10/20/23 10:29, Thomas Lamprecht wrote:
>> Am 20/10/2023 um 09:51 schrieb Dominik Csapak:
>>> On 10/20/23 09:08, Wolfgang Bumiller wrote:
>>>> Also, Dominik recently added resource mappings for qemu for USB & PCI.
>>>> PCI might be tricky, but for USB we may be able to use these 
>>>> mappings as
>>>> well.
>>>> That said, "raw" `/dev` node pass-through still makes sense as a
>>>> separate thing for containers anyway since raw `lxc....` entries in 
>>>> the
>>>> container config can currently be very inconvenient to deal with
>>>> particularly with unprivileged containers (read on below for why...)
>>>
>>> just to chime in here, i don't think it'll be easily possible to reuse
>>> the pci/usb maps as is since we'd have to map from pciid 
>>> /usb-vendor/device
>>> (or path) to a device node? i don't think thats generally possible, 
>>> since
>>> the driver does not always make that info easily available
>>> (e.g. multi gpu setup and /dev/dri/cardX, or usb-to-serial adapters
>>> and /dev/ttySX ?) i guess it could work, but we probably would have
>>> to implement that for every driver out there?
>>>
>>
>> USB should be workable via resolving to /dev/bus/usb/*, PCI could be,
>> theoretically, but isn't now and probably won't be anytime soon – i.e.,
>> as Wolfgang mentioned off list, there's a reason that there's no
>> /dev/bus/pci/
>>
>>> what i would like to see however is to integrate a new type of mapping
>>> for container devices specifically so that the ux is the same
>>> (create mappings for whole cluster, assigning privileges, etc)
>>
>> I'd try hard to re-use the USB mappings, those seem to be one of the
>> most common pass-through setups for containers (e.g., for those
>> home automation zigbee/matter/... adapters, or in some countries DVB-T
>> ones, be it for TV or ADS-B plane tracking).
>
> i guess, but sadly the /dev/bus/usb endpoint is mostly not what you want
> to pass-through but the driver specific /dev/ttySX /dev/dvb/X and so on
> (there are situations where the /dev/bus/usb path is the wanted one,
> but there are many where it isn't)
>
> and while we can map from those to the usb device/vendor/path the reverse
> mapping is not so easy (at least when i tried i did not found a 
> generic way
> via udev or similar
>
> i would welcome it though if there is a way to do that of course
>
>>
>> If we can make USB work then we'd have the basic concept of attaching
>> a mapping done, adding a new type of (block/char) device mapping could
>> then be an independent task for later to keep scope a bit smaller.
>>
>> Fixing Wolfgang's comments for workable pass-through for unprivileged
>> containers is probably the most important change needed for now.
>>
>> I'd then even be open to apply this with (root@pam only!) absolute
>> path to /dev support only, but IMO resolving the mapping itself should
>> not be too hard, so if using /dev/bus/usb/ works having that in there
>> from the start would be definitively nice.
>
>




^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2023-10-24 13:00 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-10-19 12:18 [pve-devel] [PATCH RFC container] Add device passthrough Filip Schauer
2023-10-20  7:08 ` Wolfgang Bumiller
2023-10-20  7:51   ` Dominik Csapak
2023-10-20  8:29     ` Thomas Lamprecht
2023-10-20  8:39       ` Dominik Csapak
2023-10-24 13:00         ` Filip Schauer

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox
Service provided by Proxmox Server Solutions GmbH | Privacy | Legal