From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by lists.proxmox.com (Postfix) with ESMTPS id 8F0799BB0D for ; Fri, 20 Oct 2023 09:08:36 +0200 (CEST) Received: from firstgate.proxmox.com (localhost [127.0.0.1]) by firstgate.proxmox.com (Proxmox) with ESMTP id 6D3DF309D8 for ; Fri, 20 Oct 2023 09:08:06 +0200 (CEST) Received: from proxmox-new.maurer-it.com (proxmox-new.maurer-it.com [94.136.29.106]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by firstgate.proxmox.com (Proxmox) with ESMTPS for ; Fri, 20 Oct 2023 09:08:05 +0200 (CEST) Received: from proxmox-new.maurer-it.com (localhost.localdomain [127.0.0.1]) by proxmox-new.maurer-it.com (Proxmox) with ESMTP id 2B7364350E for ; Fri, 20 Oct 2023 09:08:05 +0200 (CEST) Date: Fri, 20 Oct 2023 09:08:03 +0200 From: Wolfgang Bumiller To: Filip Schauer Cc: pve-devel@lists.proxmox.com Message-ID: References: <20231019121856.379185-1-f.schauer@proxmox.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20231019121856.379185-1-f.schauer@proxmox.com> X-SPAM-LEVEL: Spam detection results: 0 AWL 0.099 Adjusted score from AWL reputation of From: address BAYES_00 -1.9 Bayes spam probability is 0 to 1% DMARC_MISSING 0.1 Missing DMARC policy KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment SPF_HELO_NONE 0.001 SPF: HELO does not publish an SPF Record SPF_PASS -0.001 SPF: sender matches SPF record URIBL_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to URIBL was blocked. See http://wiki.apache.org/spamassassin/DnsBlocklists#dnsbl-block for more information. [config.pm, lxc.pm, mount.auto] Subject: Re: [pve-devel] [PATCH RFC container] Add device passthrough X-BeenThere: pve-devel@lists.proxmox.com X-Mailman-Version: 2.1.29 Precedence: list List-Id: Proxmox VE development discussion List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 20 Oct 2023 07:08:36 -0000 On Thu, Oct 19, 2023 at 02:18:56PM +0200, Filip Schauer wrote: > Signed-off-by: Filip Schauer > --- > Is it reasonable to add a "dev[n]" argument to the pct.conf, given that > device mount points only allow passing through block devices? Why would they only allow block devices? Also, Dominik recently added resource mappings for qemu for USB & PCI. PCI might be tricky, but for USB we may be able to use these mappings as well. That said, "raw" `/dev` node pass-through still makes sense as a separate thing for containers anyway since raw `lxc....` entries in the container config can currently be very inconvenient to deal with particularly with unprivileged containers (read on below for why...) > > src/PVE/LXC.pm | 14 ++++++++++++++ > src/PVE/LXC/Config.pm | 32 ++++++++++++++++++++++++++++++++ > 2 files changed, 46 insertions(+) > > diff --git a/src/PVE/LXC.pm b/src/PVE/LXC.pm > index c9b5ba7..6090534 100644 > --- a/src/PVE/LXC.pm > +++ b/src/PVE/LXC.pm > @@ -639,6 +639,20 @@ sub update_lxc_config { > $raw .= "lxc.mount.auto = sys:mixed\n"; > } > > + foreach my $k (keys %$conf) { > + next if $k !~ m/^dev(\d+)$/; We should add a helper `sub foreach_passthrough_device()` taking a sub which gets `$key, $raw_value, $sanitized_path`. The last one would currently be the `substr($path, 1)`. Having to do this later on just makes it easier to mess this up. Also, seeing this now this reminds me that - but not as part of this series - we should add a lot of `foreach_*` subs for this sort of iteration like we have in qemu-server and replace such existing constructs if possible... > + my $devpath = $conf->{$k}; > + die "Device $devpath does not exist\n" unless (-e $devpath); > + > + my ($mode, $rdev) = (stat($devpath))[2, 6]; > + die "Could not find major and minor ids of device $devpath.\n" unless ($mode && $rdev); > + > + my $major = PVE::Tools::dev_t_major($rdev); > + my $minor = PVE::Tools::dev_t_minor($rdev); > + $raw .= "lxc.cgroup2.devices.allow = c $major:$minor rw\n"; ^ the 'c' isn't necessarily correct. You can use `S_ISBLK($mode) ? 'b' : 'c'`. `S_ISBLK` comes from the `Fcntl` package. (You can find more info on file modes and S_ISBLK in `man 7 inode`) > + $raw .= "lxc.mount.entry = $devpath " . substr($devpath, 1) . " none bind,create=file\n"; ^ See above - the substr() there stands out quite weird :-) > + } > + > # WARNING: DO NOT REMOVE this without making sure that loop device nodes > # cannot be exposed to the container with r/w access (cgroup perms). > # When this is enabled mounts will still remain in the monitor's namespace > diff --git a/src/PVE/LXC/Config.pm b/src/PVE/LXC/Config.pm > index 56e1f10..4665ab1 100644 > --- a/src/PVE/LXC/Config.pm > +++ b/src/PVE/LXC/Config.pm > @@ -29,6 +29,7 @@ mkdir $lockdir; > mkdir "/etc/pve/nodes/$nodename/lxc"; > my $MAX_MOUNT_POINTS = 256; > my $MAX_UNUSED_DISKS = $MAX_MOUNT_POINTS; > +my $MAX_DEVICES = 256; > > # BEGIN implemented abstract methods from PVE::AbstractConfig > > @@ -908,6 +909,37 @@ for (my $i = 0; $i < $MAX_UNUSED_DISKS; $i++) { > } > } > > +PVE::JSONSchema::register_format('pve-lxc-dev-string', \&verify_lxc_dev_string); > +sub verify_lxc_dev_string { > + my ($dev, $noerr) = @_; > + > + if ($dev !~ m!^/dev/!) { > + return undef if $noerr; > + die "$dev does not start with /dev/\n"; This is where it gets interesting :-) I can't find it now, but I faintly recall that at some point I had suggested collecting devices in a different path than `/dev` to create device nodes with the correct owner/group/permissions for unprivileged containers to people on the forum. See, we don't just need the device to be visible in the container, but we also need it to be able to give the container's root user - which for unprivileged containers is usually uid 100000 - access to it. It *may* be fine for *some* devices to just `chown()` them in `/dev`, but we generally don't want to reown devices on the host itself, so it may make more sense to `mknod()` the same device somewhere in `/var/lib/lxc/$vmid/passthrough` and `chown()` it there. (One alternative would of course be ACL entries on the host, but IMO having a directory specific for a container where you can see all its accessible devics has its advantages...) (With newer kernels we might be able to instead use an idmapped bind mount, though, but pve-8's initial kernel does not support idmapping on tmpfs yet :-/ (that starts at 6.3 IIRC)) We should also think about whether we'd like to support the containers calling `mknod()` to "create" (but not really) those devices themselves. for privileged containers this is easy enough, privileged means privileged, so they can just do whatever. For unprivileged containers, however, `mknod()` does not work at all, and even if it did, we cannot allow it directly, since we have no guarantee that after the next reboot the major/minor numbers will be the same after the next reboot. So if we do want to support this, we'd need to teach `pve-lxc-syscalld` to know about allowed devices and give it the ability to hot-bind-mount such nodes (rather than actually mknod()ing them). > + } > + > + return $dev; > +} > + > +my $dev_desc = { > + dev => { > + type => 'string', > + default_key => 1, > + format => 'pve-lxc-dev-string', > + format_description => 'Path', > + description => 'Device to pass through to the container', > + verbose_description => 'Path to the device to pass through to the container' > + } > +}; > + > +for (my $i = 0; $i < $MAX_DEVICES; $i++) { > + $confdesc->{"dev$i"} = { > + optional => 1, > + type => 'string', format => $dev_desc, > + description => "Device to pass through to the container", > + } > +} > + > sub parse_pct_config { > my ($filename, $raw, $strict) = @_; > > -- > 2.39.2