From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits)) (No client certificate requested) by lists.proxmox.com (Postfix) with ESMTPS id DA5B99EB83 for ; Fri, 3 Nov 2023 09:15:02 +0100 (CET) Received: from firstgate.proxmox.com (localhost [127.0.0.1]) by firstgate.proxmox.com (Proxmox) with ESMTP id AAB3819921 for ; Fri, 3 Nov 2023 09:15:02 +0100 (CET) Received: from proxmox-new.maurer-it.com (proxmox-new.maurer-it.com [94.136.29.106]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by firstgate.proxmox.com (Proxmox) with ESMTPS for ; Fri, 3 Nov 2023 09:14:58 +0100 (CET) Received: from proxmox-new.maurer-it.com (localhost.localdomain [127.0.0.1]) by proxmox-new.maurer-it.com (Proxmox) with ESMTP id D4C554413F for ; Fri, 3 Nov 2023 09:14:57 +0100 (CET) Date: Fri, 3 Nov 2023 09:14:56 +0100 From: Wolfgang Bumiller To: Filip Schauer Cc: pve-devel@lists.proxmox.com Message-ID: References: <20231024125554.131800-1-f.schauer@proxmox.com> <20231024125554.131800-2-f.schauer@proxmox.com> <2rzmdty5ax4v5fssxkvjey4rfhzrcdmjzx5dti4m73lpbekqcf@3wna2j3j2jks> <5e29095a-a07e-ef36-22e9-90b0a2f78f90@proxmox.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <5e29095a-a07e-ef36-22e9-90b0a2f78f90@proxmox.com> X-SPAM-LEVEL: Spam detection results: 0 AWL 0.103 Adjusted score from AWL reputation of From: address BAYES_00 -1.9 Bayes spam probability is 0 to 1% DMARC_MISSING 0.1 Missing DMARC policy KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment SPF_HELO_NONE 0.001 SPF: HELO does not publish an SPF Record SPF_PASS -0.001 SPF: sender matches SPF record T_SCC_BODY_TEXT_LINE -0.01 - URIBL_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to URIBL was blocked. See http://wiki.apache.org/spamassassin/DnsBlocklists#dnsbl-block for more information. [config.pm, lxc.pm, mount.auto] Subject: Re: [pve-devel] [PATCH v2 container 1/1] Add device passthrough X-BeenThere: pve-devel@lists.proxmox.com X-Mailman-Version: 2.1.29 Precedence: list List-Id: Proxmox VE development discussion List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 03 Nov 2023 08:15:02 -0000 On Thu, Nov 02, 2023 at 03:28:22PM +0100, Filip Schauer wrote: > > On 30/10/2023 14:34, Wolfgang Bumiller wrote: > > On Tue, Oct 24, 2023 at 02:55:53PM +0200, Filip Schauer wrote: > > > Add a dev[n] argument to the container config to pass devices through to > > > a container. A device can be passed by its path. Alternatively a mapped > > > USB device can be passed through with usbmapping=. > > > > > > Signed-off-by: Filip Schauer > > > --- > > > src/PVE/LXC.pm | 34 +++++++++++++++++++++++- > > > src/PVE/LXC/Config.pm | 60 +++++++++++++++++++++++++++++++++++++++++++ > > > 2 files changed, 93 insertions(+), 1 deletion(-) > > > > > > diff --git a/src/PVE/LXC.pm b/src/PVE/LXC.pm > > > index c9b5ba7..a3ddb62 100644 > > > --- a/src/PVE/LXC.pm > > > +++ b/src/PVE/LXC.pm > > > @@ -5,7 +5,8 @@ use warnings; > > > use Cwd qw(); > > > use Errno qw(ELOOP ENOTDIR EROFS ECONNREFUSED EEXIST); > > > -use Fcntl qw(O_RDONLY O_WRONLY O_NOFOLLOW O_DIRECTORY); > > > +use Fcntl qw(O_RDONLY O_WRONLY O_NOFOLLOW O_DIRECTORY :mode); > > > +use File::Basename; > > > use File::Path; > > > use File::Spec; > > > use IO::Poll qw(POLLIN POLLHUP); > > > @@ -639,6 +640,37 @@ sub update_lxc_config { > > > $raw .= "lxc.mount.auto = sys:mixed\n"; > > > } > > > + # Clear passthrough directory from previous run > > > + my $passthrough_dir = "/var/lib/lxc/$vmid/passthrough"; > > > + File::Path::rmtree($passthrough_dir); > > I think we need to make a few changes here. > > > > First: we don't necessarily need this directory. > > Having a device list would certainly be nice, but it makes more sense to > > just have a file we can easily parse (possibly even just a json hash), > > like the `devices` file we already create in the pre-start hook, except > > prepared *for* the pre-start hook, which *should* be able to just > > `mknod` the devices right into the container's `/dev` on startup. > > > Devices mknoded into the container's /dev directory in the pre-start > hook will not be visible in the container once it is fully started. Ah yes, I keep ignoring that. > Meanwhile mknoding a device to a different path inside the container > works fine. It seems that LXC mounts over the /dev directory. This can /dev will be a tmpfs, yes. > be solved by calling mknod in lxc-pve-autodev-hook, but this does not > work with unprivileged containers without the mknod capability. > > So are bind mounts our only option without modifying LXC, > or am I overlooking something? Sort of. We *could* still do this via a separate process we signal from out of the autodev hook to do the work for it, but that'll make the startup process even more convoluted. And I think the seccomp proxying only starts after the entire init setup, so we also can't just reuly on syscalld (of which the entire point is to do mknods for the container 🙄). I'm also working on a seccomp wrapper to allow unprivileged restores of backups to `mknod()` the basics, but that, too, happens via seccomp, so not really reusable in this case either (and syscalld is not suitable for *this* either (for now) as it uses an lxc specific protocol and does not by itself perform the seccomp setup...) Perhaps there's a way to unify all that (at least partially) by teaching syscalld an additional protocol we can use in all 3 cases (although the the requirements are slightly different... here we only have "known" paths & permissions, so we wouldn't need to deal with copying another process' rootfs/chroot/fds/... to perform a syscall on their behalf, which the other cases do need...) So yeah, I suppose we can go the bind-mount route first, as it is simpler, and then maybe change it later. However, I still don't want to fill `/var/lib/lxc` on the host with device nodes directly whenever we update the config via `update_lxc_config()`. So how about this: In the prestart hook: - mount a tmpfs to this path - mknod the devices into it And then in the autodev hook do the bind-mounting. > > > > We'd also avoid "lingering" device nodes with potentially harmful > > uid/permissions in /var, which is certainly better from a security POV. > > > > But note that we do need the `lxc.cgroup2.*` entries before starting the > > container in order to ensure the devices cgroup has the right > > permissions. > > > > > + > > > + PVE::LXC::Config->foreach_passthrough_device($conf, sub { > > > + my ($key, $sanitized_path) = @_; > > > + > > > + my $absolute_path = "/$sanitized_path"; > > > + my ($mode, $rdev) = (stat($absolute_path))[2, 6]; > > > + die "Could not find major and minor ids of device $absolute_path.\n" > > > + unless ($mode && $rdev); > > > + > > > + my $major = PVE::Tools::dev_t_major($rdev); > > > + my $minor = PVE::Tools::dev_t_minor($rdev); > > > + my $device_type_char = S_ISBLK($mode) ? 'b' : 'c'; > > > + my $passthrough_device_path = "$passthrough_dir/$sanitized_path"; > > > + File::Path::make_path(dirname($passthrough_device_path)); > > > + PVE::Tools::run_command([ > > > + '/usr/bin/mknod', > > > + '-m', '0660', Btw. with a property string used for the device entry, we could probably also have an optional `mode` to use instead of `0660`, as well as a `uid` and `gid` - but we'd need to map those with the container's id mapping. Not sure if we already have helpers for that apart from getting the root ids. > > > + $passthrough_device_path, > > > + $device_type_char, > > > + $major, > > > + $minor > > > + ]); > > It's probably worth adding a helper for the mknod syscall to > > `PVE::Tools`, there are a bunch of syscalls in there already. > > > > > + chown 100000, 100000, $passthrough_device_path if ($unprivileged); > > ^ This isn't necessarily the correct id. Users may have custom id > > mappings. > > `PVE::LXC::parse_id_maps($conf)` returns the mapping alongside the root > > uid and gid. (See for example `sub mount_all` for how it's used. > > > > > + > > > + $raw .= "lxc.cgroup2.devices.allow = $device_type_char $major:$minor rw\n"; > > > + $raw .= "lxc.mount.entry = $passthrough_device_path $sanitized_path none bind,create=file\n"; > > > + }); > > > + > > > # WARNING: DO NOT REMOVE this without making sure that loop device nodes > > > # cannot be exposed to the container with r/w access (cgroup perms). > > > # When this is enabled mounts will still remain in the monitor's namespace