From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits)) (No client certificate requested) by lists.proxmox.com (Postfix) with ESMTPS id 169A49FF95 for ; Tue, 7 Nov 2023 14:49:55 +0100 (CET) Received: from firstgate.proxmox.com (localhost [127.0.0.1]) by firstgate.proxmox.com (Proxmox) with ESMTP id EBFFA335B5 for ; Tue, 7 Nov 2023 14:49:24 +0100 (CET) Received: from proxmox-new.maurer-it.com (proxmox-new.maurer-it.com [94.136.29.106]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits)) (No client certificate requested) by firstgate.proxmox.com (Proxmox) with ESMTPS for ; Tue, 7 Nov 2023 14:49:24 +0100 (CET) Received: from proxmox-new.maurer-it.com (localhost.localdomain [127.0.0.1]) by proxmox-new.maurer-it.com (Proxmox) with ESMTP id EE40846E9D for ; Tue, 7 Nov 2023 14:49:23 +0100 (CET) Message-ID: <669decad-8bd6-add9-a9cc-564aa08afb16@proxmox.com> Date: Tue, 7 Nov 2023 14:49:22 +0100 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.14.0 To: Wolfgang Bumiller Cc: pve-devel@lists.proxmox.com References: <20231024125554.131800-1-f.schauer@proxmox.com> <20231024125554.131800-2-f.schauer@proxmox.com> <2rzmdty5ax4v5fssxkvjey4rfhzrcdmjzx5dti4m73lpbekqcf@3wna2j3j2jks> <5e29095a-a07e-ef36-22e9-90b0a2f78f90@proxmox.com> Content-Language: en-US From: Filip Schauer In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-SPAM-LEVEL: Spam detection results: 0 AWL 1.316 Adjusted score from AWL reputation of From: address BAYES_00 -1.9 Bayes spam probability is 0 to 1% DMARC_MISSING 0.1 Missing DMARC policy KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment NICE_REPLY_A -3.085 Looks like a legit reply (A) SPF_HELO_NONE 0.001 SPF: HELO does not publish an SPF Record SPF_PASS -0.001 SPF: sender matches SPF record T_SCC_BODY_TEXT_LINE -0.01 - Subject: Re: [pve-devel] [PATCH v2 container 1/1] Add device passthrough X-BeenThere: pve-devel@lists.proxmox.com X-Mailman-Version: 2.1.29 Precedence: list List-Id: Proxmox VE development discussion List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 07 Nov 2023 13:49:55 -0000 Patch v3 available: https://lists.proxmox.com/pipermail/pve-devel/2023-November/059844.html On 03/11/2023 09:14, Wolfgang Bumiller wrote: > On Thu, Nov 02, 2023 at 03:28:22PM +0100, Filip Schauer wrote: >> On 30/10/2023 14:34, Wolfgang Bumiller wrote: >>> On Tue, Oct 24, 2023 at 02:55:53PM +0200, Filip Schauer wrote: >>>> Add a dev[n] argument to the container config to pass devices through to >>>> a container. A device can be passed by its path. Alternatively a mapped >>>> USB device can be passed through with usbmapping=. >>>> >>>> Signed-off-by: Filip Schauer >>>> --- >>>> src/PVE/LXC.pm | 34 +++++++++++++++++++++++- >>>> src/PVE/LXC/Config.pm | 60 +++++++++++++++++++++++++++++++++++++++++++ >>>> 2 files changed, 93 insertions(+), 1 deletion(-) >>>> >>>> diff --git a/src/PVE/LXC.pm b/src/PVE/LXC.pm >>>> index c9b5ba7..a3ddb62 100644 >>>> --- a/src/PVE/LXC.pm >>>> +++ b/src/PVE/LXC.pm >>>> @@ -5,7 +5,8 @@ use warnings; >>>> use Cwd qw(); >>>> use Errno qw(ELOOP ENOTDIR EROFS ECONNREFUSED EEXIST); >>>> -use Fcntl qw(O_RDONLY O_WRONLY O_NOFOLLOW O_DIRECTORY); >>>> +use Fcntl qw(O_RDONLY O_WRONLY O_NOFOLLOW O_DIRECTORY :mode); >>>> +use File::Basename; >>>> use File::Path; >>>> use File::Spec; >>>> use IO::Poll qw(POLLIN POLLHUP); >>>> @@ -639,6 +640,37 @@ sub update_lxc_config { >>>> $raw .= "lxc.mount.auto = sys:mixed\n"; >>>> } >>>> + # Clear passthrough directory from previous run >>>> + my $passthrough_dir = "/var/lib/lxc/$vmid/passthrough"; >>>> + File::Path::rmtree($passthrough_dir); >>> I think we need to make a few changes here. >>> >>> First: we don't necessarily need this directory. >>> Having a device list would certainly be nice, but it makes more sense to >>> just have a file we can easily parse (possibly even just a json hash), >>> like the `devices` file we already create in the pre-start hook, except >>> prepared *for* the pre-start hook, which *should* be able to just >>> `mknod` the devices right into the container's `/dev` on startup. >> >> Devices mknoded into the container's /dev directory in the pre-start >> hook will not be visible in the container once it is fully started. > Ah yes, I keep ignoring that. > >> Meanwhile mknoding a device to a different path inside the container >> works fine. It seems that LXC mounts over the /dev directory. This can > /dev will be a tmpfs, yes. > >> be solved by calling mknod in lxc-pve-autodev-hook, but this does not >> work with unprivileged containers without the mknod capability. >> >> So are bind mounts our only option without modifying LXC, >> or am I overlooking something? > Sort of. We *could* still do this via a separate process we signal from > out of the autodev hook to do the work for it, but that'll make the > startup process even more convoluted. > And I think the seccomp proxying only starts after the entire init > setup, so we also can't just reuly on syscalld (of which the entire > point is to do mknods for the container 🙄). > > I'm also working on a seccomp wrapper to allow unprivileged restores of > backups to `mknod()` the basics, but that, too, happens via seccomp, so > not really reusable in this case either (and syscalld is not suitable > for *this* either (for now) as it uses an lxc specific protocol and does > not by itself perform the seccomp setup...) > > Perhaps there's a way to unify all that (at least partially) by teaching > syscalld an additional protocol we can use in all 3 cases (although the > the requirements are slightly different... here we only have "known" > paths & permissions, so we wouldn't need to deal with copying another > process' rootfs/chroot/fds/... to perform a syscall on their behalf, > which the other cases do need...) > > So yeah, I suppose we can go the bind-mount route first, as it is > simpler, and then maybe change it later. > > However, I still don't want to fill `/var/lib/lxc` on the host with > device nodes directly whenever we update the config via > `update_lxc_config()`. > > So how about this: > > In the prestart hook: > - mount a tmpfs to this path > - mknod the devices into it > And then in the autodev hook do the bind-mounting. > >> >>> We'd also avoid "lingering" device nodes with potentially harmful >>> uid/permissions in /var, which is certainly better from a security POV. >>> >>> But note that we do need the `lxc.cgroup2.*` entries before starting the >>> container in order to ensure the devices cgroup has the right >>> permissions. >>> >>>> + >>>> + PVE::LXC::Config->foreach_passthrough_device($conf, sub { >>>> + my ($key, $sanitized_path) = @_; >>>> + >>>> + my $absolute_path = "/$sanitized_path"; >>>> + my ($mode, $rdev) = (stat($absolute_path))[2, 6]; >>>> + die "Could not find major and minor ids of device $absolute_path.\n" >>>> + unless ($mode && $rdev); >>>> + >>>> + my $major = PVE::Tools::dev_t_major($rdev); >>>> + my $minor = PVE::Tools::dev_t_minor($rdev); >>>> + my $device_type_char = S_ISBLK($mode) ? 'b' : 'c'; >>>> + my $passthrough_device_path = "$passthrough_dir/$sanitized_path"; >>>> + File::Path::make_path(dirname($passthrough_device_path)); >>>> + PVE::Tools::run_command([ >>>> + '/usr/bin/mknod', >>>> + '-m', '0660', > Btw. with a property string used for the device entry, we could probably > also have an optional `mode` to use instead of `0660`, as well as a > `uid` and `gid` - but we'd need to map those with the container's id > mapping. Not sure if we already have helpers for that apart from getting > the root ids. > >>>> + $passthrough_device_path, >>>> + $device_type_char, >>>> + $major, >>>> + $minor >>>> + ]); >>> It's probably worth adding a helper for the mknod syscall to >>> `PVE::Tools`, there are a bunch of syscalls in there already. >>> >>>> + chown 100000, 100000, $passthrough_device_path if ($unprivileged); >>> ^ This isn't necessarily the correct id. Users may have custom id >>> mappings. >>> `PVE::LXC::parse_id_maps($conf)` returns the mapping alongside the root >>> uid and gid. (See for example `sub mount_all` for how it's used. >>> >>>> + >>>> + $raw .= "lxc.cgroup2.devices.allow = $device_type_char $major:$minor rw\n"; >>>> + $raw .= "lxc.mount.entry = $passthrough_device_path $sanitized_path none bind,create=file\n"; >>>> + }); >>>> + >>>> # WARNING: DO NOT REMOVE this without making sure that loop device nodes >>>> # cannot be exposed to the container with r/w access (cgroup perms). >>>> # When this is enabled mounts will still remain in the monitor's namespace