From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <f.schauer@proxmox.com>
Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68])
 (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
 key-exchange X25519 server-signature RSA-PSS (2048 bits))
 (No client certificate requested)
 by lists.proxmox.com (Postfix) with ESMTPS id 169A49FF95
 for <pve-devel@lists.proxmox.com>; Tue,  7 Nov 2023 14:49:55 +0100 (CET)
Received: from firstgate.proxmox.com (localhost [127.0.0.1])
 by firstgate.proxmox.com (Proxmox) with ESMTP id EBFFA335B5
 for <pve-devel@lists.proxmox.com>; Tue,  7 Nov 2023 14:49:24 +0100 (CET)
Received: from proxmox-new.maurer-it.com (proxmox-new.maurer-it.com
 [94.136.29.106])
 (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
 key-exchange X25519 server-signature RSA-PSS (2048 bits))
 (No client certificate requested)
 by firstgate.proxmox.com (Proxmox) with ESMTPS
 for <pve-devel@lists.proxmox.com>; Tue,  7 Nov 2023 14:49:24 +0100 (CET)
Received: from proxmox-new.maurer-it.com (localhost.localdomain [127.0.0.1])
 by proxmox-new.maurer-it.com (Proxmox) with ESMTP id EE40846E9D
 for <pve-devel@lists.proxmox.com>; Tue,  7 Nov 2023 14:49:23 +0100 (CET)
Message-ID: <669decad-8bd6-add9-a9cc-564aa08afb16@proxmox.com>
Date: Tue, 7 Nov 2023 14:49:22 +0100
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101
 Thunderbird/102.14.0
To: Wolfgang Bumiller <w.bumiller@proxmox.com>
Cc: pve-devel@lists.proxmox.com
References: <20231024125554.131800-1-f.schauer@proxmox.com>
 <20231024125554.131800-2-f.schauer@proxmox.com>
 <2rzmdty5ax4v5fssxkvjey4rfhzrcdmjzx5dti4m73lpbekqcf@3wna2j3j2jks>
 <5e29095a-a07e-ef36-22e9-90b0a2f78f90@proxmox.com>
 <qjytkl3ubzo3mgtzr7szgjk74rjfmrhb2a7qvvjsavvrtavzp2@hioih3bwjmx7>
Content-Language: en-US
From: Filip Schauer <f.schauer@proxmox.com>
In-Reply-To: <qjytkl3ubzo3mgtzr7szgjk74rjfmrhb2a7qvvjsavvrtavzp2@hioih3bwjmx7>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
X-SPAM-LEVEL: Spam detection results:  0
 AWL 1.316 Adjusted score from AWL reputation of From: address
 BAYES_00                 -1.9 Bayes spam probability is 0 to 1%
 DMARC_MISSING             0.1 Missing DMARC policy
 KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment
 NICE_REPLY_A           -3.085 Looks like a legit reply (A)
 SPF_HELO_NONE           0.001 SPF: HELO does not publish an SPF Record
 SPF_PASS               -0.001 SPF: sender matches SPF record
 T_SCC_BODY_TEXT_LINE    -0.01 -
Subject: Re: [pve-devel] [PATCH v2 container 1/1] Add device passthrough
X-BeenThere: pve-devel@lists.proxmox.com
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Proxmox VE development discussion <pve-devel.lists.proxmox.com>
List-Unsubscribe: <https://lists.proxmox.com/cgi-bin/mailman/options/pve-devel>, 
 <mailto:pve-devel-request@lists.proxmox.com?subject=unsubscribe>
List-Archive: <http://lists.proxmox.com/pipermail/pve-devel/>
List-Post: <mailto:pve-devel@lists.proxmox.com>
List-Help: <mailto:pve-devel-request@lists.proxmox.com?subject=help>
List-Subscribe: <https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel>, 
 <mailto:pve-devel-request@lists.proxmox.com?subject=subscribe>
X-List-Received-Date: Tue, 07 Nov 2023 13:49:55 -0000

Patch v3 available:

https://lists.proxmox.com/pipermail/pve-devel/2023-November/059844.html

On 03/11/2023 09:14, Wolfgang Bumiller wrote:
> On Thu, Nov 02, 2023 at 03:28:22PM +0100, Filip Schauer wrote:
>> On 30/10/2023 14:34, Wolfgang Bumiller wrote:
>>> On Tue, Oct 24, 2023 at 02:55:53PM +0200, Filip Schauer wrote:
>>>> Add a dev[n] argument to the container config to pass devices through to
>>>> a container. A device can be passed by its path. Alternatively a mapped
>>>> USB device can be passed through with usbmapping=<name>.
>>>>
>>>> Signed-off-by: Filip Schauer<f.schauer@proxmox.com>
>>>> ---
>>>>    src/PVE/LXC.pm        | 34 +++++++++++++++++++++++-
>>>>    src/PVE/LXC/Config.pm | 60 +++++++++++++++++++++++++++++++++++++++++++
>>>>    2 files changed, 93 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/src/PVE/LXC.pm b/src/PVE/LXC.pm
>>>> index c9b5ba7..a3ddb62 100644
>>>> --- a/src/PVE/LXC.pm
>>>> +++ b/src/PVE/LXC.pm
>>>> @@ -5,7 +5,8 @@ use warnings;
>>>>    use Cwd qw();
>>>>    use Errno qw(ELOOP ENOTDIR EROFS ECONNREFUSED EEXIST);
>>>> -use Fcntl qw(O_RDONLY O_WRONLY O_NOFOLLOW O_DIRECTORY);
>>>> +use Fcntl qw(O_RDONLY O_WRONLY O_NOFOLLOW O_DIRECTORY :mode);
>>>> +use File::Basename;
>>>>    use File::Path;
>>>>    use File::Spec;
>>>>    use IO::Poll qw(POLLIN POLLHUP);
>>>> @@ -639,6 +640,37 @@ sub update_lxc_config {
>>>>    	$raw .= "lxc.mount.auto = sys:mixed\n";
>>>>        }
>>>> +    # Clear passthrough directory from previous run
>>>> +    my $passthrough_dir = "/var/lib/lxc/$vmid/passthrough";
>>>> +    File::Path::rmtree($passthrough_dir);
>>> I think we need to make a few changes here.
>>>
>>> First: we don't necessarily need this directory.
>>> Having a device list would certainly be nice, but it makes more sense to
>>> just have a file we can easily parse (possibly even just a json hash),
>>> like the `devices` file we already create in the pre-start hook, except
>>> prepared *for* the pre-start hook, which *should* be able to just
>>> `mknod` the devices right into the container's `/dev` on startup.
>>
>> Devices mknoded into the container's /dev directory in the pre-start
>> hook will not be visible in the container once it is fully started.
> Ah yes, I keep ignoring that.
>
>> Meanwhile mknoding a device to a different path inside the container
>> works fine. It seems that LXC mounts over the /dev directory. This can
> /dev will be a tmpfs, yes.
>
>> be solved by calling mknod in lxc-pve-autodev-hook, but this does not
>> work with unprivileged containers without the mknod capability.
>>
>> So are bind mounts our only option without modifying LXC,
>> or am I overlooking something?
> Sort of. We *could* still do this via a separate process we signal from
> out of the autodev hook to do the work for it, but that'll make the
> startup process even more convoluted.
> And I think the seccomp proxying only starts after the entire init
> setup, so we also can't just reuly on syscalld (of which the entire
> point is to do mknods for the container 🙄).
>
> I'm also working on a seccomp wrapper to allow unprivileged restores of
> backups to `mknod()` the basics, but that, too, happens via seccomp, so
> not really reusable in this case either (and syscalld is not suitable
> for *this* either (for now) as it uses an lxc specific protocol and does
> not by itself perform the seccomp setup...)
>
> Perhaps there's a way to unify all that (at least partially) by teaching
> syscalld an additional protocol we can use in all 3 cases (although the
> the requirements are slightly different... here we only have "known"
> paths & permissions, so we wouldn't need to deal with copying another
> process' rootfs/chroot/fds/... to perform a syscall on their behalf,
> which the other cases do need...)
>
> So yeah, I suppose we can go the bind-mount route first, as it is
> simpler, and then maybe change it later.
>
> However, I still don't want to fill `/var/lib/lxc` on the host with
> device nodes directly whenever we update the config via
> `update_lxc_config()`.
>
> So how about this:
>
> In the prestart hook:
> - mount a tmpfs to this path
> - mknod the devices into it
> And then in the autodev hook do the bind-mounting.
>
>>
>>> We'd also avoid "lingering" device nodes with potentially harmful
>>> uid/permissions in /var, which is certainly better from a security POV.
>>>
>>> But note that we do need the `lxc.cgroup2.*` entries before starting the
>>> container in order to ensure the devices cgroup has the right
>>> permissions.
>>>
>>>> +
>>>> +    PVE::LXC::Config->foreach_passthrough_device($conf, sub {
>>>> +	my ($key, $sanitized_path) = @_;
>>>> +
>>>> +	my $absolute_path = "/$sanitized_path";
>>>> +	my ($mode, $rdev) = (stat($absolute_path))[2, 6];
>>>> +	die "Could not find major and minor ids of device $absolute_path.\n"
>>>> +	    unless ($mode && $rdev);
>>>> +
>>>> +	my $major = PVE::Tools::dev_t_major($rdev);
>>>> +	my $minor = PVE::Tools::dev_t_minor($rdev);
>>>> +	my $device_type_char = S_ISBLK($mode) ? 'b' : 'c';
>>>> +	my $passthrough_device_path = "$passthrough_dir/$sanitized_path";
>>>> +	File::Path::make_path(dirname($passthrough_device_path));
>>>> +	PVE::Tools::run_command([
>>>> +	    '/usr/bin/mknod',
>>>> +	    '-m', '0660',
> Btw. with a property string used for the device entry, we could probably
> also have an optional `mode` to use instead of `0660`, as well as a
> `uid` and `gid` - but we'd need to map those with the container's id
> mapping. Not sure if we already have helpers for that apart from getting
> the root ids.
>
>>>> +	    $passthrough_device_path,
>>>> +	    $device_type_char,
>>>> +	    $major,
>>>> +	    $minor
>>>> +	]);
>>> It's probably worth adding a helper for the mknod syscall to
>>> `PVE::Tools`, there are a bunch of syscalls in there already.
>>>
>>>> +	chown 100000, 100000, $passthrough_device_path if ($unprivileged);
>>> ^ This isn't necessarily the correct id. Users may have custom id
>>> mappings.
>>> `PVE::LXC::parse_id_maps($conf)` returns the mapping alongside the root
>>> uid and gid. (See for example `sub mount_all` for how it's used.
>>>
>>>> +
>>>> +	$raw .= "lxc.cgroup2.devices.allow = $device_type_char $major:$minor rw\n";
>>>> +	$raw .= "lxc.mount.entry = $passthrough_device_path $sanitized_path none bind,create=file\n";
>>>> +    });
>>>> +
>>>>        # WARNING: DO NOT REMOVE this without making sure that loop device nodes
>>>>        # cannot be exposed to the container with r/w access (cgroup perms).
>>>>        # When this is enabled mounts will still remain in the monitor's namespace