From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits)) (No client certificate requested) by lists.proxmox.com (Postfix) with ESMTPS id E53C996516 for ; Tue, 24 Jan 2023 14:07:14 +0100 (CET) Received: from firstgate.proxmox.com (localhost [127.0.0.1]) by firstgate.proxmox.com (Proxmox) with ESMTP id CCEDE3F05 for ; Tue, 24 Jan 2023 14:06:44 +0100 (CET) Received: from proxmox-new.maurer-it.com (proxmox-new.maurer-it.com [94.136.29.106]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits)) (No client certificate requested) by firstgate.proxmox.com (Proxmox) with ESMTPS for ; Tue, 24 Jan 2023 14:06:43 +0100 (CET) Received: from proxmox-new.maurer-it.com (localhost.localdomain [127.0.0.1]) by proxmox-new.maurer-it.com (Proxmox) with ESMTP id 2085545EAA; Tue, 24 Jan 2023 14:06:43 +0100 (CET) Message-ID: <3a360312-42c0-6e97-c0e3-6cc70285f3eb@proxmox.com> Date: Tue, 24 Jan 2023 14:06:41 +0100 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.5.0 From: Fiona Ebner To: pve-devel@lists.proxmox.com, "aderumier@odiso.com" References: <20230104064303.2898194-1-aderumier@odiso.com> <20230104064303.2898194-9-aderumier@odiso.com> Content-Language: en-US In-Reply-To: <20230104064303.2898194-9-aderumier@odiso.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-SPAM-LEVEL: Spam detection results: 0 AWL 0.588 Adjusted score from AWL reputation of From: address BAYES_00 -1.9 Bayes spam probability is 0 to 1% KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment NICE_REPLY_A -1.148 Looks like a legit reply (A) SPF_HELO_NONE 0.001 SPF: HELO does not publish an SPF Record SPF_PASS -0.001 SPF: sender matches SPF record URIBL_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to URIBL was blocked. See http://wiki.apache.org/spamassassin/DnsBlocklists#dnsbl-block for more information. [qemuserver.pm, qemu.pm, memory.pm, proxmox.com, pci.pm] Subject: Re: [pve-devel] [PATCH v2 qemu-server 8/9] memory: add virtio-mem support X-BeenThere: pve-devel@lists.proxmox.com X-Mailman-Version: 2.1.29 Precedence: list List-Id: Proxmox VE development discussion List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 24 Jan 2023 13:07:14 -0000 Am 04.01.23 um 07:43 schrieb Alexandre Derumier: > a 4GB static memory is needed for DMA+boot memory, as this memory > is almost always un-unpluggeable. > > 1 virtio-mem pci device is setup for each numa node on pci.4 bridge > > virtio-mem use a fixed blocksize with 32000 blocks > Blocksize is computed from the maxmemory-4096/32000 with a minimum of > 2MB to map THP. > (lower blocksize = more chance to unplug memory). > > Note: Currently, linux only support 4MB virtio blocksize, 2MB support > is currently is progress. > For the above paragraphs: s/GB/GiB/ s/MB/MiB/ ? > For hotplug/unplug, we are try to allocate/unallocate same amount > of memory aligned to the blocksize on each numa node if possible. > If a node a not able to reach the target memory (could be an unmovable > page on unplug for example), we try again to redispatch memory the > remaining memory on others nodes. > > About hugepages: > > For ordinary memory devices, such as DIMMs, we preallocate memory via the > memory backend for such use cases; however, with virtio-mem we're dealing > with sparse memory backends; preallocating the whole memory backend > destroys the whole purpose of virtio-mem. > > Instead, we want to preallocate memory when actually exposing memory to the > VM dynamically, and fail plugging memory gracefully + warn the user in case > preallocation fails. > > fixes: > https://bugzilla.proxmox.com/show_bug.cgi?id=931 As said last time, one user in the report has Windows, which doesn't support virtio-mem, so I wouldn't consider the issue fixed. > https://bugzilla.proxmox.com/show_bug.cgi?id=2949 > --- > > Signed-off-by: Alexandre Derumier > --- > PVE/API2/Qemu.pm | 10 +- > PVE/QemuServer.pm | 7 +- > PVE/QemuServer/Memory.pm | 233 ++++++++++++++++++++++++++++++++++++--- > PVE/QemuServer/PCI.pm | 8 ++ > 4 files changed, 242 insertions(+), 16 deletions(-) > > diff --git a/PVE/API2/Qemu.pm b/PVE/API2/Qemu.pm > index cab1e84..42941ac 100644 > --- a/PVE/API2/Qemu.pm > +++ b/PVE/API2/Qemu.pm > @@ -32,7 +32,7 @@ use PVE::QemuServer::Drive; > use PVE::QemuServer::ImportDisk; > use PVE::QemuServer::Monitor qw(mon_cmd); > use PVE::QemuServer::Machine; > -use PVE::QemuServer::Memory qw(get_current_memory parse_memory get_host_max_mem); > +use PVE::QemuServer::Memory qw(get_current_memory parse_memory get_host_max_mem get_virtiomem_block_size); > use PVE::QemuMigrate; > use PVE::RPCEnvironment; > use PVE::AccessControl; > @@ -487,6 +487,14 @@ my $check_memory_param = sub { > if $mem->{max} > $host_max_mem; > } > > + #unplug works better with 128MB by dimm to match the linux blocksize btyes. s/MB/MiB/ s/btyes/bytes/ > + if ($mem->{virtio}) { > + my $blocksize = get_virtiomem_block_size($conf); > + > + die "memory need to be a multiple of $blocksize MB when virtiomem is enabled\n" s/MB/MiB/ s/need/needs/ > + if $mem->{current} % $blocksize != 0; > + } > + > if ($param->{memory} || defined($param->{balloon})) { > > my $maxmem = undef; > diff --git a/PVE/QemuServer.pm b/PVE/QemuServer.pm > index 5847a78..51b29fc 100644 > --- a/PVE/QemuServer.pm > +++ b/PVE/QemuServer.pm > @@ -3857,7 +3857,12 @@ sub config_to_command { > push @$cmd, get_cpu_options($conf, $arch, $kvm, $kvm_off, $machine_version, $winversion, $gpu_passthrough); > } > > - PVE::QemuServer::Memory::config($conf, $vmid, $sockets, $cores, $defaults, $hotplug_features, $cmd); > + my $mem_devices = {}; > + PVE::QemuServer::Memory::config($conf, $vmid, $sockets, $cores, $defaults, $hotplug_features, $cmd, $mem_devices); > + foreach my $id (sort keys %$mem_devices) { > + my $pciaddr = print_pci_addr($id, $bridges, $arch, $machine_type); > + push @$devices, "-device", "$mem_devices->{$id}$pciaddr"; > + } > > push @$cmd, '-S' if $conf->{freeze}; > > diff --git a/PVE/QemuServer/Memory.pm b/PVE/QemuServer/Memory.pm > index b9136d2..6827004 100644 > --- a/PVE/QemuServer/Memory.pm > +++ b/PVE/QemuServer/Memory.pm > @@ -3,6 +3,8 @@ package PVE::QemuServer::Memory; > use strict; > use warnings; > > +use POSIX qw/ceil/; Style nit: haven't seen this variant yet, usually qw() is used in our codebase > + > use PVE::Tools qw(run_command lock_file lock_file_full file_read_firstline dir_glob_foreach); > use PVE::Exception qw(raise raise_param_exc); > use PVE::GuestHelpers qw(safe_string_ne safe_num_ne safe_boolean_ne); > @@ -15,6 +17,7 @@ our @EXPORT_OK = qw( > get_current_memory > parse_memory > get_host_max_mem > +get_virtiomem_block_size > ); > > my $MAX_NUMA = 8; > @@ -37,6 +40,12 @@ my $memory_fmt = { > minimum => 65536, > maximum => 4194304 > }, > + virtio => { > + description => "enable virtio-mem memory", Again, we really should note that the feature is a technology preview and only supported on recent enough Linux kernels. > + type => 'boolean', > + optional => 1, > + default => 0, > + } > }; > > sub print_memory { > @@ -69,7 +78,9 @@ my sub get_static_mem { > my $static_memory = 0; > my $memory = parse_memory($conf->{memory}); > > - if ($memory->{max}) { > + if ($memory->{virtio}) { > + $static_memory = 4096; > + } elsif ($memory->{max}) { > my $dimm_size = $memory->{max} / $MAX_SLOTS; > #static mem can't be lower than 4G and lower than 1 dimmsize by socket > $static_memory = $dimm_size * $sockets; > @@ -160,6 +171,134 @@ sub get_current_memory { > return $memory->{current}; > } > > +sub get_virtiomem_block_size { > + my ($conf) = @_; > + > + my $MAX_MEM = get_max_mem($conf); > + my $static_memory = get_static_mem($conf); > + my $memory = get_current_memory($conf->{memory}); > + > + #virtiomem can map 32000 block size. > + #try to use lowest blocksize, lower = more chance to unplug memory. > + my $blocksize = ($MAX_MEM - $static_memory) / 32000; > + #2MB is the minimum to be aligned with THP > + $blocksize = 2**(ceil(log($blocksize)/log(2))); > + $blocksize = 4 if $blocksize < 4; Why suddenly 4? As discussed last time, the future-proof way to calculate this (if we want minimum 2) without risking any negative result for log($blocksize), is my $blocksize = ($MAX_MEM - $static_memory) / 32000; $blocksize = 2 if $blocksize < 2; $blocksize = 2**(ceil(log($blocksize)/log(2))); > + > + return $blocksize; > +} > + > +my sub get_virtiomem_total_current { > + my ($mems) = @_; > + my $total = 0; > + foreach my $id (keys %$mems) { Style nit: use for instead of foreach (same for all others below) Nit: can iterate over values instead of keys Same for the two functions below. > + my $mem = $mems->{$id}; > + $total += $mem->{current}; > + } > + return $total; Below, you use total for a count, not a size. This is confusing naming. Here, size should be used. > +} > + > +my sub get_virtiomem_total_noerror { > + my ($mems) = @_; > + > + my $total = 0; > + foreach my $id (keys %$mems) { > + my $mem = $mems->{$id}; > + next if $mem->{error}; > + $total++; > + } > + return $total; I think the whole function could be scalar(grep { !$mem->{error} } values $mems->%*) and since there's only one caller, you can inline it there. > +} > + > +my sub get_virtiomem_total_errors_size { > + my ($mems) = @_; > + > + my $size = 0; > + foreach my $id (keys %$mems) { > + my $mem = $mems->{$id}; > + next if !$mem->{error}; > + $size += $mem->{current}; > + } > + return $size; > +} > + > +my sub balance_virtiomem { This function is rather difficult to read. The "record errors and filter" logic really should be its own patch after the initial support. FWIW, I tried my best and it does seems fine :) But it's not clear to me that we even want that logic? Is it really that common for qom-set to take so long to be worth introducing all this additional handling/complexity? Or should it just be a hard error if qom-set still didn't have an effect on a device after 5 seconds. Would it actually be better to just fill up the first, then the second etc. as needed, rather than balancing? My gut feeling is that having fewer "active" devices is better. But this would have to be tested with some benchmarks of course. > + my ($vmid, $virtiomems, $blocksize, $target_virtiomem_total) = @_; > + > + my $virtiomem_total_noerror = get_virtiomem_total_noerror($virtiomems); > + > + print"try to balance memory on $virtiomem_total_noerror remaining virtiomems\n"; Style nit: missing space after print (same for some others below, also missing space after if a few times) > + > + die "error. no more available blocks in virtiomem to balance the remaining memory" if $target_virtiomem_total < 0; > + die "error. No more available virtiomem to balance the remaining memory\n" if $virtiomem_total_noerror == 0; Style nit: lines > 100 characters (same for some others below).