From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits)) (No client certificate requested) by lists.proxmox.com (Postfix) with ESMTPS id A743BF6FD for ; Fri, 16 Dec 2022 14:42:49 +0100 (CET) Received: from firstgate.proxmox.com (localhost [127.0.0.1]) by firstgate.proxmox.com (Proxmox) with ESMTP id 8807822B0A for ; Fri, 16 Dec 2022 14:42:19 +0100 (CET) Received: from proxmox-new.maurer-it.com (proxmox-new.maurer-it.com [94.136.29.106]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits)) (No client certificate requested) by firstgate.proxmox.com (Proxmox) with ESMTPS for ; Fri, 16 Dec 2022 14:42:17 +0100 (CET) Received: from proxmox-new.maurer-it.com (localhost.localdomain [127.0.0.1]) by proxmox-new.maurer-it.com (Proxmox) with ESMTP id 911974381A; Fri, 16 Dec 2022 14:42:17 +0100 (CET) Message-ID: <87423a6b-a17e-5ea4-9176-cd81e96c5693@proxmox.com> Date: Fri, 16 Dec 2022 14:42:15 +0100 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.5.0 From: Fiona Ebner To: pve-devel@lists.proxmox.com, aderumier@odiso.com References: <20221209192726.1499142-1-aderumier@odiso.com> <20221209192726.1499142-9-aderumier@odiso.com> Content-Language: en-US In-Reply-To: <20221209192726.1499142-9-aderumier@odiso.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-SPAM-LEVEL: Spam detection results: 0 AWL 0.028 Adjusted score from AWL reputation of From: address BAYES_00 -1.9 Bayes spam probability is 0 to 1% KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment NICE_REPLY_A -0.001 Looks like a legit reply (A) SPF_HELO_NONE 0.001 SPF: HELO does not publish an SPF Record SPF_PASS -0.001 SPF: sender matches SPF record URIBL_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to URIBL was blocked. See http://wiki.apache.org/spamassassin/DnsBlocklists#dnsbl-block for more information. [pci.pm, qemuserver.pm, gitlab.io, memory.pm, proxmox.com] Subject: Re: [pve-devel] [PATCH qemu-server 08/10] memory: add virtio-mem support X-BeenThere: pve-devel@lists.proxmox.com X-Mailman-Version: 2.1.29 Precedence: list List-Id: Proxmox VE development discussion List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 16 Dec 2022 13:42:49 -0000 Am 09.12.22 um 20:27 schrieb Alexandre Derumier: > a 4GB static memory is needed for DMA+boot memory, as this memory > is almost always un-unpluggeable. > > 1 virtio-mem pci device is setup for each numa node on pci.4 bridge > > virtio-mem use a fixed blocksize with 32000 blocks > Blocksize is computed from the maxmemory-4096/32000 with a minimum of > 2MB to map THP. > (lower blocksize = more chance to unplug memory). > > fixes: > https://bugzilla.proxmox.com/show_bug.cgi?id=931 Comment 7 talks about Windows, and virtio-mem is not supported there at the moment, so I don't think we should consider it fixed ;) > https://bugzilla.proxmox.com/show_bug.cgi?id=2949 > Signed-off-by: Alexandre Derumier > --- > PVE/QemuServer.pm | 8 +++- > PVE/QemuServer/Memory.pm | 98 +++++++++++++++++++++++++++++++++++++--- > PVE/QemuServer/PCI.pm | 8 ++++ > 3 files changed, 106 insertions(+), 8 deletions(-) > > diff --git a/PVE/QemuServer.pm b/PVE/QemuServer.pm > index 0d5b550..43fab29 100644 > --- a/PVE/QemuServer.pm > +++ b/PVE/QemuServer.pm > @@ -285,6 +285,12 @@ my $memory_fmt = { > optional => 1, > enum => [@max_memory_list], > }, > + virtio => { > + description => "enable virtio-mem memory", We really should mention that it's a technology preview, and that it only works for Linux >=5.8 guests currently: https://virtio-mem.gitlab.io/user-guide/#important-current-limitations > + type => 'boolean', > + optional => 1, > + default => 0, > + }, > }; > > my $meta_info_fmt = { > @@ -3898,7 +3904,7 @@ sub config_to_command { > push @$cmd, get_cpu_options($conf, $arch, $kvm, $kvm_off, $machine_version, $winversion, $gpu_passthrough); > } > > - PVE::QemuServer::Memory::config($conf, $vmid, $sockets, $cores, $defaults, $hotplug_features, $cmd); > + PVE::QemuServer::Memory::config($conf, $vmid, $sockets, $cores, $defaults, $hotplug_features, $cmd, $devices, $bridges, $arch, $machine_type); > > push @$cmd, '-S' if $conf->{freeze}; > > diff --git a/PVE/QemuServer/Memory.pm b/PVE/QemuServer/Memory.pm > index 8bbbf07..70ab65a 100644 > --- a/PVE/QemuServer/Memory.pm > +++ b/PVE/QemuServer/Memory.pm > @@ -8,6 +8,8 @@ use PVE::Exception qw(raise raise_param_exc); > > use PVE::QemuServer; > use PVE::QemuServer::Monitor qw(mon_cmd); > +use PVE::QemuServer::PCI qw(print_pci_addr); > + > use base qw(Exporter); > > our @EXPORT_OK = qw( > @@ -27,7 +29,9 @@ my sub get_static_mem { > my $static_memory = 0; > my $memory = PVE::QemuServer::parse_memory($conf->{memory}); > > - if($memory->{max}) { > + if ($memory->{virtio}) { > + $static_memory = 4096; > + } elsif ($memory->{max}) { > my $dimm_size = $memory->{max} / 64; > #static mem can't be lower than 4G and lower than 1 dimmsize by socket > $static_memory = $dimm_size * $sockets; > @@ -102,6 +106,24 @@ my sub get_max_mem { > return $cpu_max_mem; > } > > +my sub get_virtiomem_block_size { > + my ($conf) = @_; > + > + my $MAX_MEM = get_max_mem($conf); > + my $static_memory = get_static_mem($conf); > + my $memory = get_current_memory($conf); > + #virtiomem can map 32000 block size. try to use lowerst blocksize, lower = more chance to unplug memory. Style nit: line too long > + my $blocksize = ($MAX_MEM - $static_memory) / 32000; > + #round next power of 2 > + $blocksize = 2**(int(log($blocksize)/log(2))+1); What if log($blocksize)/log(2) is exactly an integer? Do we still want to add 1 then? If not, please use ceil(...) instead of int(...)+1. Well, I guess it can't happen in practice by what values are possible for $MAX_MEM and $static_memory, but still. > + #2MB is the minimum to be aligned with THP > + $blocksize = 2 if $blocksize < 2; Nit: $blocksize is at least 2**1 after the previous caluclation, so this isn't really needed. > + > + die "memory size need to be multiple of $blocksize MB when virtio-mem is enabled" if ($memory % $blocksize != 0); Missing newline in error message. Style nit: line too long > + > + return $blocksize; > +} > + > sub get_current_memory{ > my ($conf) = @_; > > @@ -224,7 +246,41 @@ sub qemu_memory_hotplug { > my $MAX_MEM = get_max_mem($conf); > die "you cannot add more memory than max mem $MAX_MEM MB!\n" if $value > $MAX_MEM; > > - if ($value > $memory) { > + my $confmem = PVE::QemuServer::parse_memory($conf->{memory}); > + > + if ($confmem->{virtio}) { > + my $blocksize = get_virtiomem_block_size($conf); > + my $requested_size = ($value - $static_memory) / $sockets * 1024 * 1024; > + my $totalsize = $static_memory; > + my $err = undef; > + > + for (my $i = 0; $i < $sockets; $i++) { > + > + my $id = "virtiomem$i"; > + my $retry = 0; > + mon_cmd($vmid, 'qom-set', path => "/machine/peripheral/$id", property => "requested-size", value => int($requested_size)); I'd eval the mon_cmd's and also catch errors there. > + > + my $size = 0; > + while (1) { > + sleep 1; If there really is no good alternative to this querying loop, I'd rather issue the qom-set command for all virtio-mem devices first, and then do the loop. Maybe also move the sleep to the end of the loop. > + $size = mon_cmd($vmid, 'qom-get', path => "/machine/peripheral/$id", property => "size"); > + $err = 1 if $retry > 5; > + last if $size eq $requested_size || $retry > 5; I think, $requested_size doesn't have to be a multiple of $sockets, so this should rather be int($requested_size), which is what you set above. > + $retry++; > + } > + $totalsize += ($size / 1024 / 1024 ); > + } > + #update conf after each succesfull change > + if($err) { But this is only done in the error case, not after each successful change. > + my $mem = { max => $MAX_MEM, virtio => 1}; > + $mem->{current} = $totalsize; Nit: int($totalsize) just to be sure? > + $conf->{memory} = PVE::QemuServer::print_memory($mem); > + PVE::QemuConfig->write_config($vmid, $conf); > + raise_param_exc({ 'memory' => "error modify virtio memory" }) if $err; It's not necessarily a parameter issue, please use die instead. > + } > + return $totalsize; The other branches don't (explicitly) return anything. > + > + } elsif ($value > $memory) { > > my $numa_hostmap; > > @@ -324,14 +380,15 @@ sub qemu_dimm_list { > } > > sub config { > - my ($conf, $vmid, $sockets, $cores, $defaults, $hotplug_features, $cmd) = @_; > + my ($conf, $vmid, $sockets, $cores, $defaults, $hotplug_features, $cmd, $devices, $bridges, $arch, $machine_type) = @_; > > my $memory = get_current_memory($conf); > > my $static_memory = get_static_mem($conf); > + > my $confmem = PVE::QemuServer::parse_memory($conf->{memory}); > > - if ($hotplug_features->{memory} || defined($confmem->{max})) { > + if ($hotplug_features->{memory} || defined($confmem->{max}) || defined($confmem->{virtio})) { Again, should we even bother attaching the devices if hotplug is not enabled? > die "NUMA needs to be enabled for memory hotplug\n" if !$conf->{numa}; > my $MAX_MEM = get_max_mem($conf); > die "Total memory is bigger than ${MAX_MEM}MB\n" if $memory > $MAX_MEM; > @@ -342,8 +399,12 @@ sub config { > } > > die "minimum memory must be ${static_memory}MB\n" if($memory < $static_memory); > + > + my $cmdstr = "size=${static_memory}"; > my $slots = $confmem->{max} ? 64 : 255; > - push @$cmd, '-m', "size=${static_memory},slots=$slots,maxmem=${MAX_MEM}M"; > + $cmdstr .= ",slots=$slots" if !$confmem->{'virtio'}; > + $cmdstr .= ",maxmem=${MAX_MEM}M"; > + push @$cmd, '-m', $cmdstr; > > } else { > push @$cmd, '-m', $static_memory; > @@ -412,7 +473,26 @@ sub config { > } > } > > - if ($hotplug_features->{memory} || $confmem->{max}) { > + if ($confmem->{'virtio'}) { > + my $MAX_MEM = get_max_mem($conf); > + my $node_maxmem = ($MAX_MEM - $static_memory) / $sockets; > + my $node_mem = ($memory - $static_memory) / $sockets; > + my $blocksize = get_virtiomem_block_size($conf); > + > + for (my $i = 0; $i < $sockets; $i++) { > + > + my $id = "virtiomem$i"; > + my $mem_object = print_mem_object($conf, "mem-$id", $node_maxmem); > + push @$cmd, "-object" , "$mem_object,reserve=off"; > + > + my $pciaddr = print_pci_addr($id, $bridges, $arch, $machine_type); Can we rather handle the PCI address printing in config_to_command() and only pass in the addresses here? That would also avoid the PCI module usage and the new "one-time usage" parameters passed to Memory::config(). Maybe have a small helper in here, that just returns the needed $ids depending on the config. Then in config_to_command(), call that helper, print the PCI addresses, then call Memory::config(..., { $id1 => $address1, $id2 => $adress2, ... }). Might also not be the nicest, but at least be a little less cluttering IMHO. But feel free to come up with something better or keep it as-is if you really want to ;) > + my $mem_device = "virtio-mem-pci,block-size=${blocksize}M,requested-size=${node_mem}M,id=$id,memdev=mem-$id,node=$i$pciaddr"; > + $mem_device .= ",prealloc=on" if $conf->{hugepages}; So prealloc=on for the device, but not prealloc=yes for the object below[0]. Would you mind explaining it to me? I just found the part mentioned for v7.0 here https://virtio-mem.gitlab.io/user-guide/user-guide-qemu.html#updates > + push @$devices, "-device", $mem_device; The dimm devices in the other branch are not pushed onto $devices, so this feels inconsistent. Why not add it onto $cmd too? > + } > + > + } elsif ($hotplug_features->{memory} || $confmem->{max}) { > + > foreach_dimm($conf, $vmid, $memory, $sockets, sub { > my ($conf, $vmid, $name, $dimm_size, $numanode, $current_size, $memory) = @_; > > @@ -430,12 +510,16 @@ sub config { > sub print_mem_object { > my ($conf, $id, $size) = @_; > > + my $confmem = PVE::QemuServer::parse_memory($conf->{memory}); > + > if ($conf->{hugepages}) { > > my $hugepages_size = hugepages_size($conf, $size); > my $path = hugepages_mount_path($hugepages_size); > > - return "memory-backend-file,id=$id,size=${size}M,mem-path=$path,share=on,prealloc=yes"; > + my $object = "memory-backend-file,id=$id,size=${size}M,mem-path=$path,share=on"; > + $object .= ",prealloc=yes" if !$confmem->{virtio}; [0] > + return $object; > } else { > return "memory-backend-ram,id=$id,size=${size}M"; > } > diff --git a/PVE/QemuServer/PCI.pm b/PVE/QemuServer/PCI.pm > index a18b974..0187c74 100644 > --- a/PVE/QemuServer/PCI.pm > +++ b/PVE/QemuServer/PCI.pm > @@ -249,6 +249,14 @@ sub get_pci_addr_map { > 'scsihw2' => { bus => 4, addr => 1 }, > 'scsihw3' => { bus => 4, addr => 2 }, > 'scsihw4' => { bus => 4, addr => 3 }, > + 'virtiomem0' => { bus => 4, addr => 4 }, > + 'virtiomem1' => { bus => 4, addr => 5 }, > + 'virtiomem2' => { bus => 4, addr => 6 }, > + 'virtiomem3' => { bus => 4, addr => 7 }, > + 'virtiomem4' => { bus => 4, addr => 8 }, > + 'virtiomem5' => { bus => 4, addr => 9 }, > + 'virtiomem6' => { bus => 4, addr => 10 }, > + 'virtiomem7' => { bus => 4, addr => 11 }, What if $conf->{sockets} > 8? Maybe mention the limitation in the description of the 'virtio' property in the 'memory' string. Is the plan to just add on more virtiomem PCI devices in the future? > } if !defined($pci_addr_map); > return $pci_addr_map; > }