From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits)) (No client certificate requested) by lists.proxmox.com (Postfix) with ESMTPS id CEEEF966E3 for ; Wed, 25 Jan 2023 10:54:16 +0100 (CET) Received: from firstgate.proxmox.com (localhost [127.0.0.1]) by firstgate.proxmox.com (Proxmox) with ESMTP id A8575CE4C for ; Wed, 25 Jan 2023 10:54:16 +0100 (CET) Received: from proxmox-new.maurer-it.com (proxmox-new.maurer-it.com [94.136.29.106]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits)) (No client certificate requested) by firstgate.proxmox.com (Proxmox) with ESMTPS for ; Wed, 25 Jan 2023 10:54:15 +0100 (CET) Received: from proxmox-new.maurer-it.com (localhost.localdomain [127.0.0.1]) by proxmox-new.maurer-it.com (Proxmox) with ESMTP id 8B03F460EA; Wed, 25 Jan 2023 10:54:15 +0100 (CET) Message-ID: <5c097397-3777-66a9-579c-b7d99941c399@proxmox.com> Date: Wed, 25 Jan 2023 10:54:10 +0100 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.5.0 Content-Language: en-US To: "DERUMIER, Alexandre" , "pve-devel@lists.proxmox.com" , "aderumier@odiso.com" References: <20230104064303.2898194-1-aderumier@odiso.com> <20230104064303.2898194-9-aderumier@odiso.com> <3a360312-42c0-6e97-c0e3-6cc70285f3eb@proxmox.com> From: Fiona Ebner In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-SPAM-LEVEL: Spam detection results: 0 AWL 0.585 Adjusted score from AWL reputation of From: address BAYES_00 -1.9 Bayes spam probability is 0 to 1% KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment NICE_REPLY_A -1.148 Looks like a legit reply (A) SPF_HELO_NONE 0.001 SPF: HELO does not publish an SPF Record SPF_PASS -0.001 SPF: sender matches SPF record Subject: Re: [pve-devel] [PATCH v2 qemu-server 8/9] memory: add virtio-mem support X-BeenThere: pve-devel@lists.proxmox.com X-Mailman-Version: 2.1.29 Precedence: list List-Id: Proxmox VE development discussion List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 25 Jan 2023 09:54:16 -0000 Am 25.01.23 um 10:00 schrieb DERUMIER, Alexandre: > Le mardi 24 janvier 2023 à 14:06 +0100, Fiona Ebner a écrit : >> Am 04.01.23 um 07:43 schrieb Alexandre Derumier: >>> +sub get_virtiomem_block_size { >>> +    my ($conf) = @_; >>> + >>> +    my $MAX_MEM = get_max_mem($conf); >>> +    my $static_memory = get_static_mem($conf); >>> +    my $memory = get_current_memory($conf->{memory}); >>> + >>> +    #virtiomem can map 32000 block size. >>> +    #try to use lowest blocksize, lower = more chance to unplug >>> memory. >>> +    my $blocksize = ($MAX_MEM - $static_memory) / 32000; >>> +    #2MB is the minimum to be aligned with THP >>> +    $blocksize = 2**(ceil(log($blocksize)/log(2))); >>> +    $blocksize = 4 if $blocksize < 4; >> >> Why suddenly 4? > > I have added a note in the commit : > >> Note: Currently, linux only support 4MB virtio blocksize, 2MB support >> is currently is progress. >> > > So 2MB is valid from qemu side, but linux guest kernel don't support it > actually. At least , you need to use multiple of 4MB. you can > remove/add 2 blocks of 2MB at the same time, but it don't seem to be > atomic, so I think it's better to use the minimum currently supported > bloc. Sorry, I missed that. > Maybe later, we could extend the virtio=X option, to tell the virtio > supported version. (virtio=1.1 , virtio=1.2), and enable supported > features ? If we really need to and can't detect it otherwise, sure. >>> +my sub balance_virtiomem { >> >> This function is rather difficult to read. The "record errors and >> filter" logic really should be its own patch after the initial >> support. >> FWIW, I tried my best and it does seems fine :) >> >> But it's not clear to me that we even want that logic? Is it really >> that >> common for qom-set to take so long to be worth introducing all this >> additional handling/complexity? Or should it just be a hard error if >> qom-set still didn't have an effect on a device after 5 seconds. >> > from my test,It can take 2-3second on unplug on bigger setup. I'm doing > it in // to be faster, to avoid to wait nbdimm * 2-3seconds. Sure, doing it in parallel is perfectly fine. I'm just thinking that switching gears (too early) and redirecting the request might not be ideal. You also issue an additional qom-set to go back to $virtiomem->{current} * 1024 *1024 if a request didn't make progress in time. But to be sure that request worked, we'd also need to monitor it ;) I think issuing that request is fine as-is, but if there is a "hanging" device, we really can't do much. And I do think the user should be informed via an appropriate error if there is a problematic device. Maybe we can use 10 seconds instead of 5 (2-3 seconds already sounds too close to 5 IMHO), so that we have a good margin, and just die instead of trying to redirect the request to another device. After issuing the reset request and writing our best guess of what the memory is to the config of course. If it really is an issue in practice that certain devices often take too long for whatever reason, we can still add the redirect logic. But starting out, I feel like it's not worth the additional complexity. >> Would it actually be better to just fill up the first, then the >> second >> etc. as needed, rather than balancing? My gut feeling is that having >> fewer "active" devices is better. But this would have to be tested >> with >> some benchmarks of course. > > Well, from numa perspective, you really want to balance as much as > possible. (That's why, with classic hotplug, we add/remove dimm on each > socket alternatively). > > That's the whole point of numa, read the nearest memory attached to the > processor where the process are running. > > That's a main advantage of virtio-mem vs balloning (which doesn't > handle numa, and remove pages randomly on any socket) Makes sense. Thanks for the explanation!