From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <f.ebner@proxmox.com>
Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68])
 (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
 key-exchange X25519 server-signature RSA-PSS (2048 bits))
 (No client certificate requested)
 by lists.proxmox.com (Postfix) with ESMTPS id CEEEF966E3
 for <pve-devel@lists.proxmox.com>; Wed, 25 Jan 2023 10:54:16 +0100 (CET)
Received: from firstgate.proxmox.com (localhost [127.0.0.1])
 by firstgate.proxmox.com (Proxmox) with ESMTP id A8575CE4C
 for <pve-devel@lists.proxmox.com>; Wed, 25 Jan 2023 10:54:16 +0100 (CET)
Received: from proxmox-new.maurer-it.com (proxmox-new.maurer-it.com
 [94.136.29.106])
 (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
 key-exchange X25519 server-signature RSA-PSS (2048 bits))
 (No client certificate requested)
 by firstgate.proxmox.com (Proxmox) with ESMTPS
 for <pve-devel@lists.proxmox.com>; Wed, 25 Jan 2023 10:54:15 +0100 (CET)
Received: from proxmox-new.maurer-it.com (localhost.localdomain [127.0.0.1])
 by proxmox-new.maurer-it.com (Proxmox) with ESMTP id 8B03F460EA;
 Wed, 25 Jan 2023 10:54:15 +0100 (CET)
Message-ID: <5c097397-3777-66a9-579c-b7d99941c399@proxmox.com>
Date: Wed, 25 Jan 2023 10:54:10 +0100
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101
 Thunderbird/102.5.0
Content-Language: en-US
To: "DERUMIER, Alexandre" <Alexandre.DERUMIER@groupe-cyllene.com>,
 "pve-devel@lists.proxmox.com" <pve-devel@lists.proxmox.com>,
 "aderumier@odiso.com" <aderumier@odiso.com>
References: <20230104064303.2898194-1-aderumier@odiso.com>
 <20230104064303.2898194-9-aderumier@odiso.com>
 <3a360312-42c0-6e97-c0e3-6cc70285f3eb@proxmox.com>
 <b68f0f56607950351ee5f291b24fc6ad84ee5bf7.camel@groupe-cyllene.com>
From: Fiona Ebner <f.ebner@proxmox.com>
In-Reply-To: <b68f0f56607950351ee5f291b24fc6ad84ee5bf7.camel@groupe-cyllene.com>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
X-SPAM-LEVEL: Spam detection results:  0
 AWL 0.585 Adjusted score from AWL reputation of From: address
 BAYES_00                 -1.9 Bayes spam probability is 0 to 1%
 KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment
 NICE_REPLY_A           -1.148 Looks like a legit reply (A)
 SPF_HELO_NONE           0.001 SPF: HELO does not publish an SPF Record
 SPF_PASS               -0.001 SPF: sender matches SPF record
Subject: Re: [pve-devel] [PATCH v2 qemu-server 8/9] memory: add virtio-mem
 support
X-BeenThere: pve-devel@lists.proxmox.com
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Proxmox VE development discussion <pve-devel.lists.proxmox.com>
List-Unsubscribe: <https://lists.proxmox.com/cgi-bin/mailman/options/pve-devel>, 
 <mailto:pve-devel-request@lists.proxmox.com?subject=unsubscribe>
List-Archive: <http://lists.proxmox.com/pipermail/pve-devel/>
List-Post: <mailto:pve-devel@lists.proxmox.com>
List-Help: <mailto:pve-devel-request@lists.proxmox.com?subject=help>
List-Subscribe: <https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel>, 
 <mailto:pve-devel-request@lists.proxmox.com?subject=subscribe>
X-List-Received-Date: Wed, 25 Jan 2023 09:54:16 -0000

Am 25.01.23 um 10:00 schrieb DERUMIER, Alexandre:
> Le mardi 24 janvier 2023 à 14:06 +0100, Fiona Ebner a écrit :
>> Am 04.01.23 um 07:43 schrieb Alexandre Derumier:
>>> +sub get_virtiomem_block_size {
>>> +    my ($conf) = @_;
>>> +
>>> +    my $MAX_MEM = get_max_mem($conf);
>>> +    my $static_memory = get_static_mem($conf);
>>> +    my $memory = get_current_memory($conf->{memory});
>>> +
>>> +    #virtiomem can map 32000 block size.
>>> +    #try to use lowest blocksize, lower = more chance to unplug
>>> memory.
>>> +    my $blocksize = ($MAX_MEM - $static_memory) / 32000;
>>> +    #2MB is the minimum to be aligned with THP
>>> +    $blocksize = 2**(ceil(log($blocksize)/log(2)));
>>> +    $blocksize = 4 if $blocksize < 4;
>>
>> Why suddenly 4?
> 
> I have added a note in the commit :
> 
>> Note: Currently, linux only support 4MB virtio blocksize, 2MB support
>> is currently is progress.
>>
> 
> So 2MB is valid from qemu side, but linux guest kernel don't support it
> actually. At least , you need to use multiple of 4MB. you can
> remove/add 2 blocks of 2MB at the same time, but it don't seem to be
> atomic, so I think it's better to use the minimum currently supported
> bloc.

Sorry, I missed that.

> Maybe later, we could extend the virtio=X option, to tell the virtio
> supported version.  (virtio=1.1  , virtio=1.2), and enable supported
> features ?

If we really need to and can't detect it otherwise, sure.

>>> +my sub balance_virtiomem {
>>
>> This function is rather difficult to read. The "record errors and
>> filter" logic really should be its own patch after the initial
>> support.
>> FWIW, I tried my best and it does seems fine :)
>>
>> But it's not clear to me that we even want that logic? Is it really
>> that
>> common for qom-set to take so long to be worth introducing all this
>> additional handling/complexity? Or should it just be a hard error if
>> qom-set still didn't have an effect on a device after 5 seconds.
>>
> from my test,It can take 2-3second on unplug on bigger setup. I'm doing
> it in // to be faster, to avoid to wait nbdimm * 2-3seconds.

Sure, doing it in parallel is perfectly fine. I'm just thinking that
switching gears (too early) and redirecting the request might not be
ideal. You also issue an additional qom-set to go back to
$virtiomem->{current} * 1024 *1024 if a request didn't make progress in
time. But to be sure that request worked, we'd also need to monitor it
;) I think issuing that request is fine as-is, but if there is a
"hanging" device, we really can't do much. And I do think the user
should be informed via an appropriate error if there is a problematic
device.

Maybe we can use 10 seconds instead of 5 (2-3 seconds already sounds too
close to 5 IMHO), so that we have a good margin, and just die instead of
trying to redirect the request to another device. After issuing the
reset request and writing our best guess of what the memory is to the
config of course.

If it really is an issue in practice that certain devices often take too
long for whatever reason, we can still add the redirect logic. But
starting out, I feel like it's not worth the additional complexity.

>> Would it actually be better to just fill up the first, then the
>> second
>> etc. as needed, rather than balancing? My gut feeling is that having
>> fewer "active" devices is better. But this would have to be tested
>> with
>> some benchmarks of course.
> 
> Well, from numa perspective, you really want to balance as much as
> possible. (That's why, with classic hotplug, we add/remove dimm on each
> socket alternatively).
> 
> That's the whole point of numa, read the nearest memory attached to the
> processor where the process are running.
> 
> That's a main advantage of virtio-mem  vs balloning (which doesn't
> handle numa, and remove pages randomly on any socket)

Makes sense. Thanks for the explanation!