From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <f.ebner@proxmox.com>
Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68])
 (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
 key-exchange X25519 server-signature RSA-PSS (2048 bits))
 (No client certificate requested)
 by lists.proxmox.com (Postfix) with ESMTPS id 72F7E90737
 for <pve-devel@lists.proxmox.com>; Fri,  3 Feb 2023 14:46:39 +0100 (CET)
Received: from firstgate.proxmox.com (localhost [127.0.0.1])
 by firstgate.proxmox.com (Proxmox) with ESMTP id 5CDC654A2
 for <pve-devel@lists.proxmox.com>; Fri,  3 Feb 2023 14:46:39 +0100 (CET)
Received: from proxmox-new.maurer-it.com (proxmox-new.maurer-it.com
 [94.136.29.106])
 (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
 key-exchange X25519 server-signature RSA-PSS (2048 bits))
 (No client certificate requested)
 by firstgate.proxmox.com (Proxmox) with ESMTPS
 for <pve-devel@lists.proxmox.com>; Fri,  3 Feb 2023 14:46:38 +0100 (CET)
Received: from proxmox-new.maurer-it.com (localhost.localdomain [127.0.0.1])
 by proxmox-new.maurer-it.com (Proxmox) with ESMTP id 0BCA4452D4;
 Fri,  3 Feb 2023 14:46:38 +0100 (CET)
Message-ID: <f9758e76-2a9c-cef9-412e-1b85a5d70c8b@proxmox.com>
Date: Fri, 3 Feb 2023 14:46:36 +0100
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101
 Thunderbird/102.6.0
From: Fiona Ebner <f.ebner@proxmox.com>
To: pve-devel@lists.proxmox.com, "aderumier@odiso.com" <aderumier@odiso.com>
References: <20230202110344.840195-1-aderumier@odiso.com>
 <20230202110344.840195-12-aderumier@odiso.com>
Content-Language: en-US
In-Reply-To: <20230202110344.840195-12-aderumier@odiso.com>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit
X-SPAM-LEVEL: Spam detection results:  0
 AWL 0.042 Adjusted score from AWL reputation of From: address
 BAYES_00                 -1.9 Bayes spam probability is 0 to 1%
 KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment
 NICE_REPLY_A            -0.09 Looks like a legit reply (A)
 SPF_HELO_NONE           0.001 SPF: HELO does not publish an SPF Record
 SPF_PASS               -0.001 SPF: sender matches SPF record
Subject: Re: [pve-devel] [PATCH v3 qemu-server 11/13] memory: add virtio-mem
 support
X-BeenThere: pve-devel@lists.proxmox.com
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Proxmox VE development discussion <pve-devel.lists.proxmox.com>
List-Unsubscribe: <https://lists.proxmox.com/cgi-bin/mailman/options/pve-devel>, 
 <mailto:pve-devel-request@lists.proxmox.com?subject=unsubscribe>
List-Archive: <http://lists.proxmox.com/pipermail/pve-devel/>
List-Post: <mailto:pve-devel@lists.proxmox.com>
List-Help: <mailto:pve-devel-request@lists.proxmox.com?subject=help>
List-Subscribe: <https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel>, 
 <mailto:pve-devel-request@lists.proxmox.com?subject=subscribe>
X-List-Received-Date: Fri, 03 Feb 2023 13:46:39 -0000

Am 02.02.23 um 12:03 schrieb Alexandre Derumier:
> @@ -157,6 +168,109 @@ sub get_current_memory {
>      return $memory->{current};
>  }
>  
> +sub get_virtiomem_block_size {
> +    my ($conf) = @_;
> +
> +    my $sockets = $conf->{sockets} || 1;
> +    my $MAX_MEM = get_max_mem($conf);
> +    my $static_memory = get_static_mem($conf, $sockets);
> +    my $memory = get_current_memory($conf->{memory});
> +
> +    #virtiomem can map 32000 block size.
> +    #try to use lowest blocksize, lower = more chance to unplug memory.
> +    my $blocksize = ($MAX_MEM - $static_memory) / 32000;
> +    #2MB is the minimum to be aligned with THP
> +    $blocksize = 2 if $blocksize < 2;
> +    $blocksize = 2**(ceil(log($blocksize)/log(2)));
> +    #Linux guest kernel only support 4MiB block currently (kernel <= 6.2)
> +    $blocksize = 4 if $blocksize < 4;
> +
> +    return $blocksize;
> +}
> +
> +my sub get_virtiomem_total_current_size {
> +    my ($mems) = @_;
> +    my $size = 0;
> +    for my $mem (values %$mems) {
> +	$size += $mem->{current};
> +    }
> +    return $size;
> +}
> +
> +my sub balance_virtiomem {

The function should die if not all memory can be plugged.

> +    my ($vmid, $virtiomems, $blocksize, $target_virtiomem_total) = @_;
> +
> +    my $nb_virtiomem = keys %$virtiomems;

Style nit: explicit scalar() would be nice.

> +
> +    print"try to balance memory on $nb_virtiomem virtiomems\n";
> +    die "error. no more available blocks in virtiomem to balance the remaining memory" if $target_virtiomem_total < 0;
> +
> +    #if we can't share exactly the same amount, we add the remainder on last node
> +    my $virtiomem_target_aligned = int( $target_virtiomem_total / $nb_virtiomem / $blocksize) * $blocksize;
> +    my $virtiomem_target_remaining = $target_virtiomem_total - ($virtiomem_target_aligned * ($nb_virtiomem-1));
> +
> +    my $i = 0;
> +    foreach my $id (sort keys %$virtiomems) {
> +	my $virtiomem = $virtiomems->{$id};
> +	$i++;
> +	my $virtiomem_target = $i == $nb_virtiomem ? $virtiomem_target_remaining : $virtiomem_target_aligned;
> +	$virtiomem->{completed} = 0;
> +	$virtiomem->{retry} = 0;
> +	$virtiomem->{target} = $virtiomem_target;
> +
> +	print "virtiomem$id: set-requested-size : $virtiomem_target\n";
> +	mon_cmd($vmid, 'qom-set', path => "/machine/peripheral/virtiomem$id", property => "requested-size", value => $virtiomem_target * 1024 * 1024);

Style nit: some lines are over 100 chars and some quite a bit

> +    }
> +
> +    while (1) {
> +
> +	sleep 1;
> +	my $total_finished = 0;
> +
> +	foreach my $id (keys %$virtiomems) {
> +
> +	    my $virtiomem = $virtiomems->{$id};
> +
> +	    if ($virtiomem->{error} || $virtiomem->{completed}) {
> +		$total_finished++;
> +		next;
> +	    }
> +
> +	    my $size = mon_cmd($vmid, 'qom-get', path => "/machine/peripheral/virtiomem$id", property => "size");
> +	    $virtiomem->{current} = $size / 1024 / 1024;
> +	    print"virtiomem$id: virtiomem->last: $virtiomem->{last} virtiomem->current: $virtiomem->{current} virtio_mem_target:$virtiomem->{target}\n";
> +
> +	    if($virtiomem->{current} == $virtiomem->{target}) {
> +		print"virtiomem$id: completed\n";
> +		$virtiomem->{completed} = 1;
> +		next;
> +	    }
> +
> +	    if($virtiomem->{current} != $virtiomem->{last}) {
> +		#if value has changed, but not yet completed
> +		print "virtiomem$id: changed but don't not reach target yet\n";
> +		$virtiomem->{retry} = 0;
> +		$virtiomem->{last} = $virtiomem->{current};
> +		next;
> +	    }
> +
> +	    if($virtiomem->{retry} >= 5) {
> +		print "virtiomem$id: too many retry. set error\n";
> +		$virtiomem->{error} = 1;
> +		#as change is async, we don't want that value change after the api call
> +		eval {
> +		    mon_cmd($vmid, 'qom-set', path => "/machine/peripheral/virtiomem$id", property => "requested-size", value => $virtiomem->{current} * 1024 *1024);
> +		};
> +	    }
> +	    print"virtiomem$id: increase retry: $virtiomem->{retry}\n";
> +	    $virtiomem->{retry}++;
> +	}
> +
> +	my $nb_virtiomem = keys %$virtiomems;

Redeclares variable. The number can't change from before, or am I
missing something?

> +	return if $total_finished == $nb_virtiomem;

Style nit: could also use
while ($total_finished != $nb_virtiomem)
at the beginning of the loop.

> +    }
> +}
> +
>  sub get_numa_node_list {
>      my ($conf) = @_;
>      my @numa_map;
> @@ -237,7 +351,39 @@ sub qemu_memory_hotplug {
>      my $MAX_MEM = get_max_mem($conf);
>      die "you cannot add more memory than max mem $MAX_MEM MB!\n" if $value > $MAX_MEM;
>  
> -    if ($value > $memory) {
> +    my $confmem = parse_memory($conf->{memory});

You can just re-use the existing $oldmem?

> +
> +    if ($confmem->{virtio}) {
> +	my $blocksize = get_virtiomem_block_size($conf);
> +
> +	my $virtiomems = {};
> +
> +	for (my $i = 0; $i < $sockets; $i++) {
> +	    my $size = mon_cmd($vmid, 'qom-get', path => "/machine/peripheral/virtiomem$i", property => "size");
> +	    $size = $size / 1024 /1024;
> +	    $virtiomems->{$i} = {
> +		current => $size,
> +		last => $size,
> +		error => 0,
> +		completed => 0,
> +		retry => 0
> +	    };
> +	}
> +
> +	my $target_virtiomem_total = $value - $static_memory;
> +	my $err;
> +	eval {
> +	    balance_virtiomem($vmid, $virtiomems, $blocksize, $target_virtiomem_total);
> +	};
> +	$err = $@ if $@;
> +
> +	my $current_memory = $static_memory + get_virtiomem_total_current_size($virtiomems);
> +	$newmem->{current} = $current_memory;
> +	$conf->{memory} = print_memory($newmem);
> +	PVE::QemuConfig->write_config($vmid, $conf);
> +	die $err if $err;
> +
> +    } elsif ($value > $memory) {
>  
>  	my $numa_hostmap;
>  
> @@ -441,17 +590,42 @@ sub config {
>      }
>  
>      if ($hotplug) {
> -	foreach_dimm($conf, $vmid, $memory, $static_memory, sub {
> -	    my ($conf, $vmid, $name, $dimm_size, $numanode, $current_size, $memory) = @_;
>  
> -	    my $mem_object = print_mem_object($conf, "mem-$name", $dimm_size);
> +	my $confmem = parse_memory($conf->{memory});
> +
> +	if ($confmem->{'virtio'}) {
>  
> -	    push @$cmd, "-object" , $mem_object;
> -	    push @$cmd, "-device", "pc-dimm,id=$name,memdev=mem-$name,node=$numanode";
> +	    my $MAX_MEM = get_max_mem($conf);
> +	    my $node_maxmem = ($MAX_MEM - $static_memory) / $sockets;
> +	    my $node_mem = ($memory - $static_memory) / $sockets;

Nit: If the number of $sockets is not a power of 2, I think this breaks.
But I guess we already don't support it. Running current version without
your patches (for a VM with memory hotplug):
root@pve701 ~ # qm set 131 --sockets 3
update VM 131: -sockets 3
root@pve701 ~ # qm set 131 -memory 8192
update VM 131: -memory 8192
root@pve701 ~ # qm start 131
kvm: total memory for NUMA nodes (0x3fffffff) should equal RAM size
(0x40000000)
start failed: QEMU exited with code 1

I guess we can just fix it up together with the existing rounding issue
if/when somebody complains about it ;)

> +	    my $blocksize = get_virtiomem_block_size($conf);
>  
> -	    die "memory size ($memory) must be aligned to $dimm_size for hotplugging\n"
> -		if $current_size > $memory;
> -	});
> +	    die "memory need to be a multiple of $blocksize MiB with maxmemory $MAX_MEM MiB when virtiomem is enabled\n"
> +		if $memory % $blocksize != 0;
> +
> +	    for (my $i = 0; $i < $sockets; $i++)  {
> +
> +		my $id = "virtiomem$i";
> +		my $mem_object = print_mem_object($conf, "mem-$id", $node_maxmem);
> +		push @$cmd, "-object" , "$mem_object,reserve=off";
> +
> +		my $mem_device = "virtio-mem-pci,block-size=${blocksize}M,requested-size=${node_mem}M,id=$id,memdev=mem-$id,node=$i";
> +		$mem_device .= ",prealloc=on" if $conf->{hugepages};
> +		$mem_devices->{$id} = $mem_device;
> +	    }
> +	} else {
> +	    foreach_dimm($conf, $vmid, $memory, $static_memory, sub {
> +		my ($conf, $vmid, $name, $dimm_size, $numanode, $current_size, $memory) = @_;
> +
> +		my $mem_object = print_mem_object($conf, "mem-$name", $dimm_size);
> +
> +		push @$cmd, "-object" , $mem_object;
> +		push @$cmd, "-device", "pc-dimm,id=$name,memdev=mem-$name,node=$numanode";
> +
> +		die "memory size ($memory) must be aligned to $dimm_size for hotplugging\n"
> +		    if $current_size > $memory;
> +	    });
> +	}
>      }
>  }
>