From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <pve-devel-bounces@lists.proxmox.com>
Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68])
	by lore.proxmox.com (Postfix) with ESMTPS id 8A45A1FF17C
	for <inbox@lore.proxmox.com>; Wed,  9 Jul 2025 11:04:32 +0200 (CEST)
Received: from firstgate.proxmox.com (localhost [127.0.0.1])
	by firstgate.proxmox.com (Proxmox) with ESMTP id C3443362C6;
	Wed,  9 Jul 2025 11:05:03 +0200 (CEST)
To: Dominik Csapak <d.csapak@proxmox.com>, Proxmox VE development discussion
 <pve-devel@lists.proxmox.com>
Date: Tue, 08 Jul 2025 17:08:11 +0200
In-Reply-To: <409f8c3d-dd5b-4769-99f4-a17b2de773cd@proxmox.com>
References: <20250708084900.1068146-1-d.csapak@proxmox.com>
 <9dc2c099169ee1ed64c274d64cc0a1c19f9f6c92.camel@notnullmakers.com>
 <409f8c3d-dd5b-4769-99f4-a17b2de773cd@proxmox.com>
X-Mailman-Approved-At: Wed, 09 Jul 2025 11:05:00 +0200
MIME-Version: 1.0
Message-ID: <mailman.1189.1752051901.395.pve-devel@lists.proxmox.com>
List-Id: Proxmox VE development discussion <pve-devel.lists.proxmox.com>
List-Post: <mailto:pve-devel@lists.proxmox.com>
From: Adam Kalisz via pve-devel <pve-devel@lists.proxmox.com>
Precedence: list
Cc: Adam Kalisz <adam.kalisz@notnullmakers.com>
X-Mailman-Version: 2.1.29
X-BeenThere: pve-devel@lists.proxmox.com
List-Subscribe: <https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel>, 
 <mailto:pve-devel-request@lists.proxmox.com?subject=subscribe>
List-Unsubscribe: <https://lists.proxmox.com/cgi-bin/mailman/options/pve-devel>, 
 <mailto:pve-devel-request@lists.proxmox.com?subject=unsubscribe>
List-Archive: <http://lists.proxmox.com/pipermail/pve-devel/>
Reply-To: Proxmox VE development discussion <pve-devel@lists.proxmox.com>
List-Help: <mailto:pve-devel-request@lists.proxmox.com?subject=help>
Subject: Re: [pve-devel] [RFC PATCH v2 proxmox-backup-qemu] restore: make
 chunk loading more parallel
Content-Type: multipart/mixed; boundary="===============5309014489446260501=="
Errors-To: pve-devel-bounces@lists.proxmox.com
Sender: "pve-devel" <pve-devel-bounces@lists.proxmox.com>

--===============5309014489446260501==
Content-Type: message/rfc822
Content-Disposition: inline

Return-Path: <adam.kalisz@notnullmakers.com>
X-Original-To: pve-devel@lists.proxmox.com
Delivered-To: pve-devel@lists.proxmox.com
Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (2048 bits))
	(No client certificate requested)
	by lists.proxmox.com (Postfix) with ESMTPS id 8F7F3D6E91
	for <pve-devel@lists.proxmox.com>; Tue,  8 Jul 2025 17:08:24 +0200 (CEST)
Received: from firstgate.proxmox.com (localhost [127.0.0.1])
	by firstgate.proxmox.com (Proxmox) with ESMTP id 6D7D119F06
	for <pve-devel@lists.proxmox.com>; Tue,  8 Jul 2025 17:08:24 +0200 (CEST)
Received: from mail-ej1-x630.google.com (mail-ej1-x630.google.com [IPv6:2a00:1450:4864:20::630])
	(using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits)
	 key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256)
	(No client certificate requested)
	by firstgate.proxmox.com (Proxmox) with ESMTPS
	for <pve-devel@lists.proxmox.com>; Tue,  8 Jul 2025 17:08:22 +0200 (CEST)
Received: by mail-ej1-x630.google.com with SMTP id a640c23a62f3a-ae0dad3a179so768082866b.1
        for <pve-devel@lists.proxmox.com>; Tue, 08 Jul 2025 08:08:22 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=notnullmakers.com; s=google; t=1751987296; x=1752592096; darn=lists.proxmox.com;
        h=mime-version:user-agent:references:in-reply-to:date:to:from:subject
         :message-id:from:to:cc:subject:date:message-id:reply-to;
        bh=EErYCz1ZEzxb30/bADLnZSi4pdfotxXC75z7VWsdJU4=;
        b=1XISk75wOV1a367fr7HHvMMeVFWxstnhYlwpaPpoZnq0asJXfXew65vBjeKKSUUpwy
         pCx9OdeXjUaPSJcKeaT+0ZUYRAojlSqpQcBwo2SjOfqR0EGSPS90PDJTsXFy8JkgrqYU
         x2tHiI5fYEBrSRu5VJIztrN80ugFzVzUn+We3DVmwE6ObWviLvHY8kbesf1nwA4mfnjx
         zWtqAn10crLxYUEaWaxNkc8NugnZ+AVkmD1q9drRv9o1dHhSHDCMrWN5wHP2zP8sz+Sy
         Xw917fjBOlqrewxThJHLqOujK+VbmvAz54YfKEfTh+kLtZep8sS7UyjoHyv0C0eCcgUc
         ZZVw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1751987296; x=1752592096;
        h=mime-version:user-agent:references:in-reply-to:date:to:from:subject
         :message-id:x-gm-message-state:from:to:cc:subject:date:message-id
         :reply-to;
        bh=EErYCz1ZEzxb30/bADLnZSi4pdfotxXC75z7VWsdJU4=;
        b=epgDY73ScFHdVF8rZjd9FUXkwi3N+/eMMqHjS+1Bd+4j8yBbBT8m/n8XHcYHU/Zsbj
         N5mGEd+obSxdTIr0ElfZPOMp212SqaQ0PsXW3y2QKzuxYpJoTvQ5+xsXcgzO2soklfg7
         ANrrtXXuuIoCiN6Wxslr8alJX4b0ZtqWMtLT/EmetNHRq6g7Uol0FzR6iQU25ejBMJks
         FvWfqLa0O90IjndqP3texzXrqwF/f6eRGkJQl3IlIDjJf5b4o6rSDR1gYAR47qnRwsH2
         0VfEARRkzGTKzf5pg4/ki/5BqhsXGAuDpWVo4V/PTyN66sYvT+K3vxxI0m7ckyWkBYN6
         J2uA==
X-Forwarded-Encrypted: i=1; AJvYcCXkQMGypTbTqMr7IwHwRLIjx1oXb2jpysNBeTjGvujQqFkBXitpAyImZdGnaTQkCEglxWtj8l8zXL4=@lists.proxmox.com
X-Gm-Message-State: AOJu0YyOSE9KLxZHJgsT/4SLSh8fYD1BguxoJUtZ81jrbpnnkZyAxqGO
	LzruH3bbPl++kwYIzIy56CZ1/3eUFwgbdUQ7zzu9yTeLA7ppe9g6e59q6UVX3D/Iv+Q=
X-Gm-Gg: ASbGncvco/2UdTsDbi3f/ejBf073AZqK4KN7ZiAp+yw31qJOFQxlYoc6g5973oScdTk
	2CZd5QSU1yAsNhhrxp0KN8CHwYQRtluDBG2nzgmDslq/wwbseUtXg0TmgD4DWfuvmDckMuBxVZU
	StDqjvLvwtIQXAREniFDoeLuBiMaDECP+VhZApGfT88b2/YcnJ28muMhVljQe8vOi9WL5PkHc/Q
	wlXQIAhNbG0itWAcCHTscro5bUeSrXfzzqT/HwA4AlxvPt0alU5vYZzluiDCauSjns84eYqcyWy
	6zHcU4BOMkuCKFF3aV8gPU5H/FC5ZqXzmSg7+d6n8GuF/GnpEqJgNg3yBLMZAxEIA3amgpUVqdY
	1+pKS7jh2f1qEF2nAWtdq
X-Google-Smtp-Source: AGHT+IH/iIgkWn6ANO+V0GrDE1pVgowrwJdQG4d6LpEpLIRnbbjOyAzc7yTCPB1nKRWstvnFqs5JwQ==
X-Received: by 2002:a17:907:60d1:b0:ad5:3055:a025 with SMTP id a640c23a62f3a-ae6b05942c3mr417897366b.6.1751987293199;
        Tue, 08 Jul 2025 08:08:13 -0700 (PDT)
Received: from ?IPv6:2a02:8308:299:4600::9185? ([2a02:8308:299:4600::9185])
        by smtp.gmail.com with ESMTPSA id a640c23a62f3a-ae3f6929ab5sm895308266b.46.2025.07.08.08.08.11
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Tue, 08 Jul 2025 08:08:12 -0700 (PDT)
Message-ID: <a148f2105e2e3e453e5503be61cf6ae0f28d0eba.camel@notnullmakers.com>
Subject: Re: [RFC PATCH v2 proxmox-backup-qemu] restore: make chunk loading
 more parallel
From: Adam Kalisz <adam.kalisz@notnullmakers.com>
To: Dominik Csapak <d.csapak@proxmox.com>, Proxmox VE development discussion
	 <pve-devel@lists.proxmox.com>
Date: Tue, 08 Jul 2025 17:08:11 +0200
In-Reply-To: <409f8c3d-dd5b-4769-99f4-a17b2de773cd@proxmox.com>
References: <20250708084900.1068146-1-d.csapak@proxmox.com>
	 <9dc2c099169ee1ed64c274d64cc0a1c19f9f6c92.camel@notnullmakers.com>
	 <409f8c3d-dd5b-4769-99f4-a17b2de773cd@proxmox.com>
User-Agent: Evolution 3.56.1-1 
MIME-Version: 1.0
X-SPAM-LEVEL: Spam detection results:  0
	AWL                    -0.205 Adjusted score from AWL reputation of From: address
	BAYES_00                 -1.9 Bayes spam probability is 0 to 1%
	DKIM_INVALID              0.1 DKIM or DK signature exists, but is not valid
	DKIM_SIGNED               0.1 Message has a DKIM or DK signature, not necessarily valid
	DMARC_PASS               -0.1 DMARC pass policy
	KAM_DMARC_STATUS         0.01 Test Rule for DKIM or SPF Failure with Strict Alignment
	RCVD_IN_DNSWL_NONE     -0.0001 Sender listed at https://www.dnswl.org/, no trust
	SPF_HELO_NONE           0.001 SPF: HELO does not publish an SPF Record
	SPF_PASS               -0.001 SPF: sender matches SPF record
	URIBL_BLOCKED           0.001 ADMINISTRATOR NOTICE: The query to URIBL was blocked.  See http://wiki.apache.org/spamassassin/DnsBlocklists#dnsbl-block for more information. [notnullmakers.com]
X-Mailman-Approved-At: Wed, 09 Jul 2025 11:05:00 +0200
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Content-Filtered-By: Mailman/MimeDel 2.1.29

On Tue, 2025-07-08 at 12:58 +0200, Dominik Csapak wrote:
> On 7/8/25 12:04, Adam Kalisz wrote:
> > Hi Dominik,
> >=20
>=20
> Hi,
>=20
> > this is a big improvement, I have done some performance
> > measurements
> > again:
> >=20
> > Ryzen:
> > 4 worker threads:
> > restore image complete (bytes=3D53687091200, duration=3D52.06s,
> > speed=3D983.47MB/s)
> > 8 worker threads:
> > restore image complete (bytes=3D53687091200, duration=3D50.12s,
> > speed=3D1021.56MB/s)
> >=20
> > 4 worker threads, 4 max-blocking:
> > restore image complete (bytes=3D53687091200, duration=3D54.00s,
> > speed=3D948.22MB/s)
> > 8 worker threads, 4 max-blocking:
> > restore image complete (bytes=3D53687091200, duration=3D50.43s,
> > speed=3D1015.25MB/s)
> > 8 worker threads, 4 max-blocking, 32 buffered futures:
> > restore image complete (bytes=3D53687091200, duration=3D52.11s,
> > speed=3D982.53MB/s)
> >=20
> > Xeon:
> > 4 worker threads:
> > restore image complete (bytes=3D10737418240, duration=3D3.06s,
> > speed=3D3345.97MB/s)
> > restore image complete (bytes=3D107374182400, duration=3D139.80s,
> > speed=3D732.47MB/s)
> > restore image complete (bytes=3D107374182400, duration=3D136.67s,
> > speed=3D749.23MB/s)
> > 8 worker threads:
> > restore image complete (bytes=3D10737418240, duration=3D2.50s,
> > speed=3D4095.30MB/s)
> > restore image complete (bytes=3D107374182400, duration=3D127.14s,
> > speed=3D805.42MB/s)
> > restore image complete (bytes=3D107374182400, duration=3D121.39s,
> > speed=3D843.59MB/s)
>=20
> just for my understanding: you left the parallel futures at 16 and
> changed the threads in the tokio runtime?

Yes, that's correct.

> The biggest issue here is that we probably don't want to increase
> that number by default by much, since on e.g. a running system this
> will have an impact on other running VMs. Adjusting such a number
> (especially in a way where it's now actually used in contrast to
> before) will come as a surprise for many.
>=20
> That's IMHO the biggest challenge here, that's why I did not touch
> the tokio runtime thread settings, to not increase the load too much.
>=20
> Did you by any chance observe the CPU usage during your tests?
> As I wrote in my commit message, the cpu usage quadrupled
> (proportional to the more chunks we could put through) when using 16
> fetching tasks.

Yes, please see the mpstat attachments. The one with yesterday's date
is the first patch, the two from today are you today's patch without
changes and the second is with 8 worker threads. All of them use 16
buffered futures.

> Just an additional note: With my solution,the blocking threads should
> not have any impact at all, since the fetching should be purely
> async (so no blocking code anywhere) and the writing is done in the
> main thread/task in sequence so no blocking threads will be used
> except one.

I actually see a slight negative impact but that's based on a very few
runs.

> > On the Ryzen system I was hitting:
> > With 8-way concurrency, 16 max-blocking threads:
> > restore image complete (bytes=3D53687091200, avg fetch
> > time=3D24.7342ms,
> > avg time per nonzero write=3D1.6474ms, storage nonzero total write
> > time=3D19.996s, duration=3D45.83s, speed=3D1117.15MB/s)
> > -> this seemed to be the best setting for this system
> >=20
> > It seems the counting of zeroes works in some kind of steps (seen
> > on the Xeon system with mostly incompressible data):
> >=20
>=20
> yes, only whole zero chunks will be counted.
>=20
> [snip]
> >=20
> > Especially during a restore the speed is quite important if you
> > need to hit Restore Time Objectives under SLAs. That's why we were
> > targeting 1 GBps for incompressible data.
>=20
> I get that, but this will always be a tradeoff between CPU load and
> throughput and we have to find a good middle ground here.

Sure, I am not disputing that for the default configuration.

> IMO with my current patch, we have a very good improvement already,
> without increasing the (theoretical) load to the system.
>=20
> It could be OK from my POV to make the number of threads of the
> runtime configurable e.g. via vzdump.conf. (That's a thing that's
> easily explainable in the docs for admins)

That would be great, because some users have other priorities depending
on the operational situation. The nice thing about the ENV override in
my submission is that if somebody runs PBS_RESTORE_CONCURRENCY=3D8
qmrestore ... they can change the priority of the restore ad-hoc e.g.
if they really need to restore as quickly as possibly they can throw
more threads at the problem within some reasonable bounds. In some
cases they do a restore of a VM that isn't running and the resources it
would be using on the system are available for use.

Again, I agree the defaults should be conservative.

> If someone else (@Fabian?) wants to chime in to this discussion,
> I'd be glad.
>=20
> Also feedback on my code in general would be nice ;)
> (There are probably better ways to make this concurrent in an
> async context, e.g. maybe using 'async-channel' + fixed number
> of tasks ?)

Having more timing information about how long the fetches and writes of
nonzero chunks take would be great for doing informed estimates about
the effect of performance settings should there be any. It would also
help with the benchmarking right now to see where we are saturated.

Do I understand correctly that we are blocking a worker when writing a
chunk to storage with the write_data_callback or write_zero_callback
from fetching chunks?

>From what I gathered, we have to have a single thread that writes the
VM image because otherwise we would have problems with concurrent
access.
We need to feed this thread as well as possible with a mix of zero
chunks and non-zero chunks. The zero chunks are cheap because we
generate them based on information we already have in memory. The non-
zero chunks we have to fetch over the network (or from cache, which is
still more expensive than zero chunks). If I understand correctly, if
we fill the read queue with non-zero chunks and wait for them to become
available we will not be writing any possible zero chunks that come
after it to storage and our bottleneck storage writer thread will be
idle and hungry for more chunks.

My original solution basically wrote all the zero chunks first and then
started working through the non-zero chunks. This split seems to mostly
avoid the cost difference between zero and non-zero chunks keeping
futures slots occupied. However I have not considered the memory
consumption of the chunk_futures vector which might grow very big for
multi TB VM backups. However the idea of knowing about the cheap filler
zero chunks we might always write if we don't have any non-zero chunks
available is perhaps not so bad, especially for NVMe systems. For
harddrives I imagine the linear strategy might be faster because it
avoids should in my imagination avoid some expensive seeks.
Would it make sense to have a reasonable buffer of zero chunks ready
for writing while we fetch non-zero chunks over the network?

Is this thought process correct?


Btw. I am not on the PBS list, so to avoid getting stuck in a queue
there I am posting only to PVE devel.

Adam

--===============5309014489446260501==
Content-Type: text/plain; charset="us-ascii"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
Content-Disposition: inline

_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

--===============5309014489446260501==--