From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68]) by lore.proxmox.com (Postfix) with ESMTPS id 8A45A1FF17C for ; Wed, 9 Jul 2025 11:04:32 +0200 (CEST) Received: from firstgate.proxmox.com (localhost [127.0.0.1]) by firstgate.proxmox.com (Proxmox) with ESMTP id C3443362C6; Wed, 9 Jul 2025 11:05:03 +0200 (CEST) To: Dominik Csapak , Proxmox VE development discussion Date: Tue, 08 Jul 2025 17:08:11 +0200 In-Reply-To: <409f8c3d-dd5b-4769-99f4-a17b2de773cd@proxmox.com> References: <20250708084900.1068146-1-d.csapak@proxmox.com> <9dc2c099169ee1ed64c274d64cc0a1c19f9f6c92.camel@notnullmakers.com> <409f8c3d-dd5b-4769-99f4-a17b2de773cd@proxmox.com> X-Mailman-Approved-At: Wed, 09 Jul 2025 11:05:00 +0200 MIME-Version: 1.0 Message-ID: List-Id: Proxmox VE development discussion List-Post: From: Adam Kalisz via pve-devel Precedence: list Cc: Adam Kalisz X-Mailman-Version: 2.1.29 X-BeenThere: pve-devel@lists.proxmox.com List-Subscribe: , List-Unsubscribe: , List-Archive: Reply-To: Proxmox VE development discussion List-Help: Subject: Re: [pve-devel] [RFC PATCH v2 proxmox-backup-qemu] restore: make chunk loading more parallel Content-Type: multipart/mixed; boundary="===============5309014489446260501==" Errors-To: pve-devel-bounces@lists.proxmox.com Sender: "pve-devel" --===============5309014489446260501== Content-Type: message/rfc822 Content-Disposition: inline Return-Path: X-Original-To: pve-devel@lists.proxmox.com Delivered-To: pve-devel@lists.proxmox.com Received: from firstgate.proxmox.com (firstgate.proxmox.com [212.224.123.68]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits)) (No client certificate requested) by lists.proxmox.com (Postfix) with ESMTPS id 8F7F3D6E91 for ; Tue, 8 Jul 2025 17:08:24 +0200 (CEST) Received: from firstgate.proxmox.com (localhost [127.0.0.1]) by firstgate.proxmox.com (Proxmox) with ESMTP id 6D7D119F06 for ; Tue, 8 Jul 2025 17:08:24 +0200 (CEST) Received: from mail-ej1-x630.google.com (mail-ej1-x630.google.com [IPv6:2a00:1450:4864:20::630]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by firstgate.proxmox.com (Proxmox) with ESMTPS for ; Tue, 8 Jul 2025 17:08:22 +0200 (CEST) Received: by mail-ej1-x630.google.com with SMTP id a640c23a62f3a-ae0dad3a179so768082866b.1 for ; Tue, 08 Jul 2025 08:08:22 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=notnullmakers.com; s=google; t=1751987296; x=1752592096; darn=lists.proxmox.com; h=mime-version:user-agent:references:in-reply-to:date:to:from:subject :message-id:from:to:cc:subject:date:message-id:reply-to; bh=EErYCz1ZEzxb30/bADLnZSi4pdfotxXC75z7VWsdJU4=; b=1XISk75wOV1a367fr7HHvMMeVFWxstnhYlwpaPpoZnq0asJXfXew65vBjeKKSUUpwy pCx9OdeXjUaPSJcKeaT+0ZUYRAojlSqpQcBwo2SjOfqR0EGSPS90PDJTsXFy8JkgrqYU x2tHiI5fYEBrSRu5VJIztrN80ugFzVzUn+We3DVmwE6ObWviLvHY8kbesf1nwA4mfnjx zWtqAn10crLxYUEaWaxNkc8NugnZ+AVkmD1q9drRv9o1dHhSHDCMrWN5wHP2zP8sz+Sy Xw917fjBOlqrewxThJHLqOujK+VbmvAz54YfKEfTh+kLtZep8sS7UyjoHyv0C0eCcgUc ZZVw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1751987296; x=1752592096; h=mime-version:user-agent:references:in-reply-to:date:to:from:subject :message-id:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=EErYCz1ZEzxb30/bADLnZSi4pdfotxXC75z7VWsdJU4=; b=epgDY73ScFHdVF8rZjd9FUXkwi3N+/eMMqHjS+1Bd+4j8yBbBT8m/n8XHcYHU/Zsbj N5mGEd+obSxdTIr0ElfZPOMp212SqaQ0PsXW3y2QKzuxYpJoTvQ5+xsXcgzO2soklfg7 ANrrtXXuuIoCiN6Wxslr8alJX4b0ZtqWMtLT/EmetNHRq6g7Uol0FzR6iQU25ejBMJks FvWfqLa0O90IjndqP3texzXrqwF/f6eRGkJQl3IlIDjJf5b4o6rSDR1gYAR47qnRwsH2 0VfEARRkzGTKzf5pg4/ki/5BqhsXGAuDpWVo4V/PTyN66sYvT+K3vxxI0m7ckyWkBYN6 J2uA== X-Forwarded-Encrypted: i=1; AJvYcCXkQMGypTbTqMr7IwHwRLIjx1oXb2jpysNBeTjGvujQqFkBXitpAyImZdGnaTQkCEglxWtj8l8zXL4=@lists.proxmox.com X-Gm-Message-State: AOJu0YyOSE9KLxZHJgsT/4SLSh8fYD1BguxoJUtZ81jrbpnnkZyAxqGO LzruH3bbPl++kwYIzIy56CZ1/3eUFwgbdUQ7zzu9yTeLA7ppe9g6e59q6UVX3D/Iv+Q= X-Gm-Gg: ASbGncvco/2UdTsDbi3f/ejBf073AZqK4KN7ZiAp+yw31qJOFQxlYoc6g5973oScdTk 2CZd5QSU1yAsNhhrxp0KN8CHwYQRtluDBG2nzgmDslq/wwbseUtXg0TmgD4DWfuvmDckMuBxVZU StDqjvLvwtIQXAREniFDoeLuBiMaDECP+VhZApGfT88b2/YcnJ28muMhVljQe8vOi9WL5PkHc/Q wlXQIAhNbG0itWAcCHTscro5bUeSrXfzzqT/HwA4AlxvPt0alU5vYZzluiDCauSjns84eYqcyWy 6zHcU4BOMkuCKFF3aV8gPU5H/FC5ZqXzmSg7+d6n8GuF/GnpEqJgNg3yBLMZAxEIA3amgpUVqdY 1+pKS7jh2f1qEF2nAWtdq X-Google-Smtp-Source: AGHT+IH/iIgkWn6ANO+V0GrDE1pVgowrwJdQG4d6LpEpLIRnbbjOyAzc7yTCPB1nKRWstvnFqs5JwQ== X-Received: by 2002:a17:907:60d1:b0:ad5:3055:a025 with SMTP id a640c23a62f3a-ae6b05942c3mr417897366b.6.1751987293199; Tue, 08 Jul 2025 08:08:13 -0700 (PDT) Received: from ?IPv6:2a02:8308:299:4600::9185? ([2a02:8308:299:4600::9185]) by smtp.gmail.com with ESMTPSA id a640c23a62f3a-ae3f6929ab5sm895308266b.46.2025.07.08.08.08.11 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 08 Jul 2025 08:08:12 -0700 (PDT) Message-ID: Subject: Re: [RFC PATCH v2 proxmox-backup-qemu] restore: make chunk loading more parallel From: Adam Kalisz To: Dominik Csapak , Proxmox VE development discussion Date: Tue, 08 Jul 2025 17:08:11 +0200 In-Reply-To: <409f8c3d-dd5b-4769-99f4-a17b2de773cd@proxmox.com> References: <20250708084900.1068146-1-d.csapak@proxmox.com> <9dc2c099169ee1ed64c274d64cc0a1c19f9f6c92.camel@notnullmakers.com> <409f8c3d-dd5b-4769-99f4-a17b2de773cd@proxmox.com> User-Agent: Evolution 3.56.1-1 MIME-Version: 1.0 X-SPAM-LEVEL: Spam detection results: 0 AWL -0.205 Adjusted score from AWL reputation of From: address BAYES_00 -1.9 Bayes spam probability is 0 to 1% DKIM_INVALID 0.1 DKIM or DK signature exists, but is not valid DKIM_SIGNED 0.1 Message has a DKIM or DK signature, not necessarily valid DMARC_PASS -0.1 DMARC pass policy KAM_DMARC_STATUS 0.01 Test Rule for DKIM or SPF Failure with Strict Alignment RCVD_IN_DNSWL_NONE -0.0001 Sender listed at https://www.dnswl.org/, no trust SPF_HELO_NONE 0.001 SPF: HELO does not publish an SPF Record SPF_PASS -0.001 SPF: sender matches SPF record URIBL_BLOCKED 0.001 ADMINISTRATOR NOTICE: The query to URIBL was blocked. See http://wiki.apache.org/spamassassin/DnsBlocklists#dnsbl-block for more information. [notnullmakers.com] X-Mailman-Approved-At: Wed, 09 Jul 2025 11:05:00 +0200 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Content-Filtered-By: Mailman/MimeDel 2.1.29 On Tue, 2025-07-08 at 12:58 +0200, Dominik Csapak wrote: > On 7/8/25 12:04, Adam Kalisz wrote: > > Hi Dominik, > >=20 >=20 > Hi, >=20 > > this is a big improvement, I have done some performance > > measurements > > again: > >=20 > > Ryzen: > > 4 worker threads: > > restore image complete (bytes=3D53687091200, duration=3D52.06s, > > speed=3D983.47MB/s) > > 8 worker threads: > > restore image complete (bytes=3D53687091200, duration=3D50.12s, > > speed=3D1021.56MB/s) > >=20 > > 4 worker threads, 4 max-blocking: > > restore image complete (bytes=3D53687091200, duration=3D54.00s, > > speed=3D948.22MB/s) > > 8 worker threads, 4 max-blocking: > > restore image complete (bytes=3D53687091200, duration=3D50.43s, > > speed=3D1015.25MB/s) > > 8 worker threads, 4 max-blocking, 32 buffered futures: > > restore image complete (bytes=3D53687091200, duration=3D52.11s, > > speed=3D982.53MB/s) > >=20 > > Xeon: > > 4 worker threads: > > restore image complete (bytes=3D10737418240, duration=3D3.06s, > > speed=3D3345.97MB/s) > > restore image complete (bytes=3D107374182400, duration=3D139.80s, > > speed=3D732.47MB/s) > > restore image complete (bytes=3D107374182400, duration=3D136.67s, > > speed=3D749.23MB/s) > > 8 worker threads: > > restore image complete (bytes=3D10737418240, duration=3D2.50s, > > speed=3D4095.30MB/s) > > restore image complete (bytes=3D107374182400, duration=3D127.14s, > > speed=3D805.42MB/s) > > restore image complete (bytes=3D107374182400, duration=3D121.39s, > > speed=3D843.59MB/s) >=20 > just for my understanding: you left the parallel futures at 16 and > changed the threads in the tokio runtime? Yes, that's correct. > The biggest issue here is that we probably don't want to increase > that number by default by much, since on e.g. a running system this > will have an impact on other running VMs. Adjusting such a number > (especially in a way where it's now actually used in contrast to > before) will come as a surprise for many. >=20 > That's IMHO the biggest challenge here, that's why I did not touch > the tokio runtime thread settings, to not increase the load too much. >=20 > Did you by any chance observe the CPU usage during your tests? > As I wrote in my commit message, the cpu usage quadrupled > (proportional to the more chunks we could put through) when using 16 > fetching tasks. Yes, please see the mpstat attachments. The one with yesterday's date is the first patch, the two from today are you today's patch without changes and the second is with 8 worker threads. All of them use 16 buffered futures. > Just an additional note: With my solution,the blocking threads should > not have any impact at all, since the fetching should be purely > async (so no blocking code anywhere) and the writing is done in the > main thread/task in sequence so no blocking threads will be used > except one. I actually see a slight negative impact but that's based on a very few runs. > > On the Ryzen system I was hitting: > > With 8-way concurrency, 16 max-blocking threads: > > restore image complete (bytes=3D53687091200, avg fetch > > time=3D24.7342ms, > > avg time per nonzero write=3D1.6474ms, storage nonzero total write > > time=3D19.996s, duration=3D45.83s, speed=3D1117.15MB/s) > > -> this seemed to be the best setting for this system > >=20 > > It seems the counting of zeroes works in some kind of steps (seen > > on the Xeon system with mostly incompressible data): > >=20 >=20 > yes, only whole zero chunks will be counted. >=20 > [snip] > >=20 > > Especially during a restore the speed is quite important if you > > need to hit Restore Time Objectives under SLAs. That's why we were > > targeting 1 GBps for incompressible data. >=20 > I get that, but this will always be a tradeoff between CPU load and > throughput and we have to find a good middle ground here. Sure, I am not disputing that for the default configuration. > IMO with my current patch, we have a very good improvement already, > without increasing the (theoretical) load to the system. >=20 > It could be OK from my POV to make the number of threads of the > runtime configurable e.g. via vzdump.conf. (That's a thing that's > easily explainable in the docs for admins) That would be great, because some users have other priorities depending on the operational situation. The nice thing about the ENV override in my submission is that if somebody runs PBS_RESTORE_CONCURRENCY=3D8 qmrestore ... they can change the priority of the restore ad-hoc e.g. if they really need to restore as quickly as possibly they can throw more threads at the problem within some reasonable bounds. In some cases they do a restore of a VM that isn't running and the resources it would be using on the system are available for use. Again, I agree the defaults should be conservative. > If someone else (@Fabian?) wants to chime in to this discussion, > I'd be glad. >=20 > Also feedback on my code in general would be nice ;) > (There are probably better ways to make this concurrent in an > async context, e.g. maybe using 'async-channel' + fixed number > of tasks ?) Having more timing information about how long the fetches and writes of nonzero chunks take would be great for doing informed estimates about the effect of performance settings should there be any. It would also help with the benchmarking right now to see where we are saturated. Do I understand correctly that we are blocking a worker when writing a chunk to storage with the write_data_callback or write_zero_callback from fetching chunks? >From what I gathered, we have to have a single thread that writes the VM image because otherwise we would have problems with concurrent access. We need to feed this thread as well as possible with a mix of zero chunks and non-zero chunks. The zero chunks are cheap because we generate them based on information we already have in memory. The non- zero chunks we have to fetch over the network (or from cache, which is still more expensive than zero chunks). If I understand correctly, if we fill the read queue with non-zero chunks and wait for them to become available we will not be writing any possible zero chunks that come after it to storage and our bottleneck storage writer thread will be idle and hungry for more chunks. My original solution basically wrote all the zero chunks first and then started working through the non-zero chunks. This split seems to mostly avoid the cost difference between zero and non-zero chunks keeping futures slots occupied. However I have not considered the memory consumption of the chunk_futures vector which might grow very big for multi TB VM backups. However the idea of knowing about the cheap filler zero chunks we might always write if we don't have any non-zero chunks available is perhaps not so bad, especially for NVMe systems. For harddrives I imagine the linear strategy might be faster because it avoids should in my imagination avoid some expensive seeks. Would it make sense to have a reasonable buffer of zero chunks ready for writing while we fetch non-zero chunks over the network? Is this thought process correct? Btw. I am not on the PBS list, so to avoid getting stuck in a queue there I am posting only to PVE devel. Adam --===============5309014489446260501== Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Disposition: inline _______________________________________________ pve-devel mailing list pve-devel@lists.proxmox.com https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel --===============5309014489446260501==--