* Re: [pve-devel] [pbs-devel] [RFC PATCH v2 proxmox-backup-qemu] restore: make chunk loading more parallel [not found] <20250708084900.1068146-1-d.csapak@proxmox.com> @ 2025-07-08 8:52 ` Dominik Csapak 2025-07-08 9:03 ` Thomas Lamprecht 2025-07-08 10:04 ` [pve-devel] " Adam Kalisz via pve-devel [not found] ` <9dc2c099169ee1ed64c274d64cc0a1c19f9f6c92.camel@notnullmakers.com> 2 siblings, 1 reply; 7+ messages in thread From: Dominik Csapak @ 2025-07-08 8:52 UTC (permalink / raw) To: pbs-devel, Proxmox VE development discussion Sorry, i sent the patch to the wrong list, this was originally meant for pve-devel instead. Should i resend it there? On 7/8/25 10:49, Dominik Csapak wrote: > by using async futures to load chunks and stream::buffer_unordered to > buffer up to 16 of them, depending on write/load speed, use tokio's task > spawn to make sure the continue to run in the background, since > buffer_unordered starts them, but does not poll them to completion > unless we're awaiting. > > With this, we don't need to increase the number of threads in the > runtime to trigger parallel reads and network traffic to us. This way > it's only limited by CPU if decoding and/or decrypting is the bottle > neck. > > I measured restoring a VM backup with a 60GiB disk (filled with ~42GiB > data) and fast storage over a local network link (from PBS VM to the > host). Let it 3 runs, but the variance was not that big, so here's some > representative log output with various MAX_BUFFERED_FUTURES values. > > benchmark duration speed cpu percentage > current 107.18s 573.25MB/s < 100% > 4: 44.74s 1373.34MB/s ~ 180% > 8: 32.30s 1902.42MB/s ~ 290% > 16: 25.75s 2386.44MB/s ~ 360% > > I saw an increase in CPU usage proportional to the speed increase, so > while in the current version it uses less than a single core total, > using 16 parallel futures resulted in 3-4 available threads of the > tokio runtime to be utilized. > > In general I'd like to limit the buffering somehow, but I don't think > there is a good automatic metric we can use, and giving the admin a knob > that is hard to explain what the actual ramifications about it are is > also not good, so I settled for a value that showed improvement but does > not seem too high. > > In any case, if the target and/or source storage is too slow, there will > be back/forward pressure, and this change should only matter for storage > systems where IO depth plays a role and that are fast enough. > > The way we count the finished chunks also changes a bit, since they > can come unordered, so we can't rely on the index position to calculate > the percentage. > > This patch is loosely based on the patch from Adam Kalisz[0], but removes > the need to increase the blocking threads and uses the (actually always > used) underlying async implementation for reading remote chunks. > > 0: https://lore.proxmox.com/pve-devel/mailman.719.1751052794.395.pve-devel@lists.proxmox.com/ > > Signed-off-by: Dominik Csapak <d.csapak@proxmox.com> > Based-on-patch-by: Adam Kalisz <adam.kalisz@notnullmakers.com> > --- > changes from RFC v1: > * uses tokio task spawn to actually run the fetching in the background > * redo the counting for the task output (pos was unordered so we got > weird ordering sometimes) > > When actually running the fetching in the background the speed increase > is much higher than just using buffer_unordered for the fetching > futures, which is nice (altough the cpu usage is much higher now). > > Since the benchmark was much faster with higher values, I used a > different bigger VM this time around so the timings are more consistent > and it makes sure the disk does not fit in the PBS's memory. > > The question what count we should use remains though... > > src/restore.rs | 63 +++++++++++++++++++++++++++++++++++++------------- > 1 file changed, 47 insertions(+), 16 deletions(-) > > diff --git a/src/restore.rs b/src/restore.rs > index 5a5a398..4e6c538 100644 > --- a/src/restore.rs > +++ b/src/restore.rs > @@ -2,6 +2,7 @@ use std::convert::TryInto; > use std::sync::{Arc, Mutex}; > > use anyhow::{bail, format_err, Error}; > +use futures::StreamExt; > use once_cell::sync::OnceCell; > use tokio::runtime::Runtime; > > @@ -13,7 +14,7 @@ use pbs_datastore::cached_chunk_reader::CachedChunkReader; > use pbs_datastore::data_blob::DataChunkBuilder; > use pbs_datastore::fixed_index::FixedIndexReader; > use pbs_datastore::index::IndexFile; > -use pbs_datastore::read_chunk::ReadChunk; > +use pbs_datastore::read_chunk::AsyncReadChunk; > use pbs_datastore::BackupManifest; > use pbs_key_config::load_and_decrypt_key; > use pbs_tools::crypt_config::CryptConfig; > @@ -29,6 +30,9 @@ struct ImageAccessInfo { > archive_size: u64, > } > > +// use this many buffered futures to make loading of chunks more concurrent > +const MAX_BUFFERED_FUTURES: usize = 16; > + > pub(crate) struct RestoreTask { > setup: BackupSetup, > runtime: Arc<Runtime>, > @@ -165,26 +169,53 @@ impl RestoreTask { > > let start_time = std::time::Instant::now(); > > - for pos in 0..index.index_count() { > - let digest = index.index_digest(pos).unwrap(); > + let read_queue = (0..index.index_count()).map(|pos| { > + let digest = *index.index_digest(pos).unwrap(); > let offset = (pos * index.chunk_size) as u64; > - if digest == &zero_chunk_digest { > - let res = write_zero_callback(offset, index.chunk_size as u64); > - if res < 0 { > - bail!("write_zero_callback failed ({})", res); > + let chunk_reader = chunk_reader.clone(); > + async move { > + let chunk = if digest == zero_chunk_digest { > + None > + } else { > + let raw_data = tokio::task::spawn(async move { > + AsyncReadChunk::read_chunk(&chunk_reader, &digest).await > + }) > + .await??; > + Some(raw_data) > + }; > + > + Ok::<_, Error>((chunk, offset)) > + } > + }); > + > + // this buffers futures and pre-fetches some chunks for us > + let mut stream = futures::stream::iter(read_queue).buffer_unordered(MAX_BUFFERED_FUTURES); > + > + let mut count = 0; > + while let Some(res) = stream.next().await { > + let res = res?; > + match res { > + (None, offset) => { > + let res = write_zero_callback(offset, index.chunk_size as u64); > + if res < 0 { > + bail!("write_zero_callback failed ({})", res); > + } > + bytes += index.chunk_size; > + zeroes += index.chunk_size; > } > - bytes += index.chunk_size; > - zeroes += index.chunk_size; > - } else { > - let raw_data = ReadChunk::read_chunk(&chunk_reader, digest)?; > - let res = write_data_callback(offset, &raw_data); > - if res < 0 { > - bail!("write_data_callback failed ({})", res); > + (Some(raw_data), offset) => { > + let res = write_data_callback(offset, &raw_data); > + if res < 0 { > + bail!("write_data_callback failed ({})", res); > + } > + bytes += raw_data.len(); > } > - bytes += raw_data.len(); > } > + > + count += 1; > + > if verbose { > - let next_per = ((pos + 1) * 100) / index.index_count(); > + let next_per = (count * 100) / index.index_count(); > if per != next_per { > eprintln!( > "progress {}% (read {} bytes, zeroes = {}% ({} bytes), duration {} sec)", _______________________________________________ pve-devel mailing list pve-devel@lists.proxmox.com https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [pve-devel] [pbs-devel] [RFC PATCH v2 proxmox-backup-qemu] restore: make chunk loading more parallel 2025-07-08 8:52 ` [pve-devel] [pbs-devel] [RFC PATCH v2 proxmox-backup-qemu] restore: make chunk loading more parallel Dominik Csapak @ 2025-07-08 9:03 ` Thomas Lamprecht 0 siblings, 0 replies; 7+ messages in thread From: Thomas Lamprecht @ 2025-07-08 9:03 UTC (permalink / raw) To: Proxmox Backup Server development discussion, Proxmox VE development discussion, Dominik Csapak Am 08.07.25 um 10:52 schrieb Dominik Csapak: > Sorry, i sent the patch to the wrong list, this was originally meant for > pve-devel instead. Should i resend it there? You can do both as this is relevant for both projects and lists can be accumulative, not just either/or. _______________________________________________ pve-devel mailing list pve-devel@lists.proxmox.com https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [pve-devel] [RFC PATCH v2 proxmox-backup-qemu] restore: make chunk loading more parallel [not found] <20250708084900.1068146-1-d.csapak@proxmox.com> 2025-07-08 8:52 ` [pve-devel] [pbs-devel] [RFC PATCH v2 proxmox-backup-qemu] restore: make chunk loading more parallel Dominik Csapak @ 2025-07-08 10:04 ` Adam Kalisz via pve-devel [not found] ` <9dc2c099169ee1ed64c274d64cc0a1c19f9f6c92.camel@notnullmakers.com> 2 siblings, 0 replies; 7+ messages in thread From: Adam Kalisz via pve-devel @ 2025-07-08 10:04 UTC (permalink / raw) To: Dominik Csapak, Proxmox VE development discussion; +Cc: Adam Kalisz [-- Attachment #1: Type: message/rfc822, Size: 23941 bytes --] From: Adam Kalisz <adam.kalisz@notnullmakers.com> To: Dominik Csapak <d.csapak@proxmox.com>, Proxmox VE development discussion <pve-devel@lists.proxmox.com> Subject: Re: [RFC PATCH v2 proxmox-backup-qemu] restore: make chunk loading more parallel Date: Tue, 08 Jul 2025 12:04:56 +0200 Message-ID: <9dc2c099169ee1ed64c274d64cc0a1c19f9f6c92.camel@notnullmakers.com> Hi Dominik, this is a big improvement, I have done some performance measurements again: Ryzen: 4 worker threads: restore image complete (bytes=53687091200, duration=52.06s, speed=983.47MB/s) 8 worker threads: restore image complete (bytes=53687091200, duration=50.12s, speed=1021.56MB/s) 4 worker threads, 4 max-blocking: restore image complete (bytes=53687091200, duration=54.00s, speed=948.22MB/s) 8 worker threads, 4 max-blocking: restore image complete (bytes=53687091200, duration=50.43s, speed=1015.25MB/s) 8 worker threads, 4 max-blocking, 32 buffered futures: restore image complete (bytes=53687091200, duration=52.11s, speed=982.53MB/s) Xeon: 4 worker threads: restore image complete (bytes=10737418240, duration=3.06s, speed=3345.97MB/s) restore image complete (bytes=107374182400, duration=139.80s, speed=732.47MB/s) restore image complete (bytes=107374182400, duration=136.67s, speed=749.23MB/s) 8 worker threads: restore image complete (bytes=10737418240, duration=2.50s, speed=4095.30MB/s) restore image complete (bytes=107374182400, duration=127.14s, speed=805.42MB/s) restore image complete (bytes=107374182400, duration=121.39s, speed=843.59MB/s) Much better but it would need to be 25% faster on this older system to hit the numbers I have already seen with my solution: For comparison, with my solution on the same Xeon system I was hitting: With 8-way concurrency, 16 max-blocking threads: restore image complete (bytes=10737418240, avg fetch time=16.7572ms, avg time per nonzero write=1.9310ms, storage nonzero total write time=1.580s, duration=2.25s, speed=4551.25MB/s) restore image complete (bytes=107374182400, avg fetch time=29.1714ms, avg time per nonzero write=2.2216ms, storage nonzero total write time=55.739s, duration=106.17s, speed=964.52MB/s) restore image complete (bytes=107374182400, avg fetch time=28.2543ms, avg time per nonzero write=2.1473ms, storage nonzero total write time=54.139s, duration=103.52s, speed=989.18MB/s) With 16-way concurrency, 32 max-blocking threads: restore image complete (bytes=10737418240, avg fetch time=25.3444ms, avg time per nonzero write=2.0709ms, storage nonzero total write time=1.694s, duration=2.02s, speed=5074.13MB/s) restore image complete (bytes=107374182400, avg fetch time=53.3046ms, avg time per nonzero write=2.6692ms, storage nonzero total write time=66.969s, duration=106.65s, speed=960.13MB/s) restore image complete (bytes=107374182400, avg fetch time=47.3909ms, avg time per nonzero write=2.6352ms, storage nonzero total write time=66.440s, duration=98.09s, speed=1043.95MB/s) -> this seemed to be the best setting for this system On the Ryzen system I was hitting: With 8-way concurrency, 16 max-blocking threads: restore image complete (bytes=53687091200, avg fetch time=24.7342ms, avg time per nonzero write=1.6474ms, storage nonzero total write time=19.996s, duration=45.83s, speed=1117.15MB/s) -> this seemed to be the best setting for this system It seems the counting of zeroes works in some kind of steps (seen on the Xeon system with mostly incompressible data): download and verify backup index progress 1% (read 1073741824 bytes, zeroes = 0% (0 bytes), duration 1 sec) progress 2% (read 2147483648 bytes, zeroes = 0% (0 bytes), duration 2 sec) progress 3% (read 3221225472 bytes, zeroes = 0% (0 bytes), duration 3 sec) progress 4% (read 4294967296 bytes, zeroes = 0% (0 bytes), duration 5 sec) progress 5% (read 5368709120 bytes, zeroes = 0% (0 bytes), duration 6 sec) progress 6% (read 6442450944 bytes, zeroes = 0% (0 bytes), duration 7 sec) progress 7% (read 7516192768 bytes, zeroes = 0% (0 bytes), duration 8 sec) progress 8% (read 8589934592 bytes, zeroes = 0% (0 bytes), duration 10 sec) progress 9% (read 9663676416 bytes, zeroes = 0% (0 bytes), duration 11 sec) progress 10% (read 10737418240 bytes, zeroes = 0% (0 bytes), duration 12 sec) progress 11% (read 11811160064 bytes, zeroes = 0% (0 bytes), duration 14 sec) progress 12% (read 12884901888 bytes, zeroes = 0% (0 bytes), duration 15 sec) progress 13% (read 13958643712 bytes, zeroes = 0% (0 bytes), duration 16 sec) progress 14% (read 15032385536 bytes, zeroes = 0% (0 bytes), duration 18 sec) progress 15% (read 16106127360 bytes, zeroes = 0% (0 bytes), duration 19 sec) progress 16% (read 17179869184 bytes, zeroes = 0% (0 bytes), duration 20 sec) progress 17% (read 18253611008 bytes, zeroes = 0% (0 bytes), duration 21 sec) progress 18% (read 19327352832 bytes, zeroes = 0% (0 bytes), duration 23 sec) progress 19% (read 20401094656 bytes, zeroes = 0% (0 bytes), duration 24 sec) progress 20% (read 21474836480 bytes, zeroes = 0% (0 bytes), duration 25 sec) progress 21% (read 22548578304 bytes, zeroes = 0% (0 bytes), duration 27 sec) progress 22% (read 23622320128 bytes, zeroes = 0% (0 bytes), duration 28 sec) progress 23% (read 24696061952 bytes, zeroes = 0% (0 bytes), duration 29 sec) progress 24% (read 25769803776 bytes, zeroes = 0% (0 bytes), duration 31 sec) progress 25% (read 26843545600 bytes, zeroes = 1% (515899392 bytes), duration 31 sec) progress 26% (read 27917287424 bytes, zeroes = 1% (515899392 bytes), duration 33 sec) Especially during a restore the speed is quite important if you need to hit Restore Time Objectives under SLAs. That's why we were targeting 1 GBps for incompressible data. Thank you Adam On Tue, 2025-07-08 at 10:49 +0200, Dominik Csapak wrote: > by using async futures to load chunks and stream::buffer_unordered to > buffer up to 16 of them, depending on write/load speed, use tokio's > task > spawn to make sure the continue to run in the background, since > buffer_unordered starts them, but does not poll them to completion > unless we're awaiting. > > With this, we don't need to increase the number of threads in the > runtime to trigger parallel reads and network traffic to us. This way > it's only limited by CPU if decoding and/or decrypting is the bottle > neck. > > I measured restoring a VM backup with a 60GiB disk (filled with > ~42GiB > data) and fast storage over a local network link (from PBS VM to the > host). Let it 3 runs, but the variance was not that big, so here's > some > representative log output with various MAX_BUFFERED_FUTURES values. > > benchmark duration speed cpu percentage > current 107.18s 573.25MB/s < 100% > 4: 44.74s 1373.34MB/s ~ 180% > 8: 32.30s 1902.42MB/s ~ 290% > 16: 25.75s 2386.44MB/s ~ 360% > > I saw an increase in CPU usage proportional to the speed increase, so > while in the current version it uses less than a single core total, > using 16 parallel futures resulted in 3-4 available threads of the > tokio runtime to be utilized. > > In general I'd like to limit the buffering somehow, but I don't think > there is a good automatic metric we can use, and giving the admin a > knob > that is hard to explain what the actual ramifications about it are is > also not good, so I settled for a value that showed improvement but > does > not seem too high. > > In any case, if the target and/or source storage is too slow, there > will > be back/forward pressure, and this change should only matter for > storage > systems where IO depth plays a role and that are fast enough. > > The way we count the finished chunks also changes a bit, since they > can come unordered, so we can't rely on the index position to > calculate > the percentage. > > This patch is loosely based on the patch from Adam Kalisz[0], but > removes > the need to increase the blocking threads and uses the (actually > always > used) underlying async implementation for reading remote chunks. > > 0: > https://lore.proxmox.com/pve-devel/mailman.719.1751052794.395.pve-devel@lists.proxmox.com/ > > Signed-off-by: Dominik Csapak <d.csapak@proxmox.com> > Based-on-patch-by: Adam Kalisz <adam.kalisz@notnullmakers.com> > --- > changes from RFC v1: > * uses tokio task spawn to actually run the fetching in the > background > * redo the counting for the task output (pos was unordered so we got > weird ordering sometimes) > > When actually running the fetching in the background the speed > increase > is much higher than just using buffer_unordered for the fetching > futures, which is nice (altough the cpu usage is much higher now). > > Since the benchmark was much faster with higher values, I used a > different bigger VM this time around so the timings are more > consistent > and it makes sure the disk does not fit in the PBS's memory. > > The question what count we should use remains though... > > src/restore.rs | 63 +++++++++++++++++++++++++++++++++++++----------- > -- > 1 file changed, 47 insertions(+), 16 deletions(-) > > diff --git a/src/restore.rs b/src/restore.rs > index 5a5a398..4e6c538 100644 > --- a/src/restore.rs > +++ b/src/restore.rs > @@ -2,6 +2,7 @@ use std::convert::TryInto; > use std::sync::{Arc, Mutex}; > > use anyhow::{bail, format_err, Error}; > +use futures::StreamExt; > use once_cell::sync::OnceCell; > use tokio::runtime::Runtime; > > @@ -13,7 +14,7 @@ use > pbs_datastore::cached_chunk_reader::CachedChunkReader; > use pbs_datastore::data_blob::DataChunkBuilder; > use pbs_datastore::fixed_index::FixedIndexReader; > use pbs_datastore::index::IndexFile; > -use pbs_datastore::read_chunk::ReadChunk; > +use pbs_datastore::read_chunk::AsyncReadChunk; > use pbs_datastore::BackupManifest; > use pbs_key_config::load_and_decrypt_key; > use pbs_tools::crypt_config::CryptConfig; > @@ -29,6 +30,9 @@ struct ImageAccessInfo { > archive_size: u64, > } > > +// use this many buffered futures to make loading of chunks more > concurrent > +const MAX_BUFFERED_FUTURES: usize = 16; > + > pub(crate) struct RestoreTask { > setup: BackupSetup, > runtime: Arc<Runtime>, > @@ -165,26 +169,53 @@ impl RestoreTask { > > let start_time = std::time::Instant::now(); > > - for pos in 0..index.index_count() { > - let digest = index.index_digest(pos).unwrap(); > + let read_queue = (0..index.index_count()).map(|pos| { > + let digest = *index.index_digest(pos).unwrap(); > let offset = (pos * index.chunk_size) as u64; > - if digest == &zero_chunk_digest { > - let res = write_zero_callback(offset, > index.chunk_size as u64); > - if res < 0 { > - bail!("write_zero_callback failed ({})", res); > + let chunk_reader = chunk_reader.clone(); > + async move { > + let chunk = if digest == zero_chunk_digest { > + None > + } else { > + let raw_data = tokio::task::spawn(async move { > + AsyncReadChunk::read_chunk(&chunk_reader, > &digest).await > + }) > + .await??; > + Some(raw_data) > + }; > + > + Ok::<_, Error>((chunk, offset)) > + } > + }); > + > + // this buffers futures and pre-fetches some chunks for us > + let mut stream = > futures::stream::iter(read_queue).buffer_unordered(MAX_BUFFERED_FUTUR > ES); > + > + let mut count = 0; > + while let Some(res) = stream.next().await { > + let res = res?; > + match res { > + (None, offset) => { > + let res = write_zero_callback(offset, > index.chunk_size as u64); > + if res < 0 { > + bail!("write_zero_callback failed ({})", > res); > + } > + bytes += index.chunk_size; > + zeroes += index.chunk_size; > } > - bytes += index.chunk_size; > - zeroes += index.chunk_size; > - } else { > - let raw_data = ReadChunk::read_chunk(&chunk_reader, > digest)?; > - let res = write_data_callback(offset, &raw_data); > - if res < 0 { > - bail!("write_data_callback failed ({})", res); > + (Some(raw_data), offset) => { > + let res = write_data_callback(offset, > &raw_data); > + if res < 0 { > + bail!("write_data_callback failed ({})", > res); > + } > + bytes += raw_data.len(); > } > - bytes += raw_data.len(); > } > + > + count += 1; > + > if verbose { > - let next_per = ((pos + 1) * 100) / > index.index_count(); > + let next_per = (count * 100) / index.index_count(); > if per != next_per { > eprintln!( > "progress {}% (read {} bytes, zeroes = {}% > ({} bytes), duration {} sec)", [-- Attachment #2: Type: text/plain, Size: 160 bytes --] _______________________________________________ pve-devel mailing list pve-devel@lists.proxmox.com https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel ^ permalink raw reply [flat|nested] 7+ messages in thread
[parent not found: <9dc2c099169ee1ed64c274d64cc0a1c19f9f6c92.camel@notnullmakers.com>]
* Re: [pve-devel] [RFC PATCH v2 proxmox-backup-qemu] restore: make chunk loading more parallel [not found] ` <9dc2c099169ee1ed64c274d64cc0a1c19f9f6c92.camel@notnullmakers.com> @ 2025-07-08 10:58 ` Dominik Csapak 2025-07-08 15:08 ` Adam Kalisz via pve-devel [not found] ` <a148f2105e2e3e453e5503be61cf6ae0f28d0eba.camel@notnullmakers.com> 0 siblings, 2 replies; 7+ messages in thread From: Dominik Csapak @ 2025-07-08 10:58 UTC (permalink / raw) To: Adam Kalisz, Proxmox VE development discussion, Proxmox Backup Server development discussion On 7/8/25 12:04, Adam Kalisz wrote: > Hi Dominik, > Hi, > this is a big improvement, I have done some performance measurements > again: > > Ryzen: > 4 worker threads: > restore image complete (bytes=53687091200, duration=52.06s, > speed=983.47MB/s) > 8 worker threads: > restore image complete (bytes=53687091200, duration=50.12s, > speed=1021.56MB/s) > > 4 worker threads, 4 max-blocking: > restore image complete (bytes=53687091200, duration=54.00s, > speed=948.22MB/s) > 8 worker threads, 4 max-blocking: > restore image complete (bytes=53687091200, duration=50.43s, > speed=1015.25MB/s) > 8 worker threads, 4 max-blocking, 32 buffered futures: > restore image complete (bytes=53687091200, duration=52.11s, > speed=982.53MB/s) > > Xeon: > 4 worker threads: > restore image complete (bytes=10737418240, duration=3.06s, > speed=3345.97MB/s) > restore image complete (bytes=107374182400, duration=139.80s, > speed=732.47MB/s) > restore image complete (bytes=107374182400, duration=136.67s, > speed=749.23MB/s) > 8 worker threads: > restore image complete (bytes=10737418240, duration=2.50s, > speed=4095.30MB/s) > restore image complete (bytes=107374182400, duration=127.14s, > speed=805.42MB/s) > restore image complete (bytes=107374182400, duration=121.39s, > speed=843.59MB/s) just for my understanding: you left the parallel futures at 16 and changed the threads in the tokio runtime? The biggest issue here is that we probably don't want to increase that number by default by much, since on e.g. a running system this will have an impact on other running VMs. Adjusting such a number (especially in a way where it's now actually used in contrast to before) will come as a surprise for many. That's IMHO the biggest challenge here, that's why I did not touch the tokio runtime thread settings, to not increase the load too much. Did you by any chance observe the CPU usage during your tests? As I wrote in my commit message, the cpu usage quadrupled (proportional to the more chunks we could put through) when using 16 fetching tasks. Just an additional note: With my solution,the blocking threads should not have any impact at all, since the fetching should be purely async (so no blocking code anywhere) and the writing is done in the main thread/task in sequence so no blocking threads will be used except one. > > Much better but it would need to be 25% faster on this older system to > hit the numbers I have already seen with my solution: > > For comparison, with my solution on the same Xeon system I was hitting: > With 8-way concurrency, 16 max-blocking threads: > restore image complete (bytes=10737418240, avg fetch time=16.7572ms, > avg time per nonzero write=1.9310ms, storage nonzero total write > time=1.580s, duration=2.25s, speed=4551.25MB/s) > restore image complete (bytes=107374182400, avg fetch time=29.1714ms, > avg time per nonzero write=2.2216ms, storage nonzero total write > time=55.739s, duration=106.17s, speed=964.52MB/s) > restore image complete (bytes=107374182400, avg fetch time=28.2543ms, > avg time per nonzero write=2.1473ms, storage nonzero total write > time=54.139s, duration=103.52s, speed=989.18MB/s) > > With 16-way concurrency, 32 max-blocking threads: > restore image complete (bytes=10737418240, avg fetch time=25.3444ms, > avg time per nonzero write=2.0709ms, storage nonzero total write > time=1.694s, duration=2.02s, speed=5074.13MB/s) > restore image complete (bytes=107374182400, avg fetch time=53.3046ms, > avg time per nonzero write=2.6692ms, storage nonzero total write > time=66.969s, duration=106.65s, speed=960.13MB/s) > restore image complete (bytes=107374182400, avg fetch time=47.3909ms, > avg time per nonzero write=2.6352ms, storage nonzero total write > time=66.440s, duration=98.09s, speed=1043.95MB/s) > -> this seemed to be the best setting for this system > > On the Ryzen system I was hitting: > With 8-way concurrency, 16 max-blocking threads: > restore image complete (bytes=53687091200, avg fetch time=24.7342ms, > avg time per nonzero write=1.6474ms, storage nonzero total write > time=19.996s, duration=45.83s, speed=1117.15MB/s) > -> this seemed to be the best setting for this system > > It seems the counting of zeroes works in some kind of steps (seen on > the Xeon system with mostly incompressible data): > yes, only whole zero chunks will be counted. [snip] > > Especially during a restore the speed is quite important if you need to > hit Restore Time Objectives under SLAs. That's why we were targeting 1 > GBps for incompressible data. I get that, but this will always be a tradeoff between CPU load and throughput and we have to find a good middle ground here. IMO with my current patch, we have a very good improvement already, without increasing the (theoretical) load to the system. It could be OK from my POV to make the number of threads of the runtime configurable e.g. via vzdump.conf. (That's a thing that's easily explainable in the docs for admins) If someone else (@Fabian?) wants to chime in to this discussion, I'd be glad. Also feedback on my code in general would be nice ;) (There are probably better ways to make this concurrent in an async context, e.g. maybe using 'async-channel' + fixed number of tasks ?) Thanks Dominik > > Thank you > Adam > > On Tue, 2025-07-08 at 10:49 +0200, Dominik Csapak wrote: >> by using async futures to load chunks and stream::buffer_unordered to >> buffer up to 16 of them, depending on write/load speed, use tokio's >> task >> spawn to make sure the continue to run in the background, since >> buffer_unordered starts them, but does not poll them to completion >> unless we're awaiting. >> >> With this, we don't need to increase the number of threads in the >> runtime to trigger parallel reads and network traffic to us. This way >> it's only limited by CPU if decoding and/or decrypting is the bottle >> neck. >> >> I measured restoring a VM backup with a 60GiB disk (filled with >> ~42GiB >> data) and fast storage over a local network link (from PBS VM to the >> host). Let it 3 runs, but the variance was not that big, so here's >> some >> representative log output with various MAX_BUFFERED_FUTURES values. >> >> benchmark duration speed cpu percentage >> current 107.18s 573.25MB/s < 100% >> 4: 44.74s 1373.34MB/s ~ 180% >> 8: 32.30s 1902.42MB/s ~ 290% >> 16: 25.75s 2386.44MB/s ~ 360% >> >> I saw an increase in CPU usage proportional to the speed increase, so >> while in the current version it uses less than a single core total, >> using 16 parallel futures resulted in 3-4 available threads of the >> tokio runtime to be utilized. >> >> In general I'd like to limit the buffering somehow, but I don't think >> there is a good automatic metric we can use, and giving the admin a >> knob >> that is hard to explain what the actual ramifications about it are is >> also not good, so I settled for a value that showed improvement but >> does >> not seem too high. >> >> In any case, if the target and/or source storage is too slow, there >> will >> be back/forward pressure, and this change should only matter for >> storage >> systems where IO depth plays a role and that are fast enough. >> >> The way we count the finished chunks also changes a bit, since they >> can come unordered, so we can't rely on the index position to >> calculate >> the percentage. >> >> This patch is loosely based on the patch from Adam Kalisz[0], but >> removes >> the need to increase the blocking threads and uses the (actually >> always >> used) underlying async implementation for reading remote chunks. >> >> 0: >> https://lore.proxmox.com/pve-devel/mailman.719.1751052794.395.pve-devel@lists.proxmox.com/ >> >> Signed-off-by: Dominik Csapak <d.csapak@proxmox.com> >> Based-on-patch-by: Adam Kalisz <adam.kalisz@notnullmakers.com> >> --- >> changes from RFC v1: >> * uses tokio task spawn to actually run the fetching in the >> background >> * redo the counting for the task output (pos was unordered so we got >> weird ordering sometimes) >> >> When actually running the fetching in the background the speed >> increase >> is much higher than just using buffer_unordered for the fetching >> futures, which is nice (altough the cpu usage is much higher now). >> >> Since the benchmark was much faster with higher values, I used a >> different bigger VM this time around so the timings are more >> consistent >> and it makes sure the disk does not fit in the PBS's memory. >> >> The question what count we should use remains though... >> >> src/restore.rs | 63 +++++++++++++++++++++++++++++++++++++----------- >> -- >> 1 file changed, 47 insertions(+), 16 deletions(-) >> >> diff --git a/src/restore.rs b/src/restore.rs >> index 5a5a398..4e6c538 100644 >> --- a/src/restore.rs >> +++ b/src/restore.rs >> @@ -2,6 +2,7 @@ use std::convert::TryInto; >> use std::sync::{Arc, Mutex}; >> >> use anyhow::{bail, format_err, Error}; >> +use futures::StreamExt; >> use once_cell::sync::OnceCell; >> use tokio::runtime::Runtime; >> >> @@ -13,7 +14,7 @@ use >> pbs_datastore::cached_chunk_reader::CachedChunkReader; >> use pbs_datastore::data_blob::DataChunkBuilder; >> use pbs_datastore::fixed_index::FixedIndexReader; >> use pbs_datastore::index::IndexFile; >> -use pbs_datastore::read_chunk::ReadChunk; >> +use pbs_datastore::read_chunk::AsyncReadChunk; >> use pbs_datastore::BackupManifest; >> use pbs_key_config::load_and_decrypt_key; >> use pbs_tools::crypt_config::CryptConfig; >> @@ -29,6 +30,9 @@ struct ImageAccessInfo { >> archive_size: u64, >> } >> >> +// use this many buffered futures to make loading of chunks more >> concurrent >> +const MAX_BUFFERED_FUTURES: usize = 16; >> + >> pub(crate) struct RestoreTask { >> setup: BackupSetup, >> runtime: Arc<Runtime>, >> @@ -165,26 +169,53 @@ impl RestoreTask { >> >> let start_time = std::time::Instant::now(); >> >> - for pos in 0..index.index_count() { >> - let digest = index.index_digest(pos).unwrap(); >> + let read_queue = (0..index.index_count()).map(|pos| { >> + let digest = *index.index_digest(pos).unwrap(); >> let offset = (pos * index.chunk_size) as u64; >> - if digest == &zero_chunk_digest { >> - let res = write_zero_callback(offset, >> index.chunk_size as u64); >> - if res < 0 { >> - bail!("write_zero_callback failed ({})", res); >> + let chunk_reader = chunk_reader.clone(); >> + async move { >> + let chunk = if digest == zero_chunk_digest { >> + None >> + } else { >> + let raw_data = tokio::task::spawn(async move { >> + AsyncReadChunk::read_chunk(&chunk_reader, >> &digest).await >> + }) >> + .await??; >> + Some(raw_data) >> + }; >> + >> + Ok::<_, Error>((chunk, offset)) >> + } >> + }); >> + >> + // this buffers futures and pre-fetches some chunks for us >> + let mut stream = >> futures::stream::iter(read_queue).buffer_unordered(MAX_BUFFERED_FUTUR >> ES); >> + >> + let mut count = 0; >> + while let Some(res) = stream.next().await { >> + let res = res?; >> + match res { >> + (None, offset) => { >> + let res = write_zero_callback(offset, >> index.chunk_size as u64); >> + if res < 0 { >> + bail!("write_zero_callback failed ({})", >> res); >> + } >> + bytes += index.chunk_size; >> + zeroes += index.chunk_size; >> } >> - bytes += index.chunk_size; >> - zeroes += index.chunk_size; >> - } else { >> - let raw_data = ReadChunk::read_chunk(&chunk_reader, >> digest)?; >> - let res = write_data_callback(offset, &raw_data); >> - if res < 0 { >> - bail!("write_data_callback failed ({})", res); >> + (Some(raw_data), offset) => { >> + let res = write_data_callback(offset, >> &raw_data); >> + if res < 0 { >> + bail!("write_data_callback failed ({})", >> res); >> + } >> + bytes += raw_data.len(); >> } >> - bytes += raw_data.len(); >> } >> + >> + count += 1; >> + >> if verbose { >> - let next_per = ((pos + 1) * 100) / >> index.index_count(); >> + let next_per = (count * 100) / index.index_count(); >> if per != next_per { >> eprintln!( >> "progress {}% (read {} bytes, zeroes = {}% >> ({} bytes), duration {} sec)", > _______________________________________________ pve-devel mailing list pve-devel@lists.proxmox.com https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [pve-devel] [RFC PATCH v2 proxmox-backup-qemu] restore: make chunk loading more parallel 2025-07-08 10:58 ` Dominik Csapak @ 2025-07-08 15:08 ` Adam Kalisz via pve-devel [not found] ` <a148f2105e2e3e453e5503be61cf6ae0f28d0eba.camel@notnullmakers.com> 1 sibling, 0 replies; 7+ messages in thread From: Adam Kalisz via pve-devel @ 2025-07-08 15:08 UTC (permalink / raw) To: Dominik Csapak, Proxmox VE development discussion; +Cc: Adam Kalisz [-- Attachment #1: Type: message/rfc822, Size: 12998 bytes --] From: Adam Kalisz <adam.kalisz@notnullmakers.com> To: Dominik Csapak <d.csapak@proxmox.com>, Proxmox VE development discussion <pve-devel@lists.proxmox.com> Subject: Re: [RFC PATCH v2 proxmox-backup-qemu] restore: make chunk loading more parallel Date: Tue, 08 Jul 2025 17:08:11 +0200 Message-ID: <a148f2105e2e3e453e5503be61cf6ae0f28d0eba.camel@notnullmakers.com> On Tue, 2025-07-08 at 12:58 +0200, Dominik Csapak wrote: > On 7/8/25 12:04, Adam Kalisz wrote: > > Hi Dominik, > > > > Hi, > > > this is a big improvement, I have done some performance > > measurements > > again: > > > > Ryzen: > > 4 worker threads: > > restore image complete (bytes=53687091200, duration=52.06s, > > speed=983.47MB/s) > > 8 worker threads: > > restore image complete (bytes=53687091200, duration=50.12s, > > speed=1021.56MB/s) > > > > 4 worker threads, 4 max-blocking: > > restore image complete (bytes=53687091200, duration=54.00s, > > speed=948.22MB/s) > > 8 worker threads, 4 max-blocking: > > restore image complete (bytes=53687091200, duration=50.43s, > > speed=1015.25MB/s) > > 8 worker threads, 4 max-blocking, 32 buffered futures: > > restore image complete (bytes=53687091200, duration=52.11s, > > speed=982.53MB/s) > > > > Xeon: > > 4 worker threads: > > restore image complete (bytes=10737418240, duration=3.06s, > > speed=3345.97MB/s) > > restore image complete (bytes=107374182400, duration=139.80s, > > speed=732.47MB/s) > > restore image complete (bytes=107374182400, duration=136.67s, > > speed=749.23MB/s) > > 8 worker threads: > > restore image complete (bytes=10737418240, duration=2.50s, > > speed=4095.30MB/s) > > restore image complete (bytes=107374182400, duration=127.14s, > > speed=805.42MB/s) > > restore image complete (bytes=107374182400, duration=121.39s, > > speed=843.59MB/s) > > just for my understanding: you left the parallel futures at 16 and > changed the threads in the tokio runtime? Yes, that's correct. > The biggest issue here is that we probably don't want to increase > that number by default by much, since on e.g. a running system this > will have an impact on other running VMs. Adjusting such a number > (especially in a way where it's now actually used in contrast to > before) will come as a surprise for many. > > That's IMHO the biggest challenge here, that's why I did not touch > the tokio runtime thread settings, to not increase the load too much. > > Did you by any chance observe the CPU usage during your tests? > As I wrote in my commit message, the cpu usage quadrupled > (proportional to the more chunks we could put through) when using 16 > fetching tasks. Yes, please see the mpstat attachments. The one with yesterday's date is the first patch, the two from today are you today's patch without changes and the second is with 8 worker threads. All of them use 16 buffered futures. > Just an additional note: With my solution,the blocking threads should > not have any impact at all, since the fetching should be purely > async (so no blocking code anywhere) and the writing is done in the > main thread/task in sequence so no blocking threads will be used > except one. I actually see a slight negative impact but that's based on a very few runs. > > On the Ryzen system I was hitting: > > With 8-way concurrency, 16 max-blocking threads: > > restore image complete (bytes=53687091200, avg fetch > > time=24.7342ms, > > avg time per nonzero write=1.6474ms, storage nonzero total write > > time=19.996s, duration=45.83s, speed=1117.15MB/s) > > -> this seemed to be the best setting for this system > > > > It seems the counting of zeroes works in some kind of steps (seen > > on the Xeon system with mostly incompressible data): > > > > yes, only whole zero chunks will be counted. > > [snip] > > > > Especially during a restore the speed is quite important if you > > need to hit Restore Time Objectives under SLAs. That's why we were > > targeting 1 GBps for incompressible data. > > I get that, but this will always be a tradeoff between CPU load and > throughput and we have to find a good middle ground here. Sure, I am not disputing that for the default configuration. > IMO with my current patch, we have a very good improvement already, > without increasing the (theoretical) load to the system. > > It could be OK from my POV to make the number of threads of the > runtime configurable e.g. via vzdump.conf. (That's a thing that's > easily explainable in the docs for admins) That would be great, because some users have other priorities depending on the operational situation. The nice thing about the ENV override in my submission is that if somebody runs PBS_RESTORE_CONCURRENCY=8 qmrestore ... they can change the priority of the restore ad-hoc e.g. if they really need to restore as quickly as possibly they can throw more threads at the problem within some reasonable bounds. In some cases they do a restore of a VM that isn't running and the resources it would be using on the system are available for use. Again, I agree the defaults should be conservative. > If someone else (@Fabian?) wants to chime in to this discussion, > I'd be glad. > > Also feedback on my code in general would be nice ;) > (There are probably better ways to make this concurrent in an > async context, e.g. maybe using 'async-channel' + fixed number > of tasks ?) Having more timing information about how long the fetches and writes of nonzero chunks take would be great for doing informed estimates about the effect of performance settings should there be any. It would also help with the benchmarking right now to see where we are saturated. Do I understand correctly that we are blocking a worker when writing a chunk to storage with the write_data_callback or write_zero_callback from fetching chunks? From what I gathered, we have to have a single thread that writes the VM image because otherwise we would have problems with concurrent access. We need to feed this thread as well as possible with a mix of zero chunks and non-zero chunks. The zero chunks are cheap because we generate them based on information we already have in memory. The non- zero chunks we have to fetch over the network (or from cache, which is still more expensive than zero chunks). If I understand correctly, if we fill the read queue with non-zero chunks and wait for them to become available we will not be writing any possible zero chunks that come after it to storage and our bottleneck storage writer thread will be idle and hungry for more chunks. My original solution basically wrote all the zero chunks first and then started working through the non-zero chunks. This split seems to mostly avoid the cost difference between zero and non-zero chunks keeping futures slots occupied. However I have not considered the memory consumption of the chunk_futures vector which might grow very big for multi TB VM backups. However the idea of knowing about the cheap filler zero chunks we might always write if we don't have any non-zero chunks available is perhaps not so bad, especially for NVMe systems. For harddrives I imagine the linear strategy might be faster because it avoids should in my imagination avoid some expensive seeks. Would it make sense to have a reasonable buffer of zero chunks ready for writing while we fetch non-zero chunks over the network? Is this thought process correct? Btw. I am not on the PBS list, so to avoid getting stuck in a queue there I am posting only to PVE devel. Adam [-- Attachment #2: Type: text/plain, Size: 160 bytes --] _______________________________________________ pve-devel mailing list pve-devel@lists.proxmox.com https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel ^ permalink raw reply [flat|nested] 7+ messages in thread
[parent not found: <a148f2105e2e3e453e5503be61cf6ae0f28d0eba.camel@notnullmakers.com>]
* Re: [pve-devel] [RFC PATCH v2 proxmox-backup-qemu] restore: make chunk loading more parallel [not found] ` <a148f2105e2e3e453e5503be61cf6ae0f28d0eba.camel@notnullmakers.com> @ 2025-07-10 12:48 ` Dominik Csapak 2025-07-11 8:21 ` Dominik Csapak 0 siblings, 1 reply; 7+ messages in thread From: Dominik Csapak @ 2025-07-10 12:48 UTC (permalink / raw) To: Adam Kalisz, Proxmox VE development discussion On 7/8/25 17:08, Adam Kalisz wrote: > On Tue, 2025-07-08 at 12:58 +0200, Dominik Csapak wrote: >> On 7/8/25 12:04, Adam Kalisz wrote: [snip] >>> >>> Especially during a restore the speed is quite important if you >>> need to hit Restore Time Objectives under SLAs. That's why we were >>> targeting 1 GBps for incompressible data. >> >> I get that, but this will always be a tradeoff between CPU load and >> throughput and we have to find a good middle ground here. > > Sure, I am not disputing that for the default configuration. > >> IMO with my current patch, we have a very good improvement already, >> without increasing the (theoretical) load to the system. >> >> It could be OK from my POV to make the number of threads of the >> runtime configurable e.g. via vzdump.conf. (That's a thing that's >> easily explainable in the docs for admins) > > That would be great, because some users have other priorities depending > on the operational situation. The nice thing about the ENV override in > my submission is that if somebody runs PBS_RESTORE_CONCURRENCY=8 > qmrestore ... they can change the priority of the restore ad-hoc e.g. > if they really need to restore as quickly as possibly they can throw > more threads at the problem within some reasonable bounds. In some > cases they do a restore of a VM that isn't running and the resources it > would be using on the system are available for use. > > Again, I agree the defaults should be conservative. Ok, I'll probably send a new version soon that will make the threads and futures configurable via env variables and leave the default to 16 futures with 4 threads (which seems to be a good start). > >> If someone else (@Fabian?) wants to chime in to this discussion, >> I'd be glad. >> >> Also feedback on my code in general would be nice ;) >> (There are probably better ways to make this concurrent in an >> async context, e.g. maybe using 'async-channel' + fixed number >> of tasks ?) > > Having more timing information about how long the fetches and writes of > nonzero chunks take would be great for doing informed estimates about > the effect of performance settings should there be any. It would also > help with the benchmarking right now to see where we are saturated. yes, some tracing information would be really nice. I'm currently a bit pressed for time, but will look into that. > > Do I understand correctly that we are blocking a worker when writing a > chunk to storage with the write_data_callback or write_zero_callback > from fetching chunks? yes. Technically we should call something like 'spawn_blocking' or 'block_in_place' so the executor won't get stuck, but in most cases it will work out just fine. > > From what I gathered, we have to have a single thread that writes the > VM image because otherwise we would have problems with concurrent > access. Currently it's that way because the default behavior for rust is to prevent void pointer from being used across threads (for good reason!) Though i think in this case, the callback is static, so we should be able to call it from multiple threads (assuming the qemu layer can handle that) For that I have to rewrite some parts of the qemu-side integration too, so that it does not modify the callback data in the c callback itself (which is not a problem) Then I can test if writing in multiple threads makes a difference. > We need to feed this thread as well as possible with a mix of zero > chunks and non-zero chunks. The zero chunks are cheap because we > generate them based on information we already have in memory. The non- > zero chunks we have to fetch over the network (or from cache, which is > still more expensive than zero chunks). If I understand correctly, if > we fill the read queue with non-zero chunks and wait for them to become > available we will not be writing any possible zero chunks that come > after it to storage and our bottleneck storage writer thread will be > idle and hungry for more chunks. > > My original solution basically wrote all the zero chunks first and then > started working through the non-zero chunks. This split seems to mostly > avoid the cost difference between zero and non-zero chunks keeping > futures slots occupied. However I have not considered the memory > consumption of the chunk_futures vector which might grow very big for > multi TB VM backups. However the idea of knowing about the cheap filler > zero chunks we might always write if we don't have any non-zero chunks > available is perhaps not so bad, especially for NVMe systems. For > harddrives I imagine the linear strategy might be faster because it > avoids should in my imagination avoid some expensive seeks. > Would it make sense to have a reasonable buffer of zero chunks ready > for writing while we fetch non-zero chunks over the network? > > Is this thought process correct? I think so. The issue with such optimizations is IMO that no matter for which case we'll optimize, some other cases will suffer for it. Since the systems out there are very diverse, I'd like to not optimize that at all, and simply write them in order we get them from the index. There are also other settings that play a role here, e.g. the target storage. If that is configured to be able to skip zeros, the zero writeback call will actually do nothing at all (e.g. when the target is a file storage with qcow2 files) Just for the record i also benchmarked a slower system here: 6x16 TiB spinners in raid-10 with nvme special devices over a 2.5 g link: current approach is ~61 MiB/s restore speed with my patch it's ~160MiB/s restore speed with not much increase in cpu time (both were under 30% of a single core) Also did perf stat for those to compare how much overhead the additional futures/async/await brings: first restore: 62,871.24 msec task-clock # 0.115 CPUs utilized 878,151 context-switches # 13.967 K/sec 28,205 cpu-migrations # 448.615 /sec 519,396 page-faults # 8.261 K/sec 277,239,999,474 cpu_core/cycles/ # 4.410 G/sec (89.20%) 190,782,860,504 cpu_atom/cycles/ # 3.035 G/sec (10.80%) 482,534,267,606 cpu_core/instructions/ # 7.675 G/sec (89.20%) 188,659,352,613 cpu_atom/instructions/ # 3.001 G/sec (10.80%) 46,913,925,346 cpu_core/branches/ # 746.191 M/sec (89.20%) 19,251,496,445 cpu_atom/branches/ # 306.205 M/sec (10.80%) 904,032,529 cpu_core/branch-misses/ # 14.379 M/sec (89.20%) 621,228,739 cpu_atom/branch-misses/ # 9.881 M/sec (10.80%) 1,633,142,624,469 cpu_core/slots/ # 25.976 G/sec (89.20%) 489,311,603,992 cpu_core/topdown-retiring/ # 29.7% Retiring (89.20%) 97,617,585,755 cpu_core/topdown-bad-spec/ # 5.9% Bad Speculation (89.20%) 317,074,236,582 cpu_core/topdown-fe-bound/ # 19.2% Frontend Bound (89.20%) 745,485,954,022 cpu_core/topdown-be-bound/ # 45.2% Backend Bound (89.20%) 57,463,995,650 cpu_core/topdown-heavy-ops/ # 3.5% Heavy Operations # 26.2% Light Operations (89.20%) 88,333,173,745 cpu_core/topdown-br-mispredict/ # 5.4% Branch Mispredict # 0.6% Machine Clears (89.20%) 217,424,427,912 cpu_core/topdown-fetch-lat/ # 13.2% Fetch Latency # 6.0% Fetch Bandwidth (89.20%) 354,250,103,398 cpu_core/topdown-mem-bound/ # 21.5% Memory Bound # 23.7% Core Bound (89.20%) 548.195368256 seconds time elapsed 44.493218000 seconds user 21.315124000 seconds sys second restore: 67,908.11 msec task-clock # 0.297 CPUs utilized 856,402 context-switches # 12.611 K/sec 46,539 cpu-migrations # 685.323 /sec 942,002 page-faults # 13.872 K/sec 300,757,558,837 cpu_core/cycles/ # 4.429 G/sec (75.93%) 234,595,451,063 cpu_atom/cycles/ # 3.455 G/sec (24.07%) 511,747,593,432 cpu_core/instructions/ # 7.536 G/sec (75.93%) 289,348,171,298 cpu_atom/instructions/ # 4.261 G/sec (24.07%) 49,993,266,992 cpu_core/branches/ # 736.190 M/sec (75.93%) 29,624,743,216 cpu_atom/branches/ # 436.248 M/sec (24.07%) 911,770,988 cpu_core/branch-misses/ # 13.427 M/sec (75.93%) 811,321,806 cpu_atom/branch-misses/ # 11.947 M/sec (24.07%) 1,788,660,631,633 cpu_core/slots/ # 26.339 G/sec (75.93%) 569,029,214,725 cpu_core/topdown-retiring/ # 31.4% Retiring (75.93%) 125,815,987,213 cpu_core/topdown-bad-spec/ # 6.9% Bad Speculation (75.93%) 234,249,755,030 cpu_core/topdown-fe-bound/ # 12.9% Frontend Bound (75.93%) 885,539,445,254 cpu_core/topdown-be-bound/ # 48.8% Backend Bound (75.93%) 86,825,030,719 cpu_core/topdown-heavy-ops/ # 4.8% Heavy Operations # 26.6% Light Operations (75.93%) 116,566,866,551 cpu_core/topdown-br-mispredict/ # 6.4% Branch Mispredict # 0.5% Machine Clears (75.93%) 135,276,276,904 cpu_core/topdown-fetch-lat/ # 7.5% Fetch Latency # 5.5% Fetch Bandwidth (75.93%) 409,898,741,185 cpu_core/topdown-mem-bound/ # 22.6% Memory Bound # 26.2% Core Bound (75.93%) 228.528573197 seconds time elapsed 48.379229000 seconds user 21.779166000 seconds sys so the overhead for the additional futures was ~8% in cycles, ~6% in instructions which does not seem too bad > > > Btw. I am not on the PBS list, so to avoid getting stuck in a queue > there I am posting only to PVE devel. > > Adam _______________________________________________ pve-devel mailing list pve-devel@lists.proxmox.com https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [pve-devel] [RFC PATCH v2 proxmox-backup-qemu] restore: make chunk loading more parallel 2025-07-10 12:48 ` Dominik Csapak @ 2025-07-11 8:21 ` Dominik Csapak 0 siblings, 0 replies; 7+ messages in thread From: Dominik Csapak @ 2025-07-11 8:21 UTC (permalink / raw) To: pve-devel On 7/10/25 14:48, Dominik Csapak wrote: [snip] > > Just for the record i also benchmarked a slower system here: > 6x16 TiB spinners in raid-10 with nvme special devices > over a 2.5 g link: > > current approach is ~61 MiB/s restore speed > with my patch it's ~160MiB/s restore speed with not much increase > in cpu time (both were under 30% of a single core) > > Also did perf stat for those to compare how much overhead the additional futures/async/await > brings: > > > first restore: > > 62,871.24 msec task-clock # 0.115 CPUs utilized > 878,151 context-switches # 13.967 K/sec > 28,205 cpu-migrations # 448.615 /sec > 519,396 page-faults # 8.261 K/sec > 277,239,999,474 cpu_core/cycles/ # 4.410 G/sec (89.20%) > 190,782,860,504 cpu_atom/cycles/ # 3.035 G/sec (10.80%) > 482,534,267,606 cpu_core/instructions/ # 7.675 G/sec (89.20%) > 188,659,352,613 cpu_atom/instructions/ # 3.001 G/sec (10.80%) > 46,913,925,346 cpu_core/branches/ # 746.191 M/sec (89.20%) > 19,251,496,445 cpu_atom/branches/ # 306.205 M/sec (10.80%) > 904,032,529 cpu_core/branch-misses/ # 14.379 M/sec (89.20%) > 621,228,739 cpu_atom/branch-misses/ # 9.881 M/sec (10.80%) > 1,633,142,624,469 cpu_core/slots/ # 25.976 G/sec (89.20%) > 489,311,603,992 cpu_core/topdown-retiring/ # 29.7% Retiring (89.20%) > 97,617,585,755 cpu_core/topdown-bad-spec/ # 5.9% Bad Speculation (89.20%) > 317,074,236,582 cpu_core/topdown-fe-bound/ # 19.2% Frontend Bound (89.20%) > 745,485,954,022 cpu_core/topdown-be-bound/ # 45.2% Backend Bound (89.20%) > 57,463,995,650 cpu_core/topdown-heavy-ops/ # 3.5% Heavy Operations # 26.2% > Light Operations (89.20%) > 88,333,173,745 cpu_core/topdown-br-mispredict/ # 5.4% Branch Mispredict # 0.6% > Machine Clears (89.20%) > 217,424,427,912 cpu_core/topdown-fetch-lat/ # 13.2% Fetch Latency # 6.0% > Fetch Bandwidth (89.20%) > 354,250,103,398 cpu_core/topdown-mem-bound/ # 21.5% Memory Bound # 23.7% > Core Bound (89.20%) > > > 548.195368256 seconds time elapsed > > > 44.493218000 seconds user > 21.315124000 seconds sys > > second restore: > > 67,908.11 msec task-clock # 0.297 CPUs utilized > 856,402 context-switches # 12.611 K/sec > 46,539 cpu-migrations # 685.323 /sec > 942,002 page-faults # 13.872 K/sec > 300,757,558,837 cpu_core/cycles/ # 4.429 G/sec (75.93%) > 234,595,451,063 cpu_atom/cycles/ # 3.455 G/sec (24.07%) > 511,747,593,432 cpu_core/instructions/ # 7.536 G/sec (75.93%) > 289,348,171,298 cpu_atom/instructions/ # 4.261 G/sec (24.07%) > 49,993,266,992 cpu_core/branches/ # 736.190 M/sec (75.93%) > 29,624,743,216 cpu_atom/branches/ # 436.248 M/sec (24.07%) > 911,770,988 cpu_core/branch-misses/ # 13.427 M/sec (75.93%) > 811,321,806 cpu_atom/branch-misses/ # 11.947 M/sec (24.07%) > 1,788,660,631,633 cpu_core/slots/ # 26.339 G/sec (75.93%) > 569,029,214,725 cpu_core/topdown-retiring/ # 31.4% Retiring (75.93%) > 125,815,987,213 cpu_core/topdown-bad-spec/ # 6.9% Bad Speculation (75.93%) > 234,249,755,030 cpu_core/topdown-fe-bound/ # 12.9% Frontend Bound (75.93%) > 885,539,445,254 cpu_core/topdown-be-bound/ # 48.8% Backend Bound (75.93%) > 86,825,030,719 cpu_core/topdown-heavy-ops/ # 4.8% Heavy Operations # 26.6% > Light Operations (75.93%) > 116,566,866,551 cpu_core/topdown-br-mispredict/ # 6.4% Branch Mispredict # 0.5% > Machine Clears (75.93%) > 135,276,276,904 cpu_core/topdown-fetch-lat/ # 7.5% Fetch Latency # 5.5% > Fetch Bandwidth (75.93%) > 409,898,741,185 cpu_core/topdown-mem-bound/ # 22.6% Memory Bound # 26.2% > Core Bound (75.93%) > > > 228.528573197 seconds time elapsed > > > 48.379229000 seconds user > 21.779166000 seconds sys > > > so the overhead for the additional futures was ~8% in cycles, ~6% in instructions > which does not seem too bad > addendum: the tests above did sadly run into a network limit of ~600MBit/s (still trying to figure out where the bottleneck in the network is...) tested again from a different machine that has a 10G link to the pbs mentioned above. This time i restored to the 'null-co' driver from qemu since the target storage was too slow.... anyways, the results are: current code: restore ~75MiB/s 16 way parallel: ~528MiB/s (7x !) cpu usage went up from <50% of one core to ~350% (like in my initial tests with a different setup) perf stat output below: current: 183,534.85 msec task-clock # 0.409 CPUs utilized 117,267 context-switches # 638.936 /sec 700 cpu-migrations # 3.814 /sec 462,432 page-faults # 2.520 K/sec 468,609,612,840 cycles # 2.553 GHz 1,286,188,699,253 instructions # 2.74 insn per cycle 41,342,312,275 branches # 225.256 M/sec 846,432,249 branch-misses # 2.05% of all branches 448.965517535 seconds time elapsed 152.007611000 seconds user 32.189942000 seconds sys 16 way parallel: 228,583.26 msec task-clock # 3.545 CPUs utilized 114,575 context-switches # 501.240 /sec 6,028 cpu-migrations # 26.371 /sec 1,561,179 page-faults # 6.830 K/sec 510,861,534,387 cycles # 2.235 GHz 1,296,819,542,686 instructions # 2.54 insn per cycle 43,202,234,699 branches # 189.000 M/sec 828,196,795 branch-misses # 1.92% of all branches 64.482868654 seconds time elapsed 184.172759000 seconds user 44.560342000 seconds sys so still about ~8% more cycles, about the same amount of instructions but in much less time _______________________________________________ pve-devel mailing list pve-devel@lists.proxmox.com https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2025-07-11 8:21 UTC | newest] Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- [not found] <20250708084900.1068146-1-d.csapak@proxmox.com> 2025-07-08 8:52 ` [pve-devel] [pbs-devel] [RFC PATCH v2 proxmox-backup-qemu] restore: make chunk loading more parallel Dominik Csapak 2025-07-08 9:03 ` Thomas Lamprecht 2025-07-08 10:04 ` [pve-devel] " Adam Kalisz via pve-devel [not found] ` <9dc2c099169ee1ed64c274d64cc0a1c19f9f6c92.camel@notnullmakers.com> 2025-07-08 10:58 ` Dominik Csapak 2025-07-08 15:08 ` Adam Kalisz via pve-devel [not found] ` <a148f2105e2e3e453e5503be61cf6ae0f28d0eba.camel@notnullmakers.com> 2025-07-10 12:48 ` Dominik Csapak 2025-07-11 8:21 ` Dominik Csapak
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox