public inbox for pve-devel@lists.proxmox.com
 help / color / mirror / Atom feed
* Re: [pve-devel] [pbs-devel] [RFC PATCH v2 proxmox-backup-qemu] restore: make chunk loading more parallel
       [not found] <20250708084900.1068146-1-d.csapak@proxmox.com>
@ 2025-07-08  8:52 ` Dominik Csapak
  2025-07-08  9:03   ` Thomas Lamprecht
  2025-07-08 10:04 ` [pve-devel] " Adam Kalisz via pve-devel
       [not found] ` <9dc2c099169ee1ed64c274d64cc0a1c19f9f6c92.camel@notnullmakers.com>
  2 siblings, 1 reply; 7+ messages in thread
From: Dominik Csapak @ 2025-07-08  8:52 UTC (permalink / raw)
  To: pbs-devel, Proxmox VE development discussion

Sorry, i sent the patch to the wrong list, this was originally meant for
pve-devel instead. Should i resend it there?

On 7/8/25 10:49, Dominik Csapak wrote:
> by using async futures to load chunks and stream::buffer_unordered to
> buffer up to 16 of them, depending on write/load speed, use tokio's task
> spawn to make sure the continue to run in the background, since
> buffer_unordered starts them, but does not poll them to completion
> unless we're awaiting.
> 
> With this, we don't need to increase the number of threads in the
> runtime to trigger parallel reads and network traffic to us. This way
> it's only limited by CPU if decoding and/or decrypting is the bottle
> neck.
> 
> I measured restoring a VM backup with a 60GiB disk (filled with ~42GiB
> data) and fast storage over a local network link (from PBS VM to the
> host). Let it 3  runs, but the variance was not that big, so here's some
> representative log output with various MAX_BUFFERED_FUTURES values.
> 
> benchmark   duration        speed   cpu percentage
> current      107.18s   573.25MB/s           < 100%
> 4:            44.74s  1373.34MB/s           ~ 180%
> 8:            32.30s  1902.42MB/s           ~ 290%
> 16:           25.75s  2386.44MB/s           ~ 360%
> 
> I saw an increase in CPU usage proportional to the speed increase, so
> while in the current version it uses less than a single core total,
> using 16 parallel futures resulted in 3-4 available threads of the
> tokio runtime to be utilized.
> 
> In general I'd like to limit the buffering somehow, but I don't think
> there is a good automatic metric we can use, and giving the admin a knob
> that is hard to explain what the actual ramifications about it are is
> also not good, so I settled for a value that showed improvement but does
> not seem too high.
> 
> In any case, if the target and/or source storage is too slow, there will
> be back/forward pressure, and this change should only matter for storage
> systems where IO depth plays a role and that are fast enough.
> 
> The way we count the finished chunks also changes a bit, since they
> can come unordered, so we can't rely on the index position to calculate
> the percentage.
> 
> This patch is loosely based on the patch from Adam Kalisz[0], but removes
> the need to increase the blocking threads and uses the (actually always
> used) underlying async implementation for reading remote chunks.
> 
> 0: https://lore.proxmox.com/pve-devel/mailman.719.1751052794.395.pve-devel@lists.proxmox.com/
> 
> Signed-off-by: Dominik Csapak <d.csapak@proxmox.com>
> Based-on-patch-by: Adam Kalisz <adam.kalisz@notnullmakers.com>
> ---
> changes from RFC v1:
> * uses tokio task spawn to actually run the fetching in the background
> * redo the counting for the task output (pos was unordered so we got
>    weird ordering sometimes)
> 
> When actually running the fetching in the background the speed increase
> is much higher than just using buffer_unordered for the fetching
> futures, which is nice (altough the cpu usage is much higher now).
> 
> Since the benchmark was much faster with higher values, I used a
> different bigger VM this time around so the timings are more consistent
> and it makes sure the disk does not fit in the PBS's memory.
> 
> The question what count we should use remains though...
> 
>   src/restore.rs | 63 +++++++++++++++++++++++++++++++++++++-------------
>   1 file changed, 47 insertions(+), 16 deletions(-)
> 
> diff --git a/src/restore.rs b/src/restore.rs
> index 5a5a398..4e6c538 100644
> --- a/src/restore.rs
> +++ b/src/restore.rs
> @@ -2,6 +2,7 @@ use std::convert::TryInto;
>   use std::sync::{Arc, Mutex};
>   
>   use anyhow::{bail, format_err, Error};
> +use futures::StreamExt;
>   use once_cell::sync::OnceCell;
>   use tokio::runtime::Runtime;
>   
> @@ -13,7 +14,7 @@ use pbs_datastore::cached_chunk_reader::CachedChunkReader;
>   use pbs_datastore::data_blob::DataChunkBuilder;
>   use pbs_datastore::fixed_index::FixedIndexReader;
>   use pbs_datastore::index::IndexFile;
> -use pbs_datastore::read_chunk::ReadChunk;
> +use pbs_datastore::read_chunk::AsyncReadChunk;
>   use pbs_datastore::BackupManifest;
>   use pbs_key_config::load_and_decrypt_key;
>   use pbs_tools::crypt_config::CryptConfig;
> @@ -29,6 +30,9 @@ struct ImageAccessInfo {
>       archive_size: u64,
>   }
>   
> +// use this many buffered futures to make loading of chunks more concurrent
> +const MAX_BUFFERED_FUTURES: usize = 16;
> +
>   pub(crate) struct RestoreTask {
>       setup: BackupSetup,
>       runtime: Arc<Runtime>,
> @@ -165,26 +169,53 @@ impl RestoreTask {
>   
>           let start_time = std::time::Instant::now();
>   
> -        for pos in 0..index.index_count() {
> -            let digest = index.index_digest(pos).unwrap();
> +        let read_queue = (0..index.index_count()).map(|pos| {
> +            let digest = *index.index_digest(pos).unwrap();
>               let offset = (pos * index.chunk_size) as u64;
> -            if digest == &zero_chunk_digest {
> -                let res = write_zero_callback(offset, index.chunk_size as u64);
> -                if res < 0 {
> -                    bail!("write_zero_callback failed ({})", res);
> +            let chunk_reader = chunk_reader.clone();
> +            async move {
> +                let chunk = if digest == zero_chunk_digest {
> +                    None
> +                } else {
> +                    let raw_data = tokio::task::spawn(async move {
> +                        AsyncReadChunk::read_chunk(&chunk_reader, &digest).await
> +                    })
> +                    .await??;
> +                    Some(raw_data)
> +                };
> +
> +                Ok::<_, Error>((chunk, offset))
> +            }
> +        });
> +
> +        // this buffers futures and pre-fetches some chunks for us
> +        let mut stream = futures::stream::iter(read_queue).buffer_unordered(MAX_BUFFERED_FUTURES);
> +
> +        let mut count = 0;
> +        while let Some(res) = stream.next().await {
> +            let res = res?;
> +            match res {
> +                (None, offset) => {
> +                    let res = write_zero_callback(offset, index.chunk_size as u64);
> +                    if res < 0 {
> +                        bail!("write_zero_callback failed ({})", res);
> +                    }
> +                    bytes += index.chunk_size;
> +                    zeroes += index.chunk_size;
>                   }
> -                bytes += index.chunk_size;
> -                zeroes += index.chunk_size;
> -            } else {
> -                let raw_data = ReadChunk::read_chunk(&chunk_reader, digest)?;
> -                let res = write_data_callback(offset, &raw_data);
> -                if res < 0 {
> -                    bail!("write_data_callback failed ({})", res);
> +                (Some(raw_data), offset) => {
> +                    let res = write_data_callback(offset, &raw_data);
> +                    if res < 0 {
> +                        bail!("write_data_callback failed ({})", res);
> +                    }
> +                    bytes += raw_data.len();
>                   }
> -                bytes += raw_data.len();
>               }
> +
> +            count += 1;
> +
>               if verbose {
> -                let next_per = ((pos + 1) * 100) / index.index_count();
> +                let next_per = (count * 100) / index.index_count();
>                   if per != next_per {
>                       eprintln!(
>                           "progress {}% (read {} bytes, zeroes = {}% ({} bytes), duration {} sec)",



_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [pve-devel] [pbs-devel] [RFC PATCH v2 proxmox-backup-qemu] restore: make chunk loading more parallel
  2025-07-08  8:52 ` [pve-devel] [pbs-devel] [RFC PATCH v2 proxmox-backup-qemu] restore: make chunk loading more parallel Dominik Csapak
@ 2025-07-08  9:03   ` Thomas Lamprecht
  0 siblings, 0 replies; 7+ messages in thread
From: Thomas Lamprecht @ 2025-07-08  9:03 UTC (permalink / raw)
  To: Proxmox Backup Server development discussion,
	Proxmox VE development discussion, Dominik Csapak

Am 08.07.25 um 10:52 schrieb Dominik Csapak:
> Sorry, i sent the patch to the wrong list, this was originally meant for
> pve-devel instead. Should i resend it there?

You can do both as this is relevant for both projects and lists
can be accumulative, not just either/or.


_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [pve-devel] [RFC PATCH v2 proxmox-backup-qemu] restore: make chunk loading more parallel
       [not found] <20250708084900.1068146-1-d.csapak@proxmox.com>
  2025-07-08  8:52 ` [pve-devel] [pbs-devel] [RFC PATCH v2 proxmox-backup-qemu] restore: make chunk loading more parallel Dominik Csapak
@ 2025-07-08 10:04 ` Adam Kalisz via pve-devel
       [not found] ` <9dc2c099169ee1ed64c274d64cc0a1c19f9f6c92.camel@notnullmakers.com>
  2 siblings, 0 replies; 7+ messages in thread
From: Adam Kalisz via pve-devel @ 2025-07-08 10:04 UTC (permalink / raw)
  To: Dominik Csapak, Proxmox VE development discussion; +Cc: Adam Kalisz

[-- Attachment #1: Type: message/rfc822, Size: 23941 bytes --]

From: Adam Kalisz <adam.kalisz@notnullmakers.com>
To: Dominik Csapak <d.csapak@proxmox.com>, Proxmox VE development discussion <pve-devel@lists.proxmox.com>
Subject: Re: [RFC PATCH v2 proxmox-backup-qemu] restore: make chunk loading more parallel
Date: Tue, 08 Jul 2025 12:04:56 +0200
Message-ID: <9dc2c099169ee1ed64c274d64cc0a1c19f9f6c92.camel@notnullmakers.com>

Hi Dominik,

this is a big improvement, I have done some performance measurements
again:

Ryzen:
4 worker threads:
restore image complete (bytes=53687091200, duration=52.06s,
speed=983.47MB/s)
8 worker threads:
restore image complete (bytes=53687091200, duration=50.12s,
speed=1021.56MB/s)

4 worker threads, 4 max-blocking:
restore image complete (bytes=53687091200, duration=54.00s,
speed=948.22MB/s)
8 worker threads, 4 max-blocking:
restore image complete (bytes=53687091200, duration=50.43s,
speed=1015.25MB/s)
8 worker threads, 4 max-blocking, 32 buffered futures:
restore image complete (bytes=53687091200, duration=52.11s,
speed=982.53MB/s)

Xeon:
4 worker threads:
restore image complete (bytes=10737418240, duration=3.06s,
speed=3345.97MB/s)
restore image complete (bytes=107374182400, duration=139.80s,
speed=732.47MB/s)
restore image complete (bytes=107374182400, duration=136.67s,
speed=749.23MB/s)
8 worker threads:
restore image complete (bytes=10737418240, duration=2.50s,
speed=4095.30MB/s)
restore image complete (bytes=107374182400, duration=127.14s,
speed=805.42MB/s)
restore image complete (bytes=107374182400, duration=121.39s,
speed=843.59MB/s)

Much better but it would need to be 25% faster on this older system to
hit the numbers I have already seen with my solution:

For comparison, with my solution on the same Xeon system I was hitting:
With 8-way concurrency, 16 max-blocking threads:
restore image complete (bytes=10737418240, avg fetch time=16.7572ms,
avg time per nonzero write=1.9310ms, storage nonzero total write
time=1.580s, duration=2.25s, speed=4551.25MB/s)
restore image complete (bytes=107374182400, avg fetch time=29.1714ms,
avg time per nonzero write=2.2216ms, storage nonzero total write
time=55.739s, duration=106.17s, speed=964.52MB/s)
restore image complete (bytes=107374182400, avg fetch time=28.2543ms,
avg time per nonzero write=2.1473ms, storage nonzero total write
time=54.139s, duration=103.52s, speed=989.18MB/s)

With 16-way concurrency, 32 max-blocking threads:
restore image complete (bytes=10737418240, avg fetch time=25.3444ms,
avg time per nonzero write=2.0709ms, storage nonzero total write
time=1.694s, duration=2.02s, speed=5074.13MB/s)
restore image complete (bytes=107374182400, avg fetch time=53.3046ms,
avg time per nonzero write=2.6692ms, storage nonzero total write
time=66.969s, duration=106.65s, speed=960.13MB/s)
restore image complete (bytes=107374182400, avg fetch time=47.3909ms,
avg time per nonzero write=2.6352ms, storage nonzero total write
time=66.440s, duration=98.09s, speed=1043.95MB/s)
-> this seemed to be the best setting for this system

On the Ryzen system I was hitting:
With 8-way concurrency, 16 max-blocking threads:
restore image complete (bytes=53687091200, avg fetch time=24.7342ms,
avg time per nonzero write=1.6474ms, storage nonzero total write
time=19.996s, duration=45.83s, speed=1117.15MB/s)
-> this seemed to be the best setting for this system

It seems the counting of zeroes works in some kind of steps (seen on
the Xeon system with mostly incompressible data):

download and verify backup index
progress 1% (read 1073741824 bytes, zeroes = 0% (0 bytes), duration 1
sec)
progress 2% (read 2147483648 bytes, zeroes = 0% (0 bytes), duration 2
sec)
progress 3% (read 3221225472 bytes, zeroes = 0% (0 bytes), duration 3
sec)
progress 4% (read 4294967296 bytes, zeroes = 0% (0 bytes), duration 5
sec)
progress 5% (read 5368709120 bytes, zeroes = 0% (0 bytes), duration 6
sec)
progress 6% (read 6442450944 bytes, zeroes = 0% (0 bytes), duration 7
sec)
progress 7% (read 7516192768 bytes, zeroes = 0% (0 bytes), duration 8
sec)
progress 8% (read 8589934592 bytes, zeroes = 0% (0 bytes), duration 10
sec)
progress 9% (read 9663676416 bytes, zeroes = 0% (0 bytes), duration 11
sec)
progress 10% (read 10737418240 bytes, zeroes = 0% (0 bytes), duration
12 sec)
progress 11% (read 11811160064 bytes, zeroes = 0% (0 bytes), duration
14 sec)
progress 12% (read 12884901888 bytes, zeroes = 0% (0 bytes), duration
15 sec)
progress 13% (read 13958643712 bytes, zeroes = 0% (0 bytes), duration
16 sec)
progress 14% (read 15032385536 bytes, zeroes = 0% (0 bytes), duration
18 sec)
progress 15% (read 16106127360 bytes, zeroes = 0% (0 bytes), duration
19 sec)
progress 16% (read 17179869184 bytes, zeroes = 0% (0 bytes), duration
20 sec)
progress 17% (read 18253611008 bytes, zeroes = 0% (0 bytes), duration
21 sec)
progress 18% (read 19327352832 bytes, zeroes = 0% (0 bytes), duration
23 sec)
progress 19% (read 20401094656 bytes, zeroes = 0% (0 bytes), duration
24 sec)
progress 20% (read 21474836480 bytes, zeroes = 0% (0 bytes), duration
25 sec)
progress 21% (read 22548578304 bytes, zeroes = 0% (0 bytes), duration
27 sec)
progress 22% (read 23622320128 bytes, zeroes = 0% (0 bytes), duration
28 sec)
progress 23% (read 24696061952 bytes, zeroes = 0% (0 bytes), duration
29 sec)
progress 24% (read 25769803776 bytes, zeroes = 0% (0 bytes), duration
31 sec)
progress 25% (read 26843545600 bytes, zeroes = 1% (515899392 bytes),
duration 31 sec)
progress 26% (read 27917287424 bytes, zeroes = 1% (515899392 bytes),
duration 33 sec)

Especially during a restore the speed is quite important if you need to
hit Restore Time Objectives under SLAs. That's why we were targeting 1
GBps for incompressible data.

Thank you
Adam

On Tue, 2025-07-08 at 10:49 +0200, Dominik Csapak wrote:
> by using async futures to load chunks and stream::buffer_unordered to
> buffer up to 16 of them, depending on write/load speed, use tokio's
> task
> spawn to make sure the continue to run in the background, since
> buffer_unordered starts them, but does not poll them to completion
> unless we're awaiting.
> 
> With this, we don't need to increase the number of threads in the
> runtime to trigger parallel reads and network traffic to us. This way
> it's only limited by CPU if decoding and/or decrypting is the bottle
> neck.
> 
> I measured restoring a VM backup with a 60GiB disk (filled with
> ~42GiB
> data) and fast storage over a local network link (from PBS VM to the
> host). Let it 3  runs, but the variance was not that big, so here's
> some
> representative log output with various MAX_BUFFERED_FUTURES values.
> 
> benchmark   duration        speed   cpu percentage
> current      107.18s   573.25MB/s           < 100%
> 4:            44.74s  1373.34MB/s           ~ 180%
> 8:            32.30s  1902.42MB/s           ~ 290%
> 16:           25.75s  2386.44MB/s           ~ 360%
> 
> I saw an increase in CPU usage proportional to the speed increase, so
> while in the current version it uses less than a single core total,
> using 16 parallel futures resulted in 3-4 available threads of the
> tokio runtime to be utilized.
> 
> In general I'd like to limit the buffering somehow, but I don't think
> there is a good automatic metric we can use, and giving the admin a
> knob
> that is hard to explain what the actual ramifications about it are is
> also not good, so I settled for a value that showed improvement but
> does
> not seem too high.
> 
> In any case, if the target and/or source storage is too slow, there
> will
> be back/forward pressure, and this change should only matter for
> storage
> systems where IO depth plays a role and that are fast enough.
> 
> The way we count the finished chunks also changes a bit, since they
> can come unordered, so we can't rely on the index position to
> calculate
> the percentage.
> 
> This patch is loosely based on the patch from Adam Kalisz[0], but
> removes
> the need to increase the blocking threads and uses the (actually
> always
> used) underlying async implementation for reading remote chunks.
> 
> 0:
> https://lore.proxmox.com/pve-devel/mailman.719.1751052794.395.pve-devel@lists.proxmox.com/
> 
> Signed-off-by: Dominik Csapak <d.csapak@proxmox.com>
> Based-on-patch-by: Adam Kalisz <adam.kalisz@notnullmakers.com>
> ---
> changes from RFC v1:
> * uses tokio task spawn to actually run the fetching in the
> background
> * redo the counting for the task output (pos was unordered so we got
>   weird ordering sometimes)
> 
> When actually running the fetching in the background the speed
> increase
> is much higher than just using buffer_unordered for the fetching
> futures, which is nice (altough the cpu usage is much higher now).
> 
> Since the benchmark was much faster with higher values, I used a
> different bigger VM this time around so the timings are more
> consistent
> and it makes sure the disk does not fit in the PBS's memory.
> 
> The question what count we should use remains though...
> 
>  src/restore.rs | 63 +++++++++++++++++++++++++++++++++++++-----------
> --
>  1 file changed, 47 insertions(+), 16 deletions(-)
> 
> diff --git a/src/restore.rs b/src/restore.rs
> index 5a5a398..4e6c538 100644
> --- a/src/restore.rs
> +++ b/src/restore.rs
> @@ -2,6 +2,7 @@ use std::convert::TryInto;
>  use std::sync::{Arc, Mutex};
>  
>  use anyhow::{bail, format_err, Error};
> +use futures::StreamExt;
>  use once_cell::sync::OnceCell;
>  use tokio::runtime::Runtime;
>  
> @@ -13,7 +14,7 @@ use
> pbs_datastore::cached_chunk_reader::CachedChunkReader;
>  use pbs_datastore::data_blob::DataChunkBuilder;
>  use pbs_datastore::fixed_index::FixedIndexReader;
>  use pbs_datastore::index::IndexFile;
> -use pbs_datastore::read_chunk::ReadChunk;
> +use pbs_datastore::read_chunk::AsyncReadChunk;
>  use pbs_datastore::BackupManifest;
>  use pbs_key_config::load_and_decrypt_key;
>  use pbs_tools::crypt_config::CryptConfig;
> @@ -29,6 +30,9 @@ struct ImageAccessInfo {
>      archive_size: u64,
>  }
>  
> +// use this many buffered futures to make loading of chunks more
> concurrent
> +const MAX_BUFFERED_FUTURES: usize = 16;
> +
>  pub(crate) struct RestoreTask {
>      setup: BackupSetup,
>      runtime: Arc<Runtime>,
> @@ -165,26 +169,53 @@ impl RestoreTask {
>  
>          let start_time = std::time::Instant::now();
>  
> -        for pos in 0..index.index_count() {
> -            let digest = index.index_digest(pos).unwrap();
> +        let read_queue = (0..index.index_count()).map(|pos| {
> +            let digest = *index.index_digest(pos).unwrap();
>              let offset = (pos * index.chunk_size) as u64;
> -            if digest == &zero_chunk_digest {
> -                let res = write_zero_callback(offset,
> index.chunk_size as u64);
> -                if res < 0 {
> -                    bail!("write_zero_callback failed ({})", res);
> +            let chunk_reader = chunk_reader.clone();
> +            async move {
> +                let chunk = if digest == zero_chunk_digest {
> +                    None
> +                } else {
> +                    let raw_data = tokio::task::spawn(async move {
> +                        AsyncReadChunk::read_chunk(&chunk_reader,
> &digest).await
> +                    })
> +                    .await??;
> +                    Some(raw_data)
> +                };
> +
> +                Ok::<_, Error>((chunk, offset))
> +            }
> +        });
> +
> +        // this buffers futures and pre-fetches some chunks for us
> +        let mut stream =
> futures::stream::iter(read_queue).buffer_unordered(MAX_BUFFERED_FUTUR
> ES);
> +
> +        let mut count = 0;
> +        while let Some(res) = stream.next().await {
> +            let res = res?;
> +            match res {
> +                (None, offset) => {
> +                    let res = write_zero_callback(offset,
> index.chunk_size as u64);
> +                    if res < 0 {
> +                        bail!("write_zero_callback failed ({})",
> res);
> +                    }
> +                    bytes += index.chunk_size;
> +                    zeroes += index.chunk_size;
>                  }
> -                bytes += index.chunk_size;
> -                zeroes += index.chunk_size;
> -            } else {
> -                let raw_data = ReadChunk::read_chunk(&chunk_reader,
> digest)?;
> -                let res = write_data_callback(offset, &raw_data);
> -                if res < 0 {
> -                    bail!("write_data_callback failed ({})", res);
> +                (Some(raw_data), offset) => {
> +                    let res = write_data_callback(offset,
> &raw_data);
> +                    if res < 0 {
> +                        bail!("write_data_callback failed ({})",
> res);
> +                    }
> +                    bytes += raw_data.len();
>                  }
> -                bytes += raw_data.len();
>              }
> +
> +            count += 1;
> +
>              if verbose {
> -                let next_per = ((pos + 1) * 100) /
> index.index_count();
> +                let next_per = (count * 100) / index.index_count();
>                  if per != next_per {
>                      eprintln!(
>                          "progress {}% (read {} bytes, zeroes = {}%
> ({} bytes), duration {} sec)",


[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [pve-devel] [RFC PATCH v2 proxmox-backup-qemu] restore: make chunk loading more parallel
       [not found] ` <9dc2c099169ee1ed64c274d64cc0a1c19f9f6c92.camel@notnullmakers.com>
@ 2025-07-08 10:58   ` Dominik Csapak
  2025-07-08 15:08     ` Adam Kalisz via pve-devel
       [not found]     ` <a148f2105e2e3e453e5503be61cf6ae0f28d0eba.camel@notnullmakers.com>
  0 siblings, 2 replies; 7+ messages in thread
From: Dominik Csapak @ 2025-07-08 10:58 UTC (permalink / raw)
  To: Adam Kalisz, Proxmox VE development discussion,
	Proxmox Backup Server development discussion

On 7/8/25 12:04, Adam Kalisz wrote:
> Hi Dominik,
> 

Hi,

> this is a big improvement, I have done some performance measurements
> again:
> 
> Ryzen:
> 4 worker threads:
> restore image complete (bytes=53687091200, duration=52.06s,
> speed=983.47MB/s)
> 8 worker threads:
> restore image complete (bytes=53687091200, duration=50.12s,
> speed=1021.56MB/s)
> 
> 4 worker threads, 4 max-blocking:
> restore image complete (bytes=53687091200, duration=54.00s,
> speed=948.22MB/s)
> 8 worker threads, 4 max-blocking:
> restore image complete (bytes=53687091200, duration=50.43s,
> speed=1015.25MB/s)
> 8 worker threads, 4 max-blocking, 32 buffered futures:
> restore image complete (bytes=53687091200, duration=52.11s,
> speed=982.53MB/s)
> 
> Xeon:
> 4 worker threads:
> restore image complete (bytes=10737418240, duration=3.06s,
> speed=3345.97MB/s)
> restore image complete (bytes=107374182400, duration=139.80s,
> speed=732.47MB/s)
> restore image complete (bytes=107374182400, duration=136.67s,
> speed=749.23MB/s)
> 8 worker threads:
> restore image complete (bytes=10737418240, duration=2.50s,
> speed=4095.30MB/s)
> restore image complete (bytes=107374182400, duration=127.14s,
> speed=805.42MB/s)
> restore image complete (bytes=107374182400, duration=121.39s,
> speed=843.59MB/s)

just for my understanding: you left the parallel futures at 16 and changed
the threads in the tokio runtime?

The biggest issue here is that we probably don't want to increase that
number by default by much, since on e.g. a running system this will
have an impact on other running VMs. Adjusting such a number
(especially in a way where it's now actually used in contrast to
before) will come as a surprise for many.

That's IMHO the biggest challenge here, that's why I did not touch
the tokio runtime thread settings, to not increase the load too much.

Did you by any chance observe the CPU usage during your tests?
As I wrote in my commit message, the cpu usage quadrupled (proportional
to the more chunks we could put through) when using 16 fetching
tasks.

Just an additional note: With my solution,the blocking threads should
not have any impact at all, since the fetching should be purely
async (so no blocking code anywhere) and the writing is done in the
main thread/task in sequence so no blocking threads will be used
except one.

> 
> Much better but it would need to be 25% faster on this older system to
> hit the numbers I have already seen with my solution:
> 
> For comparison, with my solution on the same Xeon system I was hitting:
> With 8-way concurrency, 16 max-blocking threads:
> restore image complete (bytes=10737418240, avg fetch time=16.7572ms,
> avg time per nonzero write=1.9310ms, storage nonzero total write
> time=1.580s, duration=2.25s, speed=4551.25MB/s)
> restore image complete (bytes=107374182400, avg fetch time=29.1714ms,
> avg time per nonzero write=2.2216ms, storage nonzero total write
> time=55.739s, duration=106.17s, speed=964.52MB/s)
> restore image complete (bytes=107374182400, avg fetch time=28.2543ms,
> avg time per nonzero write=2.1473ms, storage nonzero total write
> time=54.139s, duration=103.52s, speed=989.18MB/s)
> 
> With 16-way concurrency, 32 max-blocking threads:
> restore image complete (bytes=10737418240, avg fetch time=25.3444ms,
> avg time per nonzero write=2.0709ms, storage nonzero total write
> time=1.694s, duration=2.02s, speed=5074.13MB/s)
> restore image complete (bytes=107374182400, avg fetch time=53.3046ms,
> avg time per nonzero write=2.6692ms, storage nonzero total write
> time=66.969s, duration=106.65s, speed=960.13MB/s)
> restore image complete (bytes=107374182400, avg fetch time=47.3909ms,
> avg time per nonzero write=2.6352ms, storage nonzero total write
> time=66.440s, duration=98.09s, speed=1043.95MB/s)
> -> this seemed to be the best setting for this system
> 
> On the Ryzen system I was hitting:
> With 8-way concurrency, 16 max-blocking threads:
> restore image complete (bytes=53687091200, avg fetch time=24.7342ms,
> avg time per nonzero write=1.6474ms, storage nonzero total write
> time=19.996s, duration=45.83s, speed=1117.15MB/s)
> -> this seemed to be the best setting for this system
> 
> It seems the counting of zeroes works in some kind of steps (seen on
> the Xeon system with mostly incompressible data):
> 

yes, only whole zero chunks will be counted.

[snip]
> 
> Especially during a restore the speed is quite important if you need to
> hit Restore Time Objectives under SLAs. That's why we were targeting 1
> GBps for incompressible data.

I get that, but this will always be a tradeoff between CPU load and
throughput and we have to find a good middle ground here.

IMO with my current patch, we have a very good improvement already,
without increasing the (theoretical) load to the system.

It could be OK from my POV to make the number of threads of the runtime
configurable e.g. via vzdump.conf. (That's a thing that's easily
explainable in the docs for admins)

If someone else (@Fabian?) wants to chime in to this discussion,
I'd be glad.

Also feedback on my code in general would be nice ;)
(There are probably better ways to make this concurrent in an
async context, e.g. maybe using 'async-channel' + fixed number
of tasks ?)

Thanks
Dominik

> 
> Thank you
> Adam
> 
> On Tue, 2025-07-08 at 10:49 +0200, Dominik Csapak wrote:
>> by using async futures to load chunks and stream::buffer_unordered to
>> buffer up to 16 of them, depending on write/load speed, use tokio's
>> task
>> spawn to make sure the continue to run in the background, since
>> buffer_unordered starts them, but does not poll them to completion
>> unless we're awaiting.
>>
>> With this, we don't need to increase the number of threads in the
>> runtime to trigger parallel reads and network traffic to us. This way
>> it's only limited by CPU if decoding and/or decrypting is the bottle
>> neck.
>>
>> I measured restoring a VM backup with a 60GiB disk (filled with
>> ~42GiB
>> data) and fast storage over a local network link (from PBS VM to the
>> host). Let it 3  runs, but the variance was not that big, so here's
>> some
>> representative log output with various MAX_BUFFERED_FUTURES values.
>>
>> benchmark   duration        speed   cpu percentage
>> current      107.18s   573.25MB/s           < 100%
>> 4:            44.74s  1373.34MB/s           ~ 180%
>> 8:            32.30s  1902.42MB/s           ~ 290%
>> 16:           25.75s  2386.44MB/s           ~ 360%
>>
>> I saw an increase in CPU usage proportional to the speed increase, so
>> while in the current version it uses less than a single core total,
>> using 16 parallel futures resulted in 3-4 available threads of the
>> tokio runtime to be utilized.
>>
>> In general I'd like to limit the buffering somehow, but I don't think
>> there is a good automatic metric we can use, and giving the admin a
>> knob
>> that is hard to explain what the actual ramifications about it are is
>> also not good, so I settled for a value that showed improvement but
>> does
>> not seem too high.
>>
>> In any case, if the target and/or source storage is too slow, there
>> will
>> be back/forward pressure, and this change should only matter for
>> storage
>> systems where IO depth plays a role and that are fast enough.
>>
>> The way we count the finished chunks also changes a bit, since they
>> can come unordered, so we can't rely on the index position to
>> calculate
>> the percentage.
>>
>> This patch is loosely based on the patch from Adam Kalisz[0], but
>> removes
>> the need to increase the blocking threads and uses the (actually
>> always
>> used) underlying async implementation for reading remote chunks.
>>
>> 0:
>> https://lore.proxmox.com/pve-devel/mailman.719.1751052794.395.pve-devel@lists.proxmox.com/
>>
>> Signed-off-by: Dominik Csapak <d.csapak@proxmox.com>
>> Based-on-patch-by: Adam Kalisz <adam.kalisz@notnullmakers.com>
>> ---
>> changes from RFC v1:
>> * uses tokio task spawn to actually run the fetching in the
>> background
>> * redo the counting for the task output (pos was unordered so we got
>>    weird ordering sometimes)
>>
>> When actually running the fetching in the background the speed
>> increase
>> is much higher than just using buffer_unordered for the fetching
>> futures, which is nice (altough the cpu usage is much higher now).
>>
>> Since the benchmark was much faster with higher values, I used a
>> different bigger VM this time around so the timings are more
>> consistent
>> and it makes sure the disk does not fit in the PBS's memory.
>>
>> The question what count we should use remains though...
>>
>>   src/restore.rs | 63 +++++++++++++++++++++++++++++++++++++-----------
>> --
>>   1 file changed, 47 insertions(+), 16 deletions(-)
>>
>> diff --git a/src/restore.rs b/src/restore.rs
>> index 5a5a398..4e6c538 100644
>> --- a/src/restore.rs
>> +++ b/src/restore.rs
>> @@ -2,6 +2,7 @@ use std::convert::TryInto;
>>   use std::sync::{Arc, Mutex};
>>   
>>   use anyhow::{bail, format_err, Error};
>> +use futures::StreamExt;
>>   use once_cell::sync::OnceCell;
>>   use tokio::runtime::Runtime;
>>   
>> @@ -13,7 +14,7 @@ use
>> pbs_datastore::cached_chunk_reader::CachedChunkReader;
>>   use pbs_datastore::data_blob::DataChunkBuilder;
>>   use pbs_datastore::fixed_index::FixedIndexReader;
>>   use pbs_datastore::index::IndexFile;
>> -use pbs_datastore::read_chunk::ReadChunk;
>> +use pbs_datastore::read_chunk::AsyncReadChunk;
>>   use pbs_datastore::BackupManifest;
>>   use pbs_key_config::load_and_decrypt_key;
>>   use pbs_tools::crypt_config::CryptConfig;
>> @@ -29,6 +30,9 @@ struct ImageAccessInfo {
>>       archive_size: u64,
>>   }
>>   
>> +// use this many buffered futures to make loading of chunks more
>> concurrent
>> +const MAX_BUFFERED_FUTURES: usize = 16;
>> +
>>   pub(crate) struct RestoreTask {
>>       setup: BackupSetup,
>>       runtime: Arc<Runtime>,
>> @@ -165,26 +169,53 @@ impl RestoreTask {
>>   
>>           let start_time = std::time::Instant::now();
>>   
>> -        for pos in 0..index.index_count() {
>> -            let digest = index.index_digest(pos).unwrap();
>> +        let read_queue = (0..index.index_count()).map(|pos| {
>> +            let digest = *index.index_digest(pos).unwrap();
>>               let offset = (pos * index.chunk_size) as u64;
>> -            if digest == &zero_chunk_digest {
>> -                let res = write_zero_callback(offset,
>> index.chunk_size as u64);
>> -                if res < 0 {
>> -                    bail!("write_zero_callback failed ({})", res);
>> +            let chunk_reader = chunk_reader.clone();
>> +            async move {
>> +                let chunk = if digest == zero_chunk_digest {
>> +                    None
>> +                } else {
>> +                    let raw_data = tokio::task::spawn(async move {
>> +                        AsyncReadChunk::read_chunk(&chunk_reader,
>> &digest).await
>> +                    })
>> +                    .await??;
>> +                    Some(raw_data)
>> +                };
>> +
>> +                Ok::<_, Error>((chunk, offset))
>> +            }
>> +        });
>> +
>> +        // this buffers futures and pre-fetches some chunks for us
>> +        let mut stream =
>> futures::stream::iter(read_queue).buffer_unordered(MAX_BUFFERED_FUTUR
>> ES);
>> +
>> +        let mut count = 0;
>> +        while let Some(res) = stream.next().await {
>> +            let res = res?;
>> +            match res {
>> +                (None, offset) => {
>> +                    let res = write_zero_callback(offset,
>> index.chunk_size as u64);
>> +                    if res < 0 {
>> +                        bail!("write_zero_callback failed ({})",
>> res);
>> +                    }
>> +                    bytes += index.chunk_size;
>> +                    zeroes += index.chunk_size;
>>                   }
>> -                bytes += index.chunk_size;
>> -                zeroes += index.chunk_size;
>> -            } else {
>> -                let raw_data = ReadChunk::read_chunk(&chunk_reader,
>> digest)?;
>> -                let res = write_data_callback(offset, &raw_data);
>> -                if res < 0 {
>> -                    bail!("write_data_callback failed ({})", res);
>> +                (Some(raw_data), offset) => {
>> +                    let res = write_data_callback(offset,
>> &raw_data);
>> +                    if res < 0 {
>> +                        bail!("write_data_callback failed ({})",
>> res);
>> +                    }
>> +                    bytes += raw_data.len();
>>                   }
>> -                bytes += raw_data.len();
>>               }
>> +
>> +            count += 1;
>> +
>>               if verbose {
>> -                let next_per = ((pos + 1) * 100) /
>> index.index_count();
>> +                let next_per = (count * 100) / index.index_count();
>>                   if per != next_per {
>>                       eprintln!(
>>                           "progress {}% (read {} bytes, zeroes = {}%
>> ({} bytes), duration {} sec)",
> 



_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [pve-devel] [RFC PATCH v2 proxmox-backup-qemu] restore: make chunk loading more parallel
  2025-07-08 10:58   ` Dominik Csapak
@ 2025-07-08 15:08     ` Adam Kalisz via pve-devel
       [not found]     ` <a148f2105e2e3e453e5503be61cf6ae0f28d0eba.camel@notnullmakers.com>
  1 sibling, 0 replies; 7+ messages in thread
From: Adam Kalisz via pve-devel @ 2025-07-08 15:08 UTC (permalink / raw)
  To: Dominik Csapak, Proxmox VE development discussion; +Cc: Adam Kalisz

[-- Attachment #1: Type: message/rfc822, Size: 12998 bytes --]

From: Adam Kalisz <adam.kalisz@notnullmakers.com>
To: Dominik Csapak <d.csapak@proxmox.com>, Proxmox VE development discussion <pve-devel@lists.proxmox.com>
Subject: Re: [RFC PATCH v2 proxmox-backup-qemu] restore: make chunk loading more parallel
Date: Tue, 08 Jul 2025 17:08:11 +0200
Message-ID: <a148f2105e2e3e453e5503be61cf6ae0f28d0eba.camel@notnullmakers.com>

On Tue, 2025-07-08 at 12:58 +0200, Dominik Csapak wrote:
> On 7/8/25 12:04, Adam Kalisz wrote:
> > Hi Dominik,
> > 
> 
> Hi,
> 
> > this is a big improvement, I have done some performance
> > measurements
> > again:
> > 
> > Ryzen:
> > 4 worker threads:
> > restore image complete (bytes=53687091200, duration=52.06s,
> > speed=983.47MB/s)
> > 8 worker threads:
> > restore image complete (bytes=53687091200, duration=50.12s,
> > speed=1021.56MB/s)
> > 
> > 4 worker threads, 4 max-blocking:
> > restore image complete (bytes=53687091200, duration=54.00s,
> > speed=948.22MB/s)
> > 8 worker threads, 4 max-blocking:
> > restore image complete (bytes=53687091200, duration=50.43s,
> > speed=1015.25MB/s)
> > 8 worker threads, 4 max-blocking, 32 buffered futures:
> > restore image complete (bytes=53687091200, duration=52.11s,
> > speed=982.53MB/s)
> > 
> > Xeon:
> > 4 worker threads:
> > restore image complete (bytes=10737418240, duration=3.06s,
> > speed=3345.97MB/s)
> > restore image complete (bytes=107374182400, duration=139.80s,
> > speed=732.47MB/s)
> > restore image complete (bytes=107374182400, duration=136.67s,
> > speed=749.23MB/s)
> > 8 worker threads:
> > restore image complete (bytes=10737418240, duration=2.50s,
> > speed=4095.30MB/s)
> > restore image complete (bytes=107374182400, duration=127.14s,
> > speed=805.42MB/s)
> > restore image complete (bytes=107374182400, duration=121.39s,
> > speed=843.59MB/s)
> 
> just for my understanding: you left the parallel futures at 16 and
> changed the threads in the tokio runtime?

Yes, that's correct.

> The biggest issue here is that we probably don't want to increase
> that number by default by much, since on e.g. a running system this
> will have an impact on other running VMs. Adjusting such a number
> (especially in a way where it's now actually used in contrast to
> before) will come as a surprise for many.
> 
> That's IMHO the biggest challenge here, that's why I did not touch
> the tokio runtime thread settings, to not increase the load too much.
> 
> Did you by any chance observe the CPU usage during your tests?
> As I wrote in my commit message, the cpu usage quadrupled
> (proportional to the more chunks we could put through) when using 16
> fetching tasks.

Yes, please see the mpstat attachments. The one with yesterday's date
is the first patch, the two from today are you today's patch without
changes and the second is with 8 worker threads. All of them use 16
buffered futures.

> Just an additional note: With my solution,the blocking threads should
> not have any impact at all, since the fetching should be purely
> async (so no blocking code anywhere) and the writing is done in the
> main thread/task in sequence so no blocking threads will be used
> except one.

I actually see a slight negative impact but that's based on a very few
runs.

> > On the Ryzen system I was hitting:
> > With 8-way concurrency, 16 max-blocking threads:
> > restore image complete (bytes=53687091200, avg fetch
> > time=24.7342ms,
> > avg time per nonzero write=1.6474ms, storage nonzero total write
> > time=19.996s, duration=45.83s, speed=1117.15MB/s)
> > -> this seemed to be the best setting for this system
> > 
> > It seems the counting of zeroes works in some kind of steps (seen
> > on the Xeon system with mostly incompressible data):
> > 
> 
> yes, only whole zero chunks will be counted.
> 
> [snip]
> > 
> > Especially during a restore the speed is quite important if you
> > need to hit Restore Time Objectives under SLAs. That's why we were
> > targeting 1 GBps for incompressible data.
> 
> I get that, but this will always be a tradeoff between CPU load and
> throughput and we have to find a good middle ground here.

Sure, I am not disputing that for the default configuration.

> IMO with my current patch, we have a very good improvement already,
> without increasing the (theoretical) load to the system.
> 
> It could be OK from my POV to make the number of threads of the
> runtime configurable e.g. via vzdump.conf. (That's a thing that's
> easily explainable in the docs for admins)

That would be great, because some users have other priorities depending
on the operational situation. The nice thing about the ENV override in
my submission is that if somebody runs PBS_RESTORE_CONCURRENCY=8
qmrestore ... they can change the priority of the restore ad-hoc e.g.
if they really need to restore as quickly as possibly they can throw
more threads at the problem within some reasonable bounds. In some
cases they do a restore of a VM that isn't running and the resources it
would be using on the system are available for use.

Again, I agree the defaults should be conservative.

> If someone else (@Fabian?) wants to chime in to this discussion,
> I'd be glad.
> 
> Also feedback on my code in general would be nice ;)
> (There are probably better ways to make this concurrent in an
> async context, e.g. maybe using 'async-channel' + fixed number
> of tasks ?)

Having more timing information about how long the fetches and writes of
nonzero chunks take would be great for doing informed estimates about
the effect of performance settings should there be any. It would also
help with the benchmarking right now to see where we are saturated.

Do I understand correctly that we are blocking a worker when writing a
chunk to storage with the write_data_callback or write_zero_callback
from fetching chunks?

From what I gathered, we have to have a single thread that writes the
VM image because otherwise we would have problems with concurrent
access.
We need to feed this thread as well as possible with a mix of zero
chunks and non-zero chunks. The zero chunks are cheap because we
generate them based on information we already have in memory. The non-
zero chunks we have to fetch over the network (or from cache, which is
still more expensive than zero chunks). If I understand correctly, if
we fill the read queue with non-zero chunks and wait for them to become
available we will not be writing any possible zero chunks that come
after it to storage and our bottleneck storage writer thread will be
idle and hungry for more chunks.

My original solution basically wrote all the zero chunks first and then
started working through the non-zero chunks. This split seems to mostly
avoid the cost difference between zero and non-zero chunks keeping
futures slots occupied. However I have not considered the memory
consumption of the chunk_futures vector which might grow very big for
multi TB VM backups. However the idea of knowing about the cheap filler
zero chunks we might always write if we don't have any non-zero chunks
available is perhaps not so bad, especially for NVMe systems. For
harddrives I imagine the linear strategy might be faster because it
avoids should in my imagination avoid some expensive seeks.
Would it make sense to have a reasonable buffer of zero chunks ready
for writing while we fetch non-zero chunks over the network?

Is this thought process correct?


Btw. I am not on the PBS list, so to avoid getting stuck in a queue
there I am posting only to PVE devel.

Adam

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [pve-devel] [RFC PATCH v2 proxmox-backup-qemu] restore: make chunk loading more parallel
       [not found]     ` <a148f2105e2e3e453e5503be61cf6ae0f28d0eba.camel@notnullmakers.com>
@ 2025-07-10 12:48       ` Dominik Csapak
  2025-07-11  8:21         ` Dominik Csapak
  0 siblings, 1 reply; 7+ messages in thread
From: Dominik Csapak @ 2025-07-10 12:48 UTC (permalink / raw)
  To: Adam Kalisz, Proxmox VE development discussion

On 7/8/25 17:08, Adam Kalisz wrote:
> On Tue, 2025-07-08 at 12:58 +0200, Dominik Csapak wrote:
>> On 7/8/25 12:04, Adam Kalisz wrote:
[snip]
>>>
>>> Especially during a restore the speed is quite important if you
>>> need to hit Restore Time Objectives under SLAs. That's why we were
>>> targeting 1 GBps for incompressible data.
>>
>> I get that, but this will always be a tradeoff between CPU load and
>> throughput and we have to find a good middle ground here.
> 
> Sure, I am not disputing that for the default configuration.
> 
>> IMO with my current patch, we have a very good improvement already,
>> without increasing the (theoretical) load to the system.
>>
>> It could be OK from my POV to make the number of threads of the
>> runtime configurable e.g. via vzdump.conf. (That's a thing that's
>> easily explainable in the docs for admins)
> 
> That would be great, because some users have other priorities depending
> on the operational situation. The nice thing about the ENV override in
> my submission is that if somebody runs PBS_RESTORE_CONCURRENCY=8
> qmrestore ... they can change the priority of the restore ad-hoc e.g.
> if they really need to restore as quickly as possibly they can throw
> more threads at the problem within some reasonable bounds. In some
> cases they do a restore of a VM that isn't running and the resources it
> would be using on the system are available for use.
> 
> Again, I agree the defaults should be conservative.

Ok, I'll probably send a new version soon that will make the threads
and futures configurable via env variables and leave the default to
16 futures with 4 threads (which seems to be a good start).

> 
>> If someone else (@Fabian?) wants to chime in to this discussion,
>> I'd be glad.
>>
>> Also feedback on my code in general would be nice ;)
>> (There are probably better ways to make this concurrent in an
>> async context, e.g. maybe using 'async-channel' + fixed number
>> of tasks ?)
> 
> Having more timing information about how long the fetches and writes of
> nonzero chunks take would be great for doing informed estimates about
> the effect of performance settings should there be any. It would also
> help with the benchmarking right now to see where we are saturated.

yes, some tracing information would be really nice. I'm currently a bit
pressed for time, but will look into that.

> 
> Do I understand correctly that we are blocking a worker when writing a
> chunk to storage with the write_data_callback or write_zero_callback
> from fetching chunks?

yes. Technically we should call something like 'spawn_blocking' or
'block_in_place' so the executor won't get stuck, but in most cases it
will work out just fine.

> 
>  From what I gathered, we have to have a single thread that writes the
> VM image because otherwise we would have problems with concurrent
> access.

Currently it's that way because the default behavior for rust is
to prevent void pointer from being used across threads (for good reason!)

Though i think in this case, the callback is static, so we should be able
to call it from multiple threads (assuming the qemu layer can handle that)

For that I have to rewrite some parts of the qemu-side integration too,
so that it does not modify the callback data in the c callback itself
(which is not a problem)

Then I can test if writing in multiple threads makes a difference.

> We need to feed this thread as well as possible with a mix of zero
> chunks and non-zero chunks. The zero chunks are cheap because we
> generate them based on information we already have in memory. The non-
> zero chunks we have to fetch over the network (or from cache, which is
> still more expensive than zero chunks). If I understand correctly, if
> we fill the read queue with non-zero chunks and wait for them to become
> available we will not be writing any possible zero chunks that come
> after it to storage and our bottleneck storage writer thread will be
> idle and hungry for more chunks.
> 
> My original solution basically wrote all the zero chunks first and then
> started working through the non-zero chunks. This split seems to mostly
> avoid the cost difference between zero and non-zero chunks keeping
> futures slots occupied. However I have not considered the memory
> consumption of the chunk_futures vector which might grow very big for
> multi TB VM backups. However the idea of knowing about the cheap filler
> zero chunks we might always write if we don't have any non-zero chunks
> available is perhaps not so bad, especially for NVMe systems. For
> harddrives I imagine the linear strategy might be faster because it
> avoids should in my imagination avoid some expensive seeks.
> Would it make sense to have a reasonable buffer of zero chunks ready
> for writing while we fetch non-zero chunks over the network?
> 
> Is this thought process correct?

I think so. The issue with such optimizations is IMO that no matter
for which case we'll optimize, some other cases will suffer for it.
Since the systems out there are very diverse, I'd like to not optimize
that at all, and simply write them in order we get them from the index.

There are also other settings that play a role here, e.g. the target storage.
If that is configured to be able to skip zeros, the zero writeback call
will actually do nothing at all (e.g. when the target is a file
storage with qcow2 files)

Just for the record i also benchmarked a slower system here:
6x16 TiB spinners in raid-10 with nvme special devices
over a 2.5 g link:

current approach is ~61 MiB/s restore speed
with my patch it's ~160MiB/s restore speed with not much increase
in cpu time (both were under 30% of a single core)

Also did perf stat for those to compare how much overhead the additional futures/async/await
brings:


first restore:

         62,871.24 msec task-clock                       #    0.115 CPUs utilized 

           878,151      context-switches                 #   13.967 K/sec 

            28,205      cpu-migrations                   #  448.615 /sec 

           519,396      page-faults                      #    8.261 K/sec 

   277,239,999,474      cpu_core/cycles/                 #    4.410 G/sec 
(89.20%)
   190,782,860,504      cpu_atom/cycles/                 #    3.035 G/sec 
(10.80%)
   482,534,267,606      cpu_core/instructions/           #    7.675 G/sec 
(89.20%)
   188,659,352,613      cpu_atom/instructions/           #    3.001 G/sec 
(10.80%)
    46,913,925,346      cpu_core/branches/               #  746.191 M/sec 
(89.20%)
    19,251,496,445      cpu_atom/branches/               #  306.205 M/sec 
(10.80%)
       904,032,529      cpu_core/branch-misses/          #   14.379 M/sec 
(89.20%)
       621,228,739      cpu_atom/branch-misses/          #    9.881 M/sec 
(10.80%)
1,633,142,624,469      cpu_core/slots/                  #   25.976 G/sec                    (89.20%) 

   489,311,603,992      cpu_core/topdown-retiring/       #     29.7% Retiring 
(89.20%)
    97,617,585,755      cpu_core/topdown-bad-spec/       #      5.9% Bad Speculation 
(89.20%)
   317,074,236,582      cpu_core/topdown-fe-bound/       #     19.2% Frontend Bound 
(89.20%)
   745,485,954,022      cpu_core/topdown-be-bound/       #     45.2% Backend Bound 
(89.20%)
    57,463,995,650      cpu_core/topdown-heavy-ops/      #      3.5% Heavy Operations       # 
26.2% Light Operations        (89.20%)
    88,333,173,745      cpu_core/topdown-br-mispredict/  #      5.4% Branch Mispredict      # 
0.6% Machine Clears          (89.20%)
   217,424,427,912      cpu_core/topdown-fetch-lat/      #     13.2% Fetch Latency          # 
6.0% Fetch Bandwidth         (89.20%)
   354,250,103,398      cpu_core/topdown-mem-bound/      #     21.5% Memory Bound           # 
23.7% Core Bound              (89.20%)
  

     548.195368256 seconds time elapsed 

  

      44.493218000 seconds user 

      21.315124000 seconds sys 


second restore:

         67,908.11 msec task-clock                       #    0.297 CPUs utilized 

           856,402      context-switches                 #   12.611 K/sec 

            46,539      cpu-migrations                   #  685.323 /sec 

           942,002      page-faults                      #   13.872 K/sec 

   300,757,558,837      cpu_core/cycles/                 #    4.429 G/sec 
(75.93%)
   234,595,451,063      cpu_atom/cycles/                 #    3.455 G/sec 
(24.07%)
   511,747,593,432      cpu_core/instructions/           #    7.536 G/sec 
(75.93%)
   289,348,171,298      cpu_atom/instructions/           #    4.261 G/sec 
(24.07%)
    49,993,266,992      cpu_core/branches/               #  736.190 M/sec 
(75.93%)
    29,624,743,216      cpu_atom/branches/               #  436.248 M/sec 
(24.07%)
       911,770,988      cpu_core/branch-misses/          #   13.427 M/sec 
(75.93%)
       811,321,806      cpu_atom/branch-misses/          #   11.947 M/sec 
(24.07%)
1,788,660,631,633      cpu_core/slots/                  #   26.339 G/sec                    (75.93%) 

   569,029,214,725      cpu_core/topdown-retiring/       #     31.4% Retiring 
(75.93%)
   125,815,987,213      cpu_core/topdown-bad-spec/       #      6.9% Bad Speculation 
(75.93%)
   234,249,755,030      cpu_core/topdown-fe-bound/       #     12.9% Frontend Bound 
(75.93%)
   885,539,445,254      cpu_core/topdown-be-bound/       #     48.8% Backend Bound 
(75.93%)
    86,825,030,719      cpu_core/topdown-heavy-ops/      #      4.8% Heavy Operations       # 
26.6% Light Operations        (75.93%)
   116,566,866,551      cpu_core/topdown-br-mispredict/  #      6.4% Branch Mispredict      # 
0.5% Machine Clears          (75.93%)
   135,276,276,904      cpu_core/topdown-fetch-lat/      #      7.5% Fetch Latency          # 
5.5% Fetch Bandwidth         (75.93%)
   409,898,741,185      cpu_core/topdown-mem-bound/      #     22.6% Memory Bound           # 
26.2% Core Bound              (75.93%)
  

     228.528573197 seconds time elapsed 

  

      48.379229000 seconds user 

      21.779166000 seconds sys 



so the overhead for the additional futures was ~8%  in cycles, ~6% in instructions
which does not seem too bad

> 
> 
> Btw. I am not on the PBS list, so to avoid getting stuck in a queue
> there I am posting only to PVE devel.
> 
> Adam



_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [pve-devel] [RFC PATCH v2 proxmox-backup-qemu] restore: make chunk loading more parallel
  2025-07-10 12:48       ` Dominik Csapak
@ 2025-07-11  8:21         ` Dominik Csapak
  0 siblings, 0 replies; 7+ messages in thread
From: Dominik Csapak @ 2025-07-11  8:21 UTC (permalink / raw)
  To: pve-devel

On 7/10/25 14:48, Dominik Csapak wrote:
[snip]
> 
> Just for the record i also benchmarked a slower system here:
> 6x16 TiB spinners in raid-10 with nvme special devices
> over a 2.5 g link:
> 
> current approach is ~61 MiB/s restore speed
> with my patch it's ~160MiB/s restore speed with not much increase
> in cpu time (both were under 30% of a single core)
> 
> Also did perf stat for those to compare how much overhead the additional futures/async/await
> brings:
> 
> 
> first restore:
> 
>          62,871.24 msec task-clock                       #    0.115 CPUs utilized
>            878,151      context-switches                 #   13.967 K/sec
>             28,205      cpu-migrations                   #  448.615 /sec
>            519,396      page-faults                      #    8.261 K/sec
>    277,239,999,474      cpu_core/cycles/                 #    4.410 G/sec (89.20%)
>    190,782,860,504      cpu_atom/cycles/                 #    3.035 G/sec (10.80%)
>    482,534,267,606      cpu_core/instructions/           #    7.675 G/sec (89.20%)
>    188,659,352,613      cpu_atom/instructions/           #    3.001 G/sec (10.80%)
>     46,913,925,346      cpu_core/branches/               #  746.191 M/sec (89.20%)
>     19,251,496,445      cpu_atom/branches/               #  306.205 M/sec (10.80%)
>        904,032,529      cpu_core/branch-misses/          #   14.379 M/sec (89.20%)
>        621,228,739      cpu_atom/branch-misses/          #    9.881 M/sec (10.80%)
> 1,633,142,624,469      cpu_core/slots/                  #   25.976 G/sec                    (89.20%)
>    489,311,603,992      cpu_core/topdown-retiring/       #     29.7% Retiring (89.20%)
>     97,617,585,755      cpu_core/topdown-bad-spec/       #      5.9% Bad Speculation (89.20%)
>    317,074,236,582      cpu_core/topdown-fe-bound/       #     19.2% Frontend Bound (89.20%)
>    745,485,954,022      cpu_core/topdown-be-bound/       #     45.2% Backend Bound (89.20%)
>     57,463,995,650      cpu_core/topdown-heavy-ops/      #      3.5% Heavy Operations       # 26.2% 
> Light Operations        (89.20%)
>     88,333,173,745      cpu_core/topdown-br-mispredict/  #      5.4% Branch Mispredict      # 0.6% 
> Machine Clears          (89.20%)
>    217,424,427,912      cpu_core/topdown-fetch-lat/      #     13.2% Fetch Latency          # 6.0% 
> Fetch Bandwidth         (89.20%)
>    354,250,103,398      cpu_core/topdown-mem-bound/      #     21.5% Memory Bound           # 23.7% 
> Core Bound              (89.20%)
> 
> 
>      548.195368256 seconds time elapsed
> 
> 
>       44.493218000 seconds user
>       21.315124000 seconds sys
> 
> second restore:
> 
>          67,908.11 msec task-clock                       #    0.297 CPUs utilized
>            856,402      context-switches                 #   12.611 K/sec
>             46,539      cpu-migrations                   #  685.323 /sec
>            942,002      page-faults                      #   13.872 K/sec
>    300,757,558,837      cpu_core/cycles/                 #    4.429 G/sec (75.93%)
>    234,595,451,063      cpu_atom/cycles/                 #    3.455 G/sec (24.07%)
>    511,747,593,432      cpu_core/instructions/           #    7.536 G/sec (75.93%)
>    289,348,171,298      cpu_atom/instructions/           #    4.261 G/sec (24.07%)
>     49,993,266,992      cpu_core/branches/               #  736.190 M/sec (75.93%)
>     29,624,743,216      cpu_atom/branches/               #  436.248 M/sec (24.07%)
>        911,770,988      cpu_core/branch-misses/          #   13.427 M/sec (75.93%)
>        811,321,806      cpu_atom/branch-misses/          #   11.947 M/sec (24.07%)
> 1,788,660,631,633      cpu_core/slots/                  #   26.339 G/sec                    (75.93%)
>    569,029,214,725      cpu_core/topdown-retiring/       #     31.4% Retiring (75.93%)
>    125,815,987,213      cpu_core/topdown-bad-spec/       #      6.9% Bad Speculation (75.93%)
>    234,249,755,030      cpu_core/topdown-fe-bound/       #     12.9% Frontend Bound (75.93%)
>    885,539,445,254      cpu_core/topdown-be-bound/       #     48.8% Backend Bound (75.93%)
>     86,825,030,719      cpu_core/topdown-heavy-ops/      #      4.8% Heavy Operations       # 26.6% 
> Light Operations        (75.93%)
>    116,566,866,551      cpu_core/topdown-br-mispredict/  #      6.4% Branch Mispredict      # 0.5% 
> Machine Clears          (75.93%)
>    135,276,276,904      cpu_core/topdown-fetch-lat/      #      7.5% Fetch Latency          # 5.5% 
> Fetch Bandwidth         (75.93%)
>    409,898,741,185      cpu_core/topdown-mem-bound/      #     22.6% Memory Bound           # 26.2% 
> Core Bound              (75.93%)
> 
> 
>      228.528573197 seconds time elapsed
> 
> 
>       48.379229000 seconds user
>       21.779166000 seconds sys
> 
> 
> so the overhead for the additional futures was ~8%  in cycles, ~6% in instructions
> which does not seem too bad
> 

addendum:

the tests above did sadly run into a network limit of ~600MBit/s (still
trying to figure out where the bottleneck in the network is...)

tested again from a different machine that has a 10G link to the pbs mentioned above.
This time i restored to the 'null-co' driver from qemu since the target storage was too slow....

anyways, the results are:

current code: restore ~75MiB/s
16 way parallel: ~528MiB/s (7x !)

cpu usage went up from <50% of one core to ~350% (like in my initial tests with a different setup)

perf stat output below:

current:

        183,534.85 msec task-clock                       #    0.409 CPUs utilized
           117,267      context-switches                 #  638.936 /sec
               700      cpu-migrations                   #    3.814 /sec
           462,432      page-faults                      #    2.520 K/sec
   468,609,612,840      cycles                           #    2.553 GHz
1,286,188,699,253      instructions                     #    2.74  insn per cycle
    41,342,312,275      branches                         #  225.256 M/sec
       846,432,249      branch-misses                    #    2.05% of all branches

     448.965517535 seconds time elapsed

     152.007611000 seconds user
      32.189942000 seconds sys

16 way parallel:

        228,583.26 msec task-clock                       #    3.545 CPUs utilized
           114,575      context-switches                 #  501.240 /sec
             6,028      cpu-migrations                   #   26.371 /sec
         1,561,179      page-faults                      #    6.830 K/sec
   510,861,534,387      cycles                           #    2.235 GHz
1,296,819,542,686      instructions                     #    2.54  insn per cycle
    43,202,234,699      branches                         #  189.000 M/sec
       828,196,795      branch-misses                    #    1.92% of all branches

      64.482868654 seconds time elapsed

     184.172759000 seconds user
      44.560342000 seconds sys

so still about ~8% more cycles, about the same amount of instructions but in much less time


_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2025-07-11  8:21 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20250708084900.1068146-1-d.csapak@proxmox.com>
2025-07-08  8:52 ` [pve-devel] [pbs-devel] [RFC PATCH v2 proxmox-backup-qemu] restore: make chunk loading more parallel Dominik Csapak
2025-07-08  9:03   ` Thomas Lamprecht
2025-07-08 10:04 ` [pve-devel] " Adam Kalisz via pve-devel
     [not found] ` <9dc2c099169ee1ed64c274d64cc0a1c19f9f6c92.camel@notnullmakers.com>
2025-07-08 10:58   ` Dominik Csapak
2025-07-08 15:08     ` Adam Kalisz via pve-devel
     [not found]     ` <a148f2105e2e3e453e5503be61cf6ae0f28d0eba.camel@notnullmakers.com>
2025-07-10 12:48       ` Dominik Csapak
2025-07-11  8:21         ` Dominik Csapak

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox
Service provided by Proxmox Server Solutions GmbH | Privacy | Legal