Re: [pbs-devel] [PATCH proxmox-backup v2 1/3] tape: introduce a tape backup read thread tuning opti

From: Dominik Csapak <d.csapak@proxmox.com>
To: Proxmox Backup Server development discussion
	<pbs-devel@lists.proxmox.com>,
	Max Carrara <m.carrara@proxmox.com>
Subject: Re: [pbs-devel] [PATCH proxmox-backup v2 1/3] tape: introduce a tape backup read thread tuning opti
Date: Wed, 8 May 2024 08:56:32 +0200	[thread overview]
Message-ID: <2d13fe7c-25be-4eff-9cc6-85b65e829c36@proxmox.com> (raw)
In-Reply-To: <D13IAVJ7MM88.ZZT9M6HM142C@proxmox.com>

On 5/7/24 17:10, Max Carrara wrote:
> On Tue May 7, 2024 at 9:29 AM CEST, Dominik Csapak wrote:
>> using a single thread for reading is not optimal in some cases, e.g.
>> when the underlying storage can handle more reads in parallel than with
>> a single thread.
>>
> 
> Got only a few tiny notes that are otherwise unremarkable:
> 
> Would prefer to reword this as:
> 
> Using a single thread for reading is not optimal in some cases, e.g.
> when the underlying storage can handle reads from multiple threads in
> parallel.
> 
>> We use the ParallelHandler to handle the actual reads. Make the
>> sync_channel buffer size depending on the number of threads so we have
>> space for two chunks per thread.
> 
> Would s/depending/depend/ for the above paragraph.
> 

ok to both comments, will update in a v3 (if necessary)
or if preferred the one committing could fixup the commit message when applying

>>
>> How this impacts the backup speed largely depends on the underlying
>> storage and how the backup is laid out on it.
>>
>> I benchmarked the following setups:
>>
>> * Setup A: relatively spread out backup on a virtualized pbs on single HDDs
>> * Setup B: mostly sequential chunks on a virtualized pbs on single HDDs
>> * Setup C: backup on virtualized pbs on a fast NVME
>> * Setup D: backup on bare metal pbs with ZFS in a RAID10 with 6 HDDs
>>    and 2 fast special devices in a mirror
>>
>> (values are reported in MB/s as seen in the task log, caches were
>> cleared between runs, backups were bigger than the memory available)
>>
>> setup  1 thread  2 threads  4 threads  8 threads
>> A      55        70         80         95
>> B      110       89         100        108
>> C      294       294        294        294
>> D      118       180        300        300
>>
>> So there are cases where multiple read threads speed up the tape backup
>> (dramatically). On the other hand there are situations where reading
>> from a single thread is actually faster, probably because we can read
>> from the HDD sequentially.
>>
>> I used a new default value of '4' here since that gave good performance
>> on all setups (not necessarily the best) and we also have a default
>> value of '4' for verification threads.
>>
>> We use a struct here for the datastore since we want to introduce other
>> thread tuning options too.
>>
>> Signed-off-by: Dominik Csapak <d.csapak@proxmox.com>
> 
> The patch is pretty straightforward; there's one question inline, but
> otherwise this looks fine to me. Solid work!
> 
> Wasn't able to test this yet as setting up a VTL on Bookworm failed for
> me, unfortunately. Will try to test this tomorrow if possible.

it's probably obvious, but just to be sure:

make sure the target storage for the vtl is faster than your
source storage, otherwise you'll always be limited by that...

> 
> For now:
> 
> Reviewed-by: Max Carrara <m.carrara@proxmox.com>
> 
[snip]
> diff --git a/src/tape/pool_writer/new_chunks_iterator.rs b/src/tape/pool_writer/new_chunks_iterator.rs
>> index 1454b33d2..a2a8091f6 100644
>> --- a/src/tape/pool_writer/new_chunks_iterator.rs
>> +++ b/src/tape/pool_writer/new_chunks_iterator.rs
>> @@ -6,8 +6,9 @@ use anyhow::{format_err, Error};
>>   use pbs_datastore::{DataBlob, DataStore, SnapshotReader};
>>   
>>   use crate::tape::CatalogSet;
>> +use crate::tools::parallel_handler::ParallelHandler;
>>   
>> -/// Chunk iterator which use a separate thread to read chunks
>> +/// Chunk iterator which uses separate threads to read chunks
>>   ///
>>   /// The iterator skips duplicate chunks and chunks already in the
>>   /// catalog.
>> @@ -25,7 +26,11 @@ impl NewChunksIterator {
>>           snapshot_reader: Arc<Mutex<SnapshotReader>>,
>>           catalog_set: Arc<Mutex<CatalogSet>>,
>>       ) -> Result<(std::thread::JoinHandle<()>, Self), Error> {
>> -        let (tx, rx) = std::sync::mpsc::sync_channel(3);
>> +        let read_threads = datastore
>> +            .get_thread_configuration()
>> +            .tape_backup_read_threads;
>> +
>> +        let (tx, rx) = std::sync::mpsc::sync_channel(read_threads * 2);
> 
> Is there any reason you're using `* 2` here? For example, is the
> throughput unaffected if you use a larger value, like `* 8`?
> 
> If the constant has an effect like that it should IMO be documented, but
> if not, then it can just stay like it is.

i did not benchmark that much here (would have increased my benchmark time since
that's another dimension in the testing with many possible values)

but my though process was this:

if the source storage is much slower than what we can write to tape, that
buffer will be empty most of the time

if the source storage is much faster than what we can write to tape, it
will be full most of the time

in both cases the buffer size won't matter much (as long as there is a bit of
buffer)

the sweet spot is when the storage is about as fast as the tape writes
and in that case you want to be able to have just a bit more buffer
than threads reading from the storage so there is always a read
going on (so the buffer will never empty, since that slows
down the tape write)

so 1*number of threads seemed to low, but i think anything bigger
than 2* is just a waste of memory

does that make sense?

if we really want, i can make additional benchmarks but my guess
is that it won't make much difference in practice

_______________________________________________
pbs-devel mailing list
pbs-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel