Re: [pbs-devel] [RFC PATCH proxmox-backup] datastore: implement consitency tuning for datastores

From: Dominik Csapak <d.csapak@proxmox.com>
To: "Proxmox Backup Server development discussion"
	<pbs-devel@lists.proxmox.com>,
	"Fabian Grünbichler" <f.gruenbichler@proxmox.com>
Cc: Wolfgang Bumiller <w.bumiller@proxmox.com>,
	Thomas Lamprecht <t.lamprecht@proxmox.com>,
	Dietmar Maurer <dietmar@proxmox.com>
Subject: Re: [pbs-devel] [RFC PATCH proxmox-backup] datastore: implement consitency tuning for datastores
Date: Thu, 19 May 2022 15:49:50 +0200	[thread overview]
Message-ID: <d7e3b34e-e633-a5db-af1a-2fc3f319f96e@proxmox.com> (raw)
In-Reply-To: <1652950355.dp00h7vxx4.astroid@nora.none>

On 5/19/22 15:35, Fabian Grünbichler wrote:
> On May 18, 2022 1:24 pm, Dominik Csapak wrote:
>> currently, we don't (f)sync on chunk insertion (or at any point after
>> that), which can lead to broken chunks in case of e.g. an unexpected
>> powerloss. To fix that, offer a tuning option for datastores that
>> controls the level of syncs it does:
>>
>> * None (old default): same as current state, no (f)syncs done at any point
>> * Filesystem (new default): at the end of a backup, the datastore issues
>>    a syncfs(2) to the filesystem of the datastore
>> * File: issues an fsync on each chunk as they get inserted
>>    (using our 'replace_file' helper)
>>
>> a small benchmark showed the following (times in mm:ss):
>> setup: virtual pbs, 4 cores, 8GiB memory, ext4 on spinner
>>
>> size                none    filesystem  file
>> 2GiB (fits in ram)   00:13   0:41        01:00
>> 33GiB                05:21   05:31       13:45
>>
>> so if the backup fits in memory, there is a large difference between all
>> of the modes (expected), but as soon as it exceeds the memory size,
>> the difference between not syncing and syncing the fs at the end becomes
>> much smaller.
>>
>> i also tested on an nvme, but there the syncs basically made no difference
>>
>> Signed-off-by: Dominik Csapak <d.csapak@proxmox.com>
>> ---
>> it would be nice if anybody else tries to recreate the benchmarks on
>> different setups, to verify (or disprove) my findings
> 
> FWIW:
> 
> randfile on tmpfs as source, backed up as fidx
> randfile regenerated for every run, PBS restarted for every run
> 
> PBS in VM (8GB ram, disks on zvols on spinner + special + log), datastore on ext4:
> 
> SIZE: 4096 MODE: none Duration: 22.51s
> SIZE: 4096 MODE: filesystem Duration: 28.11s
> SIZE: 4096 MODE: file Duration: 54.47s
> 
> SIZE: 16384 MODE: none Duration: 202.42s
> SIZE: 16384 MODE: filesystem Duration: 275.36s
> SIZE: 16384 MODE: file Duration: 311.97s
> 
> same VM, datastore on single-disk ZFS pool:
> 
> SIZE: 1024 MODE: none Duration: 5.03s
> SIZE: 1024 MODE: file Duration: 22.91s
> SIZE: 1024 MODE: filesystem Duration: 15.57s
> 
> SIZE: 4096 MODE: none Duration: 41.02s
> SIZE: 4096 MODE: file Duration: 135.94s
> SIZE: 4096 MODE: filesystem Duration: 146.88s
> 
> SIZE: 16384 MODE: none Duration: 336.10s
> rest ended in tears cause of restricted resources in the VM
> 
> PBS baremetal same as source, datastore on ZFS on spinner+special+log:
> 
> SIZE: 1024 MODE: none Duration: 4.90s
> SIZE: 1024 MODE: file Duration: 4.92s
> SIZE: 1024 MODE: filesystem Duration: 4.94s
> 
> SIZE: 4096 MODE: none Duration: 19.56s
> SIZE: 4096 MODE: file Duration: 31.67s
> SIZE: 4096 MODE: filesystem Duration: 38.54s
> 
> SIZE: 16384 MODE: none Duration: 189.77s
> SIZE: 16384 MODE: file Duration: 178.81s
> SIZE: 16384 MODE: filesystem Duration: 159.26s
> 
> ^^ this is rather unexpected, I suspect something messed with the 'none'
> case here, so I re-ran it:
> 
> SIZE: 1024 MODE: none Duration: 4.90s
> SIZE: 1024 MODE: file Duration: 4.92s
> SIZE: 1024 MODE: filesystem Duration: 4.98s
> SIZE: 4096 MODE: none Duration: 19.77s
> SIZE: 4096 MODE: file Duration: 19.68s
> SIZE: 4096 MODE: filesystem Duration: 19.61s
> SIZE: 16384 MODE: none Duration: 133.93s
> SIZE: 16384 MODE: file Duration: 146.88s
> SIZE: 16384 MODE: filesystem Duration: 152.94s
> 
> and once more with ~30GB (ARC is just 16G):
> 
> SIZE: 30000 MODE: none Duration: 368.58s
> SIZE: 30000 MODE: file Duration: 292.05s  (!!!)
> SIZE: 30000 MODE: filesystem Duration: 431.73s
> 
> repeated once more:
> 
> SIZE: 30000 MODE: none Duration: 419.75s
> SIZE: 30000 MODE: file Duration: 302.73s
> SIZE: 30000 MODE: filesystem Duration: 409.07s
> 
> so.. rather weird? possible noisy measurements though, as this is on my
> workstation ;)
> 
> PBS baremetal same as source, datastore on ZFS on NVME (no surprises
> there):
> 
> SIZE: 1024 MODE: file Duration: 4.92s
> SIZE: 1024 MODE: filesystem Duration: 4.95s
> SIZE: 1024 MODE: none Duration: 4.96s
> 
> SIZE: 4096 MODE: file Duration: 19.69s
> SIZE: 4096 MODE: filesystem Duration: 19.78s
> SIZE: 4096 MODE: none Duration: 19.67s
> 
> SIZE: 16384 MODE: file Duration: 81.39s
> SIZE: 16384 MODE: filesystem Duration: 78.86s
> SIZE: 16384 MODE: none Duration: 78.38s
> 
> SIZE: 30000 MODE: none Duration: 142.65s
> SIZE: 30000 MODE: file Duration: 143.43s
> SIZE: 30000 MODE: filesystem Duration: 143.15s
> 
> 
> _______________________________________________
> pbs-devel mailing list
> pbs-devel@lists.proxmox.com
> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pbs-devel
> 
> 

so what i gather from this benchmark is that on ext4, many fsyncs are more expensive than
a single syncfs, but on zfs, it's very close together, leaning toward many fsyncs
to be faster ? (aside from the case where fsync is faster than not doing it ???)

in any case, doing some kind of syncing *will* slow down backups one way or another
(leaving the weird zfs case aside for the moment). so the question is if we
make one of the new modes the default or not...

i'd put them in there even if we leave the default, so the admin can decide how much
crash-consistency the pbs has.

any other input? @wolfgang, @thomas, @dietmar?