[pve-devel] Improve container backup speed dramatically (factor 100-1000)

public inbox for pve-devel@lists.proxmox.com
 help / color / mirror / Atom feed

* [pve-devel] Improve container backup speed dramatically (factor 100-1000)
@ 2020-11-19 19:29 Carsten Härle
  2020-11-20  4:59 ` Dietmar Maurer
  0 siblings, 1 reply; 5+ messages in thread
From: Carsten Härle @ 2020-11-19 19:29 UTC (permalink / raw)
  To: pve-devel

Container backup is very slow compared to VM backup. I have a 500 GB container (sftp server) with minimal changing files, but even the incremental bakcups take 2 hours with heavy disk activity. Almost nothing is transfered to the backup server. It seems that it it reads the whole container everytime, without any optimization. Before I did backup with zfs send it there it took only a couple of seconds or minutes for every didfferencal backup. 

See discussion here: https://forum.proxmox.com/threads/no-differantial-container-backup-with-big-containers.75676/#post-338868 <https://forum.proxmox.com/threads/no-differantial-container-backup-with-big-containers.75676/#post-338868> 

PBS is not storage agnostic but uses underlying snapshot feature according to the documentation: For container, the underlying snapshot feature of the file system ARE used, it already uses ZFS feature. 
https://pve.proxmox.com/wiki/Backup_and_Restore <https://pve.proxmox.com/wiki/Backup_and_Restore> 

For zfs file systems the set of changed file between snapshots can easy be displayed with "zfs diff", so PBS should use this feature to speed up large container backups dramatically. In my case it would be faktor about factor 1000! Alternatively the snapshots data can be accessed via the hidden .zfs directories, so PBS has knows which files changed and has access to the old and the new data.

https://bugzilla.proxmox.com/show_bug.cgi?id=3138

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [pve-devel] Improve container backup speed dramatically (factor 100-1000)
  2020-11-19 19:29 [pve-devel] Improve container backup speed dramatically (factor 100-1000) Carsten Härle
@ 2020-11-20  4:59 ` Dietmar Maurer
  2020-11-20  7:18   ` Carsten Härle
  0 siblings, 1 reply; 5+ messages in thread
From: Dietmar Maurer @ 2020-11-20  4:59 UTC (permalink / raw)
  To: Proxmox VE development discussion, Carsten Härle


> Container backup is very slow compared to VM backup. I have a 500 GB container (sftp server) with minimal changing files, but even the incremental bakcups take 2 hours with heavy disk activity. Almost nothing is transfered to the backup server. It seems that it it reads the whole container everytime, without any optimization. Before I did backup with zfs send it there it took only a couple of seconds or minutes for every didfferencal backup. 

Yes, that is how the current variable sized chunking algorithm works.
  
> See discussion here: https://forum.proxmox.com/threads/no-differantial-container-backup-with-big-containers.75676/#post-338868 <https://forum.proxmox.com/threads/no-differantial-container-backup-with-big-containers.75676/#post-338868> 
>  
> PBS is not storage agnostic but uses underlying snapshot feature according to the documentation: For container, the underlying snapshot feature of the file system ARE used, it already uses ZFS feature. 
> https://pve.proxmox.com/wiki/Backup_and_Restore <https://pve.proxmox.com/wiki/Backup_and_Restore> 

Yes, we use the snapshot feature. But the backup code is totally storage agnostic.

> For zfs file systems the set of changed file between snapshots can easy be displayed with "zfs diff", so PBS should use this feature to speed up large container backups dramatically.

"zfs diff" does not provide the information needed for our deduplication 
algorithm, so we cannot use that. But if you have ideas how to make that work, 
please shared them here.




^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [pve-devel] Improve container backup speed dramatically (factor 100-1000)
  2020-11-20  4:59 ` Dietmar Maurer
@ 2020-11-20  7:18   ` Carsten Härle
  2020-11-20  8:27     ` Dominik Csapak
  0 siblings, 1 reply; 5+ messages in thread
From: Carsten Härle @ 2020-11-20  7:18 UTC (permalink / raw)
  To: Dietmar Maurer, Proxmox VE development discussion

>>Yes, that is how the current variable sized chunking algorithm works.
...  
"zfs diff" does not provide the information needed for our deduplication 
algorithm, so we cannot use that. 
<<

1) Can you please outline the algorithm?
2) Why you think, it is not possible to use the changed information of the file system?
3) Why does differential backup work with VMs?





^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [pve-devel] Improve container backup speed dramatically (factor 100-1000)
  2020-11-20  7:18   ` Carsten Härle
@ 2020-11-20  8:27     ` Dominik Csapak
  2020-11-20  8:29       ` Dominik Csapak
  0 siblings, 1 reply; 5+ messages in thread
From: Dominik Csapak @ 2020-11-20  8:27 UTC (permalink / raw)
  To: pve-devel

hi,

it seems there are some misunderstandings as how the backup actually 
works, i'll try to clear that up

On 11/20/20 8:18 AM, Carsten Härle wrote:
>>> Yes, that is how the current variable sized chunking algorithm works.
> ...
> "zfs diff" does not provide the information needed for our deduplication
> algorithm, so we cannot use that.
> <<
> 
> 1) Can you please outline the algorithm?

we have 2 different chunking methods:

* fixed-sizes chunks
* dynamic-sized chunks

fixed-sized chunks, as the name implies, have a predefined, fixed size 
(e.g. 4M)
in vm backups we can split the disk image into such blocks and calculate
the hash

this works well in that case, since fs on disk tend to not
move data around, meaning if you change a byte in a file,
that one chunk will be different, but the rest will be the same

for dynamic sized chunks, we calculate what is called a 'rolling hash'[0]
over a window on the data and under certain circumstances, a chunk 
boundary is triggered, generating a chunk

neither of those chunking methods has any awareness or reference
to files

we use this for container backups in the following way

on iterating over the filesystem/directories, we generate
a so-called 'pxar' archive which is a streaming format
that contains metadata+data for a directory structure

while generating this data-stream we use the dynamic chunk
algorithm to generate chunks on that stream

this works well here, since if you modify/add a byte in a file,
all remaining data gets shifted over the rolling hash will
with a high degree of probabilty find a boundary again,
that it has before and the remaining chunks will be the same


> 2) Why you think, it is not possible to use the changed information of the file system?

1. we would like to avoid making features of the backup, storage
dependent

2. even if we would have that data, we'd have to completely read the 
stream of the previous backup to insert the changes in the right
position and generating a pxar stream that can be chunked.

but now we have read the whole tree again, but this time from
the backup server over the network (probably slower that local fs)
and possibly had to decrypt it (not necessary when reading again from 
local fs)

so with the current pxar+dynamic chunking, this is really not
feasible

what could be possible (but is much work) is
to create a new archive+chunking method, where
the relation files<->chunks is a bit more relevant,
but i'd guess this would blow up our indexing file size
(if you have a million small files, you'd have now a million
more chunks to reference, where as before there would be
less but bigger chunks that combined that data)

> 3) Why does differential backup work with VMs?

in vms there we can have a 'dirty bitmap' which
tracks which fixed-sized blocks was written to

since we split the disk image in the same chunk size
for the backup, there is a 1-to-1 mapping
of blocks written to, and blocks we have to backup

i hope this makes it clearer, if you have any questions, ideas,
etc. free to ask

later today/next week, i'll take the time to write
was i have written above into the documentation,
so that we have a single point of reference we can point to in
the future


kind regards
Dominik




^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [pve-devel] Improve container backup speed dramatically (factor 100-1000)
  2020-11-20  8:27     ` Dominik Csapak
@ 2020-11-20  8:29       ` Dominik Csapak
  0 siblings, 0 replies; 5+ messages in thread
From: Dominik Csapak @ 2020-11-20  8:29 UTC (permalink / raw)
  To: pve-devel

arg, of course i forgot the reference to the rolling hash info...

here it is:
0: https://en.wikipedia.org/wiki/Rolling_hash




^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2020-11-20  8:29 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-11-19 19:29 [pve-devel] Improve container backup speed dramatically (factor 100-1000) Carsten Härle
2020-11-20  4:59 ` Dietmar Maurer
2020-11-20  7:18   ` Carsten Härle
2020-11-20  8:27     ` Dominik Csapak
2020-11-20  8:29       ` Dominik Csapak

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox

Service provided by Proxmox Server Solutions GmbH | Privacy | Legal