* [pve-devel] Improve container backup speed dramatically (factor 100-1000) @ 2020-11-19 19:29 Carsten Härle 2020-11-20 4:59 ` Dietmar Maurer 0 siblings, 1 reply; 5+ messages in thread From: Carsten Härle @ 2020-11-19 19:29 UTC (permalink / raw) To: pve-devel Container backup is very slow compared to VM backup. I have a 500 GB container (sftp server) with minimal changing files, but even the incremental bakcups take 2 hours with heavy disk activity. Almost nothing is transfered to the backup server. It seems that it it reads the whole container everytime, without any optimization. Before I did backup with zfs send it there it took only a couple of seconds or minutes for every didfferencal backup. See discussion here: https://forum.proxmox.com/threads/no-differantial-container-backup-with-big-containers.75676/#post-338868 <https://forum.proxmox.com/threads/no-differantial-container-backup-with-big-containers.75676/#post-338868> PBS is not storage agnostic but uses underlying snapshot feature according to the documentation: For container, the underlying snapshot feature of the file system ARE used, it already uses ZFS feature. https://pve.proxmox.com/wiki/Backup_and_Restore <https://pve.proxmox.com/wiki/Backup_and_Restore> For zfs file systems the set of changed file between snapshots can easy be displayed with "zfs diff", so PBS should use this feature to speed up large container backups dramatically. In my case it would be faktor about factor 1000! Alternatively the snapshots data can be accessed via the hidden .zfs directories, so PBS has knows which files changed and has access to the old and the new data. https://bugzilla.proxmox.com/show_bug.cgi?id=3138 ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [pve-devel] Improve container backup speed dramatically (factor 100-1000) 2020-11-19 19:29 [pve-devel] Improve container backup speed dramatically (factor 100-1000) Carsten Härle @ 2020-11-20 4:59 ` Dietmar Maurer 2020-11-20 7:18 ` Carsten Härle 0 siblings, 1 reply; 5+ messages in thread From: Dietmar Maurer @ 2020-11-20 4:59 UTC (permalink / raw) To: Proxmox VE development discussion, Carsten Härle > Container backup is very slow compared to VM backup. I have a 500 GB container (sftp server) with minimal changing files, but even the incremental bakcups take 2 hours with heavy disk activity. Almost nothing is transfered to the backup server. It seems that it it reads the whole container everytime, without any optimization. Before I did backup with zfs send it there it took only a couple of seconds or minutes for every didfferencal backup. Yes, that is how the current variable sized chunking algorithm works. > See discussion here: https://forum.proxmox.com/threads/no-differantial-container-backup-with-big-containers.75676/#post-338868 <https://forum.proxmox.com/threads/no-differantial-container-backup-with-big-containers.75676/#post-338868> > > PBS is not storage agnostic but uses underlying snapshot feature according to the documentation: For container, the underlying snapshot feature of the file system ARE used, it already uses ZFS feature. > https://pve.proxmox.com/wiki/Backup_and_Restore <https://pve.proxmox.com/wiki/Backup_and_Restore> Yes, we use the snapshot feature. But the backup code is totally storage agnostic. > For zfs file systems the set of changed file between snapshots can easy be displayed with "zfs diff", so PBS should use this feature to speed up large container backups dramatically. "zfs diff" does not provide the information needed for our deduplication algorithm, so we cannot use that. But if you have ideas how to make that work, please shared them here. ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [pve-devel] Improve container backup speed dramatically (factor 100-1000) 2020-11-20 4:59 ` Dietmar Maurer @ 2020-11-20 7:18 ` Carsten Härle 2020-11-20 8:27 ` Dominik Csapak 0 siblings, 1 reply; 5+ messages in thread From: Carsten Härle @ 2020-11-20 7:18 UTC (permalink / raw) To: Dietmar Maurer, Proxmox VE development discussion >>Yes, that is how the current variable sized chunking algorithm works. ... "zfs diff" does not provide the information needed for our deduplication algorithm, so we cannot use that. << 1) Can you please outline the algorithm? 2) Why you think, it is not possible to use the changed information of the file system? 3) Why does differential backup work with VMs? ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [pve-devel] Improve container backup speed dramatically (factor 100-1000) 2020-11-20 7:18 ` Carsten Härle @ 2020-11-20 8:27 ` Dominik Csapak 2020-11-20 8:29 ` Dominik Csapak 0 siblings, 1 reply; 5+ messages in thread From: Dominik Csapak @ 2020-11-20 8:27 UTC (permalink / raw) To: pve-devel hi, it seems there are some misunderstandings as how the backup actually works, i'll try to clear that up On 11/20/20 8:18 AM, Carsten Härle wrote: >>> Yes, that is how the current variable sized chunking algorithm works. > ... > "zfs diff" does not provide the information needed for our deduplication > algorithm, so we cannot use that. > << > > 1) Can you please outline the algorithm? we have 2 different chunking methods: * fixed-sizes chunks * dynamic-sized chunks fixed-sized chunks, as the name implies, have a predefined, fixed size (e.g. 4M) in vm backups we can split the disk image into such blocks and calculate the hash this works well in that case, since fs on disk tend to not move data around, meaning if you change a byte in a file, that one chunk will be different, but the rest will be the same for dynamic sized chunks, we calculate what is called a 'rolling hash'[0] over a window on the data and under certain circumstances, a chunk boundary is triggered, generating a chunk neither of those chunking methods has any awareness or reference to files we use this for container backups in the following way on iterating over the filesystem/directories, we generate a so-called 'pxar' archive which is a streaming format that contains metadata+data for a directory structure while generating this data-stream we use the dynamic chunk algorithm to generate chunks on that stream this works well here, since if you modify/add a byte in a file, all remaining data gets shifted over the rolling hash will with a high degree of probabilty find a boundary again, that it has before and the remaining chunks will be the same > 2) Why you think, it is not possible to use the changed information of the file system? 1. we would like to avoid making features of the backup, storage dependent 2. even if we would have that data, we'd have to completely read the stream of the previous backup to insert the changes in the right position and generating a pxar stream that can be chunked. but now we have read the whole tree again, but this time from the backup server over the network (probably slower that local fs) and possibly had to decrypt it (not necessary when reading again from local fs) so with the current pxar+dynamic chunking, this is really not feasible what could be possible (but is much work) is to create a new archive+chunking method, where the relation files<->chunks is a bit more relevant, but i'd guess this would blow up our indexing file size (if you have a million small files, you'd have now a million more chunks to reference, where as before there would be less but bigger chunks that combined that data) > 3) Why does differential backup work with VMs? in vms there we can have a 'dirty bitmap' which tracks which fixed-sized blocks was written to since we split the disk image in the same chunk size for the backup, there is a 1-to-1 mapping of blocks written to, and blocks we have to backup i hope this makes it clearer, if you have any questions, ideas, etc. free to ask later today/next week, i'll take the time to write was i have written above into the documentation, so that we have a single point of reference we can point to in the future kind regards Dominik ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [pve-devel] Improve container backup speed dramatically (factor 100-1000) 2020-11-20 8:27 ` Dominik Csapak @ 2020-11-20 8:29 ` Dominik Csapak 0 siblings, 0 replies; 5+ messages in thread From: Dominik Csapak @ 2020-11-20 8:29 UTC (permalink / raw) To: pve-devel arg, of course i forgot the reference to the rolling hash info... here it is: 0: https://en.wikipedia.org/wiki/Rolling_hash ^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2020-11-20 8:29 UTC | newest] Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2020-11-19 19:29 [pve-devel] Improve container backup speed dramatically (factor 100-1000) Carsten Härle 2020-11-20 4:59 ` Dietmar Maurer 2020-11-20 7:18 ` Carsten Härle 2020-11-20 8:27 ` Dominik Csapak 2020-11-20 8:29 ` Dominik Csapak
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox