Thought on large files

Wed Jan 23 04:38:56 GMT 2008

Hi There,

I've been toying around with the code of rsync on and off for a while, 
and I had a thought that I would like some comments on. Its to do with 
very large files and disk space.

One of the common uses of rsync is to use it as a backup program. A 
client connects to the rsync server, and sends over any changed files. 
If the client has very large files that have changed marginally, then 
rsync efficiently only sends the changed bits.

On the server side, one may have it set up to create 'snapshots' of the 
existing data there by hardlinking that data to another directory 
periodically. Theres plenty of documentation on the web how to do this 
so I won't go into it further.

This is very effective and uses quite little disk space since a file 
that does not change effectively doesn't take up any more disk space 
(not much more anyway), even if it exists now in many snapshots.

One place where this falls down is if the file is very large. Lets say 
the file, whatever it is, is a 10Gb file, and that some small amount of 
data changes in it. This is efficiently sent accross by rsync, BUT the 
rsync server side will correctly break the hard-link and create a new 
file with the changed bits. This means, if even 1 byte of that 10Gb file 
changes, you now have to store that whole file again.

I won't get into the whole issue of why one would have big files etc... 
I see it all the time, especially in the Microsoft world, with Outlook 
PST files, and Microsoft Exchange Database files.

What my thoughts were is that if the server could transparently break a 
large file into chunks and store them that way, then one can still make 
use of hard-links efficiently.

For example, going back to a 10Gb Exchange Database file, its likely not 
going to change too much during use. So if the server stored the huge 
clumsy 'priv1.edb' as:
  .priv1.edb._somemagicstring_.1
  .priv1.edb._somemagicstring_.2
etc...

and intelligently only broke the 'hard-links' of the bits that actually 
change, then it all works well. One could have an option to enable this 
for files bigger than a certain size, and break them into specific sized 
chunks.

One could quite rightly argue that this changes rsync from a tool that 
synchronizes data between places to a dedicated backup tool (as the two 
sides will now have physically different data), however I could see it 
being useful, especially since it wouldn't need changes on the client 
side as the server still presents it as just one file.

What are your comments? Good idea? Stupid idea? Been done before? Does 
anyone have some hints about where in the code I should look to make 
these changes so I can test it out?

Brendan Grieve