Thought on large files
brendan at worldguard.com.au
Wed Jan 23 04:38:56 GMT 2008
I've been toying around with the code of rsync on and off for a while,
and I had a thought that I would like some comments on. Its to do with
very large files and disk space.
One of the common uses of rsync is to use it as a backup program. A
client connects to the rsync server, and sends over any changed files.
If the client has very large files that have changed marginally, then
rsync efficiently only sends the changed bits.
On the server side, one may have it set up to create 'snapshots' of the
existing data there by hardlinking that data to another directory
periodically. Theres plenty of documentation on the web how to do this
so I won't go into it further.
This is very effective and uses quite little disk space since a file
that does not change effectively doesn't take up any more disk space
(not much more anyway), even if it exists now in many snapshots.
One place where this falls down is if the file is very large. Lets say
the file, whatever it is, is a 10Gb file, and that some small amount of
data changes in it. This is efficiently sent accross by rsync, BUT the
rsync server side will correctly break the hard-link and create a new
file with the changed bits. This means, if even 1 byte of that 10Gb file
changes, you now have to store that whole file again.
I won't get into the whole issue of why one would have big files etc...
I see it all the time, especially in the Microsoft world, with Outlook
PST files, and Microsoft Exchange Database files.
What my thoughts were is that if the server could transparently break a
large file into chunks and store them that way, then one can still make
use of hard-links efficiently.
For example, going back to a 10Gb Exchange Database file, its likely not
going to change too much during use. So if the server stored the huge
clumsy 'priv1.edb' as:
and intelligently only broke the 'hard-links' of the bits that actually
change, then it all works well. One could have an option to enable this
for files bigger than a certain size, and break them into specific sized
One could quite rightly argue that this changes rsync from a tool that
synchronizes data between places to a dedicated backup tool (as the two
sides will now have physically different data), however I could see it
being useful, especially since it wouldn't need changes on the client
side as the server still presents it as just one file.
What are your comments? Good idea? Stupid idea? Been done before? Does
anyone have some hints about where in the code I should look to make
these changes so I can test it out?
More information about the rsync