performance suggestion: sparse files

Wed Aug 27 04:28:12 EST 2003

So I was transferring a 2GB virtual machine disk image image over a slow
wireless link. Of course I used --sparse, to keep the image small on the
destination end as well as on the source end.

Much to my surprise, I noticed that the transfer took a long time even
when it got past the first 0.5GB of actually-populated file. A little
sleuthing with strace revealed that the source rsync was dutifully reading
block after block of zeros, sending them to ssh, who compressed them and
send them across the wire(less), where another rsync got the zero blocks,
realized that they were sparse, and just bode its time until it could do
one big seek to the next non-sparse block. ("bode its time"? Who writes
like that?)

Of course, it never survived to see that moment; a cruel SIGINT arrived
and dispatched both rsyncs.

It seems like the right thing would be for the local end to skim past the
zero blocks and send some metainformation, to avoid encrypting and
transferring many GB of zeros.

I worked around the problem by adding -z to compress the stream first
(blocks of zeros compress remarkably well), and that made the virtual disk
image transfer go much faster. Of course, all of the .tgzs and .tbzs in
the same transfer got slower waiting on the source CPU to compress the
incompressible.

The obvious solution is to <music type=organ register=bass>change the
protocol</music>, but that seems like a scary thing to do for a
performance tweak. What about an option for "really-crappy-compression"?
Something really cheezy (RLE) that can decide in a hurry whether to
compress away a string of zeros, and if not, just send them raw. That way,
performance on compressed files stays I/O bound even on systems with pokey
CPUs, but sparse files are disk-bound on the source system (as they should
be). (And, of course, --sparse would automatically promote the compression
level to "really-crappy" if it was at "none" before.)

Well, okay, they shouldn't even be disk bound; the source system should be
able to discover the sparsity of the file without making 1.5GB-worth of
read calls. Does POSIX (or do specific OSes) offer a call that provides a
map of allocated regions in the file?

Source rsync: 2.5.6
Destination rsync: 2.5.5
Diligence: I searched for 'sparse' in the faqomatic, the bug database, the
current issues page, the TODO document, and the mailing list archive, and
didn't find anything relevant; please don't flame if I missed an existing
comment.

Thanks!

    --Jon