performance suggestion: sparse files

Wed Aug 27 06:45:49 EST 2003

On Tue, Aug 26, 2003 at 11:28:12AM -0700, Jon Howell wrote:
> So I was transferring a 2GB virtual machine disk image image over a slow
> wireless link. Of course I used --sparse, to keep the image small on the
> destination end as well as on the source end.
> 
> Much to my surprise, I noticed that the transfer took a long time even
> when it got past the first 0.5GB of actually-populated file. A little
> sleuthing with strace revealed that the source rsync was dutifully reading
> block after block of zeros, sending them to ssh, who compressed them and
> send them across the wire(less), where another rsync got the zero blocks,
> realized that they were sparse, and just bode its time until it could do
> one big seek to the next non-sparse block. ("bode its time"? Who writes
> like that?)

Had you been updating an existing image file it would have
the blocks of zeroes would have had matches and not been
sent.  A workaround if you do this again in future would be
to create an original file full of zeros.  dd if=/dev/zero
of=$dest bs=1024 count=$block_size

> 
> Of course, it never survived to see that moment; a cruel SIGINT arrived
> and dispatched both rsyncs.
> 
> It seems like the right thing would be for the local end to skim past the
> zero blocks and send some metainformation, to avoid encrypting and
> transferring many GB of zeros.
> 
> I worked around the problem by adding -z to compress the stream first
> (blocks of zeros compress remarkably well), and that made the virtual disk
> image transfer go much faster. Of course, all of the .tgzs and .tbzs in
> the same transfer got slower waiting on the source CPU to compress the
> incompressible.

That is what i would have recommended.

> The obvious solution is to <music type=organ register=bass>change the
> protocol</music>, but that seems like a scary thing to do for a
> performance tweak. What about an option for "really-crappy-compression"?
> Something really cheezy (RLE) that can decide in a hurry whether to
> compress away a string of zeros, and if not, just send them raw. That way,
> performance on compressed files stays I/O bound even on systems with pokey
> CPUs, but sparse files are disk-bound on the source system (as they should
> be). (And, of course, --sparse would automatically promote the compression
> level to "really-crappy" if it was at "none" before.)

This is really only an issue when rsync hits a new file.  I
agree an RLE of the stream _sounds_ lika a good idea.  But
even better might be an extra phantom block that represents
all zeros.  That too would require a protocol bump.

> Well, okay, they shouldn't even be disk bound; the source system should be
> able to discover the sparsity of the file without making 1.5GB-worth of
> read calls. Does POSIX (or do specific OSes) offer a call that provides a
> map of allocated regions in the file?

There is no way in user-mode to distinguish between a sparse file and a
file full of zeroed blocks.

> Source rsync: 2.5.6
> Destination rsync: 2.5.5
> Diligence: I searched for 'sparse' in the faqomatic, the bug database, the
> current issues page, the TODO document, and the mailing list archive, and
> didn't find anything relevant; please don't flame if I missed an existing
> comment.
> 
> Thanks!
> 
>     --Jon
> 
> 
> -- 
> To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync
> Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
> 

-- 
________________________________________________________________
	J.W. Schultz            Pegasystems Technologies
	email address:		jw at pegasys.ws

		Remember Cernan and Schmitt