performance suggestion: sparse files

Tue Sep 9 13:48:16 EST 2003

On 26 Aug 2003 jw schultz <jw at pegasys.ws> wrote:

> On Tue, Aug 26, 2003 at 11:28:12AM -0700, Jon Howell wrote:
> > I worked around the problem by adding -z to compress the stream
> > first(blocks of zeros compress remarkably well), and that made the
> > virtual disk image transfer go much faster. Of course, all of the
> > .tgzs and .tbzs in the same transfer got slower waiting on the
> > source CPU to compress the incompressible.
> 
> That is what i would have recommended.
> 
> > The obvious solution is to <music type=organ register=bass>change
> > the protocol</music>, but that seems like a scary thing to do for a
> > performance tweak. What about an option for
> > "really-crappy-compression"? Something really cheezy (RLE) that can
> > decide in a hurry whether to compress away a string of zeros, and if
> > not, just send them raw. That way, performance on compressed files
> > stays I/O bound even on systems with pokey CPUs, but sparse files
> > are disk-bound on the source system (as they should be). (And, of
> > course, --sparse would automatically promote the compression level
> > to "really-crappy" if it was at "none" before.)
> 
> This is really only an issue when rsync hits a new file.  I
> agree an RLE of the stream _sounds_ lika a good idea.  But
> even better might be an extra phantom block that represents
> all zeros.  That too would require a protocol bump.

I'd want to be convinced that this was really enough cheaper than -z1
to justify the complexity.

(For rdiff having cheap encoding of zeros would seem to make sense...)

> There is no way in user-mode to distinguish between a sparse file and
> a file full of zeroed blocks.

That is correct.

Actually you can guess by looking at the allocated-blocks measure, and
use this to guess whether it's preallocated zeros or sparse, which
might be useful for backups.  But there is no way around reading the
blocks.

-- 
Martin