Question and feature requests for processor bound systems

Evan Harris eharris at puremagic.com
Thu Aug 18 09:45:21 GMT 2005


Is there any way to disable the checksum block search in rsync, or to
somehow optimize it for systems that are processor-bound in addition to
being network bound?

I'm using rsync on very low power embedded systems to rsync files that are
sometimes comparatively large (sometimes a few hundred megs in size or
larger), and am finding that just the operation of the checksumming one such
file on the sender is taking tens of minutes.

The systems in question have processors on the order of a pentium 166, and
the tests that I did the other day syncing a single ~500meg file was between
15 and 20 minutes just for the checksum calculation.  When these systems are
potentially battery powered, the cost of keeping the system up for long
periods at full processor utilization is very expensive in power terms.

I couldn't find any such option, and I was trying to come up with a way to
reduce that cpu-bound problem without completely abandoning rsync.  So here
are some proposed solutions that I put in as feature requests to help avoid
this issue.

Option 1: Add an option, maybe --optimize-append, that would optimize the
checksum search by telling it that it can assume that files are probably
just appended to, like logfiles.  This would make rsync not do checksums on
the files at all except for very rudimentary checking.  I would think a good
algoritm might be to checksum only the first and last block of an existing
file, and if those two blocks are the same, assume all intervening data is
also the same and just transfer the remaining data.  This is basically a
hint that the file is only being appended to.  Then if either of those
blocks don't match, fall back to the full checksum algorithm.

Option 2: Add an option, maybe --checksum-block-skip=N, that would tell
rsync that when checksumming the file, to only checksum every Nth block.
This would still allow allow keeping most of the advantages of the rsync,
but would allow cpu-bound systems to speed up the checksumming process at
the expense of possibly not detecting file differences if the differences
fall in between blocks that are checksummed.  This would basically be a hint
that the only changes the file should contain would be insertions or
deletions of data within the file, but no updates of blocks in-place.  This
would also help on systems that are disk-bound in addition to being network
and cpu-bound in that it doesn't have to read every block of the file to
send checksums.

Option 3: Add an option, maybe --checksum-block-bytes=N, that would tell
rsync to only checksym the first N bytes of every block.  This would
probably be used with a very large --block-size.  This would be a hint that
the file should have no insertions or deletions of data, but only in-place
updates with large blocks, or possibly appended additions.  This also would
help disk-bound systems.

Option 4: Add an option, maybe --optimize-cpu, or --weak-checksums that
would tell rsync to only use weak checksums up until the point in the file
where the weak checksums first differ, and then fallback to the normal weak
and strong checksums from there on.  This is a hint that most likely the
file is appended to, but will still catch most occurances where a file was
modified.

All of these options might also benefit from another option that says to
only apply these optimizations to files over a certain size, or where the
automatic blocksize is over a cerain size.

Obviously, these optimizations would all be for systems with comparatively
low cpu power, but as average filesizes continue to get larger and larger,
they would also benefit even much faster systems when used on very large
(several gigabytes and up) files.

In the process of testing this, I also found out that the timeout setting I
had in the receiver side of ten minutes wasn't sufficient.  So I was also
wondering if it would be possible to add an option to make rsync, when used
in daemon mode and not over another shell transport, use some form of tcp
keepalives during long-running processes.  This could allow me to reduce the
timeout to a smaller value like 2 minutes, but still not let the rsync
connection die as long as the remote system still had a "live" connection
even when one end was waiting on the other for very long operations (like
this long-running checksum) and there was no other connection traffic.

Thoughts?

Evan



More information about the rsync mailing list