Improve --inplace updates on pathological inputs

Michael Chapman mike at very.puzzling.org
Sun Jul 28 21:15:22 MDT 2013


Hi,

I recently came across a situation where "rsync --inplace" performs very poorly. If both the source and destination files contain long sequences of identical blocks, but not necessarily in the same location, the sender can spend an inordinate amount of CPU time finding matching blocks.

In my case, I came across this problem while backing up multi-hundred-gigabyte MySQL database files. There could be periods of *hours* where the sender was not reading anything from disk or writing anything over the network.

Please consider the attached patch. It alleviates this problem by discarding hash table entries on the sender that can't possibly be used during an --inplace update.

I've included some test results below demonstrating the problem and the improvement this patch provides.

Regards,
Michael

------------------------------------------------------------

All tests were conducted with files generated as follows:

  $ perl -e 'print "\x00" x (10*1024*1024); print "\xff" x (10*1024*1024)' >a
  $ perl -e 'print "\xff" x (10*1024*1024); print "\x00" x (10*1024*1024)' >b
  $ md5sum a b
  07c84be14041575befb779ca6dee16ab  a
  fc8fc3324a22639ff61d063e28385962  b

In other words, "a" contains 10 MiB of zero bits followed by 10 MiB of one bits, while "b" contains 10 MiB of one bits followed by 10 MiB of zero bits.

Prior to each test, the filesystem was synced and the kernel page cache cleared (echo 3 >/proc/sys/vm/drop_caches).

Running "rsync --inplace" between these takes a little time:

  $ time ./rsync-unpatched -vv --no-whole-file --checksum --inplace a b
  delta-transmission enabled
  a
  total: matches=2291  hash_hits=10483476  false_alarms=0 data=10487904
  
  sent 10,499,720 bytes  received 27,564 bytes  113,808.48 bytes/sec
  total size is 20,971,520  speedup is 1.99
  
  real    1m32.170s
  user    1m31.756s
  sys     0m0.137s
  $ md5sum a b
  07c84be14041575befb779ca6dee16ab  a
  07c84be14041575befb779ca6dee16ab  b

With the patch, this time is significantly reduced:

  $ time ./rsync-patched -vv --no-whole-file --checksum --inplace a b
  delta-transmission enabled
  a
  total: matches=2291  hash_hits=2292  false_alarms=0 data=10487904
  
  sent 10,499,720 bytes  received 27,564 bytes  7,018,189.33 bytes/sec
  total size is 20,971,520  speedup is 1.99
  
  real    0m0.628s
  user    0m0.389s
  sys     0m0.043s
  $ md5sum a b
  07c84be14041575befb779ca6dee16ab  a
  07c84be14041575befb779ca6dee16ab  b

The behaviour of rsync without --inplace is hardly affected. Unpatched:

  $ time ./rsync-unpatched -vv --no-whole-file --checksum a b
  delta-transmission enabled
  a
  total: matches=4582  hash_hits=4582  false_alarms=0 data=4288
  
  sent 22,716 bytes  received 27,564 bytes  33,520.00 bytes/sec
  total size is 20,971,520  speedup is 417.09
  
  real    0m0.579s
  user    0m0.297s
  sys     0m0.053s
  $ md5sum a b
  07c84be14041575befb779ca6dee16ab  a
  07c84be14041575befb779ca6dee16ab  b

Patched:

  $ time ./rsync-patched -vv --no-whole-file --checksum a b
  delta-transmission enabled
  a
  total: matches=4582  hash_hits=4582  false_alarms=0 data=4288
  
  sent 22,715 bytes  received 27,564 bytes  33,519.33 bytes/sec
  total size is 20,971,520  speedup is 417.10
  
  real    0m0.598s
  user    0m0.314s
  sys     0m0.043s
  $ md5sum a b
  07c84be14041575befb779ca6dee16ab  a
  07c84be14041575befb779ca6dee16ab  b

------------------------------------------------------------

Michael Chapman (1):
  Discard unusable hash table entries

 match.c | 25 ++++++++++++++++---------
 1 file changed, 16 insertions(+), 9 deletions(-)

-- 
1.8.3.1



More information about the rsync mailing list