DO NOT REPLY [Bug 5482] look into a hierarchical checksum algorithm that would help to efficiently transmit really large files

samba-bugs at samba.org samba-bugs at samba.org
Sun May 25 14:31:52 GMT 2008


https://bugzilla.samba.org/show_bug.cgi?id=5482


wayned at samba.org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |ASSIGNED
            Summary|apply the rsync comparison  |look into a hierarchical
                   |algorithm specially to .mov |checksum algorithm that
                   |and .mp4 files              |would help to efficiently
                   |                            |transmit really large files




------- Comment #6 from wayned at samba.org  2008-05-25 09:31 CST -------
I also think coding a special transfer algorithm for media files is not
something that would be appropriate for rsync (though someone may want to code
their own custom version if this is important to them).

A general hierarchical checksum would be quite interesting, but we'd need to
balance the costs of disk I/O and memory use, and avoid round-trip latency. 
Rsync deals with latency by pipe-lining the checksum generation and file
reconstruction in separate processes, but each one requires a separate read of
the basis file.  So, we'd want to make rsync still able to process more than
one active transfer and not do a read-read for each sub-level of checksum
generation.

If rsync were made to send a first level set of checksums for each file,
computing a certain number of sub-level checksums into memory, that might work
for the generator at the expense of holding the hierarchical sums in memory for
N active files at once (perhaps topping out after a certain selectable memory
limit was reached).

The biggest unknown to me would be how to make the sender work efficiently for
such an algorithm.  The sender is the part that must compute a new checksum on
each character boundary (using a rolling checksum) to find relocation matches
for each block.  Computing a rolling checksum using the current method would
require holding huge amount of the file in memory at once (but that could be
changed to use mem-mapped files, potentially triggering re-reads as the window
slides forward).  We'd also need to move away from the current bifurcated
checksum algorithm into a single, stronger rolling checksum to avoid having to
recompute the strong checksum over a huge amount of data to double-check a
match.  Also, each successive level of transfer would likely trigger a re-read
of the part of the file that didn't match anything as we divide it into smaller
checksums.

We'd also want to limit how hierarchical the checksums are based on file size,
as smaller files just need the current method, while really large files might
benefit from several levels.

We may want to even consider a reversal of the jobs of the sender and generator
to lighten the load on the server.  This would slow things down latency-wise,
but could make it possible for a server to deal with more clients, especially
if this makes the per-byte processing more intensive.

Sounds like an intriguing idea to investigate further and see how feasible it
would be.  Comments?  Jamie, do you have code to share?


-- 
Configure bugmail: https://bugzilla.samba.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the QA contact for the bug, or are watching the QA contact.


More information about the rsync mailing list