DO NOT REPLY [Bug 5124] Parallelize the rsync run using multiple threads and/or connections

samba-bugs at samba.org samba-bugs at samba.org
Wed Oct 28 08:34:31 MDT 2009


https://bugzilla.samba.org/show_bug.cgi?id=5124


matt at mattmccutchen.net changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
            Summary|Lessons to learn from other |Parallelize the rsync run
                   |tools, better use of        |using multiple threads
                   |resources, speed gains      |and/or connections




------- Comment #3 from matt at mattmccutchen.net  2009-10-28 09:34 CST -------
A stab at a more meaningful summary, and some thoughts:

My first reaction to the suggestion to use multiple connections is that it's a
gimmick to get a higher total bandwidth allocation from routers that allocate
bandwidth per connection; IMO, that would not be an appropriate goal.  But
there's another more fundamental benefit, even if the total bandwidth were to
remain the same: loss of a single packet won't stall the rsync run because the
other connections can continue (at least for a while) without that packet.

But why stop at several streams?  Rsync could use datagrams (UDP) and just act
on packets as they arrive, so that loss of a packet doesn't affect /any/ of the
other packets.  The only drawback is that we have really good tooling for
working with streams (pipes, nc, port forwarding, TLS, etc.), while the tooling
for datagrams is nonexistent or less mature (there is Datagram TLS, but I've
never tried it).

Rather than implement the UDP stuff ad-hoc for rsync, I would like to see it
adopt an application-level scheduler that maintains a list of active tasks
(scanning a directory, transferring a file, etc.) and handles the rudiments of
accepting a packet and calling the appropriate routine to take the next step on
that task.  If the scheduler would support asynchronous I/O, rsync could use
that to dramatically cut time blocked on I/O by letting the OS decide the order
in which to fulfill requests based on the actual layout of the files on disk. 
Once rsync exposes a set of available tasks to the scheduler, it becomes
trivial to vary the number of OS threads in which the tasks run.  This would be
awesome but is probably better pursued in a successor to rsync.


-- 
Configure bugmail: https://bugzilla.samba.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the QA contact for the bug, or are watching the QA contact.


More information about the rsync mailing list