rsync issue with lots of small files

David A. Soussan das at dascomputerconsultants.com
Thu Mar 25 19:41:47 MDT 2010


(Sorry if this is a repost - didn't see my earlier post, so thinking it
got lost somehow.)

Wayne, et al:

First posting, and I'm a n00b with rSync. I was having issues with rSync
from a PC (3.0.7 either via DeltaCopy or cwRsync) to a Freenas box
(3.0.6). I've since replicated and isolated the problem attempting to
sync PC to PC with various flavors of rSync on both sides.

The executive summary is "daemon side processes are hanging up with 100%
cpu utilization when sending many small files."

I can replicate this 100% of the time.

What I've done:

Created 2 directories with 20K files in each, 17 bytes long numbered
1.txt through 20000.txt. This is the source sync directory.

Here was one of many iterations of copy lines, this one happened to use
DeltaCopy's version of rsync, but I can repeat it with versions of
cwRsync as well:

C:\DeltaCopy>rsync -rld -t --stats -v --timeout=360
"/cygdrive/e/SyncTest" sassy::Sync

System named Sassy is the target machine.

Here is one of his lines when server side started by hand:

C:\Program Files\ICW\Bin>rsync.exe --config ..\rsyncd.conf -vvvvvv
--daemon --detach  -vvvvvvvv --log-file=e:\temp\rsync.txt

Yes, I'm typing all those -vvvvv lines - trying to get some debug /
verbosity out of the server side, which is where I believe the root of
the problem is, and I'm not getting much in the way of debugging clues
from the server end.

I did some analysis of the network data with Wireshark on the client
side.

Started @ 12 seconds into the capture
@ 13 seconds, started sending the file list.
@ 27 seconds, file data starts flowing
@ 41 seconds, I'm seeing the first TCP ZeroWindow on the sender, which
frees up and stops a few times between this and the next line as the
socket is full or empties on the server side.
@ 122 seconds, the zero window happens and this time it no longer frees
up.

The client's "-v" output screen shows it is sending file 13159.txt and
is frozen until the timeout or I hit ^C.

Meanwhile, back at the server side ...

I'm monitoring the rsync server with ProcessMonitor from
SysInternals.com (now Microsoft), so I can see all the file opens,
reads, writes, etc.

The last file written was 10819.txt, so he wrote 913 total files. Next
time I'll zero-fill my file names so the counts are accurate. Sorted in
order, they go 1.txt, 10.txt, 100.txt, 1000.txt, 10000.txt, 10001.txt,
..., but that is why 10819 really means 913 files.

The process on the server side is spinning and taking up as much CPU as
it can. I tried forcing affinity in case this was a deadlock type issue,
but that didn't fix the problem. I think one version of rSync had both
spinning, and the one I'm looking at right now had only one of the
processes spinning, but I can't say for sure.

I can replicate this failure anytime and believe the lab setup
replicates the failure in the production network. --bwlimit=<various
numbers> did not improve the situation. The lowest was --bwlimit=10 with
no change other than where the hang happens.

Lab has been tried at 100 Mb/s, 1 Gb/s, and I'm very confident I don't
have cable / network / switch issues as I'm seeing the same failure in
both locations with compeltely different hardware in each world.

I can get a stack trace from ProcessExplorer, and all I can tell is the
thread is "WaitForSingleObject" but not much else.

Stack trace (from Process Explorer) of the spinning thread:

ntkrnlpa.exe+0x8db2e
ntkrnlpa.exe+0x29a82
ntkrnlpa.exe+0x331a4
ntkrnlpa.exe+0x312a9
ntkrnlpa.exe+0x28c73
ntkrnlpa.exe+0x29c72
ntkrnlpa.exe+0x138d10
ntkrnlpa.exe+0x897cc
ntdll.dll!KiFastSystemCallRet
kernel32.dll!WaitForSingleObject+0x12
cygwin1.dll!cygwin_internal+0x25f3d

I'd get more debug out of the server side if the -vvvv options would
work, but they don't seem to on the server side nor can I find a way to
have the client pass more -vvvv onto the server and get more debug.

Timeout or ^C will eventually kill the client, and the server's child
processes do stop when I kill the client, but the logs on the server
aren't chatty:

Rsync.log on server:

[--snip--]
2010/03/25 13:03:59 [3912] connect from uber.dascc2.local
(192.168.2.148)
[I sopped the server with ^C]
2010/03/25 13:23:07 [5500] [Receiver] _exit_cleanup(code=20,
file=rsync.c, line=543): entered
2010/03/25 13:23:07 [5500] rsync error: received SIGINT, SIGTERM, or
SIGHUP (code 20) at rsync.c(543) [Receiver=3.0.7]
2010/03/25 13:23:07 [5500] [Receiver] _exit_cleanup(code=20,
file=rsync.c, line=543): about to call exit(20)

Rsyncd.log on server:
[--snip--]
2010/03/25 13:05:35 [3912] recv uber.dascc2.local [192.168.2.148] Sync
() SyncTest/11774.txt 17
2010/03/25 13:05:36 [3912] recv uber.dascc2.local [192.168.2.148] Sync
() SyncTest/11775.txt 17
2010/03/25 13:05:36 [3912] recv uber.dascc2.local [192.168.2.148] Sync
() SyncTest/11776.txt 17
[I stopped the client]
2010/03/25 13:22:39 [3912] rsync: writefd_unbuffered failed to write
4092 bytes to socket [generator]: Connection reset by peer (104)
2010/03/25 13:22:39 [3912] rsync error: error in rsync protocol data
stream (code 12) at io.c(1539) [generator=3.0.7]


I looked for this issue in the current bug list and found nothing.

Short of setting up a development environment and compiling all the
source to an exe so I can run the server side in the debugger, can you
tell me what I might do to get more debug out of the server side and
thus be able to give more details on this issue?

Thanks!
-----
David Soussan
das at dascomputerconsultants.com


More information about the rsync mailing list