Odd behavior

Thu Apr 22 16:30:36 MDT 2010

Well, I solved this problem myself, it seems.  It was not an rsync 
problem, per se, but it's interesting anyway on big filesystems like 
this so I'll outline what went down:

Because my rsyncs were mostly just statting millions of files very 
quickly, RAM filled up with inode cache.  At a certain point, the kernel 
stopped allowing new cache entries to be added to the slab memory it had 
been using, and was slow to reclaim memory on old, clean inode cache 
entries.  So it basically slowed the I/O of the computer to barely anything.

Slab memory can be checked by looking at the /proc/meminfo file.  If you 
see that slab memory is using up a fair portion of your total memory, 
run the 'slabtop' program to see the top offenders.  In my case, it was 
the filesystem that was screwing me (by way of the kernel).

I was able to speed up the reclaiming of clean, unused inode cache 
entries by tweaking this in the kernel:

# sysctl -w vm.vfs_cache_pressure=10000

The default value for that is 100, where higher values release memory 
faster for dentries and inodes.  It helped, but my rsyncs were still 
faster than it was, and it didn't help that much.  What really fixed it 
was this:

# echo 3 > /proc/sys/vm/drop_caches

That effectively clears ALL dentry and inode entries from slab memory 
immediately.  When I did that, memory usage dropped from 35GB to 500MB, 
my rsyncs fired themselves up again magically, and the computer was 
responsive again.

Slab memory began to fill up again of course, as the rsyncs were still 
going.  But slowly.  For this edge case, I'm just going to configure a 
cron job to flush the dentry/inode cache every five minutes or so.  But 
things look much better now!

A word of warning for folks rsyncing HUGE numbers of files under linux.  ;)

As a side note, Solaris does not seem to have this problem, presumably 
because the kernel handles inode/dentry caching in a different way.

-erich

Erich Weiler wrote:
> Hi Y'all,
> 
> I'm seeing some interesting behavior that I was hoping someone could 
> shed some light on.  Basically I'm trying to rsync a lot of files, in a 
> series of about 60 rsyncs, from one server to another.  There are about 
> 160 million files.  I'm running 3 rsyncs concurrently to increase the 
> speed, and as each one finishes, another starts, until all 60 are done.
> 
> The machine I'm initiating the rsyncs on has 48GB RAM.  This is CentOS 
> linux 5.4, kernel revision 2.6.18-164.15.1.el5.  Rsync version 3.0.5 (on 
> both sides).
> 
> I was able to rsync all the data over to the new machine.  But, because 
> there was so much data, I need to run the rsyncs again to catch data 
> that changed during the last rsync run.  It sort of hangs midway through.
> 
> What happens is that as the rsyncs run, the memory usage on the machine 
> slowly creeps up, using quite a bit of RAM, which is odd because I 
> thought the rsyncs were counting files incrementally, to reduce RAM 
> impact.  But, looking at top, the rsync processes aren't using much RAM 
> at all:
> 
> top - 12:22:10 up 1 day, 27 min,  1 user,  load average: 46.85, 46.37, 
> 44.97
> Tasks: 309 total,   8 running, 301 sleeping,   0 stopped,   0 zombie
> Cpu(s):  1.0%us, 13.8%sy,  0.0%ni, 84.9%id,  0.0%wa,  0.0%hi,  0.3%si, 
> 0.0%st
> Mem:  49435196k total, 34842524k used, 14592672k free,   141748k buffers
> Swap: 10241428k total,        0k used, 10241428k free,    49428k cached
> 
>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>  7351 root      25   0 19892 9.8m  844 R 100.1  0.0 552:58.55 rsync
>  9084 root      16   0 13108 2904  820 R 100.1  0.0 299:24.59 rsync
>  4759 root       0 -20 1447m  94m  15m S 29.9  0.2 667:34.21 mmfsd
>  9539 root      16   0 30136  19m  820 R  6.3  0.0   6:29.28 rsync
>  9540 root      15   0  271m  46m  260 S  0.3  0.1   0:12.13 rsync
> 10047 root      15   0 10992 1212  768 R  0.3  0.0   0:00.01 top
>     1 root      15   0 10348  700  592 S  0.0  0.0   0:02.15 init
> ...etc...
> 
> But nevertheless, 34GB RAM is in use.  But what really kills things is 
> that at some point, each rsync all of a sudden ramps up to 100% CPU 
> usage, and the all activity for that rsync essentially stops.   In the 
> above example, 2 of the 3 rsyncs are in that 100% CPU state, while the 
> third rsync is only at 6.3%, but that is the one actually doing 
> something.  In some cases all 3 rsyncs get to 100%, and they all stall, 
> there is no network traffic on the NIC at all and they don't progress.
> 
> Now mostly what they are doing is counting files, since most of the 
> files are the same on both sides, but there are just so many files (160 
> million).  I don't seem to be out of memory, but I don't know why rsync 
> would go to 100% CPU and just stall.
> 
> I am rsyncing from an rsync server to my local server, with commands 
> similar to this:
> 
> rsync -a --delete rsync://encodek-0-4/data/genomes/ /hive/data/genomes/
> 
> Again, both sides at version 3.0.5.  Nothing fancy or special.  I have 
> confirmed that it does count the files incrementally by running a few 
> manually, it does report "getting incremental file list...".
> 
> Any ideas why the processes go to 100% CPU and then stall?  I should 
> also note that the initial run of rsyncs, where it was actually copying 
> a ton of data, did not seem to have this problem, but now that the data 
> is there and I'm rsyncing again, it seems to have this problem.  Is it 
> somehow related to the fact that it is mostly comparing a ton of files 
> very quickly but not actually copying many of them?
> 
> Thanks for any ideas!
> 
> -erich
>