rsync oldest files first

nate rsync at linuxpowered.net
Sun Feb 8 01:01:12 GMT 2009


Hello -

Running rsync v3.0.5 on a mixture of CentOS 4.6 and 5.1 systems,
using hpnssh as the transport mechanism.

I am using rsync to replicate roughly a TB worth of compressed log
data per day from a bunch of systems for processing.

Every hour the systems generate log files, compress them and then
rsync pushes them out to a centralized set of redundant hosts
with their storage connected to a clustered NFS file store.

Every hour as well back end systems pull these logs and process
them. In total there are roughly 210,000 files to be downloaded
per day at the moment.

It seems that once or twice a week I get a corrupted file(at
least for a while), which I would not expect because rsync
doesn't rename the file until the file copy is completed
successfully as far as I can tell. Eventually rsync does
recover on it's own but not before the back end process
tries to grab the corrupted file and fails, usually the
team of people that watch this process copy the file by
hand before rsync has a chance to catch up.

But the main reason for my post is that rsync does not appear
to be obeying the file order I am specifying in my command.
In the event of a failed transfer I want rsync to transfer
the oldest files first so that the back end systems don't
get backed up. But even though the file list I provide
rsync has files in a specific order, it seems to re-arrange
them anyways into alphabetical order.

My rsync script kills any running copies of rsync before
running again to make sure it's the only one going and to
make sure that rsync doesn't get stuck for some reason. The
scripts run twice an hour on each system(half of this data
is being pushed from the east coast of the U.S. to the
west coast).

An example from today is that at about 12:15PM rsync was
running and copying a specific file, and was aborted by the
newest running rsync script, which then proceeded to copy
files that were *just* generated in the past 5 minutes instead
of the files that were generated in the previous hour. This
script ran until about 12:45 when it was killed again and a
new copy of the script started, and it too was copying files
from the latest hour and not the older files first. Then it
was killed at about 1:15 and a new job kicked of and it was
actually able to catch up everything by 1:30. It would of
caught up sooner we were having problems with an ISP that
was causing throughput to be much lower than normal.

I build the file list with a simple ls -ltr | awk '{print $9}'.
I keep quite a bit of logging/debug information so I can
confirm that the file lists rsync was using were ordered
correctly and that the files that rsync was transferring
was in alphabetical order, not in the order listed in the
file listing.

The problem is really two fold-  there seems to be some sort
of issue/bug? with rsync where under certain circumstances it
will rename one of it's temporary "dot" files to the real
file name even though it hasn't been successfully copied,
but more importantly I'd like to transfer files that are
oldest first, either via file includes list or some other
means. Given the large number of files I would prefer not
to execute a seperate rsync process for every file..!

I've tested several times copying a file and aborting it
mid copy and rsync has never renamed it, the temporary
dot file is left behind(which I expect). I've been using
rsync for years but this is by far the biggest
implementation I've done with it.

Here is a sample of the command I am using(added some line
breaks for readability?):

ssh 10.254.213.203 "mkdir -p /path/to/dest/" &&
rsync -ae "/usr/bin/hpnssh -v -o TcpRcvBufPoll=yes -o
NoneEnabled=yes -o NoneSwitch=yes" --timeout=600 --partial
--log-format="[%p] %t %o %f (%l/%b)"
--files-from=/home/logrsync/conf/rsync_log_file_list.20090207_124642
/path/to/source 10.254.213.203:/path/to/dest
1>>/home/logrsync/logs/server_name_rsync_log_transfer_20090207_124642.log
2>&1

Side note - anyone have numbers for running rsync over ssh
over a WAN? Even with several hundred megabits of bandwidth
available on each side it seems most often each file copy
caps out at about 700kB/s with hpnssh(lower with normal ssh),
latency is about 80-90ms between the sites. There are about
45 servers so we still get good performance as an aggregate
but it'd be nice to get better performance on a per-server
level as well if possible. I think hpnssh is the right
approach with it's auto tuning and stuff but my expectations
were for higher throughput than I ended up getting.

And it seems that even with 5% packet loss that absolutely
kills performance, can go down 75-90%. Which is the problem
one of my ISPs seems to be having.

thanks

nate



More information about the rsync mailing list