vanilla rsync 3.0.9 hangs after transferring ~2000 files

Christian Iversen ci at meebox.net
Thu Nov 8 15:54:52 MST 2012


On 2012-11-02 19:32, Justin T Pryzby wrote:
> On Fri, Nov 02, 2012 at 02:33:13PM +0100, Christian Iversen wrote:
>> However, 1 server is giving me a lot of trouble. It has a directory
>> with (currently) 734088 files in it, and every time I try to backup
>> this dir, rsync hangs after transferring roughly 2000 files.
>> Sometimes it's around 1800, sometimes it's over 2100 (I think), but
>> it's in that ballbark.
>
> Is that using rsync:// or rsync/ssh ?  What is the commandline?

Wow, I can't believe I forgot the commandline :D

It's this:

/rsync -v -rlpt --specials --partial --chmod=Du+rwx --chmod=Fu+r 
--rsh='ssh -F /etc/spye/backup/main/ssh/config -i 
/etc/spye/backup/main/ssh/private.key -o 
UserKnownHostsFile=/etc/spye/backup/main/ssh/known_hosts' 
--files-from=/etc/spye/backup/main/include.list 
--exclude-from=/etc/spye/backup/main/exclude.list 
--link-dest=$LAST_BACKUP_DIR / $USER@$SERVER:$TARGET

> There are probably two rsync processes on the server.  Could you
> strace both?  How much vram are they using?  ps e -O vsz -C rsync

That was a bit tricky, since the whole process hangs shortly after being 
started (the files are not that big). But I managed to strace -f -p 
$PID1 -p $PID2 the whole thing.

And this is SUPER strange. I expected it to hang soon after starting 
(like it did the first couple of times), but now with strace, it seems 
to keep running. Slower, of course, but it seems to be running. Is there 
potential for a race condition here? Or does it simply take longer to 
get there, because of the tracing?

Anyway, it eventually DID hang, leaving this strace:

[pid 27310] select(6, [5], [1], [1], {60, 0}) = 1 (in [5], left {59, 
999996})
[pid 27310] select(6, [5], [1], [1], {60, 0}) = 1 (in [5], left {59, 
999996})
[pid 27310] select(6, [5], [1], [1], {60, 0}) = 1 (in [5], left {59, 
999996})
[pid 27310] select(6, [5], [1], [1], {60, 0}) = 1 (in [5], left {59, 
999995})
[pid 27310] select(6, [5], [1], [1], {60, 0}) = 1 (in [5], left {59, 
999996})
[pid 27310] select(6, [5], [1], [1], {60, 0}) = 1 (in [5], left {59, 
999996})
[pid 27310] select(6, [5], [1], [1], {60, 0}) = 1 (in [5], left {59, 
999996})
[pid 27310] select(6, [5], [1], [1], {60, 0}) = 1 (in [5], left {59, 
999996})
[pid 27310] select(6, [5], [1], [1], {60, 0}) = 1 (in [5], left {59, 
999996})
[pid 27310] select(6, [5], [1], [1], {60, 0}) = 1 (in [5], left {59, 
999996})
[pid 27310] select(6, [5], [1], [1], {60, 0}) = 1 (in [5], left {59, 
999996})
[pid 27310] select(6, [5], [1], [1], {60, 0}) = 1 (in [5], left {59, 
999996})
[pid 27310] select(6, [5], [1], [1], {60, 0}) = 1 (in [5], left {59, 
999996})
[pid 27310] select(6, [5], [1], [1], {60, 0}) = 1 (in [5], left {59, 
999995})
[pid 27310] select(6, [5], [1], [1], {60, 0}) = 1 (in [5], left {59, 
999995})
[pid 27310] select(6, [5], [1], [1], {60, 0}) = 1 (in [5], left {59, 
999996})
[pid 27310] select(6, [5], [1], [1], {60, 0}) = 1 (in [5], left {59, 
999996})
[pid 27310] select(6, [5], [1], [1], {60, 0}) = 1 (in [5], left {59, 
999995})
[pid 27310] select(6, [5], [1], [1], {60, 0}) = 1 (in [5], left {59, 
999996})
[pid 27310] select(6, [5], [1], [1], {60, 0}) = 1 (in [5], left {59, 
999996})
[pid 27310] select(6, [5], [1], [1], {60, 0}) = 1 (in [5], left {59, 
999996})
[pid 27310] select(6, [5], [1], [1], {60, 0}) = 1 (in [5], left {59, 
999994})
[pid 27310] select(6, [5], [1], [1], {60, 0}) = 1 (in [5], left {59, 
999994})
[pid 27310] select(6, [5], [1], [1], {60, 0}) = 1 (in [5], left {59, 
999996})
[pid 27310] select(6, [5], [1], [1], {60, 0}) = 1 (in [5], left {59, 
999996})
[pid 27310] select(6, [5], [1], [1], {60, 0}) = 1 (in [5], left {59, 
999996})
[pid 27310] select(6, [5], [1], [1], {60, 0}) = 1 (in [5], left {59, 
999996})
[pid 27310] select(6, [5], [1], [1], {60, 0}) = 1 (in [5], left {59, 
999996})
[pid 27310] select(6, [5], [1], [1], {60, 0}) = 1 (in [5], left {59, 
999996})
[pid 27310] select(6, [5], [1], [1], {60, 0}) = 1 (in [5], left {59, 
999995})
[pid 27310] select(6, [5], [1], [1], {60, 0}) = 1 (in [5], left {59, 
999995})
[pid 27310] select(6, [5], [1], [1], {60, 0}) = 1 (in [5], left {59, 
999996})
[pid 27310] select(6, [5], [1], [1], {60, 0}) = 1 (in [5], left {59, 
999996})
[pid 27310] select(6, [5], [1], [1], {60, 0}) = 1 (in [5], left {59, 
999996})
[pid 27310] select(6, [5], [1], [1], {60, 0}) = 1 (in [5], left {59, 
999996})
[pid 27310] select(6, [5], [1], [1], {60, 0}) = 1 (in [5], left {59, 
999995})
[pid 27310] select(6, [5], [1], [1], {60, 0}) = 1 (in [5], left {59, 
999996})
[pid 27310] select(6, [5], [1], [1], {60, 0}) = 1 (in [5], left {59, 
999996})
[pid 27310] select(6, [5], [1], [1], {60, 0}) = 1 (in [5], left {59, 
999996})
[pid 27310] select(6, [5], [1], [1], {60, 0}) = 1 (in [5], left {59, 
999996})
[pid 27310] select(6, [5], [1], [1], {60, 0}) = 1 (in [5], left {59, 
999997})
[pid 27310] select(6, [5], [1], [1], {60, 0}) = 1 (in [5], left {59, 
999996})
[pid 27310] select(6, [5], [1], [1], {60, 0}) = 1 (in [5], left {59, 
999996})
[pid 27310] select(6, [5], [1], [1], {60, 0}) = 1 (in [5], left {59, 
999996})
[pid 27310] select(6, [5], [1], [1], {60, 0}) = 1 (in [5], left {59, 
999996})
[pid 27310] select(6, [5], [1], [1], {60, 0}) = 1 (in [5], left {59, 
999996})
[pid 27310] select(6, [5], [1], [1], {60, 0}) = 1 (in [5], left {59, 
999996})
[pid 27310] select(6, [5], [1], [1], {60, 0}) = 1 (in [5], left {59, 
999996})
[pid 27310] select(6, [5], [1], [1], {60, 0}) = 1 (in [5], left {59, 
999996})
[pid 27310] select(6, [5], [1], [1], {60, 0}) = 1 (in [5], left {59, 
999996})
[pid 27310] select(6, [5], [1], [1], {60, 0}) = 1 (in [5], left {59, 
999996})
[pid 27310] select(6, [5], [1], [1], {60, 0}) = 1 (in [5], left {59, 
999996})
[pid 27310] select(6, [5], [1], [1], {60, 0}) = 1 (in [5], left {59, 
999996})
[pid 27310] select(6, [5], [1], [1], {60, 0}) = 1 (in [5], left {59, 
999996})
[pid 27310] select(6, [5], [1], [1], {60, 0}) = 1 (in [5], left {59, 
999996})
[pid 27310] select(6, [5], [1], [1], {60, 0}) = 1 (in [5], left {59, 
999996})
[pid 27310] select(6, [5], [1], [1], {60, 0}) = 1 (in [5], left {59, 
999996})
[pid 27310] select(6, [5], [], NULL, {60, 0}) = 1 (in [5], left {59, 
999997})
[pid 27310] read(5, 
"\4\0\0kA*\0\0\4\0\0kB*\0\0\4\0\0kC*\0\0\4\0\0kD*\0\0"..., 8184) = 208
[pid 27310] select(6, [5], [1], [1], {60, 0}


That {59, 99999x} looks highly suspicious. rsync wouldn't happen to hold 
open 100000 file handles, would it? Or something to that effect?

stats with ps e:


27310 108504 T ?        00:00:18 /bin/rsync --server ...
27311 117472 S ?        00:00:03 /bin/rsync --server ...

(server has 6GB of available RAM more, so not pressed for memory exactly..)

> What filesystem is it?  Can you create a tarball of the directory?

Tarball could be problematic. It's 8.6GB of data spread over 734088 
files in a single dir. Yes, that many. Yes, I know it's insane. No, I 
didn't choose it :)

> time c /path/to/dir |wc

What's "c"?

I took a guess:

$ time find usrPictures | wc -l
734088
find usrPictures  0,36s user 0,62s system 91% cpu 1,057 total
wc -l  0,02s user 0,06s system 8% cpu 1,056 total

-- 
De bedste hilsner,

Christian Iversen
Systemadministrator, Meebox.net

-------
Denne e-mail kan indeholde fortrolige
oplysninger. Er du ikke den rette modtager,
bedes du returnere og slette denne e-mail.
-------


More information about the rsync mailing list