vanilla rsync 3.0.9 hangs after transferring ~2000 files

Christian Iversen ci at meebox.net
Thu Nov 8 16:42:08 MST 2012


On 2012-11-09 00:14, Justin T Pryzby wrote:
> On Thu, Nov 08, 2012 at 11:54:52PM +0100, Christian Iversen wrote:
>> On 2012-11-02 19:32, Justin T Pryzby wrote:
>>> On Fri, Nov 02, 2012 at 02:33:13PM +0100, Christian Iversen wrote:
>>>> However, 1 server is giving me a lot of trouble. It has a directory
>>>> with (currently) 734088 files in it, and every time I try to backup
>>>> this dir, rsync hangs after transferring roughly 2000 files.
>>>> Sometimes it's around 1800, sometimes it's over 2100 (I think), but
>>>> it's in that ballbark.
> 
>> That was a bit tricky, since the whole process hangs shortly after
>> being started (the files are not that big). But I managed to strace
>> -f -p $PID1 -p $PID2 the whole thing.
> Do you know you can strace something while starting it?
> strace -f rsync -avz /path/dir $dest 2>/strace/rsync

Yes, very. But you asked for server straces, not client ones, so.. :-)
 
>> [pid 27310] select(6, [5], [], NULL, {60, 0}) = 1 (in [5], left {59,
>> 999997})
>> [pid 27310] read(5,
>> "\4\0\0kA*\0\0\4\0\0kB*\0\0\4\0\0kC*\0\0\4\0\0kD*\0\0"..., 8184) =
>> 208
>> [pid 27310] select(6, [5], [1], [1], {60, 0}
>>
>>
>> That {59, 99999x} looks highly suspicious. rsync wouldn't happen to
>> hold open 100000 file handles, would it? Or something to that
>> effect?
> 
> No, that means FD#5 has data waiting; and, of the 60 second theshhold
> given to "select", 59.999997 seconds remain (it returned almost
> immediately).

-ENOSLEEP. Of course.
 
> BTW, you can check file descriptors in /proc/27310/fd/ (and, in recent
> kernels, fdinfo/ has even more).  BTW, symlinks in /proc are
> "special".  Eg. if a process opens /file as FD#3, and then the user
> "unlinks" that file, and creates a new file with the same name (but
> different inode), referencing the pathname will access the new file.
> But, /proc/pid/fd/3 will access the old file.

I know about /proc/*/fd, but I didn't think to take a look at open files. Will do next time :)

> Could you strace the rsync client, too?

Yup (I added newlines for clarity):

[pid 24969] read(4, "\377\330\377\340\0\20JFIF\0\1\1\0\0\1\0\1\0\0\377\376\0;CREATOR:"..., 2427) = 2427
[pid 24969] select(6, NULL, [5], [5], {60, 0}) = 1 (out [5], left {59, 999997})
[pid 24969] write(5, "\374\17\0\7", 4)  = 4
[pid 24969] select(6, NULL, [5], [5], {60, 0}) = 1 (out [5], left {59, 999998})
[pid 24969] write(5, "GHIJSTUVWXYZcdefghijstuvwxyz\203\204\205\206"..., 4092) = 4092
[pid 24969] close(4)                    = 0
[pid 24969] open("data/www/virtual/beta.ageforce.dk/usrPictures/10365-90x80-OnlineModerator.jpg", O_RDONLY) = 4
[pid 24969] fstat(4, {st_mode=S_IFREG|0644, st_size=1702, ...}) = 0
[pid 24969] write(1, "data/www/virtual/beta.ageforce.d"..., 78data/www/virtual/beta.ageforce.dk/usrPictures/10365-90x80-OnlineModerator.jpg
) = 78


[pid 24969] read(4, "\377\330\377\340\0\20JFIF\0\1\1\0\0\1\0\1\0\0\377\376\0;CREATOR:"..., 1702) = 1702
[pid 24969] close(4)                    = 0
[pid 24969] open("data/www/virtual/beta.ageforce.dk/usrPictures/10365-90x80-OnlineModeratorImage.jpg", O_RDONLY) = 4
[pid 24969] fstat(4, {st_mode=S_IFREG|0644, st_size=2517, ...}) = 0
[pid 24969] write(1, "data/www/virtual/beta.ageforce.d"..., 83data/www/virtual/beta.ageforce.dk/usrPictures/10365-90x80-OnlineModeratorImage.jpg
) = 83


[pid 24969] read(4, "\377\330\377\340\0\20JFIF\0\1\1\0\0\1\0\1\0\0\377\376\0;CREATOR:"..., 2517) = 2517
[pid 24969] select(6, NULL, [5], [5], {60, 0}) = 1 (out [5], left {59, 999997})
[pid 24969] write(5, "\374\17\0\7", 4)  = 4
[pid 24969] select(6, NULL, [5], [5], {60, 0}) = 1 (out [5], left {59, 999998})
[pid 24969] write(5, "R\321\360$3br\202\t\n\26\27\30\31\32%&'()*456789:CDEF"..., 4092) = 4092
[pid 24969] close(4)                    = 0
[pid 24969] open("data/www/virtual/beta.ageforce.dk/usrPictures/10366-160x168-Offline.jpg", O_RDONLY) = 4
[pid 24969] fstat(4, {st_mode=S_IFREG|0644, st_size=2712, ...}) = 0
[pid 24969] write(1, "data/www/virtual/beta.ageforce.d"..., 72data/www/virtual/beta.ageforce.dk/usrPictures/10366-160x168-Offline.jpg
) = 72


[pid 24969] read(4, "\377\330\377\340\0\20JFIF\0\1\1\0\0\1\0\1\0\0\377\376\0;CREATOR:"..., 2712) = 2712
[pid 24969] select(6, NULL, [5], [5], {60, 0}) = 0 (Timeout)
[pid 24969] select(6, NULL, [5], [5], {60, 0}


Seems it just stops. Wonder what's going on here?

>>> What filesystem is it?  Can you create a tarball of the directory?
>>
>> Tarball could be problematic. It's 8.6GB of data spread over 734088
>> files in a single dir. Yes, that many. Yes, I know it's insane. No,
>> I didn't choose it :)
>>
>>> time c /path/to/dir |wc
>>
>> What's "c"?
> Oops, I meant:
> 
> time tar c /path/to/dir |wc
> 
> That creates a tarball piped to wc.  If a tarball is problematic, then
> rsync may be a problem, too.  But it seems you were able to run "find"
> on that dir just fine.

Yeah, and du. So I can read the inodes at least.

> What filesystem is that?  I believe "huge dirs" is something *some*
> filesystems specificaly intend to handle gracefully, but some
> filesystems handle poorly.  Some programs may handle it poorly, too :)
> Could you also send the output of df -i for that partition?

It's vanilla ext4, with:

$ df -i /data

Filesystem            Inodes   IUsed   IFree IUse% Mounted on
/dev/mapper/storage-data
                     16384000 1475262 14908738   10% /data

and:

$ file /dev/mapper/storage-data
/dev/mapper/storage-data: Linux rev 1.0 ext4 filesystem data, UUID=8b748eca-bdfe-4165-b7b3-debfb9463366 (needs journal recovery) (extents) (large files) (huge files)


-- 
De bedste hilsner,

Christian Iversen
Systemadministrator, Meebox.net

-------
Denne e-mail kan indeholde fortrolige 
oplysninger. Er du ikke den rette modtager,
bedes du returnere og slette denne e-mail.
------- 


More information about the rsync mailing list