Best organizing hundreds of thousands files for rsync and find

Cristian Bichis cristi at imagis.ro
Thu Mar 28 00:57:16 MDT 2013


Hi Kevin,

Thank you for your answer, below some clarifications:

Filesystem involved is ext4 on Debian 6.0.x, noatime.

My issue occurred when I had on the last (deepest) level of hierarchy 
this situation/change:
1. We had about 75,000 files grouped into each last level (deepest) 
folder => performance was ok
2. We changed the deepest folder(s) to add 25,000 subfolders each one 
having about 3 files => a lot of performance degradation

Of course, I don't want to just go straight back to the situation we 
used before, rather to try to plan ahead for the future.

Rsync v3 is on both ends (diff machines for source and target) and is 
run with -avp --delete options only.

Cristian

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> This isn't an easy question to answer.  Meaning I can't just give you
> numbers.
>
> The first question depends on what OS you are on, what filesystem is
> involved, and what filesystem options are in use.  For instance if you
> were on an older Linux system using ext3 I would say that anything
> more than 10k files in a directory is going to have horrible
> performance consequences.  But a modern Linux system using ext4 could
> probably handle 100k files in a directory with less drag.  Especially
> if you have atime disabled.
>
> But that is just filesystem talk.  We are talking about rsync here...
>
> As far as rsync is concerned the main thing is to not require rsync to
> keep the entire tree in memory.  That means make sure you have rsync
> v3 on both ends.  It also means don't use any of the options that are
> listed in man rsync in the --recursive section that disable
> incremental indexing.
>
> Aside from that, rsyncing 100 million files means 200 million calls to
> stat() (hopefully running in parallel on 2 systems).  This will take
> time.  Even if there is very little change it will take rsync time to
> determine that.  Plan accordingly.
>
> Unfortunately this need for lots of stat() calls is not limited to
> rsync.  Any file based utility will do at least that much work.  At
> least rsync is using stat to figure out what it doesn't have to
> actually copy.
>
> On 03/28/13 01:30, Cristian Bichis wrote:
>> Hi,
>>
>> I need to organize about 100 millions small files (and the number
>> grows up) on a server which should be copied to other server.
>>
>> I am wondering how many files are recommended to be kept into a
>> folder for optimal performance? As well, if I have a folder with
>> only subfolders (not files) what number of subfolders are
>> recommended to have?
>>
>> As well, the question could be for "find" command, not just for
>> for rsyncas I am doing some cleanups using find (or for - find).
>>
>>
>> I made a mistake before and I increased a lot the number of
>> subfoldersfolders (having just few files within them) and rsync
>> performance was decreasing considerably. Was a mistake which I will
>> try to correct.
>>
>> So now as the number of files is increasing constantly I need to
>> find out a solution on long term to correct the current issues.
>>
>> Cristian
>>
>>
> - -- 
> ~*-,._.,-*~'`^`'~*-,._.,-*~'`^`'~*-,._.,-*~'`^`'~*-,._.,-*~'`^`'~*-,._.,-*~
> 	Kevin Korb			Phone:    (407) 252-6853
> 	Systems Administrator		Internet:
> 	FutureQuest, Inc.		Kevin at FutureQuest.net  (work)
> 	Orlando, Florida		kmk at sanitarium.net (personal)
> 	Web page:			http://www.sanitarium.net/
> 	PGP public key available on web site.
> ~*-,._.,-*~'`^`'~*-,._.,-*~'`^`'~*-,._.,-*~'`^`'~*-,._.,-*~'`^`'~*-,._.,-*~
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v2.0.19 (GNU/Linux)
> Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/
>
> iEYEARECAAYFAlFT168ACgkQVKC1jlbQAQehfwCgrNK//TphWVfLjE8XneHWqSN8
> TxoAoO4fI1lS5G1mI+P3j25t6pdrhnNp
> =O0Sd
> -----END PGP SIGNATURE-----



More information about the rsync mailing list