Best organizing hundreds of thousands files for rsync and find

Kevin Korb kmk at sanitarium.net
Wed Mar 27 23:39:59 MDT 2013


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

This isn't an easy question to answer.  Meaning I can't just give you
numbers.

The first question depends on what OS you are on, what filesystem is
involved, and what filesystem options are in use.  For instance if you
were on an older Linux system using ext3 I would say that anything
more than 10k files in a directory is going to have horrible
performance consequences.  But a modern Linux system using ext4 could
probably handle 100k files in a directory with less drag.  Especially
if you have atime disabled.

But that is just filesystem talk.  We are talking about rsync here...

As far as rsync is concerned the main thing is to not require rsync to
keep the entire tree in memory.  That means make sure you have rsync
v3 on both ends.  It also means don't use any of the options that are
listed in man rsync in the --recursive section that disable
incremental indexing.

Aside from that, rsyncing 100 million files means 200 million calls to
stat() (hopefully running in parallel on 2 systems).  This will take
time.  Even if there is very little change it will take rsync time to
determine that.  Plan accordingly.

Unfortunately this need for lots of stat() calls is not limited to
rsync.  Any file based utility will do at least that much work.  At
least rsync is using stat to figure out what it doesn't have to
actually copy.

On 03/28/13 01:30, Cristian Bichis wrote:
> Hi,
> 
> I need to organize about 100 millions small files (and the number
> grows up) on a server which should be copied to other server.
> 
> I am wondering how many files are recommended to be kept into a
> folder for optimal performance? As well, if I have a folder with
> only subfolders (not files) what number of subfolders are
> recommended to have?
> 
> As well, the question could be for "find" command, not just for
> for rsyncas I am doing some cleanups using find (or for - find).
> 
> 
> I made a mistake before and I increased a lot the number of 
> subfoldersfolders (having just few files within them) and rsync 
> performance was decreasing considerably. Was a mistake which I will
> try to correct.
> 
> So now as the number of files is increasing constantly I need to
> find out a solution on long term to correct the current issues.
> 
> Cristian
> 
> 

- -- 
~*-,._.,-*~'`^`'~*-,._.,-*~'`^`'~*-,._.,-*~'`^`'~*-,._.,-*~'`^`'~*-,._.,-*~
	Kevin Korb			Phone:    (407) 252-6853
	Systems Administrator		Internet:
	FutureQuest, Inc.		Kevin at FutureQuest.net  (work)
	Orlando, Florida		kmk at sanitarium.net (personal)
	Web page:			http://www.sanitarium.net/
	PGP public key available on web site.
~*-,._.,-*~'`^`'~*-,._.,-*~'`^`'~*-,._.,-*~'`^`'~*-,._.,-*~'`^`'~*-,._.,-*~
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.19 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iEYEARECAAYFAlFT168ACgkQVKC1jlbQAQehfwCgrNK//TphWVfLjE8XneHWqSN8
TxoAoO4fI1lS5G1mI+P3j25t6pdrhnNp
=O0Sd
-----END PGP SIGNATURE-----


More information about the rsync mailing list