filelist caching optimization proposal

Edwin Eefting edwin at datux.nl
Mon May 23 13:24:07 GMT 2005


Hi,

As a gentoo-user i frequently run the emerge sync command, which in turn does 
a rsync with the mainserver. The 'problem' is that the portage directory tree 
contains about 19.000 directories and 96.000 files. So building the filelist 
takes a pretty long time, because of the many disk accesses that are 
neccesary. On the server side the disk-io problem is probably less worse 
since after the first time the whole tree is cached in the OS disk cache. 
(but still a lot of cpu resources in all the syscalls i think)

My idea is to create a patch for something like a --cache option that will use 
a cached version of the filelist: This way instead of creating the filelist 
every time (100.000's of system calls, diskaccesses), we can now load the 
filelist in one instance. This is even more usefull for rsync-servers, that 
are usually read-only. (like the gentoo mirrors or kernel.org which always 
has a +100 load it seems ;)

I see the following problem with this:
The cache will become 'out of sync' if something manually changes the local 
files.  So using the cache option wouldn't be recommended for users that 
don't know whats going on. However it can be enabled manually under the right 
cicumstances. Maybe it's even possible to do some extra checks on directory 
ctimes in the maindir or some other checks.

-What are the opinions of other people on this list? 
-Would it be easy to implement, or would it give too much trouble? 
-What are the most likely problems i would run into when i try to implement 
this?
-Any ideas on WHERE to store such a cache? (a magic hidden file in the 
directory that is being builded perhaps?)

Thanks,
Edwin


-- 
  //||\\  Edwin Eefting
 || || || DatuX, Linux solutions and innovations
  \\||//  http://www.datux.nl  

        Nieuw Amsterdamsestraat 40
        7814 VA Emmen
        Tel. 0591-857037
        Fax. 0591-633001
         


More information about the rsync mailing list