[jcifs] SMB Crawler Guidelines

Michael B.Allen mba2000 at ioplex.com
Wed Apr 9 06:34:11 EST 2003


On Tue, 08 Apr 2003 15:52:13 -0400
Dan Dumont <Dan at canofsleep.com> wrote:

> Is there an algorithm you can suggest, or some documentation that you can
> point me to that would explain the benefits and downfalls of various
> implementations?

You mean crawler implementations? Not really. I tried on several occations
to create a truely optimal crawler. It's not trivial. The T2Crawler
example is without a doubt the fastest. But it does not respect the
depth limit and last I checked it was no easy way to communicate that
limit to each thread.

Here's some general guidelines though:

o Do not use a large working set. If the working set is large the
jcifs.smb.client.attrExpirationPeriod will run out and the client
will go back to the network to query attributes. This will cause the
client to stall. Use a small list of directories to traverse and make
attrExpirationPeriod large to ensure SmbFile attributes are never stale.

o Do not create and destroy threads. It is expensive to create (and
destroy) threads. Create a pool of N threads and let them loop over a
shared working set. Take care to syncronize access to the shared list.

o Do not use many threads. More threads != more work. Threads can only
divide up a fixed potential amount of work. A separate thread will only
do productive work if all other threads are sleeping (e.g waiting for a
reponse from the network in this case). If many threads are used there
will always be some that are not sleeping and therefore adding more will
not result in more work being performed. Threads use a lot of memory
(~1MB) and context switching between threads is expesive (destroys CPU
cache) so having too many threads will quickly degrade your crawlers
performace.

o Do not traverse many hosts at the same time because each will require
an SmbTransport which re quires a Thread and two 64K buffers.

o Do not traverse many directories on the same host at the same time
becuase more IO multiplexin g on the same host makes the host less
responsive and therefore your threads ultimately end up doing less
work. They could be off working on another host that would respond
quickly.

o Considering the above two constraints there is a ratio of threads
per host that is ideal. This is probably 2 or 3 but may be as high
as 5 possibly. Now add threads until your throughput no longer
increases. Depending on the machine this is probably around 10 to
50 threads.

o Increase the memory allocated to the VM with the -mx256m VM parameter
(or use -mx512m if you have a lot of memory). This will releave garbage
collection pressure. After doing so you may find the number of threads
and or threads per host may be increased.

Mike

-- 
A  program should be written to model the concepts of the task it
performs rather than the physical world or a process because this
maximizes  the  potential  for it to be applied to tasks that are
conceptually  similar and, more important, to tasks that have not
yet been conceived. 


More information about the jcifs mailing list