[jcifs] SMB Crawler Guidelines

Dan Dumont Dan at canofsleep.com
Wed Apr 9 07:05:27 EST 2003

> This is probably 2 or 3 but may be as high
> as 5 possibly. Now add threads until your throughput no longer
> increases.

Is there a way to test this in the program or is this an observation you
were intending we make while we are deciding how many threads to use.

In this case, I was thinking that a large number of threads would be
preferable, as the database of ip's to scan can have a great many inactive
computers.  And the network io is very slow.

However, you mentioned that a thread is spawned for each smbTransport, but
you said that thread creation is expensive.  Did you somehow make a
workaround for this or is this a cost that we must live with?

Also.. I think I understand what you meant, you said that about 5 threads
per host, but how many hosts do you think we should spawn at a time?  There
is a larger probability for waiting for a host since the host may be
inactive..    so...    does a 20-30 parent threads each having 3-5 per host
indexing threads sound reasonable?

-----Original Message-----
From: jcifs-bounces+dan=canofsleep.com at lists.samba.org
[mailto:jcifs-bounces+dan=canofsleep.com at lists.samba.org] On Behalf Of
Michael B.Allen
Sent: Tuesday, April 08, 2003 4:34 PM
To: Dan Dumont
Cc: jcifs at lists.samba.org
Subject: [jcifs] SMB Crawler Guidelines

On Tue, 08 Apr 2003 15:52:13 -0400
Dan Dumont <Dan at canofsleep.com> wrote:

> Is there an algorithm you can suggest, or some documentation that you can
> point me to that would explain the benefits and downfalls of various
> implementations?

You mean crawler implementations? Not really. I tried on several occations
to create a truely optimal crawler. It's not trivial. The T2Crawler
example is without a doubt the fastest. But it does not respect the
depth limit and last I checked it was no easy way to communicate that
limit to each thread.

Here's some general guidelines though:

o Do not use a large working set. If the working set is large the
jcifs.smb.client.attrExpirationPeriod will run out and the client
will go back to the network to query attributes. This will cause the
client to stall. Use a small list of directories to traverse and make
attrExpirationPeriod large to ensure SmbFile attributes are never stale.

o Do not create and destroy threads. It is expensive to create (and
destroy) threads. Create a pool of N threads and let them loop over a
shared working set. Take care to syncronize access to the shared list.

o Do not use many threads. More threads != more work. Threads can only
divide up a fixed potential amount of work. A separate thread will only
do productive work if all other threads are sleeping (e.g waiting for a
reponse from the network in this case). If many threads are used there
will always be some that are not sleeping and therefore adding more will
not result in more work being performed. Threads use a lot of memory
(~1MB) and context switching between threads is expesive (destroys CPU
cache) so having too many threads will quickly degrade your crawlers

o Do not traverse many hosts at the same time because each will require
an SmbTransport which re quires a Thread and two 64K buffers.

o Do not traverse many directories on the same host at the same time
becuase more IO multiplexin g on the same host makes the host less
responsive and therefore your threads ultimately end up doing less
work. They could be off working on another host that would respond

o Considering the above two constraints there is a ratio of threads
per host that is ideal. This is probably 2 or 3 but may be as high
as 5 possibly. Now add threads until your throughput no longer
increases. Depending on the machine this is probably around 10 to
50 threads.

o Increase the memory allocated to the VM with the -mx256m VM parameter
(or use -mx512m if you have a lot of memory). This will releave garbage
collection pressure. After doing so you may find the number of threads
and or threads per host may be increased.


A  program should be written to model the concepts of the task it
performs rather than the physical world or a process because this
maximizes  the  potential  for it to be applied to tasks that are
conceptually  similar and, more important, to tasks that have not
yet been conceived. 

More information about the jcifs mailing list