SV: [jcifs] BFS vs DFS

Sat Jul 28 09:44:47 EST 2001

On Fri, Jul 27, 2001 at 03:24:36PM +0200, James Nord wrote:
> Allen, Michael B (RSCH) wrote:
> 
> >Please note the public jCIFS API is far from ideal for an SmbCrawler. There
> >are a few advanced things that could be done to improve performance.
> >
> >3) Use threads but only one per host.
> >
>                                            ^^^^^^^^^^^^^^^^^^
> Oops,  Back to the drawing board I go ;-)
> 
> Any particular reason?

Some very primitive tests suggest that only using one thread per host
is a little better. This is because transports are shared so issueing
concurrent requests to the same host will result in threads waiting for
a locks to be released before a message can be sent or received. Since
about 10 threads will push the CPU up to ~100 you might as well just
dedicate a thread to each of 10 hosts. That will open 10 transports and
each thread will have uninhibited access to. This will probably only
increase performance only slightly and in a linear order. That's why
it's listed 3rd.

If you *really* want raw speed. Look at the SmbTree.java code to
see how to use it's methods directly or better still use send() and
sendTransaction(). This is actually not that hard. It's a pretty good
API by itself because you still don't need to know anything about
encoding or decoding the actuall smbs. This is close to the metal
and allows you to send the bare minimum number of messages on the wire
needed. For example, when you call exists() a Trans2QueryPathInformation
message is sent and received. But you probably already recieved this
information in a previous message such as from the Trans2FindFirst2 list
operation. If you look at packet traces during a crawling session, there
is a Trans2QueryPathInformation sent for *every* file or directory. This
is because the public API does not have the clairvoyace to know that
you already retrieved this information and that it's not old (was the
file deleled?).

>  Is it just the nature of SMB and the way the 
> quiries are sent?

SMB is a very verbose protocol. jCIFS does a very good job in that it
multiplexes IO and uses batching. I would say it's more of an incongruency
between the public API and the nature of what you are trying to do which
is traverse as many files on different shares, on different hosts, as
quickly as possible. If you go down one lay this could be very effient
but the *public* API is designed to be "as simple as possible and no
simpler". The target audience here are utilitarian programs for bridging
the gap between MS and the Java environment, not mp3 spiders. If you
want a rippin' fast mp3 spider, you have to go down one level. I would
help you further but I think I should give people a chance to password
protect there shares first :~)

> I was thinking about this when I started my crawler, but I figured go 
> multithreaded incase the files/shares where on different disks and I 
> would get a speed increase ;-)

No. If you run the ThreadedSmbCrawler example on a machine and monitor
the CPU usage of both, you'll see the crawler is using ~100% whereas
the server doesn't even break a sweat. This is mostly due to the above
mentioned incongruency. But it's also a little bit of a Java vs c thing.

Mike