[jcifs] OutOfMemoryError unable to create new native thread

Sun Mar 7 13:03:26 MST 2010

Michael B Allen <ioplex <at> gmail.com> writes:

> 
> Hi Peter,
> 
> The whole point of the QueryThread is to allow simultaneous lookups
> such that the first valid response is used immediately without
> waiting. So by adding join you basically completely defeat the purpose
> of using QueryThreads at all. You might as well just dump the
> QueryThreads entirely and perform to synchronous lookups.
> 
> First, try playing with the name service properties. Try setting
> jcifs.resolveOrder=DNS. I think that might stop the NetBIOS lookups
> entirely. Of course you'll then need to use only fully qualified DNS
> hostnames.
> 
> Otherwise, you'll need to look at why these lookups are being
> generated at such a high frequency in the first place. You must be
> using an awful lot of threads if you eventually end up spawing 1900
> lookups in which case you might want to analyze and determine if using
> that many threads is really increasing overall throughput. Or prevent
> all of the threads from starting at the same time so that the result
> of the lookups have a chance to be cached.
> 
> Regarding the cache, if it turns out that increasing
> jcifs.netbios.cachePolicy reduces the frequency of the lookups, then
> that is actually a good solution. These names do not change
> frequently. The default value is low only because in almost all cases
> it doesn't need to be higher. It's only when someone is trying to do
> 100 lookups at the same exact instant that it matters. This is why we
> provide for changing these properties.
> 
> Mike

Hi Michael,

Thanks for responding. And for pointing to the possible tuning.

However, probably I wasn't clear enough in the description of the problem.
Let me try to clarify that.

The problem is not that our app creates too many threads. Our app doesn't do
that. The problem is that the JCIFS library creates those too many threads. In
our case when we hit the OufOfMemoryError, there are more than 1800 JCIFS
QueryThreadS alive, causing the problem.
The tuning might push the problem further (or hide in some cases), but not fix
it.

Please, check the test case I've sent. It's very simple and it should be easy
to understand, and if you hit the issue it dumps the threads alive, so you can
see that those threads are JCIFS QueryThreadS.
Please, also remove the property setting the cache to 0 - it seems I was wrong
in the assumption it could be tuned based on that property.

As you can see, the test itself doesn't create any new threads, it only sends
many lookup requests to the JCIFS. Those threads, which are getting created, are
JCIFS QueryThreads.
I hope you are able to reproduce the problem, it should be only a matter
of increasing the amount of the requests (and providing some DNS name).

I was debugging the test, and here is a rough description of what happens, and
what goes wrong:

1) The app sends many requests to resolve the same DNS name (from same thread),
which should be perfectly valid thing to do.
2) For the first request R1, the JCIFS creates two QueryThreadS (R1-Q1, R1-Q2),
where each resolves for specific flavor of the DNS name creating two Name
instances -> Name1, and Name2.
3) One of the QueryThreads (R1-Q1) resolves the Name1 successfully, puts it
into cache, and provides the result to the caller.
4) The caller of the request gets the result and continues.
5) However at the same time, the second QueryThread (R1-Q2) is resolving the
Name2, it get stuck in the resolving call (CLIENT.getName or similar call). At
that point the lookupTable is set that is resolving Name2, the thread is alive.
6) Then the next lookup request R2 comes to resolve the same DNS name.
7) Again two new QueryThreads (R2-Q1, R2-Q2) are created to resolve the Name
instance types, i.e. the same Name1 and Name2.
8) The QueryThread (R2-Q1) resolving Name1, gets the name resolved
successfully, either from cache, or repeating the steps from 3).
9) The caller gets the results and continues.
10) However the second QueryThread (R2-Q2) resolving the Name2, gets into wait
state, because there is still another QueryThread (R1-Q2) from the first request
R1 trying to resolve the same Name2, and keeps blocking QueryThread (R2-Q2)
because the lookupTable for the Name2 is set.
11) If you repeat this process (steps 6)-10)) many times (up to Rn, where n is
large enough),
and the R1-Q1 is still trying to resolve the Name2, which sometime happens in
our app, then you eventually get that many QueryThreads (Rx-Q2) blocked and
lingering around, that you hit the OutOfMemoryError (no new native thread could
be created).

To the fix:
You'd be correct in dismissing the fix, if there are just the join calls for
the threads.
The request would be blocked for longer time, because even when the first
QueryThread Q1 already provided a result, you'd need to wait for the second Q2,
which is still trying to resolve the Name2, which will be not successful, and
it seems it takes usually much longer (probably because of some timeout).

However before calling the join on such thread, there is also an interrupt call
on that thread, which in our case interrupts the unsuccessful attempts to
resolve the Name2.
That works really well on my machine.
I experienced no measurable performance penalty, and on top of it also no
unnecessary threads lingering around doing wasteful job and also no
OutOfMemoryError.

Please, reconsider it.

Thanks,
-Peter