[jcifs] preventing soTimeout NT_STATUS_ACCESS_VIOLATION w/ NTLM

Fri Jun 17 17:23:38 GMT 2005

On Fri, 17 Jun 2005 08:13:03 -0500
"Tapperson Kevin" <Kevin.Tapperson at hcahealthcare.com> wrote:

> >> I have had success in preventing jcifs from throwing an SmbAuthException with the NT_STATUS_ACCESS_VIOLATION ("Invalid access to memory location") error 
> >>code in association with an soTimeout event by implementing a reference counter for NTLM HTTP authentication requests in the SmbTransport class.  The 
> >>following changes described below were done on the jcifs_1.1.10 code base.  I checked the jcifs_1.2.0 code base to see how different the changes would be 
> >>for it.  The only major difference is the change in the SmbTransport.run() method.  (In 1.2.0, the changes below to SmbTransport would need to go into the 
> >>Transport class and the Transport.disconnect() method.)
> >
> >Why is this necessary? My understanding is that with the default soTimeout
> >value of 10 minutes the chances of getting an access violation situation
> >are very slight. What soTimeout value are you using?
> 
> I have tried using soTimeout values of 300000 (5 minutes) and 0.

5 min is *less* than the default value. Why not just leave it at 10 minutes?

Are you using the stock NtlmHttpFilter or are you using a modified filter (aside from your refcounting change)?

> (I tried using an soTimeout value of 0 to avoid this problem, but found that it is still possible to generate the NT_STATUS_ACCESS_VIOLATION exception if the domain controller happens to decide it's time to close the transport socket during the time period after a type-2 message has been sent to a client but before the type-3 message has been received and processed by the filter.)  By adding a reference counter (as previously described) to the jcifs code and using an soTimeout value smaller than what the domain controller uses (appears to be about 15 minutes), jcifs can be in complete control over when the transport socket gets closed and can avoid this error (except as would occur in cases of dropped network connections).
> 
> We have robotic monitoring of our application in place and keep getting dinged with unexplained SLA violations (due to inability to authenticate).  After investigating, I found that our robotic script (and actual users of our system) are receiving this error from time to time.  We have users scattered across the US, so round trip response times (even for small data packets like the NTLM authentication process) can sometimes be measured in seconds depending on network conditions.  The frequency of occurrence of this problem is tied to the round trip response time between when a type-2 message is sent and when a type-3 message is received and processed.  The longer it takes for a client browser to receive a type-2 message and send a type-3 response, the better chance there is of encountering this issue.  With an soTimeout value of 5 minutes, this means that (depending on load and load distribution) up to (24*60/5) = 288 SmbTransport sockets from any one application server pro
 cess will be closed per day per domain controller.  If any of those socket close events happen to coincide with a delayed response from a client, this error would be generated.  (We have 4 app servers running 3 JVMs each and load balancing across 3 different domain controllers for authentication.  So at worst case with an soTimeout of 5 minutes, we have 4*3*3*288 = 10368 SmbTransport sockets closed in a day.)
> 

I think there has to be something else going on here. First, your "number
of sockets closed in a day" is wrong because the socket only closes
after it's *been idle for soTimeout ms*. So the webserver would have
to get NO new requests that trigger auth for 5 minutes and then, within
a window of a few seconds depending on the network, get a request that
needs authentication. How many times is *that* going to happen in a day?

If your network is really that slow then make the soTimeout something
really huge like 100 minutes. 0 is extreem and really shoulnd't be
used (and as you discovered it's broken). 5 minutes is *less* than the
default which is the wrong direction I think. Personally I don't think
you should change it at all. If you're using 10 minutes and getting more
than 1 exception per month something else is going on. Crank up logging
and check to see what the exceptions actually are. If it's "Connection
timeout" or "No route to host" or something like that then I don't think
refcounting is going to help.

But if you're certain refcounting helps amealiorate the issue I'll look
at it. Have you received any exceptions since instituting the refcounting?

Mike