[jcifs] NTLM usrname/password failure after each 5 mins

Tue Jun 24 14:04:50 GMT 2008

After investigating some more on the issue, I've found out that the load
test problem is not in the synchronize issue I've raised, but its in the
hiccup you've described.

*The Hiccup as I've seen it*
The transport is not being used for 5 minutes, since the load test is using
the same users, thus jCIFS is using the cache. Once the transport performs
the disconnect and then the connect, the transport.server.encryptionKey
changes. Since the encryptionKey is the challenge jCIFS returns to the
browser on the type2 message, it is the heart of the problem.
The problem occurs when we send the challenge, *before the disconnect* ,in
the type2 message. While the browser is processing and preparing the type3
message using that challenge (let's call it A), the disconnect/connect
occurs and we have a new encryption key / challenge we're using when we
communicate with the DC (Let's call this challenge B). After the connect
completes, we receive the type3 message which was prepared using challenge
A, and send it to the DC, while he expects us to use challenge B. That's why
we get the "bad username or password" exception.

*Trying to solve it*
I can detect in the NtlmFilter, this situation, if I save the challenge,
used in the type2 message, on the session, and compare it with the challenge
currently being used in the transport, before processing the type 3 message.

I've tried sending a 401 with WWW-Authenticate: NTLM, to *restart* the
process, this time with right challenge, but it didn't work.

I've also tried sendind a redirect (301) to the same resource, to restart
the NTLM, but it fails with a circular redirect.

If the load test is using a small number of threads, then the problem
doesn't happen, since the time between the type3 message is processed quick
enough, before we the disconnect occurs.
If we load it with many threads, it always happens.

*Guidance is needed*
1. Can we some how know how much time the socket have before reaching
timeout? If we did, maybe we could use it when we prepare type2 message. If
we're too close to the disconnect, then we'll force it, or wait until it
happens and then send in the type2 message.

2. What's the implications of keeping it open indefenitely? Or just "keeping
it alive" somehow?

3. Do you have any idea on how to handle this issue?

Thank you,

Asaf

On Mon, Jun 16, 2008 at 1:00 AM, Michael B Allen <ioplex at gmail.com> wrote:

> On 6/15/08, AsafM <asaf.mesika at gmail.com> wrote:
> >
> >  Hi all,
> >
> >  I'm reviving a 2 years old topic, regarding load testing.
> >  You can take a look at the entire thread of the discussion
> >
> http://www.nabble.com/NTLM-usrname-password-failure-after-each-5-mins-td5381546.html#a5391633
> >  here
> >
> >  I'll start with a quick summary, and then shed lots of details to make
> it
> >  clearer:
> >  After the transport disconnects (due to socket timeout) and connects,
> the
> >  first 10 , or so, attempts to authenticate against the DC fails on bad
> >  username/password. After those failures, all attempts succeeds.
> >  I've gained some knowledge I'll now share, but I'm still missing some
> key
> >  elements to figuring this out.
> >
> >  Load Testing Setup
> >  110 threads, consistently accessing a protected resource on Tomcat,
> which
> >  requires an NTLM authentication.
> >  Each thread is using one user. For example: Thread-34 is logging in as
> user
> >  TEST34.
> >
> >  The turn of events
> >  1. The first thread accessing the resource, setups the session
> >  (SmbSession.sessionSetup()), which blocks all other threads, since each
> >  thread (user) requires to setup a session of its own.
> >     The session setup runs the Transport.connect(), creates a tree for
> the
> >  default user (to enable SMB signing), and send the
> SmbComSessionSetupAndX to
> >  the DC, for authentication.
> >
> >  2. Once the 1st session setup is done, all other threads follows, each
> >  creating its own session, attached to one transport object (Transport-1
> >  thread).
> >
> >  3. On the second iteration of the test threads, there's no need for
> session
> >  setup. The session object is retrieved from the transport (it's cached
> >  there).
> >  This usage of cache causes the lack of usage in the transport socket.
> >
> >  4. After soTimeout (jcifs constant of 5 min), the loop() method of
> Transport
> >  receives a SocketTimeoutException, and calls Transport.disconnect()
> which in
> >  turn calls SmbTransport.doDisconnect().
> >
> >  5. The doDisconnect() logs off all sessions attached to the transport
> >  object, closes down the socket and finally resets the digest property,
> which
> >  is used to sign each request sent to the DC (this is set in the first
> >  sessionSetup in SmbSession).
> >
> >     ** First Problem**
> >  While disconnects logs-off sessions, other threads were using them, and
> >  acting as-if the transport is connected.
>
> It is ok for other threads to reference sessions. If there is no
> activity on the socket then it should be possible to close the
> sessions even if there are 100 threads constantly calling
> SmbSession.logon().
>
> But the "acting as-if they transport is connected" sounds suspicious.
> When a transport is shutdown it should call logoff() on each session
> which should call treeDisconnect() on each transport which should set
> treeConnected = false. Then, if threads regain access to calling
> SmbSession.logon() they should see treeConnected = false and the first
> thread should reconnect the tree, re-logon the session and reconnect
> the transport. Then subsequent threads see treeConnected and you're
> back in the steady-state.
>
> >  I've bypassed this issue, by:
> >  a) Setting the Transport.state to 0 in the Transport.disconnect()
> function.
> >  This causes the Transport.connect() to actually connect.
> >  b) Adding a synchronize (this) block on both disconnect() and connect()
> >  methods, which prevents running connect() while disconnect() is
> commencing.
>
> I don't understand this. The Transport.connect()/disconnect() methods
> are already synchronized and the transport state is changed to 0 in
> disconnect().
>
> >  6. While disconnect() was running, all other threads were waiting in
> queue,
> >  to run transport.connect(), in the SmbTree.treeConnect() method.
> >     Once the disconnect finished, each thread in its turn, ran the
> connect
> >  and cotinued for creating a session by running
> SmbSession.sessionSetup().
> >  Since that function is syncrhonized on transport(), sessions were
> created
> >  once at a time, for each thread.
> >
> >  7. The first session to run the setup, identified that the
> transport.digest
> >  was empty (due to SmbTransport.doDisconnect()), thus ran treeConnect on
> the
> >  default username, used for SMB signing.
> >  Once that was finished successfully, it sent the SmbComSessionSetupAndX
> for
> >  the user it was trying to authenticate.
> >  It failed in the DC. SmbComSessionSetupAndXResponse returned with an
> error
> >  code: Logon failure: unknown user name or bad password
> >
> >  8. Allot of threads after the first thread inline, failed also on the
> exact
> >  spot in the sessionSetup().
>
> There is a known "hiccup" that occurs whenever the connection is
> recycled due to the soTimeout. I don't know what the problem is. I
> assume the challenge is momentarily wrong.
>

> >  9. From some magical reason, which I'm yet to figure out, after 10 or so
> >  failures, the DC started returning success in the
> >  SmbComSessionSetupAndXResponse.
>
> Is the NTLM challenge old? Log the hexdump of the NTLM challenge and
> see if it changes with the result of the
> SmbComSessionSetupAndXResponse. If it does that confirms that the
> challenge isn't being handled properly. If it does not change and the
> new challenge is being used correctly, but the DC is returning
> different results given the same input then that would be very
> interesting.
>
> This is the best analysis of the "hiccup" bug that I've seen. Aside
> from my comments, everything you say is true and is expected behavior.
> The interesting parts are the "acting as-if they transport is
> connected" bit and what the challenge is spanning the authentication
> failure / success.
>
> Mike
>
> --
> Michael B Allen
> PHP Active Directory SPNEGO SSO
> http://www.ioplex.com/
>
-------------- next part --------------
HTML attachment scrubbed and removed