Samba 2.2.2 oplock problem

Fri Jan 4 11:13:05 GMT 2002

Hi Jeremy,

We have been using Samba 1.9.18p10 in production here at UB for over 2 
1/2 years with no problem. He use it to give Windows user's access to 
DFS space via DCE username/password. We have four Sun's running Samba 
that we recently upgraded from Solaris 8 (from 2.6) and DCE/DFS 3.1.

We took the opportunity before the semester starts to also upgrade to 
Samba 2.2.0. We found the 100% cpu utilization error relating to the 
race condition and oplocks, so I upgraded to 2.2.2 and applied your patch.

However, we still are seeing this problem. Perhaps there was another 
patch or fix for this I missed on the list?

Any help at all on this would be greatly appreciated.

Thanks
--
Lee Liolios
Programmer/Analyst
SUNY Buffalo

 >Hi all,
 >
 >	I think I've finally worked out what is causing the panic
 >error messages. This is the :
 >
 >open_mode_check: Existant process XXXX left active oplock.
 >
 >messages that people have mainly been seeing on Solaris (but
 >occasionally on Linux also).
 >
 >I think it's a race condition caused by the cleanup code in
 >locking/locking.c that ensures the share mode database contains
 >no entries from a terminating smbd, and the code in open.c that
 >ensures an open file has no exclusive oplock entries left.
 >
 >It would normally occur with a heavily contended file, the
 >scenario looks something like this...
 >
 >smbd (a) sends client an oplock break message due to open
 >reqeusts from smbd's (b) and (c).... (z) - all of which can
 >happen concurrently.
 >
 >The client of smbd (a) fails to respond to the break request
 >(happens sometimes, bad cabling, client dead, whatever..).
 >
 >smbd (a) then decides it's time to exit. In doing so it
 >goes through the share mode/open file database, deleting
 >records for open files it has. It then does a second traverse
 >of the share mode db looking for any records it may have
 >missed (that would be a logic error). This second traversal
 >is very expansive, and unnessesary (it's been removed in
 >the current 2.2 and HEAD CVS code). The whole point is that
 >this termination could take a relatively long time, depending
 >on the contention on the share mode db (this is the variable
 >part which is why it's been impossible to get a reproducible
 >test case).
 >
 >In the mean time, smbd's (b)....(z) are scanning the share
 >mode db, waiting for the record that caused them to send the
 >oplock break to be removed. Eventually they give up and decide
 >to remove the record themselves. Before doing that, currently
 >in the 2.2.2 and CVS (2.2 and HEAD) code, they check if the
 >process owning that record still exists. If it does, they
 >consider it a logic error and terminate themselves. THIS ASSUMPTION
 >IS THE FLAW. As noted above, the cleanup process may take a
 >relatively long time, and as such it's not an error if the
 >process still exists, it's (hopefully) doing it's best to
 >cleanup and die.
 >
 >The following simple patch (already applied to 2.2 and HEAD CVS)
 >should apply cleanly to a 2.2.2 source tree, and if this
 >assumption is correct, should fix the problem. If the
 >case described above occurs, all that should happen now
 >is log messages stating
 >
 >"open_mode_check: Existant process XXXX left active oplock"
 >
 >which can be treated as a warning rather than a fatal error.
 >
 >If people who have been suffering from this problem could
 >either try the 2.2 CVS or apply this patch to their 2.2.2
 >code and test to see if the problems reported are fixed,
 >I'd greatly appreciate it.
 >
 >Thanks,
 >
 >	Jeremy Allison,
 >	Samba Team.
[...actual patch removed...]