Samba 2.2.2 oplock problem

Tue Nov 27 15:10:05 GMT 2001

Hi all,

	I think I've finally worked out what is causing the panic
error messages. This is the :

open_mode_check: Existant process XXXX left active oplock.

messages that people have mainly been seeing on Solaris (but
occasionally on Linux also).

I think it's a race condition caused by the cleanup code in
locking/locking.c that ensures the share mode database contains
no entries from a terminating smbd, and the code in open.c that
ensures an open file has no exclusive oplock entries left.

It would normally occur with a heavily contended file, the
scenario looks something like this...

smbd (a) sends client an oplock break message due to open
reqeusts from smbd's (b) and (c).... (z) - all of which can
happen concurrently.

The client of smbd (a) fails to respond to the break request
(happens sometimes, bad cabling, client dead, whatever..).

smbd (a) then decides it's time to exit. In doing so it
goes through the share mode/open file database, deleting
records for open files it has. It then does a second traverse
of the share mode db looking for any records it may have
missed (that would be a logic error). This second traversal
is very expansive, and unnessesary (it's been removed in
the current 2.2 and HEAD CVS code). The whole point is that
this termination could take a relatively long time, depending
on the contention on the share mode db (this is the variable
part which is why it's been impossible to get a reproducible
test case).

In the mean time, smbd's (b)....(z) are scanning the share
mode db, waiting for the record that caused them to send the
oplock break to be removed. Eventually they give up and decide
to remove the record themselves. Before doing that, currently
in the 2.2.2 and CVS (2.2 and HEAD) code, they check if the
process owning that record still exists. If it does, they
consider it a logic error and terminate themselves. THIS ASSUMPTION
IS THE FLAW. As noted above, the cleanup process may take a
relatively long time, and as such it's not an error if the
process still exists, it's (hopefully) doing it's best to
cleanup and die.

The following simple patch (already applied to 2.2 and HEAD CVS)
should apply cleanly to a 2.2.2 source tree, and if this
assumption is correct, should fix the problem. If the
case described above occurs, all that should happen now
is log messages stating 

"open_mode_check: Existant process XXXX left active oplock"

which can be treated as a warning rather than a fatal error.

If people who have been suffering from this problem could
either try the 2.2 CVS or apply this patch to their 2.2.2
code and test to see if the problems reported are fixed,
I'd greatly appreciate it.

Thanks,

	Jeremy Allison,
	Samba Team.

Warning, this patch goes over the 80 character limit on many
email clients. If it doesn't apply cleanly, check your email
client doesn't wrap at 80 columns.

--------------------------------------------------------------------

--- locking/locking.c.orig	Tue Nov 27 14:18:28 2001
+++ locking/locking.c	Tue Nov 27 14:18:46 2001
@@ -330,8 +330,10 @@
 
 		/* delete any dead locks */
 
+#if 0
 		if (!open_read_only)
 			tdb_traverse(tdb, delete_fn, &check_self);
+#endif
 
 		if (tdb_close(tdb) != 0)
 			return False;
--- smbd/open.c.orig	Tue Nov 27 14:17:22 2001
+++ smbd/open.c	Tue Nov 27 14:18:07 2001
@@ -555,11 +555,8 @@
 dev = %x, inode = %.0f. Deleting it to continue...\n", (int)broken_entry.pid, fname, (unsigned int)dev, (double)inode));
 
 					if (process_exists(broken_entry.pid)) {
-						pstring errmsg;
-						slprintf(errmsg, sizeof(errmsg)-1, 
-									"open_mode_check: Existant process %d left active oplock.\n",
-								broken_entry.pid );
-						smb_panic(errmsg);
+						DEBUG(0,("open_mode_check: Existant process %d left active oplock.\n",
+								broken_entry.pid ));
 					}
 
 					if (del_share_entry(dev, inode, &broken_entry, NULL) == -1) {