Oplock logic bugs in Samba 2.2.2

Mon Oct 15 12:55:03 GMT 2001

On Mon, Oct 15, 2001 at 02:26:44PM -0400, acherry at pobox.com wrote:
> 
> Hello-
> 
> I hope it's OK for me to send this to samba-technical.  It seems
> appropriate since it involves a problem that has received a lot of
> attention recently.
> 
> We just recently tried an upgrade to Samba 2.2.2 on our Solaris 2.6
> server.  It seems that oplock-related problems are still present in
> this release, but there also seems to be a tdb locking issue that one
> of the oplock debug attempts might be triggering.
> 
> We have seen a number of messages in log.smbd that follow this
> sequence:
> 
> [2001/10/14 07:08:06, 0] smbd/oplock.c:request_oplock_break(929)
>   request_oplock_break: PANIC : breaking our own oplock requested for
> dev = 274c740, inode = 251773, tv_sec = 3bc98026, tv_usec = 52ff1 and no fsp found !
> [2001/10/14 07:08:06, 0] lib/util.c:smb_panic(1055)
>   PANIC: request_oplock_break: no fsp found for our own oplock
> [2001/10/14 07:08:06, 0] locking/locking.c:delete_fn(253)
>   locking : delete_fn. LOGIC ERROR ! Entry for pid 15764 and it no longer exists !
> [2001/10/14 07:08:06, 0] locking/locking.c:delete_fn(253)
>   locking : delete_fn. LOGIC ERROR ! Entry for pid 15764 and it no longer exists !
> 
> As more of these occur, the smbd process load on the system gets
> higher and higher.  All of the PANICs seem to be happening in the
> "JRA PARANOIA TEST" section added to oplock.c in 2.2.2.
> 
> The error messages and symptoms would seem to indicate that these
> processes somehow still have the share modes tdb locked when they die.
> After several of these occur, all of the smbd processes on the system
> start to hang, and as the clients request new connections, the memory
> load eventually takes the system down.  (My fault for not setting max
> smb connections...)
> 
> I have not been able to get any core files from the aborted smbd
> proceses yet, so I don't have any gdb traces.  We've backed up to
> 2.0.6, but I'm setting up a test system to try to reproduce the
> problem again.  I'm hoping to get a core dump out of one of the
> panicked smbd processes.
> 
> When I attempted to shut down Samba after the failure, many of the
> smbd processes hung while trying to exit, requiring a "kill -9".  Here
> is a gdb backtrace of one of them:
> 
> (gdb) bt
> #0  0xef5b634c in __fcntl () from /usr/lib/libc.so.1
> #1  0xef5e9a60 in s_fcntl () from /usr/lib/libc.so.1
> #2  0x112860 in tdb_brlock (tdb=0x20d188, offset=252, rw_type=2, lck_type=35, 
>     probe=0) at tdb/tdb.c:169
> #3  0x1129a4 in tdb_lock (tdb=0x20d188, list=21, ltype=2) at tdb/tdb.c:199
> #4  0x114088 in tdb_next_lock (tdb=0x20d188, tlock=0xefffccc8, rec=0xefffccd8)
>     at tdb/tdb.c:1101
> #5  0x114300 in tdb_traverse (tdb=0x20d188, fn=0xe4068 <delete_fn>, 
>     state=0xefffcd74) at tdb/tdb.c:1194
> #6  0xe438c in brl_shutdown (read_only=0) at locking/brlock.c:229
> #7  0xe30f0 in locking_end () at locking/locking.c:328
> #8  0x2e2c8 in exit_server (reason=0x1249d8 "caught signal")
>     at smbd/server.c:491
> #9  0x2d684 in dflt_sig () at smbd/server.c:71
> #10 <signal handler called>
> #11 0xef5b743c in poll () from /usr/lib/libc.so.1
> #12 0xef5ccc7c in select () from /usr/lib/libc.so.1
> #13 0x110b24 in sys_select (maxfd=36, fds=0xeffff3b0, tval=0xeffff3a8)
>     at lib/select.c:82
> #14 0x692a0 in receive_message_or_smb (buffer=0x276c19 "", buffer_len=65600, 
>     timeout=60000) at smbd/process.c:202
> #15 0x6a7dc in smbd_process () at smbd/process.c:1254
> #16 0x2ec20 in main (argc=0, argv=0xeffff58c) at smbd/server.c:811

Hmmm. This looks like the problem that was confirmed fixed by
several large Solaris sites with the Samba 2.2.2 CVS code.

Getting more backtraces would be very useful here.

Jeremy.