segfault related to locking in 2.0.0beta1

Andy Bakun abakun at reac.com
Thu Nov 19 19:57:14 GMT 1998


I've been using samba 2.0.0 since alpha 6 as a PDC and we've had this
problem on our network since then and I only now was able to track it
down.

Clients are WinNT4 SP3 and SP4 running Office97.  Samba 2.0.0beta1 is
running on Linux 2.0.35 RedHat 5.1.  We are extremely heavy Word 97
users.  I mean heavy in that the average file size is between 2.5 and 4
megs and between 2 and 5 documents are usually open at once.

The problem manifests itself on the clients as dialog boxes that say
something to the effect of "Word was unable to save the document because
the network connection was lost or the floppy disk is missing", or "A
buffered write failed. Some data may be lost".  No data is ever actually
missing, but Word sometimes quits after printing this message.  The
message is variable and sometimes contains a Retry button, othertimes
it's just an OK button.  Sometimes the message is coming from the
operating system rather than from Word.

I tracked this down to the fact that smbd was segfaulting when trying to
break an oplock.  When smbd dies, the client reconnects, so nothing is
lost but the users get worried when they see these messages.
Interestingly enough, some machines don't print messages at all -- my
workstation (NT4SP4) doesn't display the dialog boxes even when smbd
segfaults.  Also, I can not get it to segfault predictablly.  I tracked
it down to derefing a pointer that was NULL.  This may indicate a
possible race condition.  Here is a patch to file_find_dit in files.c:

*** files.c.orig  Thu Oct 22 22:34:50 1998
--- files.c     Thu Nov 19 13:27:42 1998
***************
*** 262,268 ****
        files_struct *fsp;

        for (fsp=Files;fsp;fsp=fsp->next,count++) {
!               if (fsp->open &&
                    fsp->fd_ptr->dev == dev &&
                    fsp->fd_ptr->inode == inode &&
                    (tval ? (fsp->open_time.tv_sec == tval->tv_sec) : True ) &&
--- 262,268 ----
        files_struct *fsp;

        for (fsp=Files;fsp;fsp=fsp->next,count++) {
!               if (fsp->open && fsp->fd_ptr &&
                    fsp->fd_ptr->dev == dev &&
                    fsp->fd_ptr->inode == inode &&
                    (tval ? (fsp->open_time.tv_sec == tval->tv_sec) : True ) &&

This patch fixes the problem, but I don't know if NULL should be
returned by file_find_dit or if true should be assumed for
fsp->fd_ptr->dev/inode like the tval comparisons are.  It might also be
important to note that we use some DOS programs that run in a DOS box on
NT that do file locking, and before I applied this patch, the locked
files (shown by smbstatus) these programs use would grow and multiple
instances would appear even though there was only ever one instance of
the program running (apparently, the locks were not correctly released
on the samba end when the DOS program exited).  Overall, the crashing
has stopped, and locked files list seems more sane.

A level 5 debug log is avaiable at
http://www.reac.com/oplock_crash.log.gz.  Obviously, the good parts are
at the end :).  A few of my own debug messages are in there.
I can provide a higher level debug log if necessary.

Andy.



More information about the samba-technical mailing list