Locking and VFS modules question

Tue Jul 30 10:38:08 MDT 2013

As part of coding a local VFS opaque module we have run into a problem 
with locking causing core dumps.  We're at a relatively early stage of our 
development and the target filesystem that we're interfacing to doesn't 
really support locking, so we're just interested in simple share mode 
locks currently (no byte range locking or anything fancy).  In our VFS's 
create_file() routine we try to get share mode locks when we open files 
using:

   struct share_mode_lock *lck = NULL;
   lck = get_share_mode_lock(handle, fsp->file_id,
 	                    handle->conn->connectpath,
                             smb_fname, &old_write_time);
   if (lck == NULL) {
       DEBUG(0, ("Could not get share mode lock\n"));
       return NT_STATUS_SHARING_VIOLATION;
   }
   set_share_mode(lck, fsp, get_current_uid(handle->conn),
                  req ? req->mid : 0,
                  fsp->oplock_type);

This seems to work and I can dump out a sensible looking share_mode_lock 
data structure.  We can then do some file operations (read, write, etc) 
and when we finally close we try to get rid of the share mode lock in 
our close() VFS function:

   struct share_mode_lock *lck = NULL;
   lck = get_existing_share_mode_lock(talloc_tos(), fsp->file_id);

   if (lck != NULL) {
       int res = 0;
       res = del_share_mode(lck, fsp);
       if(res) {
           DEBUG(0,("del_share_mode() OK\n"));
       } else {
           DEBUG(0,("del_share_mode() failed\n"));
       }
       TALLOC_FREE(lck);
   } else {
       DEBUG(0,("Didn't get a share mode lock!\n"));
   }

This again seems to run fine - we get a "del_share_mode() OK" message in 
the logs and the close() function then does its thing.

However once the close() function has completed (including deleting the 
file if close-on-delete is set) we get a core dump just after some 
warnings about locking issues:

Jul 30 16:51:13 toby smbd_audit: [2013/07/30 16:51:13.977998,  0, pid=25206, effective(0, 0), real(0, 0)] ../lib/dbwrap/dbwrap.c:193(dbwrap_check_lock_order)
Jul 30 16:51:13 toby smbd_audit:   Lock order violation: Trying /usr/local/samba/var/lock/smbXsrv_tcon_global.tdb at 1 while /usr/local/samba/var/lock/locking.tdb at 1 is locked
Jul 30 16:51:13 toby smbd_audit: [2013/07/30 16:51:13.978075,  0, pid=25206, effective(0, 0), real(0, 0)] ../lib/dbwrap/dbwrap.c:133(debug_lock_order)
Jul 30 16:51:13 toby smbd_audit:   lock order:  1:/usr/local/samba/var/lock/locking.tdb 2:<none> 3:<none>
Jul 30 16:51:13 toby smbd_audit: [2013/07/30 16:51:13.978133,  0, pid=25206, effective(0, 0), real(0, 0)] ../source3/lib/util.c:810(smb_panic_s3)
Jul 30 16:51:13 toby smbd_audit:   PANIC (pid 25206): invalid lock_order

So it seems there's something wrong with my simple minded idea of creating 
a share mode lock when opening a file and then freeing it when we close 
the file.

After much scratching of heads (as both I and my co-coder are learning 
Samba internals as we go along) I "fixed" (hacked round) the issue by 
tweaking smbXsrv_open_global_init() in source3/smbd/smbXsrv_open.c to put 
smbXsrv_open_global.tdb in at DBWRAP_LOCK_ORDER_2.  I shouldn't really 
have to fiddle in there whilst writing a VFS module, but I was getting a 
bit desparate! ;-)

Now this works for us at the moment in so much as the PANICs and core 
dumps have gone away and we can carry on developing other parts of our VFS 
module.  However I'd like to find out what is really causing this issue - 
why are two bits of Samba code both trying to get their .tdb files into 
lock order position 1?  Is this some sort of race condition? What should I 
really be doing in the VFS module create_file()/close() functions to stop 
this in the first place?  Is it a result of our VFS module being opaque 
and not representing what's in the server's own file system (our module 
is effectively a "bridge" between CIFS and a remote filestore)? Am I just 
being a moron with an obvious bug (highly likely!)?

Any pointers and/or suggestions welcome.

Jon