Hung smbd sessions with robust mutexes for TDBs.(Similar to BUG 13267)

Fri Mar 23 18:05:25 UTC 2018

We came across a glibc issue with robust mutex handling.
https://sourceware.org/bugzilla/show_bug.cgi?id=19402

As per this, due to race in mutex unlock could lead to missed wakeups. But this race shouldn't have been affecting multiple smbds when each one is operating on different lists(hash chains).

Thanks,
Hemanth.

From: Hemanth Thummala <hemanth.thummala at nutanix.com>
Date: Thursday, 22 March 2018 at 12:03 AM
To: "samba-technical at lists.samba.org" <samba-technical at lists.samba.org>
Subject: Hung smbd sessions with robust mutexes for TDBs.(Similar to BUG 13267)

Hi Everyone,

We have recently come across an issue at one of our customers. After reporting the client access issue, we have observed that bunch of smbds in hung state(in futex_wait). We have enabled the robust mutex for TDB access. We are currently running samba version 4.3.11 (+ security patches)

Here is the bt of one of the hung processes.

(gdb) bt

#0  0x00007fabe7996594 in __lll_robust_lock_wait () from /lib64/libpthread.so.0

#1  0x00007fabe79915a2 in _L_robust_lock_261 () from /lib64/libpthread.so.0

#2  0x00007fabe79910ff in __pthread_mutex_lock_full () from /lib64/libpthread.so.0

#3  0x00007fabe1a9b094 in chain_mutex_lock (m=0x7fabd19e6a88, waitflag=true) at ../lib/tdb/common/mutex.c:182

#4  0x00007fabe1a9b1cd in tdb_mutex_lock (tdb=0x556c385f6d40, rw=0, off=412, len=1, waitflag=true, pret=0x7fff8b25cf18)

   at ../lib/tdb/common/mutex.c:234

#5  0x00007fabe1a8fc52 in fcntl_lock (tdb=0x556c385f6d40, rw=0, off=412, len=1, waitflag=true) at ../lib/tdb/common/lock.c:44

#6  0x00007fabe1a8fdec in tdb_brlock (tdb=0x556c385f6d40, rw_type=0, offset=412, len=1, flags=TDB_LOCK_WAIT) at ../lib/tdb/common/lock.c:174

#7  0x00007fabe1a90349 in tdb_nest_lock (tdb=0x556c385f6d40, offset=412, ltype=0, flags=TDB_LOCK_WAIT) at ../lib/tdb/common/lock.c:346

#8  0x00007fabe1a90593 in tdb_lock_list (tdb=0x556c385f6d40, list=61, ltype=0, waitflag=TDB_LOCK_WAIT) at ../lib/tdb/common/lock.c:438

#9  0x00007fabe1a9063b in tdb_lock (tdb=0x556c385f6d40, list=61, ltype=0) at ../lib/tdb/common/lock.c:456

#10 0x00007fabe1a8d285 in tdb_find_lock_hash (tdb=0x556c385f6d40, key=..., hash=1922915160, locktype=0, rec=0x7fff8b25d120)

   at ../lib/tdb/common/tdb.c:118

#11 0x00007fabe1a8d669 in tdb_parse_record (tdb=0x556c385f6d40, key=..., parser=0x7fabe13ddce1 <db_tdb_parser>, private_data=0x7fff8b25d1a0)

   at ../lib/tdb/common/tdb.c:245

#12 0x00007fabe13dddc6 in db_tdb_parse (db=0x556c385f6a80, key=..., parser=0x7fabe7346b58 <fetch_share_mode_unlocked_parser>,

   private_data=0x556c3867a950) at ../lib/dbwrap/dbwrap_tdb.c:231

#13 0x00007fabe13d9d03 in dbwrap_parse_record (db=0x556c385f6a80, key=..., parser=0x7fabe7346b58 <fetch_share_mode_unlocked_parser>,

   private_data=0x556c3867a950) at ../lib/dbwrap/dbwrap.c:387

#14 0x00007fabe7346ca1 in fetch_share_mode_unlocked (mem_ctx=0x556c38678980, id=...) at ../source3/locking/share_mode_lock.c:650

#15 0x00007fabe733ad8a in get_file_infos (id=..., name_hash=0, delete_on_close=0x0, write_time=0x7fff8b25d4e0)

   at ../source3/locking/locking.c:615

#16 0x00007fabe71dcf8e in smbd_dirptr_get_entry (ctx=0x556c386e8820, dirptr=0x556c3879ab90, mask=0x556c3875bf10 “*”, dirtype=22,

---Type <return> to continue, or q <return> to quit---

   dont_descend=false, ask_sharemode=true, match_fn=0x7fabe7233c58 <smbd_dirptr_lanman2_match_fn>,

   mode_fn=0x7fabe7233faa <smbd_dirptr_lanman2_mode_fn>, private_data=0x7fff8b25d600, _fname=0x7fff8b25d620, _smb_fname=0x7fff8b25d618,

   _mode=0x7fff8b25d668, _prev_offset=0x7fff8b25d628) at ../source3/smbd/dir.c:1194

#17 0x00007fabe7237a53 in smbd_dirptr_lanman2_entry (ctx=0x556c386e8820, conn=0x556c3864b8e0, dirptr=0x556c3879ab90, flags2=53313,

We have actually missed checking the current mutex owner and check its own process user layer stack(to identify which operation that it was blocked by holding a mutex lock). There were many smbd in the hung state(everything pointing to futex wait).

Looking at db_tdb_fetch_locked() code, doesn't seem be having any system calls or other blocking operations after obtaining the mutex. Wondering what has caused the smbds going into that state. We have also observed that different smbd process were hung on separate list(has chain) mutex.

We have collected couple of offline cores without the shared memory page dumps which has the actual mutex state(including the owner PID). Not sure if those cores will be useful to debug this issue. Please let us know the possibilities of getting into this state.

Thanks,
Hemanth.