TDB lock contention during "startup" event caused winbind crash

Sat May 28 10:37:50 UTC 2016

Hi Kenny,

On Fri, 27 May 2016 15:41:49 -0700, Kenny Dinh <kdinh at peaxy.net> wrote:

> This is a one time occurrence.  I was not able to reproduce it even with
> ctdb 2.5.x and samba 4.1.x.  Please forgive my ignorance, but I haven't
> been able to see the new change that fixed the Samba/CTDB deadlock in Samba
> 4.4
> 
> From the log, what I saw was the contention for g_lock.tdb between winbind
> and smb processes rather than CTDB.

Yeah, but they usually play well together.  They should grab a
transaction lock using g_lock, do the transaction on the persistent
database and then release the lock.

The pstree output in the hung script debugging is weird.  There's never
a significant process tree and it doesn't contain anything that can get
stuck.

Something seems to putting the cluster back in to recovery:

  2015/11/18 08:13:52.078635 [13482]: server/ctdb_recover.c:612 Recovery
 mode set to ACTIVE 2015/11/18 08:13:52.081472 [13482]: Freeze priority 1
  2015/11/18 08:13:52.157107 [13482]: DB Attach to database
 dbwrap_watchers.tdb deferred for client with pid:14734 since node is in recovery mode.

This seems to be coming from the other node.

This is usually a bad sign:

  2015/11/18 08:14:28.513418 [13482]: Unable to get ALLDB locks for 36
 seconds

It indicates that something is holding a record lock, so database
recovery can't occur.

You also look to be having issues with recovery lock file handling:

  2015/11/18 08:16:26.306347 [13482]: server/ctdb_recover.c:487
 set_recmode child process hung/timedout CFS slow to grant locks? (allowing recmode set anyway)

And you're seeing lots of these:

  2015/11/18 08:12:35.774968 [13482]: Handling event took 12 seconds!

On real clusters we only see them under high load.  On virtual clusters they can happen fairly frequently during recovery but not usually at other times.

It can be very difficult to debug these sorts of things without having
the logs from all nodes.  Our usual practice is to log to syslog with a
sortable data/time format and then merge/sort the logs from all nodes
to get an overall view of what is happening on the cluster.

> I reviewed the code path in v4-4-stable and the code path to start a
> transaction on tdb is still the same.

> db_ctdb_transaction_start -> ctdbd_control ==> CTDB: ctdb_control_db_attach

Hmmm... usually databases are attached before something tries to do a
transaction.

> From the git history, the change that looks like it could affect this code
> path might be commit "81ac247c10c362b643c2b3ce57df4fe6ecae78dd".
> 
> Is this the time when the rework of that code path was being done?  Any
> pointer to where to look for the rework code would be greatly appreciated.

Yes, that's part of the parallel recovery code.  The actual parallel
recovery helper is introduced in commit
951431910dc6d00815c19e087bc2546b1daa591b.  However, you're unlikely to
be able to backport parallel recovery to 2.5.x.  Amitay wrote
completely new protocol and client code so that the parallel recovery
helper could be cleanly written to be fully async.  So there are about
20,000 lines of code added before the introduction of the recovery
helper!

We do lots of testing and I can't remember seeing these sorts of
issues.  It would be nice to know if you see the same issues with
Samba/CTDB 4.4 or even with the master branch.  If you can't do that
then can you please at least try CTDB 2.5.6, since it has many
additional fixes on top of 2.5.4?

peace & happiness,
martin