TDB lock contention during "startup" event caused winbind crash

Tue May 31 18:38:19 UTC 2016

Hi Martin,

This issue occurred one time when the system was under load.  I have not
seen it again.  There were many other issues with the system before it got
to this point.  The file system was slow to respond the recovery lock
requests, as you have noticed.

I will definitely let you know if I encounter this issue again on the 4.4.x
branch.  It is highly unlikely though.

Thank you for the pointer to the parallel recovery helper code.

As for your question on what caused the cluster to go into recovery at
"2015/11/18
08:13:52.078635".  My setup has 3 CTDB nodes.  CTDB services on all 3 nodes
were restarted at around 8:01.  CTDB processes on all 3 CTDB nodes were
stuck at the "startup" stage.  At 08:13:52, node_1 was not able to receive
"keep_alive" messages from node_0 and declared that it was dead, which put
the cluster into recovery.  Again, there were all kinds of other issues
with the cluster at that time.

I don't think we should spend more time on this issue.  Attached are the
logs from 3 ctdb nodes, if you are curious.

Thanks,
Kenny

On Sat, May 28, 2016 at 3:37 AM, Martin Schwenke <martin at meltin.net> wrote:

> Hi Kenny,
>
> On Fri, 27 May 2016 15:41:49 -0700, Kenny Dinh <kdinh at peaxy.net> wrote:
>
> > This is a one time occurrence.  I was not able to reproduce it even with
> > ctdb 2.5.x and samba 4.1.x.  Please forgive my ignorance, but I haven't
> > been able to see the new change that fixed the Samba/CTDB deadlock in
> Samba
> > 4.4
> >
> > From the log, what I saw was the contention for g_lock.tdb between
> winbind
> > and smb processes rather than CTDB.
>
> Yeah, but they usually play well together.  They should grab a
> transaction lock using g_lock, do the transaction on the persistent
> database and then release the lock.
>
> The pstree output in the hung script debugging is weird.  There's never
> a significant process tree and it doesn't contain anything that can get
> stuck.
>
> Something seems to putting the cluster back in to recovery:
>
>   2015/11/18 08:13:52.078635 [13482]: server/ctdb_recover.c:612 Recovery
>  mode set to ACTIVE 2015/11/18 08:13:52.081472 [13482]: Freeze priority 1
>   2015/11/18 08:13:52.157107 [13482]: DB Attach to database
>  dbwrap_watchers.tdb deferred for client with pid:14734 since node is in
> recovery mode.
>
> This seems to be coming from the other node.
>
> This is usually a bad sign:
>
>   2015/11/18 08:14:28.513418 [13482]: Unable to get ALLDB locks for 36
>  seconds
>
> It indicates that something is holding a record lock, so database
> recovery can't occur.
>
> You also look to be having issues with recovery lock file handling:
>
>   2015/11/18 08:16:26.306347 [13482]: server/ctdb_recover.c:487
>  set_recmode child process hung/timedout CFS slow to grant locks?
> (allowing recmode set anyway)
>
> And you're seeing lots of these:
>
>   2015/11/18 08:12:35.774968 [13482]: Handling event took 12 seconds!
>
> On real clusters we only see them under high load.  On virtual clusters
> they can happen fairly frequently during recovery but not usually at other
> times.
>
> It can be very difficult to debug these sorts of things without having
> the logs from all nodes.  Our usual practice is to log to syslog with a
> sortable data/time format and then merge/sort the logs from all nodes
> to get an overall view of what is happening on the cluster.
>
> > I reviewed the code path in v4-4-stable and the code path to start a
> > transaction on tdb is still the same.
>
> > db_ctdb_transaction_start -> ctdbd_control ==> CTDB:
> ctdb_control_db_attach
>
> Hmmm... usually databases are attached before something tries to do a
> transaction.
>
> > From the git history, the change that looks like it could affect this
> code
> > path might be commit "81ac247c10c362b643c2b3ce57df4fe6ecae78dd".
> >
> > Is this the time when the rework of that code path was being done?  Any
> > pointer to where to look for the rework code would be greatly
> appreciated.
>
> Yes, that's part of the parallel recovery code.  The actual parallel
> recovery helper is introduced in commit
> 951431910dc6d00815c19e087bc2546b1daa591b.  However, you're unlikely to
> be able to backport parallel recovery to 2.5.x.  Amitay wrote
> completely new protocol and client code so that the parallel recovery
> helper could be cleanly written to be fully async.  So there are about
> 20,000 lines of code added before the introduction of the recovery
> helper!
>
> We do lots of testing and I can't remember seeing these sorts of
> issues.  It would be nice to know if you see the same issues with
> Samba/CTDB 4.4 or even with the master branch.  If you can't do that
> then can you please at least try CTDB 2.5.6, since it has many
> additional fixes on top of 2.5.4?
>
> peace & happiness,
> martin
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: all_logs.zip
Type: application/zip
Size: 163017 bytes
Desc: not available
URL: <http://lists.samba.org/pipermail/samba-technical/attachments/20160531/31f2a317/all_logs-0001.zip>