TDB lock contention during "startup" event caused winbind crash

Mon Jun 6 02:03:21 UTC 2016

Hi Kenny,

On Tue, 31 May 2016 11:38:19 -0700, Kenny Dinh <kdinh at peaxy.net> wrote:

> This issue occurred one time when the system was under load.  I have not
> seen it again.  There were many other issues with the system before it got
> to this point.  The file system was slow to respond the recovery lock
> requests, as you have noticed.
> 
> I will definitely let you know if I encounter this issue again on the 4.4.x
> branch.  It is highly unlikely though.
> 
> Thank you for the pointer to the parallel recovery helper code.
> 
> As for your question on what caused the cluster to go into recovery at
> "2015/11/18
> 08:13:52.078635".  My setup has 3 CTDB nodes.  CTDB services on all 3 nodes
> were restarted at around 8:01.  CTDB processes on all 3 CTDB nodes were
> stuck at the "startup" stage.  At 08:13:52, node_1 was not able to receive
> "keep_alive" messages from node_0 and declared that it was dead, which put
> the cluster into recovery.  Again, there were all kinds of other issues
> with the cluster at that time.
> 
> I don't think we should spend more time on this issue.  Attached are the
> logs from 3 ctdb nodes, if you are curious.

I took a very quick look, but there are no obvious, identifiable
problems apart from the ones I have already mentioned.

I hope things work better with 4.4.x.  :-)

peace & happiness,
martin