ctdb-2.3-3.1 on SLES11 SP2: Timed out waiting for initialisation

Tue Jul 16 20:20:52 MDT 2013

On Tue, 16 Jul 2013 21:12:31 +1000, Martin Schwenke <martin at meltin.net>
wrote:

> On Tue, 16 Jul 2013 08:49:30 +0200, Rainer Krienke
> <krienke at uni-koblenz.de> wrote:
> 
> > I have a running ctdb cluster with 6 nodes based on a OCFS2 Cluster FS.
> > Until now I am still running ctdb 1.2.52 which works without problems.
> > 
> > Yesterday I tried to upgrade ctdb to version 2.3 on one node. However
> > when I start ctdb on this node using the same settings as for version
> > 1.2  ctdb startes and everything looks fine buth then after exactly 10
> > sec it says
> > 
> > Timed out waiting for initialisation - check logs - killing CTDB

Right now we're unable to reproduce the deadlock problem.  However, as
mentioned below, I don't think that's the problem.  I think something
much simpler is happening...  :-)

[...]

> The timeout comes from the new ctdbd_wrapper (although the timeout isn't
> new - in CTDB 2.2 it was in the initscript).
> 
> However, right now I'm a little bit confused because I
> think ctdbd_wrapper should succeed instead of timing out after ctdbd
> moves to the FIRST_RECOVERY runstate:
> 
> > log.ctdb
> > ---------------
> > 2013/07/16 08:29:00.206512 [19648]: CTDB starting on node
> > 2013/07/16 08:29:00.211208 [19649]: Starting CTDBD (Version 2.3) as PID:
> [...]
> > 2013/07/16 08:29:02.035237 [19649]: Set runstate to FIRST_RECOVERY (3)
> [...]

If you look at the timeout loop in /usr/sbin/ctdbd_wrapper::start()
then you will see the various conditions that will cause the loop to
continue through the entire timeout period or to exit early.  I can't
see anything that could be backward incompatible across distributions.
The only suspicious item is the "ctdb runstate ..." command.  Do you
have an old version of the "ctdb" tool somewhere in $PATH?

If not, can you please try running the wrapper like this?

  sh -x /usr/sbin/ctdbd_wrapper /var/run/ctdb/ctdbd.pid start

If the loop keeps continuing it must be because "ctdb runstate ..." is
failing.  If that's the case, can you please remove the redirections on
this line?

  if ctdb runstate first_recovery startup running >/dev/null 2>&1 ; then

That is, you want it to look like this:

  if ctdb runstate first_recovery startup running ; then

Then you should see any errors from this command.  If you do, can you
please include them in your reply?

Thanks...

peace & happiness,
martin