[Samba] [ctdb]Unable to run startrecovery event(if mail content is encrypted, please see the attached file)

Wed Sep 5 10:28:24 UTC 2018

Thanks for reporting this.  It looks very interesting and we will fix
it all as soon as we understand it!  :-)

On Wed, 5 Sep 2018 16:29:31 +0800 (CST), "zhu.shangzhong--- via samba"
<samba at lists.samba.org> wrote:

> There is a 3 nodes ctdb cluster is running. When one of 3 nodes is
> powered down, lots of logs will be wrote to log.ctdb.

Can you please let us know what version of Samba/CTDB you're using?

Note that you're referring to nodes 1, 2, 3 while CTDB numbers the
nodes 0, 1, 2.  In fact, the situation is a little more confused than
this:

> Power down node3
> The node1 log is as follow:
> 2018/09/04 04:29:33.402108 ctdbd[10129]: 10.231.8.65:4379: node 10.231.8.67:4379 is dead: 1 connected
> 2018/09/04 04:29:33.414817 ctdbd[10129]: Tearing down connection to dead node :0

It appears that the node you're calling node 3 is the one CTDB calls
node 0!  Can you please post the output of "ctdb status" when all nodes
are up and running?

I'm guessing that your nodes file looks like:

10.231.8.67
10.231.8.65
10.231.8.66

This:

> node1: repeat logs:
> 2018/09/04 04:35:06.414369 ctdbd[10129]: Recovery has started
> 2018/09/04 04:35:06.414944 ctdbd[10129]: connect() failed, errno=111
> 2018/09/04 04:35:06.415076 ctdbd[10129]: Unable to run startrecovery event

is due to this:

> 2018/09/04 04:29:55.570212 ctdb-eventd[10131]: Bad talloc magic value - wrong talloc version used/mixed
> 2018/09/04 04:29:57.240533 ctdbd[10129]: Eventd went away

We have fixed a similar issue in some versions.  When we know what
version you are running then we can say whether it is a known issue or
a new issue.

I have been working on the following issue for most of this week:

> 2018/09/04 04:29:52.465663 ctdbd[10129]: This node (1) is now the recovery master
> 2018/09/04 04:29:55.468771 ctdb-recoverd[11302]: Election period ended
> 2018/09/04 04:29:55.469404 ctdb-recoverd[11302]: Node 2 has changed flags - now 0x8  was 0x0
> 2018/09/04 04:29:55.469475 ctdb-recoverd[11302]: Remote node 2 had flags 0x8, local had 0x0 - updating local
> 2018/09/04 04:29:55.469514 ctdb-recoverd[11302]: ../ctdb/server/ctdb_recoverd.c:1267 Starting do_recovery
> 2018/09/04 04:29:55.469525 ctdb-recoverd[11302]: Attempting to take recovery lock (/share-fs/export/ctdb/.ctdb/reclock)
> 2018/09/04 04:29:55.563522 ctdb-recoverd[11302]: Unable to take recovery lock - contention
> 2018/09/04 04:29:55.563573 ctdb-recoverd[11302]: Unable to get recovery lock - aborting recovery and ban ourself for 300 seconds
> 2018/09/04 04:29:55.563585 ctdb-recoverd[11302]: Banning node 1 for 300 seconds

Are you able to recreate this every time?  Sometimes?  Rarely?

I hadn't seen this until recently and I'm now worried that it is more
widespread than we realise.

Thanks...

peace & happiness,
martin