[Samba] [ctdb]Unable to run startrecovery event(if mail content is encrypted, please see the attached file)
martin at meltin.net
Wed Sep 5 10:28:24 UTC 2018
Thanks for reporting this. It looks very interesting and we will fix
it all as soon as we understand it! :-)
On Wed, 5 Sep 2018 16:29:31 +0800 (CST), "zhu.shangzhong--- via samba"
<samba at lists.samba.org> wrote:
> There is a 3 nodes ctdb cluster is running. When one of 3 nodes is
> powered down, lots of logs will be wrote to log.ctdb.
Can you please let us know what version of Samba/CTDB you're using?
Note that you're referring to nodes 1, 2, 3 while CTDB numbers the
nodes 0, 1, 2. In fact, the situation is a little more confused than
> Power down node3
> The node1 log is as follow:
> 2018/09/04 04:29:33.402108 ctdbd: 10.231.8.65:4379: node 10.231.8.67:4379 is dead: 1 connected
> 2018/09/04 04:29:33.414817 ctdbd: Tearing down connection to dead node :0
It appears that the node you're calling node 3 is the one CTDB calls
node 0! Can you please post the output of "ctdb status" when all nodes
are up and running?
I'm guessing that your nodes file looks like:
> node1: repeat logs:
> 2018/09/04 04:35:06.414369 ctdbd: Recovery has started
> 2018/09/04 04:35:06.414944 ctdbd: connect() failed, errno=111
> 2018/09/04 04:35:06.415076 ctdbd: Unable to run startrecovery event
is due to this:
> 2018/09/04 04:29:55.570212 ctdb-eventd: Bad talloc magic value - wrong talloc version used/mixed
> 2018/09/04 04:29:57.240533 ctdbd: Eventd went away
We have fixed a similar issue in some versions. When we know what
version you are running then we can say whether it is a known issue or
a new issue.
I have been working on the following issue for most of this week:
> 2018/09/04 04:29:52.465663 ctdbd: This node (1) is now the recovery master
> 2018/09/04 04:29:55.468771 ctdb-recoverd: Election period ended
> 2018/09/04 04:29:55.469404 ctdb-recoverd: Node 2 has changed flags - now 0x8 was 0x0
> 2018/09/04 04:29:55.469475 ctdb-recoverd: Remote node 2 had flags 0x8, local had 0x0 - updating local
> 2018/09/04 04:29:55.469514 ctdb-recoverd: ../ctdb/server/ctdb_recoverd.c:1267 Starting do_recovery
> 2018/09/04 04:29:55.469525 ctdb-recoverd: Attempting to take recovery lock (/share-fs/export/ctdb/.ctdb/reclock)
> 2018/09/04 04:29:55.563522 ctdb-recoverd: Unable to take recovery lock - contention
> 2018/09/04 04:29:55.563573 ctdb-recoverd: Unable to get recovery lock - aborting recovery and ban ourself for 300 seconds
> 2018/09/04 04:29:55.563585 ctdb-recoverd: Banning node 1 for 300 seconds
Are you able to recreate this every time? Sometimes? Rarely?
I hadn't seen this until recently and I'm now worried that it is more
widespread than we realise.
peace & happiness,
More information about the samba