CTDB asymetric (non-)recovery

Nicolas Ecarnot nicolas at ecarnot.net
Thu Jun 7 04:22:49 MDT 2012

Le 07/06/2012 08:32, Martin Schwenke a écrit :
> On Wed, 06 Jun 2012 23:02:57 +0200, Nicolas Ecarnot
> <nicolas at ecarnot.net>  wrote:
>> I have to add : when being in the infinite cycle of death (node0 unable
>> to recover), stopping ctdb on node1 leads to node0 recovering well.
> Yikes, so it really is stuck in recovery.  I'm not sure how to debug
> that...  :-(
>> Martin : Does your question suggest the issue lies in the scripting part?
> No, I was just being lazy and making sure that the issue isn't in the
> scripting part...  that's where many similar looking issues are...
> peace&  happiness,
> martin


I increased the log level to 9 (damn, this IS verbose), and I try to 
extract the relevant part of the loop, on the failing node (though yet 
nothing is proving me that the unhealthy node _is_ the faulty one).

The log file is here : http://pastebin.com/YEwrkmPx

Here are some points I have to add because I must make this cluster work 
: On each node, I'm using bonding on two interfaces, but I'm using this 
same bond0 interface for public and private (intra-cluster) communication.
I know this is sub-optimal, but (obviously) I've no other choice.

Continuing my tests, I saw today that this non recovery problem is not 
asymetric : I manage to get the same issue one node0.

I'm speaking about network because I'm heading to network related issue, 
as I'm seeing strange things:
When a node gets loop-stucked, it displays
"The interfaces status has changed on local node X...

 From times to times (rare but happens), in looping situation, 'ping' 
keeps working but ssh does not anymore (though I have a working 
pass-free ssh setup).

Well, I'd be glad a coder could explain to me what ctdb does to 
interfaces : what actions, and what monitoring?

Tests I could do :
- change bonding mode
- according to your answer : change timings/waiting values (for this, 
I'm a bit lost because there are numerous values I could play with)

Tests that would be near to impossible for me :
- use dedicated interfaces for private network

Nicolas Ecarnot

More information about the samba-technical mailing list