CTDB asymetric (non-)recovery

Wed Jun 6 09:03:20 MDT 2012


----- Original Message -----
> From: "Nicolas Ecarnot" <nicolas.ecarnot at gmail.com>
> To: "Steven Whitehouse" <swhiteho at redhat.com>
> Cc: adas at redhat.com, samba-technical at samba.org
> Sent: Wednesday, June 6, 2012 9:47:56 AM
> Subject: CTDB asymetric (non-)recovery
> 
> Le 06/06/2012 11:00, Steven Whitehouse a écrit :
> >> Bonus question : Do you know which better channel I could ask a
> >> precise
> >> ctdb question?
> >>
> >
> > I probably missed that... I'm just catching up after the Jubilee
> > holiday :-) Abhi should be able to point you in the right direction
> > wrt
> > ctdb,
> 
> Thank you Steve.
> 
> Well, as Abhi is in Cc, here is the situation:
> > I had a 2-nodes cluster running too fine under Ubuntu server 11.10,
> > with
> > cman, corosync, GFS2, OCFS2, clvm, ctdb, samba, winbind.
> >
> > So I decided to upgrade to Precise (12.04)
> 
> Ctdb seems to run fine as it was under 11.10.
> 
> But an asymetric behaviour is striking my setup.
> My tests are showing this :
> 
> Test 01
> - both nodes down (ctdb stop)
> - node 0 : ctdb start : OK
> - node 1 : ctdb start : both OK
> 
> Test 02
> - both nodes down (ctdb stop)
> - node 1 : ctdb start : OK
> - node 0 : ctdb start : both OK
> 
> Test 03
> - both nodes down (ctdb stop)
> - node 0 : ctdb start : OK
> - node 1 : ctdb start : both OK
> - node 1 : ctdb stop : OK
> - node 1 : ctdb start : both OK
> 
> Test 04
> - both nodes down (ctdb stop)
> - node 0 : ctdb start : OK
> - node 1 : ctdb start : both OK
> - node 0 : ctdb stop : OK
> - node 0 : ctdb start : node 0 down, only node 1 OK !!!
> 
> I tried to run these tests by asking ctdb to manage samba+winbind,
> and
> tried the same tests without managing them.
> 
> Without managing them, it greatly improves the Test 04, but not at
> each
> try, so I guess this is not related to the 50.samba script.
> (That may be related to timings???)
> 
> I'm reading the docs and the source file to understand what's wrong.
> In my log files, what is different between the situation of success
> (Test 01,02,03) and failure (Test 04) is the following error message
> looping :
> 
> Good situation :
> || [...]
> [recoverd: 5894]: The interfaces status has changed on local node 1 -
> force takeover run
> || [recoverd: 5894]: Trigger takeoverrun
> || [18295]: CTDB_WAIT_UNTIL_RECOVERED
> || [ 1077]: startup event OK - enabling monitoring
> || [...]
> 
> Bad situation :
> || [...]
> || [recoverd: 5894]: The interfaces status has changed on local node
> || 1 -
> force takeover run
> || [ 5846]: CTDB_WAIT_UNTIL_RECOVERED
> || [recoverd: 5894]: The interfaces status has changed on local node
> || 1 -
> force takeover run
> || [ 5846]: CTDB_WAIT_UNTIL_RECOVERED
> || [Infinite repeating and looping...]
> 
> 
> On the healthy node, I see every 10 seconds :
> || [recoverd:18343]: client/ctdb_client.c:990 control timed out.
> reqid:8068 opcode:18 dstnode:0
> || [recoverd:18343]: client/ctdb_client.c:1101 ctdb_control_recv
> || failed
> || [recoverd:18343]: client/ctdb_client.c:990 control timed out.
> reqid:8070 opcode:18 dstnode:0
> || [recoverd:18343]: client/ctdb_client.c:1101 ctdb_control_recv
> || failed
> || [18295]: Recovery daemon ping timeout. Count : 0
> || [recoverd:18343]: Could not find idr:8068
> 
> I also see signs saying that node 1 can not pull db from node 0. Is
> it
> just a consequence, or else?
> What does my cluster is trying to whisper to my deaf ears?
> 
> --
> Nicolas Ecarnot
>