CTDB asymetric (non-)recovery

Nicolas Ecarnot nicolas.ecarnot at gmail.com
Wed Jun 6 08:47:56 MDT 2012


Le 06/06/2012 11:00, Steven Whitehouse a écrit :
>> Bonus question : Do you know which better channel I could ask a precise
>> ctdb question?
>>
>
> I probably missed that... I'm just catching up after the Jubilee
> holiday :-) Abhi should be able to point you in the right direction wrt
> ctdb,

Thank you Steve.

Well, as Abhi is in Cc, here is the situation:
> I had a 2-nodes cluster running too fine under Ubuntu server 11.10, with
> cman, corosync, GFS2, OCFS2, clvm, ctdb, samba, winbind.
>
> So I decided to upgrade to Precise (12.04)

Ctdb seems to run fine as it was under 11.10.

But an asymetric behaviour is striking my setup.
My tests are showing this :

Test 01
- both nodes down (ctdb stop)
- node 0 : ctdb start : OK
- node 1 : ctdb start : both OK

Test 02
- both nodes down (ctdb stop)
- node 1 : ctdb start : OK
- node 0 : ctdb start : both OK

Test 03
- both nodes down (ctdb stop)
- node 0 : ctdb start : OK
- node 1 : ctdb start : both OK
- node 1 : ctdb stop : OK
- node 1 : ctdb start : both OK

Test 04
- both nodes down (ctdb stop)
- node 0 : ctdb start : OK
- node 1 : ctdb start : both OK
- node 0 : ctdb stop : OK
- node 0 : ctdb start : node 0 down, only node 1 OK !!!

I tried to run these tests by asking ctdb to manage samba+winbind, and 
tried the same tests without managing them.

Without managing them, it greatly improves the Test 04, but not at each 
try, so I guess this is not related to the 50.samba script.
(That may be related to timings???)

I'm reading the docs and the source file to understand what's wrong.
In my log files, what is different between the situation of success 
(Test 01,02,03) and failure (Test 04) is the following error message 
looping :

Good situation :
|| [...]
[recoverd: 5894]: The interfaces status has changed on local node 1 - 
force takeover run
|| [recoverd: 5894]: Trigger takeoverrun
|| [18295]: CTDB_WAIT_UNTIL_RECOVERED
|| [ 1077]: startup event OK - enabling monitoring
|| [...]

Bad situation :
|| [...]
|| [recoverd: 5894]: The interfaces status has changed on local node 1 - 
force takeover run
|| [ 5846]: CTDB_WAIT_UNTIL_RECOVERED
|| [recoverd: 5894]: The interfaces status has changed on local node 1 - 
force takeover run
|| [ 5846]: CTDB_WAIT_UNTIL_RECOVERED
|| [Infinite repeating and looping...]


On the healthy node, I see every 10 seconds :
|| [recoverd:18343]: client/ctdb_client.c:990 control timed out. 
reqid:8068 opcode:18 dstnode:0
|| [recoverd:18343]: client/ctdb_client.c:1101 ctdb_control_recv failed
|| [recoverd:18343]: client/ctdb_client.c:990 control timed out. 
reqid:8070 opcode:18 dstnode:0
|| [recoverd:18343]: client/ctdb_client.c:1101 ctdb_control_recv failed
|| [18295]: Recovery daemon ping timeout. Count : 0
|| [recoverd:18343]: Could not find idr:8068

I also see signs saying that node 1 can not pull db from node 0. Is it 
just a consequence, or else?
What does my cluster is trying to whisper to my deaf ears?

-- 
Nicolas Ecarnot


More information about the samba-technical mailing list