CTDB asymetric (non-)recovery
Nicolas Ecarnot
nicolas.ecarnot at gmail.com
Wed Jun 6 08:47:56 MDT 2012
Le 06/06/2012 11:00, Steven Whitehouse a écrit :
>> Bonus question : Do you know which better channel I could ask a precise
>> ctdb question?
>>
>
> I probably missed that... I'm just catching up after the Jubilee
> holiday :-) Abhi should be able to point you in the right direction wrt
> ctdb,
Thank you Steve.
Well, as Abhi is in Cc, here is the situation:
> I had a 2-nodes cluster running too fine under Ubuntu server 11.10, with
> cman, corosync, GFS2, OCFS2, clvm, ctdb, samba, winbind.
>
> So I decided to upgrade to Precise (12.04)
Ctdb seems to run fine as it was under 11.10.
But an asymetric behaviour is striking my setup.
My tests are showing this :
Test 01
- both nodes down (ctdb stop)
- node 0 : ctdb start : OK
- node 1 : ctdb start : both OK
Test 02
- both nodes down (ctdb stop)
- node 1 : ctdb start : OK
- node 0 : ctdb start : both OK
Test 03
- both nodes down (ctdb stop)
- node 0 : ctdb start : OK
- node 1 : ctdb start : both OK
- node 1 : ctdb stop : OK
- node 1 : ctdb start : both OK
Test 04
- both nodes down (ctdb stop)
- node 0 : ctdb start : OK
- node 1 : ctdb start : both OK
- node 0 : ctdb stop : OK
- node 0 : ctdb start : node 0 down, only node 1 OK !!!
I tried to run these tests by asking ctdb to manage samba+winbind, and
tried the same tests without managing them.
Without managing them, it greatly improves the Test 04, but not at each
try, so I guess this is not related to the 50.samba script.
(That may be related to timings???)
I'm reading the docs and the source file to understand what's wrong.
In my log files, what is different between the situation of success
(Test 01,02,03) and failure (Test 04) is the following error message
looping :
Good situation :
|| [...]
[recoverd: 5894]: The interfaces status has changed on local node 1 -
force takeover run
|| [recoverd: 5894]: Trigger takeoverrun
|| [18295]: CTDB_WAIT_UNTIL_RECOVERED
|| [ 1077]: startup event OK - enabling monitoring
|| [...]
Bad situation :
|| [...]
|| [recoverd: 5894]: The interfaces status has changed on local node 1 -
force takeover run
|| [ 5846]: CTDB_WAIT_UNTIL_RECOVERED
|| [recoverd: 5894]: The interfaces status has changed on local node 1 -
force takeover run
|| [ 5846]: CTDB_WAIT_UNTIL_RECOVERED
|| [Infinite repeating and looping...]
On the healthy node, I see every 10 seconds :
|| [recoverd:18343]: client/ctdb_client.c:990 control timed out.
reqid:8068 opcode:18 dstnode:0
|| [recoverd:18343]: client/ctdb_client.c:1101 ctdb_control_recv failed
|| [recoverd:18343]: client/ctdb_client.c:990 control timed out.
reqid:8070 opcode:18 dstnode:0
|| [recoverd:18343]: client/ctdb_client.c:1101 ctdb_control_recv failed
|| [18295]: Recovery daemon ping timeout. Count : 0
|| [recoverd:18343]: Could not find idr:8068
I also see signs saying that node 1 can not pull db from node 0. Is it
just a consequence, or else?
What does my cluster is trying to whisper to my deaf ears?
--
Nicolas Ecarnot
More information about the samba-technical
mailing list