Setting up CTDB on OCFS2 and VMs ...

Mon Dec 15 06:25:34 MST 2014

On 13/12/14 13:55, Michael Adam wrote:
> On 2014-12-13 at 12:31 +0000, Rowland Penny wrote:
>> OK, I now have a single node up and running as per the
>> instructions provided by Ronnie.
> Yay!
>
>> I just have a few questions:
>>
>> there is this in the ctdb log
>>
>> 2014/12/13 11:52:43.522708 [ 5740]: Set runstate to INIT (1)
>> 2014/12/13 11:52:43.540992 [ 5740]: 00.ctdb: awk: line 2: function gensub
>> never defined
>> 2014/12/13 11:52:43.543178 [ 5740]: 00.ctdb: awk: line 2: function gensub
>> never defined
>> 2014/12/13 11:52:43.545354 [ 5740]: 00.ctdb: awk: line 2: function gensub
>> never defined
> In Debian, there are several possible awk versions
> that provide awk: at least: mawk, gawk, original-awk.
> Which is chosen if multiple are installed, depends on
> the alternatives-mechanism:
>    update-alternatives --display awk
>    update-alternatives --edit awk
>
> A quick web search has reveiled that only the gawk
> (gnu awk) variant might feature the needed gensub
> function.
>
> Maybe we should change "awk" to "gawk" in our scripts
> and packages would need to adapt their dependencies.
>
>
>> 2014/12/13 11:52:56.931393 [recoverd: 5887]: We are still serving a public
>> IP '127.0.0.3' that we should not be serving. Removing it
>> 2014/12/13 11:52:56.931536 [ 5740]: Could not find which interface the ip
>> address is hosted on. can not release it
>> 2014/12/13 11:52:56.931648 [recoverd: 5887]: We are still serving a public
>> IP '127.0.0.2' that we should not be serving. Removing it
>>
>> The above three lines are there 4 times
> I guess this will not be the case any more, when you
> move to a more realistic setup where you don't use loopback
> for hosting nodes internal and public addresses, but
> for a start that is ok.
>
>> the final 4 lines are:
>>
>> 2014/12/13 11:53:02.982441 [ 5740]: monitor event OK - node re-enabled
>> 2014/12/13 11:53:02.982480 [ 5740]: Node became HEALTHY. Ask recovery master
>> 0 to perform ip reallocation
>> 2014/12/13 11:53:02.982733 [recoverd: 5887]: Node 0 has changed flags - now
>> 0x0  was 0x2
>> 2014/12/13 11:53:02.983266 [recoverd: 5887]: Takeover run starting
>> 2014/12/13 11:53:03.046859 [recoverd: 5887]: Takeover run completed
>> successfully
>>
>> ctdb status shows:
>>
>> Number of nodes:1
>> pnn:0 127.0.0.1        OK (THIS NODE)
>> Generation:740799152
>> Size:1
>> hash:0 lmaster:0
>> Recovery mode:NORMAL (0)
>> Recovery master:0
> Great!
>
>> Now I know it works, I just have to pull it all together.
> Right. Next step: take a "real" ethernet interface
> and first use that for nodes address. You can even
> start here with a single node.
>
> You can also go towards more realistic clusters in two
> steps: First no public addresse, only the nodes file.
> That is the core of a ctdb cluster. Then you can go
> towards cluster-resource management and add public
> addresse and also CTDB_MANAGES_SAMBA and friends.
>
> One further note:
> Virtual machines or even containers (lxc or docker) are
> awesome for setting up such clusters for learing and
> testing. I use that for development myselves.
>
> And here is one (imho) very neat trick:
> If you use lxc containers (or docker can probably also
> do that), you can completely take the complexity
> of having to set up a cluster file system out of
> the equation: You can just bind mount a directory
> of the host file system into the node containers'
> root file systems by the lxc fstab file.
> Thereby you have a posix-file system that is shared
> between the nodes and you can use that as cluster FS.
>
> This way, you can concentrate on ctdb and samba immediately
> until you are comfortable with that.
>
> I wanted at some point to provide a mechanism to set
> such a thing up automatically, by just providing some
> config files. Maybe I'll investigate the vagrant+puppet
> approach that Ralph Böhme has recently posted in this
> or a related thread...
>
> Cheers - Michael
>

Getting closer :-)

I now have two ctdb nodes up and running:

root at cluster1:~# ctdb status
Number of nodes:3 (including 1 deleted nodes)
pnn:1 192.168.1.10     OK (THIS NODE)
pnn:2 192.168.1.11     OK
Generation:1073761636
Size:2
hash:0 lmaster:1
hash:1 lmaster:2
Recovery mode:NORMAL (0)
Recovery master:1

This is with CTDB_RECOVERY_LOCK turned off, if I turn it on, the nodes 
go unhealthy. I am putting the lockfile on the shared cluster, should I 
be putting it somewhere else, it says at the top of /etc/default/ctdb :

# Shared recovery lock file to avoid split brain.  No default.
#
# Do NOT run CTDB without a recovery lock file unless you know exactly
# what you are doing.
#CTDB_RECOVERY_LOCK=/some/place/on/shared/storage

As I don't know what I am doing, I need to run the recovery lockfile :-D

Rowland