Some questions about ctdb availability when some node is crashed

Sun Nov 11 00:23:02 MST 2012

Hello list,

I installed ctdb (Version 1.13) on ubuntu 12.04
and did the following test to simulate a cluster member is crashed.

My cluster is consisted of four nodes {node-0, node-1, node-2, node-3}.
First of all, I checked the status of my cluster is HEALTHY,
i.e. all nodes' status is OK.
A record (/tmp/test) is stored in a persistent tdb (test.tdb) by this
command.

$sudo ctdb pstore test.tdb /tmp/test /tmp/test

This record is fetched again and again on node-1.
Then, node-1's network interface was turned down to simulate it is crashed
by this command.

$sudo ifconfig eth0 down

Now, node-1 is banned and the cluster status is

Number of nodes:4
pnn:0 192.168.1.118    OK
pnn:1 192.168.1.120    DISCONNECTED|UNHEALTHY|INACTIVE
pnn:2 192.168.1.128    OK (THIS NODE)
pnn:3 192.168.1.149    OK
Generation:157463799
Size:3
hash:0 lmaster:0
hash:1 lmaster:2
hash:2 lmaster:3
Recovery mode:NORMAL (0)
Recovery master:2

It looks fine.

However, the records in test.tdb can not be fetched anymore.
But "ctdb getdbstatus" says test.tdb is healthy
and "ctdb catdb" can dump its content:

key(20) = "__transaction_lock__"
dmaster: 1
rsn: 529
data(4) = "\F0<\00\00"

key(9) = "/tmp/test"
dmaster: 0
rsn: 2
data(4) = "iii\0A"

Dumped 2 records

My conjecture is that the records in test.tdb can not be fetched is
due to dmaster of "__transaction_lock__" is node-1.
But node-1 is crashed.

There are some questions:
1. Is my conjecture correct?
2. Is there any __transaction_lock__ timeout?
   In my test, the record can not be fetched more than 1 hour.
3. Is there any workaround for this situation or how to fix it?

az