SV: CTDB and Glusterfs setup

Mon Oct 8 04:00:06 MDT 2012

Ok, have been working some more.

Ping_pong now works just fine, on the two node setup I run:

Ping_pong /mnt/lock/test.dat 3.

Gives me abt 1500/sec on each node. 

When running with -rw the documentation  says that it should increment the "data increment" value, but it stays on "1" even when second node is running ping_poing with -rw.

The ctdb still gives me the same error message abt missing files.

Morten

-----Opprinnelig melding-----
Fra: Michael Adam [mailto:obnox at samba.org] 
Sendt: 8. oktober 2012 10:19
Til: Morten Bøhmer
Kopi: samba-technical at lists.samba.org
Emne: Re: CTDB and Glusterfs setup

Hi Morten,

On 2012-10-08 at 07:28 +0000, Morten Bøhmer wrote:
> Hi all
> 
> I am working on a ctdb setup and would like to use Glusterfs for a shared volume for ctdb.
> 
> I have setup a simple replicated volume with glusterfs:
> 
> gluster volume create lock replicate 2 lock 
> 172.16.0.1:/mnt/gluster/lock 172.16.0.2:/mnt/gluster/lock
> 
> I have started this volume and can mount it successfully on /mnt/lock
> 
> I have configured files for ctdb;
> 
> [root at gluster1 lock]# ls -la
> total 48
> drwxr-xr-x  2 root root 4096 Oct  7 23:09 .
> drwxr-xr-x. 4 root root 4096 Oct  7 00:21 ..
> -rw-r--r--  1 root root  165 Oct  7 23:09 ctdb
> -rw-r--r--  1 root root   22 Oct  7 23:09 nodes
> -rw-r--r--  1 root root   42 Oct  7 23:09 public_addresses
> -rw-r--r--  1 root root  417 Oct  7 23:09 smb.conf
> -rw-------  1 root root    3 Oct  7 23:09 test.dat
> 
> And tried to run the ping_pong test software:
> 
> [root at gluster1 lock]# /tmp/ping_pong test.dat 1
>   2934 locks/sec
> 
> When I run ping_pong on node 2 while node 1 is  running it, the systems halts and  it is time for a reboot.

Firstly, ping_pong is called "ping_pong <filename> <N>" where N is at least one bigger than the number of ping_pong processes that you intend to run. Secondly, all instances of ping_pong that operate on a given file simultaneously should be called with the same "N".

If running on several nodes at once and you specify N too small, then the system should not halt. The ping_pong processes should block and you should not see a positive locks/sec rate printed any more.

If the system crashes/halts etc, there is something seriously wrong with your cluster file systeme (Gluster), be it a bug or a configuration issue.

You have to configure Gluster for posix fcntl lock support.
Would you share your gluster config?

> Secondly I cannot  get ctdb to start it complains about being unable 
> to lock the files in the volume, even though the files are fully 
> accessible from both nodes by  using common tools like vi,touch,less 
> and so on:

Before you get the ping_pong above running reliably with multiple processes per node and instances on different nodes simultaneously, you need not try and get ctdb running.
Ctdb relies on correct posix fcntl locking semantics for the recovery lock file (unless you disable it, which you should not do in a production environment unless you want to risk corrupting your data...). This is precisely what ping_pong has been written for...

Cheers - Michael

> 2012/10/08 08:59:45.072196 [ 8399]: Starting CTDBD as pid : 8399
> 2012/10/08 08:59:45.072640 [ 8399]: Unable to set scheduler to 
> SCHED_FIFO (Operation not permitted)
> 2012/10/08 08:59:45.207976 [ 8399]: Freeze priority 1
> 2012/10/08 08:59:45.208166 [ 8399]: Freeze priority 2
> 2012/10/08 08:59:45.208286 [ 8399]: Freeze priority 3
> 2012/10/08 08:59:48.212279 [recoverd: 8451]: Taking out recovery lock 
> from recovery daemon
> 2012/10/08 08:59:48.212553 [recoverd: 8451]: Take the recovery lock
> 2012/10/08 08:59:48.212731 [recoverd: 8451]: ctdb_recovery_lock: 
> Unable to open '/mnt/lock/lockfile' --nlist='/mnt/lock/nodes' 
> --public-addresses='/mnt/lock/public_addresses' - (No such file or 
> directory)
> 2012/10/08 08:59:48.212850 [recoverd: 8451]: Unable to get recovery 
> lock - aborting recovery
> 2012/10/08 08:59:49.213604 [recoverd: 8451]: Taking out recovery lock 
> from recovery daemon
> 2012/10/08 08:59:49.213823 [recoverd: 8451]: Take the recovery lock
> 2012/10/08 08:59:49.213975 [recoverd: 8451]: ctdb_recovery_lock: 
> Unable to open '/mnt/lock/lockfile' --nlist='/mnt/lock/nodes' 
> --public-addresses='/mnt/lock/public_addresses' - (No such file or 
> directory)
> 2012/10/08 08:59:49.214095 [recoverd: 8451]: Unable to get recovery 
> lock - aborting recovery
> 2012/10/08 08:59:50.214874 [recoverd: 8451]: Taking out recovery lock 
> from recovery daemon
> 2012/10/08 08:59:50.215338 [recoverd: 8451]: Take the recovery lock
> 2012/10/08 08:59:50.215616 [recoverd: 8451]: ctdb_recovery_lock: 
> Unable to open '/mnt/lock/lockfile' --nlist='/mnt/lock/nodes' 
> --public-addresses='/mnt/lock/public_addresses' - (No such file or 
> directory)
> 2012/10/08 08:59:50.215727 [recoverd: 8451]: Unable to get recovery 
> lock - aborting recovery
> 2012/10/08 08:59:51.216524 [recoverd: 8451]: Taking out recovery lock 
> from recovery daemon
> 2012/10/08 08:59:51.216749 [recoverd: 8451]: Take the recovery lock
> 2012/10/08 08:59:51.216906 [recoverd: 8451]: ctdb_recovery_lock: 
> Unable to open '/mnt/lock/lockfile' --nlist='/mnt/lock/nodes' 
> --public-addresses='/mnt/lock/public_addresses' - (No such file or 
> directory)
> 2012/10/08 08:59:51.217030 [recoverd: 8451]: Unable to get recovery 
> lock - aborting recovery
> 2012/10/08 08:59:52.217836 [ 8399]: Banning this node for 300 seconds
> 2012/10/08 08:59:52.218095 [recoverd: 8451]: Taking out recovery lock 
> from recovery daemon
> 2012/10/08 08:59:52.218228 [recoverd: 8451]: Take the recovery lock
> 2012/10/08 08:59:52.218359 [recoverd: 8451]: ctdb_recovery_lock: 
> Unable to open '/mnt/lock/lockfile' --nlist='/mnt/lock/nodes' 
> --public-addresses='/mnt/lock/public_addresses' - (No such file or 
> directory)
> 2012/10/08 08:59:52.218519 [recoverd: 8451]: Unable to get recovery 
> lock - aborting recovery
> 
> 
> I have done tests with both FC17+stock Glusterfs and with 3.3 version from RPM. Anyone got a clue of how to get this up and running ?
> 
> 
> Morten