ctdb reclock + fcntl posix locks with ocfs2

Thu Nov 28 04:19:49 MST 2013

[Short version]
What to tweak for ctdb reclock dir to handle correct fcntl locks on 
ocfs2 fs?

[Less short version]

Hi,

We are using in production a 2 nodes cluster that is working fine : 
Ubuntu server 12.10, cman, samba, ctdb.
The ctdb reclock are done on a dedicated GFS2 partition, and the data 
storage (samba user shares) are done on a dedicated OCFS2 partition.

At the time being, I am setting up a similar cluster but with updated 
versions, distribs and with the will to simplify some parts :

- 2 nodes, CentOS 6.4 64 bits, as two oVirt VM
- cman 3.0.12
- corosync 1.4.1
- samba 3.6.9
- ctdb 1.0.114.5-3
- ocfs2 1.8.0-10.el6
- ctdb lock LUN and ocfs2 user data both stored in a remote equalogic 
iSCSI SAN
- UEK kernel : 2.6.39-400.211.1.el6uek.x86_64

I would like to get rid of two different clustered filesystem types 
(gfs2, ocfs2), and keep only ocfs2, as my 3 To of data are already 
stored this way.

Once everything is setup, I run ctdb on the first node, then on the 
second, and I'm facing a problem I remember I already had to cope with 2 
years ago on the production cluster :

ctdb_recovery_lock: Got recovery lock on '/ctdb/.ctdb.lock'
ERROR: recovery lock file /ctdb/.ctdb.lock not locked when recovering!

This was two years ago, but I remember I had to switch back to store 
this lock mechanism on a GFS2 partition.

I took the time to search and test many things.

Ping_pong tests showed this :

* 1 *
ping_pong /ctdb/test.dat 3
showing ~ 2.1 M locks/s, and /NOT/ dropping to a different value when 
ran on the second node. Symmetrical, though.

* 2 *
ping_pong -rw /ctdb/test.dat 3
showing ~ 330 k locks/s, and dropping to 190 locks/s when ran on node 2. 
Symmetrical again.

* 3 *
ping_pong -rw -m /ctdb/test.dat 3
showing ~ 2.1 M locks/s, and dropping to 88 k locks/s when ran on node 
2, and showing several data increment oscillating between 1 and 
30,150,180...

In the test 1, according to the ctdb webpage advices, the value _NOT_ 
dropping seems to indicate an issue, so I tried to validate the correct 
POSIX locking, following the advices found there :
http://serverfault.com/questions/531813/how-to-determine-posix-advisory-file-locks-are-working-in-simfs-in-the-vm-im-us
and the tests were successful.

I found that this samba-technical mailing list helped someone to solve a 
similar problem when storing the ctdb lock on a gluster FS 
(https://lists.samba.org/archive/samba-technical/2012-October/087515.html), 
but on ocfs2, I have no mean to use the --direct-io-mode=enable mount 
option). In the OCFS2 man page, I read this :
"datavolume
This mount option has been deprecated in OCFS2 1.6. It has been used in 
the past (OCFS2  1.2  and  OCFS2  1.4),  to force the Oracle RDBMS to 
issue direct IOs to the hosted data files, control files, redo logs, 
archive logs, voting disk, cluster registry, etc. It has been deprecated 
because it is no longer required.  Oracle  RDBMS  users should instead 
use the init.ora parameter, filesystemio_options, to enable direct IOs."

I also found this relevant bug :
https://bugzilla.samba.org/show_bug.cgi?id=6777
but the version of filesystem and kernel I'm using should be more than 
capable of handling all this. This is confirmed on Oracle's ocfs2 1.6 
release notes.

My cluster.conf contains the following line :
<dlm plock_ownership="1" plock_rate_limit="0"/>
because this seemed correct to me according to what I understood so far.

I'd be very glad to hear your advices on what I could change next, or 
what to check.
On my side, I will now test the setup I wanted to avoid : store the ctdb 
lock on a GFS2 partition, and see if it works as on our old production 
cluster.

Thank you.

-- 
Nicolas Ecarnot