[ceph-users] CTDB Cluster Samba on Cephfs

Fri Mar 29 12:05:56 MDT 2013

On Fri, Mar 29, 2013 at 9:31 AM, Marco Aroldi <marco.aroldi at gmail.com> wrote:
> Still trying with no success:
>
> Sage and Ronnie:
> I've tried the ping_pong tool, even with "locking=no" in my smb.conf
> (no differences)
>
> # ping_pong /mnt/ceph/samba-cluster/test 3
> I have about 180 locks/second

That is very slow.

> If I start the same command from the other node, the tools stops
> completely. 0 locks/second

Looks like fcntl() locking doenst work that well.

The slow rate of fcntl() lock  will impact samba.
By default, for almost all file i/o samba will need to do at least one
fcntl(G_GETLK) in ordet to check whether some other, non-samba,
process holds a lock to the file.
If you can only do 180 fcntl(F_*LK)  per second across the cluster for
a file (I assume this is per file limitation)
this will have the effect of you only being able to do 180 i/o per
second to a file, which will make CIFS impossibly slow for any real
use.
This was all from a single node as well  so no inter-node contention!

So here you probably want to use "locking = no" in samba.  But beware,
  locking=no can have catastrophic effects on your data.
But without "locking = no"  would just become impossibly slow,
probably uselessly slow.

Using "locking = no" in samba does mean though that you no longer have
any locking coherency across protocols.
I.e.  NFS clients and samba clients are now disjoint since they can no
longer see eachothers locks.

If you only ever access the data via CIFS,  locking = no  should be safe.
But IF you access data via NFS or other NS protocols,   breaking lock
coherency across protocols like this could lead to dataloss
depending on the i/o patterns.

I would recommend only using   locking = no   if you can guarantee
that you will never export the data via other means than CIFS.
If you can not guarantee that,  you will have to reseach the use
patterns very carefully to determine whether locking = no is safe or
not.

I think for fcntl() locking,   depending on use case , is this a home
server where you can accept very poor performance?   or is this a
server for a small workgroup?
If the latter, if using locking = yes   you probably want your
filesystem to allow  >10.000 operations per second from a node with no
contention
and >1000 operations per node per second when there is contention across nodes.

If it is a big server, you probably want >> instead of > for these
numbers. At least.

But first you would need to get ping_pong working reliably  both
running in a steady state, and later running and recovering from
continous single node reboots.
It seems ping pong is not working really well for you at all at this
stage,   so that is likely a problem.

As I said,  very few cluster filesystems have fcntl() locking that is
not completely broken.

For now,   you could try "lokcing = no" in samba with the caveats above,
and you can disable using fcntl() split brain prevention in CTDB by setting
CTDB_RECOVERY_LOCK=

in /etc/sysconfig/ctdb
This will disable the split brain detection in ctdb  but allow you to
recover quicker if your cluster fs does not handle fcntl() locking
well.
(with 5 minute recovery   you will have so much dataloss due to the
way CIFS clients work and timeout  that there is probably little point
in running CIFS at all)

>
> Sage, when I start the CTDB service, the mds log says every second:
> 2013-03-29 16:49:34.442437 7f33fe6f3700  0 mds.0.server
> handle_client_file_setlock: start: 0, length: 0, client: 5475, pid:
> 14795, type: 4
>
> 2013-03-29 16:49:35.440856 7f33fe6f3700  0 mds.0.server
> handle_client_file_setlock: start: 0, length: 0, client: 5475, pid:
> 14799, type: 4
>
> Exactly as you see it: with a blank line in between
> When i start the ping_pong command i have these lines at the same rate
> reported by the script (180 lines/second):
>
> 2013-03-29 17:07:50.277003 7f33fe6f3700  0 mds.0.server
> handle_client_file_setlock: start: 2, length: 1, client: 5481, pid:
> 11011, type: 2
>
> 2013-03-29 17:07:50.281279 7f33fe6f3700  0 mds.0.server
> handle_client_file_setlock: start: 1, length: 1, client: 5481, pid:
> 11011, type: 4
>
> 2013-03-29 17:07:50.286643 7f33fe6f3700  0 mds.0.server
> handle_client_file_setlock: start: 0, length: 1, client: 5481, pid:
> 11011, type: 2
>
> Finally, I've tried to lower the ctdb's RecoverBanPeriod but the
> clients was unable to recover for 5 minutes (again!)
> So, I've found the mds logging this:
> 2013-03-29 16:55:23.354854 7f33fc4ed700  0 log [INF] : closing stale
> session client.5475 192.168.130.11:0/580042840 after 300.159862
>
> I hope to find a solution.
> I am at your disposal to further investigation
>
> --
> Marco Aroldi
>
> 2013/3/29 ronnie sahlberg <ronniesahlberg at gmail.com>:
>> The ctdb package comes with a tool "ping pong" that is used to test
>> and exercise fcntl() locking.
>>
>> I think a good test is using this tool and then randomly powercycling
>> nodes in your fs cluster
>> making sure that
>> 1, fcntl() locking is still coherent and correct
>> 2, always recover within 20 seconds for a single node power cycle
>>
>>
>> That is probably a good test for CIFS serving.
>>
>>
>> On Thu, Mar 28, 2013 at 6:22 PM, ronnie sahlberg
>> <ronniesahlberg at gmail.com> wrote:
>>> On Thu, Mar 28, 2013 at 6:09 PM, Sage Weil <sage at inktank.com> wrote:
>>>> On Thu, 28 Mar 2013, ronnie sahlberg wrote:
>>>>> Disable the recovery lock file from ctdb completely.
>>>>> And disable fcntl locking from samba.
>>>>>
>>>>> To be blunt, unless your cluster filesystem is called GPFS,
>>>>> locking is probably completely broken and should be avoided.
>>>>
>>>> Ha!
>>>>
>>>>> On Thu, Mar 28, 2013 at 8:46 AM, Marco Aroldi <marco.aroldi at gmail.com> wrote:
>>>>> > Thanks for the answer,
>>>>> >
>>>>> > I haven't yet looked at the samba.git clone, sorry. I will.
>>>>> >
>>>>> > Just a quick report on my test environment:
>>>>> > * cephfs mounted with kernel driver re-exported from 2 samba nodes
>>>>> > * If "node B" goes down, everything works like a charm: "node A" does
>>>>> > ip takeover and bring up the "node B"'s ip
>>>>> > * Instead, if "node A" goes down, "node B" can't take the rlock file
>>>>> > and gives this error:
>>>>> >
>>>>> > ctdb_recovery_lock: Failed to get recovery lock on
>>>>> > '/mnt/ceph/samba-cluster/rlock'
>>>>> > Unable to get recovery lock - aborting recovery and ban ourself for 300 seconds
>>>>> >
>>>>> > * So, for 5 minutes, neither "node A" nor "node B" are active. After
>>>>> > that, the cluster recover correctly.
>>>>> > It seems that one of the 2 nodes "owns" and don't want to "release"
>>>>> > the rlock file
>>>>
>>>> Cephfs aims to give you coherent access between nodes.  The cost of that
>>>> is that if another client goes down and it holds some lease/lock, you have
>>>> to wait for it to time out.  That is supposed to happen after 60 seconds,
>>>> it sounds like you've hit a bug here.  The flock/fnctl locks aren't
>>>> super-well tested in the failure scenarios.
>>>>
>>>> Even assuming it were working, though, I'm not sure that you want to wait
>>>> the 60 seconds either for the CTDB's to take over for each other.
>>>
>>> You do not want to wait 60 seconds. That is approaching territory where
>>> CIFS clients will start causing file corruption and dataloss due to
>>> them dropping writeback caches.
>>>
>>> You probably want to aim to try to guarantee that fcntl() locking
>>> start working again after
>>> ~20 seconds or so to have some headroom.
>>>
>>>
>>> Microsoft themself state 25seconds as the absolute deadline they
>>> require you guarantee before they will qualify storage.
>>> That is among other things to accomodate and have some headroom for
>>> some really nasty dataloss issues that will
>>> happen if storage can not recover quickly enough.
>>>
>>>
>>> CIFS is hard realtime. And you will pay dearly for missing the deadline.
>>>
>>>
>>> regards
>>> ronnie sahlberg