ctdb

Fri Jun 17 14:33:24 UTC 2016

On Thu, Jun 16, 2016 at 8:46 PM, Amitay Isaacs <amitay at gmail.com> wrote:
> Hi Steve,
>
> On Fri, Jun 17, 2016 at 7:32 AM, Steve French <smfrench at gmail.com> wrote:
>>
>> We are running into some problems with nfs leases/delegations and slow
>> timeouts which impact ctdb recovery, so am leaning toward your idea of
>> backporting patches for ctdb cluster mutex helper from master.
>
>
> Can you please explain the problem in a bit more detail?  I am interested to
> know what timeouts are affecting ctdb recovery.  And specifically how this
> is related to ctdb recovery lock.

(Added samba-technical because I think this info is of general interest.)

I am not in front of the ticket at the moment, but we are/were placing
the recovery lock on NFSv4 storage.

Locks are handled with leases in the case of NFSv4.1 and the lease
period is 45 seconds (in our case, but perhaps universally) and it
requires two lease periods to expire before a lock is released in
cases where the node of the lock owner has crashed.

So, Node A was running the CTDB leader but it crashed. Node B's
instance of CTDB is elected the leader and tries to take the lock but
it has to sit there for up to 90 seconds (average 45 seconds) before
the lease expires and it can get the lock.

It is for this reason that I have started thinking about what is
required to work with things like etdc or zookeeper where recovery
times can be much lower.

-- 
Regards,
Richard Sharpe
(何以解憂？唯有杜康。--曹操)