CTDB: Split brain and banning

David Disseldorp ddiss at suse.de
Mon Nov 5 10:29:44 UTC 2018


Hi Martin,

On Thu, 1 Nov 2018 14:11:08 +1100, Martin Schwenke via samba-technical wrote:

> [Somewhat repeating myself here for clarity]
> 
> That's actually a very good idea... but then the hardcoded 5 second
> limit for taking the reclock would be hit.  We could think about making
> that timeout configurable.

Hmm okay, so not much of an option at this stage.

> However, I think Michel's setting for KeepaliveInterval is too
> ambitious, given that it is less than the lock duration.  I think that
> is the core of the problem.
> 
> Having thought about this over lunch, much of this is really a question
> about how quickly you can get back a usable cluster.  There's no use
> setting KeepaliveInterval so low that you can't get through recovery
> because of other limitations.  GPFS (and any other distributed method
> being used for locking) will take a finite time to recover themselves.
> Setting ElectionTimeout to defer recovery after a node goes away is
> just a workaround - you know the node is gone but the cluster is still
> in the gutter.

Thanks for the details. I suppose the crux of my question was what other
implementations return to ctdb reclock callers immediately after the
reclock holder dies, if not "contended".

> So, the question becomes: do you care if a node has gone away if you
> can't do anything (reliable) about it?  ;-)

I care about getting the cluster in a state where it can serve
(consistent) data to clients as soon as possible :)

Cheers, David



More information about the samba-technical mailing list