CTDB: Split brain and banning
ddiss at suse.de
Mon Nov 5 10:29:44 UTC 2018
On Thu, 1 Nov 2018 14:11:08 +1100, Martin Schwenke via samba-technical wrote:
> [Somewhat repeating myself here for clarity]
> That's actually a very good idea... but then the hardcoded 5 second
> limit for taking the reclock would be hit. We could think about making
> that timeout configurable.
Hmm okay, so not much of an option at this stage.
> However, I think Michel's setting for KeepaliveInterval is too
> ambitious, given that it is less than the lock duration. I think that
> is the core of the problem.
> Having thought about this over lunch, much of this is really a question
> about how quickly you can get back a usable cluster. There's no use
> setting KeepaliveInterval so low that you can't get through recovery
> because of other limitations. GPFS (and any other distributed method
> being used for locking) will take a finite time to recover themselves.
> Setting ElectionTimeout to defer recovery after a node goes away is
> just a workaround - you know the node is gone but the cluster is still
> in the gutter.
Thanks for the details. I suppose the crux of my question was what other
implementations return to ctdb reclock callers immediately after the
reclock holder dies, if not "contended".
> So, the question becomes: do you care if a node has gone away if you
> can't do anything (reliable) about it? ;-)
I care about getting the cluster in a state where it can serve
(consistent) data to clients as soon as possible :)
More information about the samba-technical