CTDB: Split brain and banning

Stefan Kooman stefan at bit.nl
Tue Nov 6 07:51:37 UTC 2018


Hi,

FYI: The issue has received bugid 13672 [1].

Quoting David Disseldorp via samba-technical (samba-technical at lists.samba.org):
> On Thu, 1 Nov 2018 12:07:36 +1100, Martin Schwenke via samba-technical wrote:
> 
> > I wonder if we need a configurable number of timeouts during which the
> > lock is retried... and then finally banned.  This relates to David's
> > question about whether the helper should block and retry internally -
> > that seems like a better solution.  However, then we hit the fixed 5
> > second time-out that is allowed for taking the recovery lock.  Perhaps
> > that needs to be configurable.  Then David's suggestion could work.
> 
> I'd prefer to fix this in the caller (ctdb), rather than reclock
> helpers. I think it's reasonable to expect that reclock helpers will
> return "contended" for an (arbitrarily) short period following failure
> of the reclock holder.

The potential problem we see with a blocking reclock helper is that at
that point the VIPs that are configured on that node stay online until
it times out anyway. If the network connectivity would get restored at
that point, the configured VIPS would clash with the other VIPS, the
node would try to get a reclock ... gets contention and bans itself.
While if it would fail faster it would have deconfigured the VIPS, not
try to get a reclock and would able to recover without disturbing the
other nodes.

One other thing I would like to note. A node that gets "contention" on a
reclock bans itself immediately. Proably because of the assumption that
a node is only allowed to proceed to get a reclock when it has won the
CTDB election. However, we have seen that this is not true if you
parition the network, and later restore connectivity. I think CTDB
should try to handle "contention" cases a bit more gracefully. First try
to find out "who" has the lock (ceph mutex helper will give you the ip
address of the client for example) and then check if that node is
healthy or not. Only PANIC! when there is a reason to ;-).

My 2 cents.


Gr. Stefan

[1]: https://bugzilla.samba.org/show_bug.cgi?id=13672
-- 
| BIT BV  http://www.bit.nl/        Kamer van Koophandel 09090351
| GPG: 0xD14839C6                   +31 318 648 688 / info at bit.nl



More information about the samba-technical mailing list