[Samba] CTDB RecLockLatencyMs vs RecoverInterval

Wed Jul 1 04:13:12 UTC 2020

Hi Bob,

On Tue, 30 Jun 2020 22:20:14 -0400, Robert Buck <robert.buck at som.com>
wrote:

> Yes, we happen to be using Samba and CTDB v4.10.7, on Ubuntu. *Would these
> happen to include the defect?*  *In your opinion, will 4s be an issue?* We
> happen to be running this on top of a geo-distributed etcd cluster, and in
> this particular case there was about 4200 miles between the two data
> centers. We're running a distributed NFS file system over a total of three
> data centers, spanning 7000+ miles. During failover testing we're seeing
> failover times less than 7 seconds, which seems pretty nice to me.  *In
> your experience, anything we should be tuning for? *

4.10.7 has the bug.  The Ubuntu package almost certainly has the bug.
If Ubuntu provided an update in the last few months then it might be
possible that they backported the fix... but I doubt it.  I've taken a
quick look at the changelog for the package and I don't see any
evidence of them having backported this fix.  I have not checked their
patches - sorry, no time for that.  I had the "pleasure" of backporting
the fix to 4.9.x and it was non-trivial, due to other changes.

High recovery lock latency isn't generally an issue.  It is simply the
amount of time it takes for CTDB to take the lock, which is (or
should be) rarely taken, so it doesn't impact normal performance.

That said, high recovery lock latency can certainly help to trigger the
bug. The most likely time for it to be triggered is if multiple nodes
have restarted CTDB at (about) the same time.  The window is where a
node has started running recovery and, while taking the lock, another
node joins the cluster and becomes active.  So, if "while taking the
lock" is a significant amount of time then the window is widened.
General slowness can also widen the window.

I've analysed situations where the bug was triggered ~5 times in total.
Several of these occurrences had high recovery lock latency logged.
Attempts to recreate it, even by artificially increasing the lock
latency (by wrapping the lock program in a script that sleeps for a few
seconds) were unsuccessful.  There might be a gap in my understanding
of how the problem is sometimes avoided (i.e there might be a code path
I haven't considered).  However, the reason the problem occurs and how
to fix it were very clear to me.

So, it looks like you need to be unlucky to hit this bug...  but
avoiding bad luck isn't known to be reliable strategy.  ;-)

Sorry that I don't have a clear answer.  The bug is a race condition,
so the chances of hitting it are somewhat difficult to quantify.

peace & happiness,
martin