CTDB performance issue

Xavi Hernandez xhernandez at gmail.com
Mon May 26 09:46:05 UTC 2025


Hi Martin,

Thanks for your comments.

On Sun, May 25, 2025 at 6:52 AM Martin Schwenke <martin at meltin.net> wrote:

> Hi Xavi,
>
> On Wed, 21 May 2025 13:14:24 +0200, Xavi Hernandez
> <xhernandez at gmail.com> wrote:
>
> > I've been analyzing a problem where CTDB does many leader reelections
> > because it loses the leader broadcast messages.
> >
> > It's not always the same node that loses the broadcast, and it has been
> > confirmed with a network capture that the leader is actually sending the
> > broadcasts, and the nodes are receiving them.
> >
> > Based on the data we have collected, I think the problem is caused by the
> > lock helper processes that CTDB starts to wait for a mutex in case of
> > contention. The data shows that there's a lot of requests requiring TDB
> > access (mostly create and close), in the order of 1000 per second.
> However
> > we have seen that under normal circumstances, very few lock helpers are
> > started, which means that there's little contention and most of the
> mutexes
> > can be acquired immediately.
> >
> > The problem starts when a global operation, like vacuuming, is started.
> It
> > acquires the global lock of the TDB, causing all requests to contend.
> This
> > triggers the execution of many lock helpers, which get blocked. When 200
> > (default value for tunable LockProcessesPerDB) lock helpers are started,
> > CTDB stops creating new processes, but queues them in a list.
> >
> > Meanwhile the vacuuming operation is running. It could take a few
> seconds,
> > but every second ~1000 new requests are queued to run a lock helper. Once
> > vacuuming completes, the global lock is released and all 200 lock helpers
> > are unblocked at the same time.
> >
> > This seems to cause a lot of context switches. CTDB needs to handle the
> > termination of each process and starting a new one from the queue.
> > Additionally, there are many smbd processes doing work. During this time,
> > it looks like CTDB is not able to process the incoming queue fast enough,
> > and even though the leader broadcast message is present in the socket's
> > kernel buffer, CTDB doesn't see it for several seconds. This triggers the
> > timeout and forces a reelection. The reelection itself also takes the
> > global lock (TDBs are frozen), which could cause the problem to repeat.
> >
> > We are still collecting data to try to find more evidence, but right now
> > this seems to be what's happening.
> >
> > Does this make sense ?
> > Any recommendations to fix (or at least minimize) this problem in the
> > short-term ?
>
> It makes excellent sense.  Thanks for the excellent analysis!
>
> The best change I can think of would be to not start a recovery at the
> beginning of an election and to not bother with a recovery at the end
> of an election if the leader is unchanged.
>
> Basically, a recovery isn't needed unless the set of active nodes
> changes.  It is nice to have a leader, but in CTDB the leader is only
> really needed to do the things that a leader needs to do (recovery,
> failover, ...).
>
> I started some work in this area a few years ago but got spooked when I
> saw unexpected behaviour.  I need to take another look.
>

The recovery process can only make the problem worse, but it's not what
initially triggers the problem itself. Even without the recovery in the
picture, CTDB is already running slowly and requests accumulate big delays
(including the leader broadcast which arrives too late). I also think that
a workload where there's a lot of lock contention (forcing CTDB to start
the lock helper processes even in normal cases without vacuuming or other
operations) will make CTDB go slower and accumulate latencies. In any case
that's another issue and I don't have conclusive data yet.

Probably avoiding recovery would help, but I think we should try to
understand and fix the initial issue.



> However, I'm not sure this is short-term enough for you.  :-(
>
> I'm interested in a couple of things:
>
> * What sort of workload hammers CTDB hard enough that this problem
>   occurs?  Can you please give me some idea of the scale here?
>

I can't tell for sure what the users are doing, but from a network traffic
analysis, I can say that there are many open/close requests (in the order
of 1000 per second), and read and write operations mostly. The open/close
requests need to check/update the share mode, which requires TDB locked
access.

This happens on a 3 node CTDB cluster.



> * Apart from this, how stable is the cluster?  Is it a static group of
>   nodes that is generally rock solid?  In that case, you could just try
>   increasing the leader broadcast timeout.  However, if it were that
>   easy, I'm guessing you wouldn't have needed to do all of this
>   analysis.  ;-)
>

I would say that without these issues, nodes are pretty stable. We thought
about increasing the leader broadcast timeout, but without understanding
exactly why the broadcast was lost in the first place, it was not so clear
it could help (we thought that the broadcast was actually lost, not just
delayed).



>
> > Besides tweaking some parameters to reduce the frequency of operations
> that
> > require the global lock, could it help to reduce the LockProcessesPerDB ?
> > It looks like less processes would cause less context switches and less
> > overhead to ctdb, so it would be able to process the queue faster. Does
> > that make sense or this could cause slowness in other cases ?
>
> The current defaults are based on some performance work that was
> done >10 years ago.  If you have a test setup where you can generate
> similar load to your productions setup, then I would encourage you to
> try things out and report back.  Reducing the size of a thundering herd
> might be good.  However, I wonder if reducing LockProcessesPerDB means
> you might increase the latency for taking lock.
>

Yes, we will definitely try to reproduce it, but from past experience,
sometimes it's hard to generate the same load from a limited test
environment. We'll try it anyway and experiment with other values for the
tunable.



>
> Sorry, I'm having trouble swapping all of this back into my brain
> today.  :-(
>
> > If this issue is really caused by the execution of the lock helpers, I'm
> > wondering if we couldn't get rid of them. I have an idea on that side,
> but
> > first I prefer to be sure that what we have seen is valid and I haven't
> > missed something else that could explain the problem.
>
> Just out of interest, do you have the "realtime scheduling" option
> switched off?  That would certainly impact on ctdbd's ability to stay
> ahead of other processes.  If this is the case, and you have a good
> reason for doing this, then perhaps we need to look at finishing and
> merging some patches to use regular scheduling priorities instead of
> real-time.  Those have been around for a long time - we were
> experimenting with this 10 years ago but never had the courage to
> listen to our test results and just go with it...
>

"realtime scheduling" is enabled, but note that even with this, all lock
helpers started by CTDB also have the same priority. So, even if running
with high priority, the main CTDB process is just one of many competing for
CPU.

I think it's also important to note that, in this particular case, lock
contention seems very low before the global lock is taken, so I expect that
most of the helpers will run without actually waiting for the mutex (it
will be granted immediately).



>
> Also interested in your idea for getting rid of lock helpers...
>

My idea is a bit more radical. The motivation is that starting a new
process and communicating with it asynchronously is in the order of a few
milliseconds (even tens or hundreds of milliseconds if we have a high
number of context switches and CPU competition). On the other hand, trying
to acquire a mutex is in the order of a few nanoseconds. This means a
difference of at least 6 orders of magnitude. So my raw idea is based on:

1. Add support in tevent to manage locks.
2. Instead of spawning a lock helper, just send a lock attempt to tevent,
with a callback that will be executed when the lock is acquired.
3. If the lock cannot be acquired immediately, it will be added to the
pending list.
4. In each loop, and before handling immediates, tevent will check the list
of pending locks and will try to acquire them.
5. If a lock is acquired, post its callback as an immediate.
6. When tevent processes the immediates, the callbacks of all acquired
locks will be executed.

This is a very high level idea. This will be faster as long as we don't
attempt to get each lock a million times. I'm also thinking about how to
prevent having to check each single pending lock in each iteration, which
will heavily reduce the overhead, and how to prevent starvation in the
worst case. I think there are ways to avoid these problems, but first I
need to know if this idea makes sense to you.

Best regards,

Xavi


More information about the samba-technical mailing list