CTDB performance issue
Xavi Hernandez
xhernandez at gmail.com
Wed May 21 11:14:24 UTC 2025
Hello,
I've been analyzing a problem where CTDB does many leader reelections
because it loses the leader broadcast messages.
It's not always the same node that loses the broadcast, and it has been
confirmed with a network capture that the leader is actually sending the
broadcasts, and the nodes are receiving them.
Based on the data we have collected, I think the problem is caused by the
lock helper processes that CTDB starts to wait for a mutex in case of
contention. The data shows that there's a lot of requests requiring TDB
access (mostly create and close), in the order of 1000 per second. However
we have seen that under normal circumstances, very few lock helpers are
started, which means that there's little contention and most of the mutexes
can be acquired immediately.
The problem starts when a global operation, like vacuuming, is started. It
acquires the global lock of the TDB, causing all requests to contend. This
triggers the execution of many lock helpers, which get blocked. When 200
(default value for tunable LockProcessesPerDB) lock helpers are started,
CTDB stops creating new processes, but queues them in a list.
Meanwhile the vacuuming operation is running. It could take a few seconds,
but every second ~1000 new requests are queued to run a lock helper. Once
vacuuming completes, the global lock is released and all 200 lock helpers
are unblocked at the same time.
This seems to cause a lot of context switches. CTDB needs to handle the
termination of each process and starting a new one from the queue.
Additionally, there are many smbd processes doing work. During this time,
it looks like CTDB is not able to process the incoming queue fast enough,
and even though the leader broadcast message is present in the socket's
kernel buffer, CTDB doesn't see it for several seconds. This triggers the
timeout and forces a reelection. The reelection itself also takes the
global lock (TDBs are frozen), which could cause the problem to repeat.
We are still collecting data to try to find more evidence, but right now
this seems to be what's happening.
Does this make sense ?
Any recommendations to fix (or at least minimize) this problem in the
short-term ?
Besides tweaking some parameters to reduce the frequency of operations that
require the global lock, could it help to reduce the LockProcessesPerDB ?
It looks like less processes would cause less context switches and less
overhead to ctdb, so it would be able to process the queue faster. Does
that make sense or this could cause slowness in other cases ?
If this issue is really caused by the execution of the lock helpers, I'm
wondering if we couldn't get rid of them. I have an idea on that side, but
first I prefer to be sure that what we have seen is valid and I haven't
missed something else that could explain the problem.
Thanks and best regards,
Xavi
More information about the samba-technical
mailing list