CTDB performance issue

Martin Schwenke martin at meltin.net
Thu May 29 05:32:18 UTC 2025


Hi Xavi,

On Mon, 26 May 2025 11:46:05 +0200, Xavi Hernandez
<xhernandez at gmail.com> wrote:

> The recovery process can only make the problem worse, but it's not what
> initially triggers the problem itself. Even without the recovery in the
> picture, CTDB is already running slowly and requests accumulate big delays
> (including the leader broadcast which arrives too late). I also think that
> a workload where there's a lot of lock contention (forcing CTDB to start
> the lock helper processes even in normal cases without vacuuming or other
> operations) will make CTDB go slower and accumulate latencies. In any case
> that's another issue and I don't have conclusive data yet.
> 
> Probably avoiding recovery would help, but I think we should try to
> understand and fix the initial issue.

Definitely true.

> I can't tell for sure what the users are doing, but from a network traffic
> analysis, I can say that there are many open/close requests (in the order
> of 1000 per second), and read and write operations mostly. The open/close
> requests need to check/update the share mode, which requires TDB locked
> access.
> 
> This happens on a 3 node CTDB cluster.

Is ctdbd logging any messages like the following?

  WARNING: CPU utilisation X% >= threshold (90%)

I know a lot of the load seems to be coming from lock helpers.
However, when there is a lot of lock contention then ctdbd sometimes
gets close to saturation.

Also, there is always the possibility that some directories, usually at
the top of a share, are generating a lot of lock contention, because
they are common to a lot of users.  One simple example is a share that
contains home directories under /home.  Although none of the users are
able to modify /home, there is still a lot of locking.tdb traffic
related to this directory.  To work around this you can break lock
coherency in this directory using features of:

  https://www.samba.org/samba/docs/current/man-html/vfs_fileid.8.html

In particular, see the fileid:nolock_paths option.  This can have a
surprising effect on overall lock contention.

If admins use SMB clients to create directories for users then you can
always have them use an admin version of the share that has lock
coherency.

Yes, more workarounds, no really addressing the underlying problem.  :-)


> I would say that without these issues, nodes are pretty stable. We thought
> about increasing the leader broadcast timeout, but without understanding
> exactly why the broadcast was lost in the first place, it was not so clear
> it could help (we thought that the broadcast was actually lost, not just
> delayed).

Makes sense.

> [...]

> Yes, we will definitely try to reproduce it, but from past experience,
> sometimes it's hard to generate the same load from a limited test
> environment. We'll try it anyway and experiment with other values for the
> tunable.

> [...]

> "realtime scheduling" is enabled, but note that even with this, all lock
> helpers started by CTDB also have the same priority. So, even if running
> with high priority, the main CTDB process is just one of many competing for
> CPU.

The WIP patches include an option to run the lock helpers at a lower
priority than ctdbd.  Could help...

> I think it's also important to note that, in this particular case, lock
> contention seems very low before the global lock is taken, so I expect that
> most of the helpers will run without actually waiting for the mutex (it
> will be granted immediately).

Hmmm... OK.

> My idea is a bit more radical. The motivation is that starting a new
> process and communicating with it asynchronously is in the order of a few
> milliseconds (even tens or hundreds of milliseconds if we have a high
> number of context switches and CPU competition). On the other hand, trying
> to acquire a mutex is in the order of a few nanoseconds. This means a
> difference of at least 6 orders of magnitude. So my raw idea is based on:
> 
> 1. Add support in tevent to manage locks.
> 2. Instead of spawning a lock helper, just send a lock attempt to tevent,
> with a callback that will be executed when the lock is acquired.
> 3. If the lock cannot be acquired immediately, it will be added to the
> pending list.
> 4. In each loop, and before handling immediates, tevent will check the list
> of pending locks and will try to acquire them.
> 5. If a lock is acquired, post its callback as an immediate.
> 6. When tevent processes the immediates, the callbacks of all acquired
> locks will be executed.
> 
> This is a very high level idea. This will be faster as long as we don't
> attempt to get each lock a million times. I'm also thinking about how to
> prevent having to check each single pending lock in each iteration, which
> will heavily reduce the overhead, and how to prevent starvation in the
> worst case. I think there are ways to avoid these problems, but first I
> need to know if this idea makes sense to you.

I doubt that support for locks will be added to tevent - ctdbd would be
the only user.

However, you could do something very similar by using tevent to process
a queue when you add to it and also on a timer.  The only problem is,
as you say, "as long as we don't attempt to get each lock a million
times".  You end up with a queue that you need to manage.  You might
place a limit on the number of locks you retry in each run.  Then you
need to decide whether you're doing to be fair and always retry the
oldest queue members first (and they might be stubborn), or move those
that have been tried to the end of the queue, since others may be more
likely.  So, it is likely to get complicated.

At the moment, by using blocking locks, we're delegating the queuing
to the kernel.

Quite a few years ago TDB switched from fcntl() locks to mutexes, due
to the thundering herd problem.  Now, the fcntl() lock thundering herd
problem seems to be elegantly solved in the Linux kernel.  I don't know
what else we would lose, but perhaps it is time to try fcntl() lock
again?

Interesting problems... good times...  :-)

peace & happiness,
martin



More information about the samba-technical mailing list