CTDB performance issue

Xavi Hernandez xhernandez at gmail.com
Thu May 29 10:06:54 UTC 2025


Hi Martin,

On Thu, May 29, 2025 at 7:32 AM Martin Schwenke <martin at meltin.net> wrote:

> Hi Xavi,
>
> On Mon, 26 May 2025 11:46:05 +0200, Xavi Hernandez
> <xhernandez at gmail.com> wrote:
>
> > The recovery process can only make the problem worse, but it's not what
> > initially triggers the problem itself. Even without the recovery in the
> > picture, CTDB is already running slowly and requests accumulate big
> delays
> > (including the leader broadcast which arrives too late). I also think
> that
> > a workload where there's a lot of lock contention (forcing CTDB to start
> > the lock helper processes even in normal cases without vacuuming or other
> > operations) will make CTDB go slower and accumulate latencies. In any
> case
> > that's another issue and I don't have conclusive data yet.
> >
> > Probably avoiding recovery would help, but I think we should try to
> > understand and fix the initial issue.
>
> Definitely true.
>
> > I can't tell for sure what the users are doing, but from a network
> traffic
> > analysis, I can say that there are many open/close requests (in the order
> > of 1000 per second), and read and write operations mostly. The open/close
> > requests need to check/update the share mode, which requires TDB locked
> > access.
> >
> > This happens on a 3 node CTDB cluster.
>
> Is ctdbd logging any messages like the following?
>
>   WARNING: CPU utilisation X% >= threshold (90%)
>
> I know a lot of the load seems to be coming from lock helpers.
> However, when there is a lot of lock contention then ctdbd sometimes
> gets close to saturation.
>

No. The overall CPU and memory utilization are monitored and they don't
seem to be near saturation. Those log messages don't appear either. However
we have observed that CTDBD has some 100% CPU spikes, but probably not long
enough to be detected by CTDBD.



> Also, there is always the possibility that some directories, usually at
> the top of a share, are generating a lot of lock contention, because
> they are common to a lot of users.  One simple example is a share that
> contains home directories under /home.  Although none of the users are
> able to modify /home, there is still a lot of locking.tdb traffic
> related to this directory.  To work around this you can break lock
> coherency in this directory using features of:
>
>   https://www.samba.org/samba/docs/current/man-html/vfs_fileid.8.html
>
> In particular, see the fileid:nolock_paths option.  This can have a
> surprising effect on overall lock contention.
>

Support people working on this already have experience on this issue and
they use fsname_norootdir and even fsname_nodirs when necessary.

In any case, lock contention between user operations is very low. Check the
attached graph. It represents the number of lock helpers started per
second. Before 17:28:22, the number of lock helpers started is 0 or 1 per
second, which means very low contention. At 17:28:22 is when the global
lock of a TDB is taken. This causes "artificial" contention, but it's
completely independent of the user operations. CTDB creates 200 lock
helpers in 2 seconds (which can't progress because the global lock is
taken), and then it just enqueues newer incoming requests pending on
getting a free slot to run a new lock helper. At 17:28:27 the global lock
is released and all lock helpers get unblocked. Here is when the problem
starts and when CTDB is almost unable to progress. All pending requests are
sequentially executed each time a lock helper completes.



> If admins use SMB clients to create directories for users then you can
> always have them use an admin version of the share that has lock
> coherency.
>
> Yes, more workarounds, no really addressing the underlying problem.  :-)
>
>
> > I would say that without these issues, nodes are pretty stable. We
> thought
> > about increasing the leader broadcast timeout, but without understanding
> > exactly why the broadcast was lost in the first place, it was not so
> clear
> > it could help (we thought that the broadcast was actually lost, not just
> > delayed).
>
> Makes sense.
>
> > [...]
>
> > Yes, we will definitely try to reproduce it, but from past experience,
> > sometimes it's hard to generate the same load from a limited test
> > environment. We'll try it anyway and experiment with other values for the
> > tunable.
>
> > [...]
>
> > "realtime scheduling" is enabled, but note that even with this, all lock
> > helpers started by CTDB also have the same priority. So, even if running
> > with high priority, the main CTDB process is just one of many competing
> for
> > CPU.
>
> The WIP patches include an option to run the lock helpers at a lower
> priority than ctdbd.  Could help...
>

I'm not sure if it could help in this case. Note that when the problem
happens, we have 200 processes just awakened at the same time, and probably
hundreds or thousands more waiting to be started. We need that one of the
lock helpers completes to let another request to start. Decreasing the
priority of the lock helpers could actually increase this delay.

Another potential improvement would be to attempt a mutex lock on all
queued requests before starting the lock helper.



> > I think it's also important to note that, in this particular case, lock
> > contention seems very low before the global lock is taken, so I expect
> that
> > most of the helpers will run without actually waiting for the mutex (it
> > will be granted immediately).
>
> Hmmm... OK.
>
> > My idea is a bit more radical. The motivation is that starting a new
> > process and communicating with it asynchronously is in the order of a few
> > milliseconds (even tens or hundreds of milliseconds if we have a high
> > number of context switches and CPU competition). On the other hand,
> trying
> > to acquire a mutex is in the order of a few nanoseconds. This means a
> > difference of at least 6 orders of magnitude. So my raw idea is based on:
> >
> > 1. Add support in tevent to manage locks.
> > 2. Instead of spawning a lock helper, just send a lock attempt to tevent,
> > with a callback that will be executed when the lock is acquired.
> > 3. If the lock cannot be acquired immediately, it will be added to the
> > pending list.
> > 4. In each loop, and before handling immediates, tevent will check the
> list
> > of pending locks and will try to acquire them.
> > 5. If a lock is acquired, post its callback as an immediate.
> > 6. When tevent processes the immediates, the callbacks of all acquired
> > locks will be executed.
> >
> > This is a very high level idea. This will be faster as long as we don't
> > attempt to get each lock a million times. I'm also thinking about how to
> > prevent having to check each single pending lock in each iteration, which
> > will heavily reduce the overhead, and how to prevent starvation in the
> > worst case. I think there are ways to avoid these problems, but first I
> > need to know if this idea makes sense to you.
>
> I doubt that support for locks will be added to tevent - ctdbd would be
> the only user.
>

tevent is designed to run on single threaded applications. In those
applications, taking contented locks without blocking the entire thread is
a very real problem, and launching a lock helper has an overhead of some
milliseconds, while taking a lock only takes a few nanoseconds (6 or more
orders of magnitude !!). So I would say it's a common problem in all
applications using tevent.



> However, you could do something very similar by using tevent to process
> a queue when you add to it and also on a timer.  The only problem is,
> as you say, "as long as we don't attempt to get each lock a million
> times".  You end up with a queue that you need to manage.  You might
> place a limit on the number of locks you retry in each run.  Then you
> need to decide whether you're doing to be fair and always retry the
> oldest queue members first (and they might be stubborn), or move those
> that have been tried to the end of the queue, since others may be more
> likely.  So, it is likely to get complicated.
>

Depending on timers to retry seems very bad to me.

My idea to avoid checking all locks every time was to group locks in blocks
of 64, for example, and then have a bitmap which indicates which ones may
be available. When a lock is released, that bitmap will be updated,
indicating that a lock attempt can be done, otherwise no lock attempt will
be done. We could also create blocks of blocks of mutexes, reducing even
more the number of checks. With this approach, only the mutexes that have
been released will be retried, reducing the overhead significantly.

Of course there are still many details to work on, like the fact that these
bitmaps should be in shared memory, and how to deal with dead owners.
That's just an initial approach.



> At the moment, by using blocking locks, we're delegating the queuing
> to the kernel.
>
> Quite a few years ago TDB switched from fcntl() locks to mutexes, due
> to the thundering herd problem.  Now, the fcntl() lock thundering herd
> problem seems to be elegantly solved in the Linux kernel.  I don't know
> what else we would lose, but perhaps it is time to try fcntl() lock
> again?
>

I don't think it will be better. fcntl() requires, at least, two system
calls when not contended (one to lock and one to unlock), while an
uncontended mutex doesn't require any system call. Also, how does fcntl()
avoids the lock helpers ? we'll still need them when there's contention,
right ?

Best regards,

Xavi
-------------- next part --------------
A non-text attachment was scrubbed...
Name: lock-helpers.png
Type: image/png
Size: 63446 bytes
Desc: not available
URL: <http://lists.samba.org/pipermail/samba-technical/attachments/20250529/89136940/lock-helpers-0001.png>


More information about the samba-technical mailing list