[Samba] db_ctdb_fetch_locked: folder lock contention of over 4 million ms (over an hour)

Thu Oct 7 16:01:10 UTC 2021

Hey Martin, thanks for the information.

We did notice there were duplicates, such as `locking.tdb.2` being on nodes
2 and 3. This also happened during the EBS outage the week before. Could
this be a symptom of the high hopcount problem? records being moved between
nodes because one node had an outage or similar? Would there be a way to
recover without restarting ctdb?

The problem with the stop, delete, start method we have been using is that
it affects all users of the system, whereas the issues are usually isolated
to a single folder.

Cheers,
- isaac

On Thu, Oct 7, 2021 at 2:51 AM Martin Schwenke <martin at meltin.net> wrote:

> Hi Isaac,
>
> On Wed, 6 Oct 2021 11:57:07 -0700, Isaac Stone via samba
> <samba at lists.samba.org> wrote:
>
> > We have been noticing slowness and outages in samba.
> >
> > First time was the aws EBS outage of 9/27. We are hosting our ctdb `tdb`
> > files on aws nvme, and the outage caused issues that did not seem to get
> > recovered. The smbd logs had lots of this kind of message
> >
> > db_ctdb_fetch_locked for /var/cache/dbdir/volatile/locking.tdb.2 key
> > AECED2BC2A66F9AD1C55142C290000000000000000000000, chain 62588 needed 1
> > attempts, 21471 milliseconds, chainlock: 7257.000000 ms, CTDB
> > 14214.887000 ms
>
> > We have samba+ctdb setup with multiple globally distributed nodes.
> >
> > The issue persisted well after the EBS outage, but seemed restricted to
> > accessing a single folder. Anyone who tried to access that folder would
> > have their machine hard lock and were forced to reboot.
> >
> > To fix and restore functionality to our users we stopped ctdb on all
> nodes,
> > wiped the `volatile` directory and then started ctdb again.
> >
> > This fixed the issue, but then yesterday we had the same thing again.
> This
> > time it was a different folder, and the wait times were much higher, past
> > 4,500,000 milliseconds.
> >
> > I attempted to run `ctdb wipedb locking.tdb` but that did not seem to
> have
> > any effect. In the end we had to stop, delete volatile, and restart
> again.
> >
> > questions:
> >
> > What could be the root cause? we see occasional db_ctdb_fetch_locked logs
> > but usually not more than 10 seconds. But this is the second big spike
> and
> > I want to know what caused it and how to avoid. I am looking at possible
> > changing fileid:algorithm from fsname to fsname_nodirs, any potential
> > problems with this?
>
> When I have seen this it has been due to an smbd holding a lock (in
> locking.tdb) and getting stuck in a system call to the cluster
> filesystem. This is usually due to a cluster filesystem bug or an
> underlying storage problem (an EBS outage might do that).
>
> You can diagnose this by hand by seeing if one of your nodes has an
> smbd in D state and then check /proc/<pid>/stack for the relevant pid.
>
> See also the "lock debug script" option in the [database] section of the
> ctdb.conf(5) manual page.  However, if your installation is using
> robust mutexes for locking records in TDBs (the modern default) then
> you would need the changes that went into 4.15 to be able to use that.
>
> > Durning around the same time as the locking logs in smbd we see ctdbd
> logs
> > like this
> >
> > High hopcount 198 dbid:locking.tdb
> >
> > Could be related?
>
> High hopcount is described here:
>
>
> https://wiki.samba.org/index.php/CTDB_database_design#Performance_problems_due_to_contention
>
> It is possible that they're related if there is a high enough latency
> when fetching records between nodes.  It might not be
> smbd-stuck-in-cluster-filesystem...  but 4,500 / 200 = 22.5, so you'd
> need a 22 second latency to get that number... and it seems unlikely.
>
> I hope some of that is helpful...
>
> peace & happiness,
> martin
>
>