Patch to support Scalable CTDB
martin at meltin.net
Tue May 1 01:37:58 UTC 2018
On Mon, 30 Apr 2018 16:12:01 -0700, Partha Sarathi via samba-technical
<samba-technical at lists.samba.org> wrote:
> On Mon, Apr 30, 2018 at 7:52 AM, Volker Lendecke <Volker.Lendecke at sernet.de>
> > On Mon, Apr 30, 2018 at 02:36:14PM +0000, Partha Sarathi via
> > samba-technical wrote:
> > > Ok. My concern is when you have common Ctdb running across cluster with
> > > different file spaces keeping the locking.tdb replicated for all file
> > opens
> > > doesn’t seems to be worth.
> > locking.tdb is never actively replicated. The records are only ever
> > moved on demand to the nodes that actively request it. That's
> > different from secrets.tdb for example.
> Thanks, Volker and Ralph, but I see different behavior.
> I had three node cluster running a common ctdb with different filespace on
> each of the nodes as below
> pnn:1 fe80::ec4:7aff:fe34:ac0b OK
> pnn:2 fe80::ec4:7aff:fe34:ee47 OK
> pnn:3 fe80::ec4:7aff:fe34:b923 OK (THIS NODE)
> hash:0 lmaster:1
> hash:1 lmaster:2
> hash:2 lmaster:3
> Recovery mode:NORMAL (0)
> Recovery master:2
> 1) Opened a "1.pdf" on node 1 and noticed couple records updated in
> "locking.tdb.1" and also in node 2 "locking.tdb.2".
> 2) Opened a "3.pdf" on node 3 and noticed couple records updated in
> "locking.tdb.3" and also in both "locking.tdb.2" and locking.tdb.1
> Per your statement what I was expecting was unless any node specifically
> request for the records, it shouldn't have to get those records. but in the
> above example, even without asking all the records were available on all
> the nodes. Basically, one more understanding what I learned is, every node
> in the cluster try to update their open/close file records to Recovery
> master in the large cluster with different filespace it may be overwhelmed
> with all record updates unnecessarily.
> The below is the locking.tdb dumps on all the three nodes for different
> file open/close but with lcoking.tdb had all the records on all the nodes.
> The open records for file "3.pdf" was not necessary on node 1, but Recovery
> master had those records, so it updated/replicated to rest of the nodes in
> the cluster.
> So this kind of cluster-wide replications may slow down the overall
> performance when you are trying to open a large number of file with
> different file spaces in subclusters.
I don't think you're seeing records in volatile database
being replicated. However, there is a simple explanation for what
you're seeing, especially on a 3 node cluster!
As others have said, the volatile databases are distributed.
Unfortunately the diagrams at:
are wrong. I have a new diagram but need to discuss with people
whether the above should be kept as a historical document or whether I
CTDB uses 2 (relatively :-) simple concepts for doing the distribution:
* DMASTER (or data master)
This is the node that has the most recent copy of a record.
The big question is: How can you find this DMASTER? The answer is...
* LMASTER (or location master)
This node always knows which node is DMASTER.
The LMASTER for a record is calculated by hashing the record key and
then doing a modulo of the number of active, LMASTER-capable nodes
and then mapping this to a node number via the VNNMAP.
Let's say you have 3 nodes (A, B, C) and node A wants a
particular record. Let's say that node B is the LMASTER for that
There are 3 cases, depending on which node is DMASTER:
* DMASTER is A
smbd will find the record locally. No migration is necessary. The
LMASTER is not consulted.
* DMASTER is B
A will ask B for the record. B will notice that it is DMASTER and
will forward the record to A. The record will be updated on both A
and B because the change of DMASTER must be recorded.
* DMASTER is C
A will ask B for the record. B will notice that it is not DMASTER
and forward the request to C. C forwards the record to B, which
forwards it to A. The record will be updated on A, B and C because
the change of DMASTER must be recorded.
You can now add nodes D, E, F, ... and they will not affect migration
of the record (if there is no contention for the record from those
If there is heavy contention for a record then 2 different performance
issues can occur:
* High hop count
Before C gets the request from node B, C responds to a migration
request from another node and is no longer DMASTER for the record.
C must then forward the request back to the LMASTER. This can go on
for a while. CTDB logs this sort of behaviour and keeps statistics.
* Record migrated away before smbd gets it
The record is successfully migrated to node A and ctdbd informs the
requesting smbd that the record is there. However, before smbd can
grab the record, a request is processed to migrate the record to
another node. smbd looks, notices that node A is not DMASTER and
must once again ask ctdbd to migrate the record. smbd may log if
there are multiple attempts to migrate a record.
git grep attempts source3/lib/dbwrap
to get an initial understand of what is logged and what the
parameters are. :-)
Read-only and sticky records are existing features that may help to
peace & happiness,
More information about the samba-technical