Patch to support Scalable CTDB

Tue May 1 01:37:58 UTC 2018

Hi Partha,

On Mon, 30 Apr 2018 16:12:01 -0700, Partha Sarathi via samba-technical
<samba-technical at lists.samba.org> wrote:

> On Mon, Apr 30, 2018 at 7:52 AM, Volker Lendecke <Volker.Lendecke at sernet.de>
> wrote:
> 
> > On Mon, Apr 30, 2018 at 02:36:14PM +0000, Partha Sarathi via
> > samba-technical wrote:  
> > > Ok. My concern is when you have common Ctdb running across cluster with
> > > different file spaces keeping the locking.tdb replicated for all file  
> > opens  
> > > doesn’t seems to be worth.  
> >
> > locking.tdb is never actively replicated. The records are only ever
> > moved on demand to the nodes that actively request it. That's
> > different from secrets.tdb for example.

> Thanks, Volker and Ralph, but I see different behavior.
> 
> I had three node cluster running a common ctdb with different filespace on
> each of the nodes as below
> 
> pnn:1 fe80::ec4:7aff:fe34:ac0b OK
> pnn:2 fe80::ec4:7aff:fe34:ee47 OK
> pnn:3 fe80::ec4:7aff:fe34:b923 OK (THIS NODE)
> Generation:1740147135
> Size:3
> hash:0 lmaster:1
> hash:1 lmaster:2
> hash:2 lmaster:3
> Recovery mode:NORMAL (0)
> Recovery master:2
> 
> 
> 1) Opened a "1.pdf" on node 1  and  noticed couple records updated in
> "locking.tdb.1" and also in node 2 "locking.tdb.2".
> 2) Opened a "3.pdf" on node 3 and  noticed couple records updated in
> "locking.tdb.3" and also in both "locking.tdb.2" and locking.tdb.1
> 
> Per your statement what I was expecting was unless any node specifically
> request for the records, it shouldn't have to get those records. but in the
> above example, even without asking all the records were available on all
> the nodes. Basically, one more understanding what I learned is, every node
> in the cluster try to update their open/close file records  to Recovery
> master in the large cluster with different filespace it may be overwhelmed
> with all record updates unnecessarily.
> 
> The below is the locking.tdb dumps on all the three nodes for different
> file open/close but with lcoking.tdb had all the records on all the nodes.
> The open records for file "3.pdf" was not necessary on node 1, but Recovery
> master had those records, so it updated/replicated to rest of the nodes in
> the cluster.
> 
> So this kind of cluster-wide replications may slow down the overall
> performance when you are trying to open a large number of file with
> different file spaces in subclusters.

I don't think you're seeing records in volatile database
being replicated.  However, there is a simple explanation for what
you're seeing, especially on a 3 node cluster!

As others have said, the volatile databases are distributed.

Unfortunately the diagrams  at:

  https://wiki.samba.org/index.php/Samba_%26_Clustering#Finding_the_DMASTER

are wrong.  I have a new diagram but need to discuss with people
whether the above should be kept as a historical document or whether I
should update.

CTDB uses 2 (relatively :-) simple concepts for doing the distribution:

* DMASTER (or data master)

  This is the node that has the most recent copy of a record.

  The big question is: How can you find this DMASTER?  The answer is...

* LMASTER (or location master)

  This node always knows which node is DMASTER.

  The LMASTER for a record is calculated by hashing the record key and
  then doing a modulo of the number of active, LMASTER-capable nodes
  and then mapping this to a node number via the VNNMAP.

Let's say you have 3 nodes (A, B, C) and node A wants a
particular record. Let's say that node B is the LMASTER for that
record.

There are 3 cases, depending on which node is DMASTER:

* DMASTER is A

  smbd will find the record locally.  No migration is necessary.  The
  LMASTER is not consulted.

* DMASTER is B

  A will ask B for the record.  B will notice that it is DMASTER and
  will forward the record to A.  The record will be updated on both A
  and B because the change of DMASTER must be recorded.

* DMASTER is C

  A will ask B for the record.  B will notice that it is not DMASTER
  and forward the request to C.  C forwards the record to B, which
  forwards it to A.  The record will be updated on A, B and C because
  the change of DMASTER must be recorded.

You can now add nodes D, E, F, ... and they will not affect migration
of the record (if there is no contention for the record from those
additional nodes).

If there is heavy contention for a record then 2 different performance
issues can occur:

* High hop count

  Before C gets the request from node B, C responds to a migration
  request from another node and is no longer DMASTER for the record.
  C must then forward the request back to the LMASTER. This can go on
  for a while. CTDB logs this sort of behaviour and keeps statistics.

* Record migrated away before smbd gets it

  The record is successfully migrated to node A and ctdbd informs the
  requesting smbd that the record is there.  However, before smbd can
  grab the record, a request is processed to migrate the record to
  another node.  smbd looks, notices that node A is not DMASTER and
  must once again ask ctdbd to migrate the record.  smbd may log if
  there are multiple attempts to migrate a record.

  Try this

    git grep attempts source3/lib/dbwrap

  to get an initial understand of what is logged and what the
  parameters are.  :-)

Read-only and sticky records are existing features that may help to
counteract contention.

peace & happiness,
martin