CTDB - the 'Mainz' plan for clustered Samba

tridge at samba.org tridge at samba.org
Sat Sep 30 01:07:48 GMT 2006

A while ago I mentioned that a number of us have been working on a new
plan for clustered Samba. The document below is my write up of our
current thoughts on how to make clustered Samba scalable, fast and

This is the first public discussion of this plan, and I hope we'll get
some useful feedback. It is quite complex in places, and I expect that
it will need quite a few details filled in and quite a few things
fixed as problems are found, but I also think it does provide a good
basis for a great clustered Samba solution.

The document should probably move to our Wiki once we've had the
initial flurry of discussion on this list.

Cheers, Tridge


ctdb - Clustered trivial database

  Andrew Tridgell
  Samba Team

version 0.2
September 2006

tdb has been at the the heart of Samba for many years. It is an
extremely lightweight database originally modelled on the old-style
Berkeley database API, but with some additional features that make it
particularly suitable for Samba.

Over the last few years there has been a lot of discussion about the
possibility of using Samba on tightly connected clusters of computers
running a clustered filesystem, and inevitably those discussions
revolve around ways of providing the equivalent functionality to tdb
but for a clustered environment.

It was quickly discovered that a naive port of tdb to a cluster
performed very badly, often resulting in negative scaling, sometimes
by quite a large factor. These original porting attempts were based
around using the coherence provided by the clustered filesystem
through the fcntl() byte range locking interface as the core coherence
method for a clustered tdb, and using the shared storage of files
available in all clustered filesystems to store the tdb database

My own efforts in this area largely consisted of talking to the
developers of a number of clustered filesystems, and giving them
simple test tools which demonstrated the lack of scalability of the
operations we were relying upon. While this did lead to some
improvements in some of the clustered filesystems, the performance was
still a very long way from what we needed in really scalable clustered
database solution.


The first real step forward was the work done by Volker Lendecke and
others in the 'vl-messaging' branch of Samba3. That code attempts to
work around the performance limitations of clustered filesystems using
a variety of techniques. The most significant of those techniques are:

 1) changing the tdb API to use a fetch_locked() call paired with
    talloc destructor for unlock to reduce the number of round trips
    needed for the most common operation on the most contended

 2) The use of a 'dispatcher' process, separate from smbd, to handle
    communication between the database layers of the smbd daemons on
    each node. 

 3) The use of a 'file per key' design to minimise the contention on
    data within files, which especially helps on clustered filesystems
    which have the notion of a 'whole file lease' or token.

After some tuning the vl-messaging branch did show some degree of
positive scaling, which was an enormous step forward compared to
previous attempts. Unfortunately its baseline performance is still not
nearly at the level we want.

Mainz plan

At a meeting in Mainz in July we did some experiments on the
vl-messaging code, and came up with a new design that should alleviate
to a large extent the remaining scalability and baseline performance

The basis of the design is a significant departure from our previous
attempts, in that we will no longer attempt to utilise the existing
data and lock coherence mechanisms of the underlying clustered
filesystem, but will instead build our own locking and transport layer
independent of the filesystem.

Temporary databases

The design below is particularly aimed at the temporary databases in
Samba, which are the databases that get wiped and re-created each time
Samba is started. The most important of those databases are the
'brlock.tdb' byte range locking database and the 'locking.tdb' open
file database. There are a number of other databases that fall into
this class, such as 'connections.tdb' and 'sessionid.tdb', but they
are of less concern as they are accessed much less frequently.

Samba also uses a number of persistent databases, such as the password
database, which must be handled in a different manner from the
temporary databases. The modification to the 'Mainz design' for those
databases is discussed in a later section below.

What data can we lose?

One of the keys to understanding the new design that is outlined below
is to see that Samba has quite different requirements regarding data
integrity for the data in its temporary databases than most
traditional clustered databases provide.

In a traditional clustered database or clustered filesystem with N
nodes, the system tries to guarantee than even if N-1 of those nodes
suddenly go down that no data is lost. Guaranteeing this data
integrity is very expensive, and is a large part of why it is hard to
make clustered databases and clustered filesystems scale and perform

So when we were trying to build Samba on top of a clustered filesystem
we were automatically getting this data integrity guarantee. This
usually meant that when we wrote data to the filesystem that the data
had to go onto shared storage (a SAN or similar), or had to be
replicated to all nodes. For user data written by Samba clients this
is great, but for the meta data in our temporary databases it is
completely unnecessary.

The reason it is unnecessary is twofold:

 1) when a node goes down that holds the only copy of some meta data,
    its OK if that meta data only relates to open files or other
    resources opened by clients of that node. In fact, its preferable
    to lose that data, as when that node goes down those resources are
    implicitly released.

 2) data associated with open resources on node A that is stored on
    node B can be recreated by node A, as node A already keeps
    internal data structures relating to the open resources of its

Combining these two observations, we can see that we don't need, or in
fact want, a clustered version of tdb in Samba for the temporary
databases to provide the usual guarantees of data integrity.

In the extreme case where all nodes go down, the correct thing to
happen to these temporary databases is that they be completely lost. A
normal clustered database or clustered filesystem takes great steps to
ensure that doesn't happen.

Where this really comes into play is in the description of the "ctdb
recovery" below which handles the situation when one or more nodes go

Conditional Append 

Another crucial thing to understand about our temporary databases is
that the most common logical operation on the records is a
"conditional append". A conditional append is an operation that first
checks a condition (which can be a complex function) and then if that
condition holds true it adds some of its own data to the end of the

For example, in the open files database the "condition" is that the
new open does not conflict with any of the existing open handles on
the file. The "append" is to add a new bit of data at the end of the
record which represents the new open handle.

Along with this conditional append is a "remove and trigger"
operation, which removes just part of a record, and then potentially
triggers an operation. In the case of our open files database the
"remove" is to delete part of the record associated with the open file
(a part previously added by a conditional append) and then potentially
trigger a callback which could allow a pending open to continue.

This "conditional append" and "remove and trigger" pattern allows us
to greatly reduce the number of network round trips involved in
implementing the ctdb protocol. 

Local storage

The first step in the design is to abandon the use of the underlying
clustered filesystem for storing the tdb data in the temporary
databases. Instead, each node of the cluster will have a local,
old-style tdb stored in a fast local filesystem. Ideally this
filesystem will be in-memory, such as on a small ramdisk, but a fast
local disk will also suffice if that is more administratively

This local storage tdb will be referred to as the LTDB in the
discussion below. The contents of this database on each node will be a
subset of the records in the CTDB (clustered tdb).

Virtual Node Mapping

To facilitate automatic recovery when one or more nodes in the cluster
dies, the CTDB will make use of a node numbering scheme which maps a
'virtual node number' (VNN) onto a physical node address in the
cluster. This mapping will be stored in a file in a common location in
the clustered filesystem, but will not rely on fast coherence in the
filesystem. The table will be read only at startup, and when a
transmission error or error return from the CTDB protocol indicates
that the local copy of the node mapping may be out of date. The
contents and format of this virtual node mapping table will be
specific to the type of underlying clustering protocol in use by CTDB.

Extended Data Records

The records in the LTDB will be keyed in the same way they are for a
normal tdb in Samba, but the data portion of each record will be
augmented with an additional header. That header will contain the
following additional information:

  uint64 RSN       (record sequence number)
  uint32 DMASTER   (VNN of data master)
  uint32 LACCESSOR (VNN of last accessor)
  uint32 LACOUNT   (last accessor count)

This augmented data will not visible to the users of the CTDB API, and
will be stripped before the data is returned.

The purpose of the RSN (record sequence number) is to identify which
of the nodes in the cluster has the most recent copy of a particular
record in the database during a recovery after one or more nodes have
died. It is incremented whenever a record is updated in a LTDB by the
'DMASTER' node. See the section on "ctdb recovery" for more
information on the RSN.

The DMASTER ('data master') field is the virtual node number of the
node that 'owns the data' for a particular record. It is only
authoritative on the node which has the highest RSN for a particular
record. On other nodes it can be considered a hint only.

One of the design constraints of CTDB is that the node that has the
highest RSN for a particular record will also have its VNN equal to
the local DMASTER field of the record, and that no other node will
have its VNN equal to the DMASTER field. This allows a node to verify
that it is the 'owner' of a particular record by comparing its local
copy of the DMASTER field with its VNN. If and only if they are equal
then it knows that it is the current owner of that record.

The LACCESSOR and LACOUNT fields form the basis for a heuristic
mechanism to determine if the current data master should hand over
ownership of this record to another node. The LACCESSOR field holds
the VNN of the last node to request a copy of the record, and the
LACOUNT field holds a count of the number of consecutive requests by
that node. When LACOUNT goes above a configurable threshold then a
record transfer will be used in response to a record request (see the
sections on record request and record transfer below).

It is worth noting that this heuristic mechanism is very primitive,
and suffers from the problem that frequent remote reads of records
will require that the data master write frequently to update these
fields in the LTDB. The use of a ramdisk for the LTDB will reduce the
impact of these writes, but it still is likely that this heuristic
mechanism will need to be improved upon in future revisions of this

Location Master

In addition to the concept of a DMASTER (data master), each record
will have an associated LMASTER (location master). This is the VNN of
the node for each record that will be referred to when a node wishes
to contact the current DMASTER for a record. The LMASTER for a
particular record is determined solely by the number of virtual nodes
in the cluster and the key for the record.

Finding the DMASTER

When a node in the cluster wants to find out who the DMASTER is for a
record, it first contacts the LMASTER, which will reply with the VNN
of the DMASTER. The requesting node then contacts that DMASTER, but
must be prepared to receive a further redirect, because the value for
the DMASTER held by the LMASTER could have changed by the time the
node sends its message.

This step of returning a DMASTER reply from the LMASTER is skipped
when the LMASTER also happens to be the DMASTER for a record. In that
case the LMASTER can send a reply to the requesters query directly,
skipping the redirect stage.

It should also be noted that nodes should have a small cache of
DMASTER location replies, and can use this cache to avoid asking the
LMASTER every time for the location of a particular record. In that
case, just as in the case when they get a reply from the LMASTER, they
must be prepared for a further redirect, or errors. If they get an
error reply, then depending on the type of error reply they will
either go back to the LMASTER or will initiate a recovery process (see
the section on error handling below).

Clustered TDB API

The CTDB API is a small extension to the existing tdb API, and draws
on ideas from the vl-messaging code.

The main calls will be:

  /* initialise a ctdb context */
  struct ctdb_context *ctdb_init(TALLOC_CTX *mem_ctx);

  /* set the conditional function for a conditional append */
  int ctdb_set_conditional(struct ctdb_context *ctdb,
		           ctdb_conditional_fn fn, uint32_t condition_id, void *private);

  /* attach to the database */
  int ctdb_attach(struct ctdb_context *ctdb, const char *name,
	          int tdb_flags, int open_flags, mode_t mode);

  /* fetch a locked record, unlock via talloc_free() */
  struct ctdb_record  *ctdb_fetch_locked(struct ctdb_context *ctdb,
                                       TALLOC_CTX *mem_ctx,
				       TDB_DATA key);

  /* store a record fetched with ctdb_fetch_locked(), and release the lock */
  int ctdb_store_unlock(struct tdb_context *ctdb, struct ctdb_record *rec);

  /* fetch a record unlocked */
  TDB_DATA ctdb_fetch(struct ctdb_context *ctdb, TALLOC_CTX *mem_ctx, TDB_DATA key);
  /* delete a record without locking (use tdb_store_unlock with NULL
     data to delete a locked record) */
  int ctdb_delete(struct ctdb_context *tdb, TDB_DATA key);
  /* conditionally append data to a record */
  int ctdb_conditional_append(struct ctdb_context *ctdb, uint32_t condition_id,
			      TDB_DATA key, TDB_DATA data);

  /* remove a piece of a record, maybe triggering a function */
  int ctdb_remove_and_trigger(struct ctdb_context *ctdb, TDB_DATA key, TDB_DATA data);

  NOTE: above API still needs a lot more development

Dispatcher Daemon

Coordination between the nodes in the cluster will happen via a
'dispatcher daemon'. This daemon will listen for CTDB protocol
requests from other nodes, and from the local smbd via a unix domain
datagram socket.

The nature of the protocol requests described below mean that the
dispatcher daemon will need to be written in an event driven manner,
with all operations happening asynchronously.

LTDB Bypass

Experiments show that sending a request via the dispatcher daemon will
add about 4us on a otherwise idle multi-processor system (as measured
on a Linux 2.6 kernel with dual Xeon processors). On a busy system or
on a single core system this time could rise considerably, making it
preferable for the dispatcher daemon to be avoided in common cases.

This will be achieved by allowing smbd to directly attach to the LTDB,
and to perform CTDB_REQ_FETCH_LOCKED and CTDB_REQ_FETCH calls directly
on the LTDB once it determines (with the record lock held) that the
local node is in fact the DMASTER for the record.

This means that if a record is frequently accessed by one node then it
will migrate via the DMASTER heuristics to requesting node, and from
then on that client will have direct local access to the record.

CTDB Protocol

The CTDB protocol consists of the following message types:









additional message types will be used during node recovery after one
or more nodes have crashed. They will be dealt with separately below.

CTDB protocol header

Every CTDB_REQ_* packet contains the following header:

   uint32  OPERATION  (CTDB opcode)
   uint32  DESTNODE   (destination VNN)
   uint32  SRCNODE    (source VNN)
   uint32  REQID      (request id)
   uint32  REQTIMEOUT (request timeout, milliseconds)

Every CTDB_REPLY_* packet contains the following header:

   uint32  OPERATION (CTDB opcode)
   uint32  DESTNODE  (destination VNN)
   uint32  SRCNODE   (source VNN)
   uint32  REQID     (request id)

The DESTNODE and SRCNODE are used to determine if the virtual node
mapping table has been updated. If a node receives a message which is
not for one of its own VNN numbers then it will send back a reply with
a CTDB_ERR_VNN_MAP status.

The REQID is used to allow a node to have multiple outstanding
requests on an unordered transport. A reply will always contain the
same REQID as the corresponding request. It is up to each node to keep
track of what REQID values are pending.

The REQTIMEOUT field is used to determine how long the receiving node
should keep trying to complete the request if the record is currently
locked by another client. A value of zero means to fail immediately if
the record is locked.


   uint32  KEYLEN
   uint8   KEY[]

A CTDB_REQ_FETCH request is used when a node wishes to fetch the
contents of a record but does not need to lock or update the record. 

A server receiving a CTDB_REQ_FETCH must send one of the following

   1) a CTDB_REPLY_REDIRECT if it is not the DMASTER for the given
      record (as determined by the key)

   2) a CTDB_REPLY_FETCH if it is the DMASTER for the node, and wants
      to keep the DMASTER role for the time being.

   3) a CTDB_REQUEST_DMASTER if it wishes to hand over the DMASTER role
      for the record to the requesting node.

The following errors can be generated:


The 3rd type of reply is special, as it is sent not to the requesting
node, but to the LMASTER. It is a message that tells the LMASTER that
this node no longer wants to be the DMASTER for a record, and asks the
LMASTER to grant the DMASTER status to another node.

When the LMASTER receives a CTDB_REQUEST_DMASTER request from a node,
it will in turn send a CTDB_REPLY_DMASTER reply to the original
requesting node. Also see the section on "dmaster handover" below.


   uint32  DATALEN
   uint8   DATA[]

A CTDB_REPLY_FETCH contains the data for the record requested in a


   uint32  LCKTIMEOUT (milliseconds)
   uint32  KEYLEN
   uint8   KEY[]

A node sends a CTDB_REQ_FETCH_LOCKED when it wishes to lock and fetch
a record, for possible future update. The record that is wanted is
specified by the KEY[] field. The client must also specify a lock

A server receiving a CTDB_REQ_FETCH_LOCKED must send one of the
following replies:

   1) a CTDB_REPLY_REDIRECT if it is not the DMASTER for the given
      record (as determined by the key)

   2) a CTDB_REQUEST_DMASTER if it wishes to hand over the DMASTER
      role for the record to the requesting node. See "dmaster
      handover" below.

   3) a CTDB_REPLY_FETCH_LOCKED if it is the DMASTER for the node, and
      wants to keep the DMASTER role for the time being.

The following errors can be generated:


The server replying to a CTDB_REQ_FETCH_LOCKED request only needs to
keep state regarding the request in case (3) above. In that case the
server needs to setup a timer with the given LCKTIMEOUT, and needs to
keep track of who has the record locked. 

That state is kept until one of 3 things happens:

  1) the timer expires, in which case the record is unlocked. The
     client node is not notified that this has happened. The client
     finds out that its lock timed out when it tries to unlock the

  2) The client sends a CTDB_REQ_UNLOCK request

  3) The client sends a CTDB_REQ_STORE_UNLOCK request

I should note that CTDB_REQ_FETCH_LOCKED is quite a complex operation,
and it is entirely possible that we can do without it in CTDB. It may
be that we can do all the operations we need via the
CTDB_REQ_CONDITIONAL_APPEND operation instead, which would make things
both much faster and simpler. 


   uint32  DATALEN
   uint8   DATA[]

A CTDB_REPLY_FETCH_LOCKED contains the data for the record requested


   uint32  CONDITIONID
   uint32  KEYLEN
   uint8   KEY[]
   uint32  DATALEN
   uint8   DATA[]

A CTDB_REQ_CONDITIONAL_APPEND is used to conditionally append data to
a existing record. The CONDITIONID corresponds to the condition_id
given in a ctdb_set_conditional() call when the database was
opened. It identifies which conditional function should be used.

The DATA[] part of the request is the data to be conditionally
appended to the record.

A node receiving this request can optionally choose to send a
CTDB_REQUEST_DMASTER to the LMASTER to hand over control of the record
to the requesting node. This is done when the heuristics that use
LACCESSOR and LACOUNT determine that it would be better for the caller
to have direct access.


A CTDB_REPLY_CONDITIONAL_APPEND is sent on a successful conditional
append. It contains no additional data.


   uint32  CONDITIONID
   uint32  KEYLEN
   uint8   KEY[]
   uint32  DATALEN
   uint8   DATA[]

A CTDB_REQ_REMOVE_TRIGGER requests that a part of a record previously
added with a CTDB_REQ_CONDITIONAL_APPEND be removed. The conditional
function set at database open time does the work of finding the data
within the current record, and potentially triggering further events
due to the removal of this record.


   uint32  KEYLEN
   uint8   KEY[]

A CTDB_REQ_DELETE request is sent to delete a record. The server will
respond in one of the following ways:

   1) a CTDB_REPLY_REDIRECT if it is not the DMASTER for the given
      record (as determined by the key)

   3) a CTDB_REPLY_DELETE if it is the DMASTER for the node, and has
      deleted the node. In this case the DMASTER for the node reverts
      back to the LMASTER and the server sends CTDB_REQUEST_DMASTER to
      the LMASTER node.

The following errors can be generated:



A CTDB_REPLY_DELETE is sent on a successful delete. It contains no
additional data.


A CTDB_REQ_UNLOCK is sent when a node wishes to unlock a record
previously locked with a CTDB_REQ_FETCH_LOCKED request. It contains no
additional data, but the REQID of the request must be the same as the
REQID of the original CTDB_REQ_FETCH_LOCKED request.

The following error codes can be generated



A CTDB_REPLY_UNLOCK is sent on successful unlock of a record. It
contains no additional data.


   uint32  DMASTER (VNN of requested new DMASTER)
   uint32  KEYLEN
   uint8   KEY[]
   uint32  DATALEN
   uint8   DATA[]

The CTDB_REQUEST_DMASTER request is unusual in that it is always sent
to the LMASTER, and may be sent in response to a different request
from another node. It asks the LMASTER to hand over the DMASTER status
to another node, along with the current data for the record.

If LMASTER is equal to the requested new DMASTER then no further
packets need to be sent, as the LMASTER has now become the DMASTER. If
the requested DMASTER is not equal to the LMASTER then the LMASTER
will send a CTDB_REPLY_DMASTER to the new requested DMASTER.

This message will have the same REQID as the incoming
message that triggered it.


   uint32  DATALEN
   uint8   DATA[]

This message always comes from the LMASTER, and tells a node that it
it is now the DMASTER for a record.

This message will have the same REQID as the incoming
CTDB_REQUEST_DMASTER that triggered it.


  uint32 DMASTER

A CTDB_REPLY_REDIRECT is sent when a node receives a request for a
record for which it is not the current DMASTER. The DMASTER field in
the reply is a hint to the requester giving the next DMASTER it should
try. If no reasonable DMASTER value is known then the LMASTER is
specified in the reply.


  uint32 STATUS
  uint32 MSGLEN
  uint8  MSG[]

All errors from CTDB_REQ_* messages are sent using a CTDB_REPLY_ERROR
reply. The STATUS field indicates the type of error, from the
CTDB_ERR_* range of errors. The MSG field is an optional UTF8 encoded
string that can provide additional human readable information
regarding the error.

The possible error codes are:


CTDB Recovery

Database recovery when one or more nodes go down is crucial to the
robustness of the cluster. In CTDB, recovery is triggered when one of
the following things happens:

 1) a reply to a CTDB message is not received after a reasonable time. 

 2) an admin manually requests that one or more nodes be removed from
    the cluster, or added to the cluster

 3) a CTDB_ERR_VNN_MAP error is received, indicating that either the
    sending or receiving node has the wrong VNN map

A node that detects one of these conditions starts the recovery
process. It immediately stops processing normal CTDB messages and
sends a message to all nodes starting a global recovery. I have not
yet worked out the precise nature of these messages (that should
appear in a later version of this document), but some basics are

 1) the recovery process needs to assign every node a new VNN, and
    will choose VNNs that are different from all the VNNs currently in
    use. This is important to ensure that none of the old VNNs remain
    valid, so we can detect when a 'zombie' node that is
    non-responsive during recovery starts sending messages again. When
    such a node wakes up it will trigger a CTDB_ERR_VNN_MAP message as
    soon as it tries to send a CTDB message.

 2) There will be one 'master' node controlling the recovery
    process. We need to determine how this node is determined,
    probably using the lowest numbered physical node that is currently

 3) at the end of recovery there needs to be a global ACK that the
    recovery has concluded before normal CTDB messages start again.

 4) nodes will need to look through their data structures of open
    resources to recover the pieces of the data for each record. These
    will then be pieced together with a mechanism very similar to a

 5) there will need to be a callback function to allow the database to
    get at the data from (4).

In version 0.1 of this document it was envisioned that the recovery
would be based solely on the RSN, rolling back each record to the
record with the highest RSN for that record across the cluster. While
I think that mechanism would still work, it is harder to prove it is
correct in all cases than a mechanism based on nodes re-supplying
their own pieces of the data for each record.


Lots more to work out ...

 - Recovery protocol details

 - What if CTDB_REQ_DMASTER message is lost?

 - Should deleted records still be stored by the LMASTER?  If not,
   then how does recovery avoid undeleting a record?

 - Client based recovery - highest seqnum, plus any actively locked
   records versus structure driven recovery.

 - Need to work out the consequences of message loss for each of the
   message types. Especially replies.

 - What events system to use?

 - explain recovery model, with deliberate rollback of records

 - what transport API to use? Standardise on MPI? Use a lower level
   API? Use a transport abstraction? What is the latency cost of an

 - how to integrate event handling with the transport? Pull Samba4
   events system into tdb?


This work is the result of many long discussions and experiments by
Volker Lendecke, Sven Oehme, Alexander Bokovoy, Aleksey Fedoseev and
Andrew Tridgell

More information about the samba-technical mailing list