CTDB - the 'Mainz' plan for clustered Samba

Sun Oct 1 02:45:52 GMT 2006

On Sat, 2006-09-30 at 11:07 +1000, an unknown sender wrote:
[snip]
> Extended Data Records
> ---------------------
> 
> The records in the LTDB will be keyed in the same way they are for a
> normal tdb in Samba, but the data portion of each record will be
> augmented with an additional header. That header will contain the
> following additional information:
> 
>   uint64 RSN       (record sequence number)
>   uint32 DMASTER   (VNN of data master)
>   uint32 LACCESSOR (VNN of last accessor)
>   uint32 LACOUNT   (last accessor count)
> 
> This augmented data will not visible to the users of the CTDB API, and
> will be stripped before the data is returned.
> 
> The purpose of the RSN (record sequence number) is to identify which
> of the nodes in the cluster has the most recent copy of a particular
> record in the database during a recovery after one or more nodes have
> died. It is incremented whenever a record is updated in a LTDB by the
> 'DMASTER' node. See the section on "ctdb recovery" for more
> information on the RSN.
> 
> The DMASTER ('data master') field is the virtual node number of the
> node that 'owns the data' for a particular record. It is only
> authoritative on the node which has the highest RSN for a particular
> record. On other nodes it can be considered a hint only.
> 
> One of the design constraints of CTDB is that the node that has the
> highest RSN for a particular record will also have its VNN equal to
> the local DMASTER field of the record, and that no other node will
> have its VNN equal to the DMASTER field. This allows a node to verify
> that it is the 'owner' of a particular record by comparing its local
> copy of the DMASTER field with its VNN. If and only if they are equal
> then it knows that it is the current owner of that record.
> 
> The LACCESSOR and LACOUNT fields form the basis for a heuristic
> mechanism to determine if the current data master should hand over
> ownership of this record to another node. The LACCESSOR field holds
> the VNN of the last node to request a copy of the record, and the
> LACOUNT field holds a count of the number of consecutive requests by
> that node. When LACOUNT goes above a configurable threshold then a
> record transfer will be used in response to a record request (see the
> sections on record request and record transfer below).
> 
> It is worth noting that this heuristic mechanism is very primitive,
> and suffers from the problem that frequent remote reads of records
> will require that the data master write frequently to update these
> fields in the LTDB. The use of a ramdisk for the LTDB will reduce the
> impact of these writes, but it still is likely that this heuristic
> mechanism will need to be improved upon in future revisions of this
> design.
> 
> 
> Location Master
> ---------------
> 
> In addition to the concept of a DMASTER (data master), each record
> will have an associated LMASTER (location master). This is the VNN of
> the node for each record that will be referred to when a node wishes
> to contact the current DMASTER for a record. The LMASTER for a
> particular record is determined solely by the number of virtual nodes
> in the cluster and the key for the record.

Can you elaborate more on the role of the LMASTER? If LMASTER assignment
is per-record, how do I figure out which LMASTER to contact to find the
DMASTER for a particular record? If an extended LTDB record contains a
DMASTER field, and a node will redirect if it is not the DMASTER, why is
the LMASTER necessary?

> Finding the DMASTER
> -------------------
> 
> When a node in the cluster wants to find out who the DMASTER is for a
> record, it first contacts the LMASTER, which will reply with the VNN
> of the DMASTER. The requesting node then contacts that DMASTER, but
> must be prepared to receive a further redirect, because the value for
> the DMASTER held by the LMASTER could have changed by the time the
> node sends its message.
> 
> This step of returning a DMASTER reply from the LMASTER is skipped
> when the LMASTER also happens to be the DMASTER for a record. In that
> case the LMASTER can send a reply to the requesters query directly,
> skipping the redirect stage.
> 
> It should also be noted that nodes should have a small cache of
> DMASTER location replies, and can use this cache to avoid asking the
> LMASTER every time for the location of a particular record. In that
> case, just as in the case when they get a reply from the LMASTER, they
> must be prepared for a further redirect, or errors. If they get an
> error reply, then depending on the type of error reply they will
> either go back to the LMASTER or will initiate a recovery process (see
> the section on error handling below).
> 
> 
> Clustered TDB API
> -----------------
> 
> The CTDB API is a small extension to the existing tdb API, and draws
> on ideas from the vl-messaging code.
> 
> The main calls will be:
> 
>   /* initialise a ctdb context */
>   struct ctdb_context *ctdb_init(TALLOC_CTX *mem_ctx);
> 
>   /* set the conditional function for a conditional append */
>   int ctdb_set_conditional(struct ctdb_context *ctdb,
> 		           ctdb_conditional_fn fn, uint32_t condition_id, void *private);
> 
>   /* attach to the database */
>   int ctdb_attach(struct ctdb_context *ctdb, const char *name,
> 	          int tdb_flags, int open_flags, mode_t mode);
> 
>   
>   /* fetch a locked record, unlock via talloc_free() */
>   struct ctdb_record  *ctdb_fetch_locked(struct ctdb_context *ctdb,
>                                        TALLOC_CTX *mem_ctx,
> 				       TDB_DATA key);
> 
>   /* store a record fetched with ctdb_fetch_locked(), and release the lock */
>   int ctdb_store_unlock(struct tdb_context *ctdb, struct ctdb_record *rec);
> 
>   /* fetch a record unlocked */
>   TDB_DATA ctdb_fetch(struct ctdb_context *ctdb, TALLOC_CTX *mem_ctx, TDB_DATA key);
>   
>   /* delete a record without locking (use tdb_store_unlock with NULL
>      data to delete a locked record) */
>   int ctdb_delete(struct ctdb_context *tdb, TDB_DATA key);
>   
>   /* conditionally append data to a record */
>   int ctdb_conditional_append(struct ctdb_context *ctdb, uint32_t condition_id,
> 			      TDB_DATA key, TDB_DATA data);
> 
>   /* remove a piece of a record, maybe triggering a function */
>   int ctdb_remove_and_trigger(struct ctdb_context *ctdb, TDB_DATA key, TDB_DATA data);
> 
>   NOTE: above API still needs a lot more development
> 
> 
> Dispatcher Daemon
> -----------------
> 
> Coordination between the nodes in the cluster will happen via a
> 'dispatcher daemon'. This daemon will listen for CTDB protocol
> requests from other nodes, and from the local smbd via a unix domain
> datagram socket.
> 
> The nature of the protocol requests described below mean that the
> dispatcher daemon will need to be written in an event driven manner,
> with all operations happening asynchronously.
> 
> 
> LTDB Bypass
> -----------
> 
> Experiments show that sending a request via the dispatcher daemon will
> add about 4us on a otherwise idle multi-processor system (as measured
> on a Linux 2.6 kernel with dual Xeon processors). On a busy system or
> on a single core system this time could rise considerably, making it
> preferable for the dispatcher daemon to be avoided in common cases.
> 
> This will be achieved by allowing smbd to directly attach to the LTDB,
> and to perform CTDB_REQ_FETCH_LOCKED and CTDB_REQ_FETCH calls directly
> on the LTDB once it determines (with the record lock held) that the
> local node is in fact the DMASTER for the record.
> 
> This means that if a record is frequently accessed by one node then it
> will migrate via the DMASTER heuristics to requesting node, and from
> then on that client will have direct local access to the record.
> 
> 
> CTDB Protocol
> -------------
> 
> The CTDB protocol consists of the following message types:
> 
>   CTDB_REQ_FETCH
>   CTDB_REPLY_FETCH
> 
>   CTDB_REQ_FETCH_LOCKED
>   CTDB_REPLY_FETCH_LOCKED
> 
>   CTDB_REQ_STORE_UNLOCK
>   CTDB_REPLY_STORE_UNLOCK
> 
>   CTDB_REQ_DELETE
>   CTDB_REPLY_DELETE
> 
>   CTDB_REQ_UNLOCK
>   CTDB_REPLY_UNLOCK
> 
>   CTDB_REQ_CONDITIONAL_APPEND
>   CTDB_REPLY_CONDITIONAL_APPEND
> 
>   CTDB_REQ_REMOVE_TRIGGER
>   CTDB_REPLY_REMOVE_TRIGGER
> 
>   CTDB_REPLY_REDIRECT
>   CTDB_REQUEST_DMASTER
>   CTDB_REPLY_DMASTER
>   CTDB_REPLY_ERROR
> 
> 
> additional message types will be used during node recovery after one
> or more nodes have crashed. They will be dealt with separately below.

Is this a UDP or TCP transport?

> CTDB protocol header
> --------------------
> 
> Every CTDB_REQ_* packet contains the following header:
> 
>    uint32  OPERATION  (CTDB opcode)
>    uint32  DESTNODE   (destination VNN)
>    uint32  SRCNODE    (source VNN)
>    uint32  REQID      (request id)
>    uint32  REQTIMEOUT (request timeout, milliseconds)
> 
> Every CTDB_REPLY_* packet contains the following header:
> 
>    uint32  OPERATION (CTDB opcode)
>    uint32  DESTNODE  (destination VNN)
>    uint32  SRCNODE   (source VNN)
>    uint32  REQID     (request id)
> 
> The DESTNODE and SRCNODE are used to determine if the virtual node
> mapping table has been updated. If a node receives a message which is
> not for one of its own VNN numbers then it will send back a reply with
> a CTDB_ERR_VNN_MAP status.
> 
> The REQID is used to allow a node to have multiple outstanding
> requests on an unordered transport. A reply will always contain the
> same REQID as the corresponding request. It is up to each node to keep
> track of what REQID values are pending.
> 
> The REQTIMEOUT field is used to determine how long the receiving node
> should keep trying to complete the request if the record is currently
> locked by another client. A value of zero means to fail immediately if
> the record is locked.
> 
> 
> CTDB_REQ_FETCH
> --------------
> 
>    uint32  KEYLEN
>    uint8   KEY[]
> 
> A CTDB_REQ_FETCH request is used when a node wishes to fetch the
> contents of a record but does not need to lock or update the record. 
> 
> A server receiving a CTDB_REQ_FETCH must send one of the following
> replies:
> 
>    1) a CTDB_REPLY_REDIRECT if it is not the DMASTER for the given
>       record (as determined by the key)
> 
>    2) a CTDB_REPLY_FETCH if it is the DMASTER for the node, and wants
>       to keep the DMASTER role for the time being.
> 
>    3) a CTDB_REQUEST_DMASTER if it wishes to hand over the DMASTER role
>       for the record to the requesting node.
> 
> The following errors can be generated:
> 
>    CTDB_ERR_VNN_MAP
>    CTDB_ERR_TIMEOUT
> 
> The 3rd type of reply is special, as it is sent not to the requesting
> node, but to the LMASTER. It is a message that tells the LMASTER that
> this node no longer wants to be the DMASTER for a record, and asks the
> LMASTER to grant the DMASTER status to another node.
> 
> When the LMASTER receives a CTDB_REQUEST_DMASTER request from a node,
> it will in turn send a CTDB_REPLY_DMASTER reply to the original
> requesting node. Also see the section on "dmaster handover" below.
> 
> 
> CTDB_REPLY_FETCH
> ----------------
> 
>    uint32  DATALEN
>    uint8   DATA[]
> 
> A CTDB_REPLY_FETCH contains the data for the record requested in a
> CTDB_REQ_FETCH. 
> 
> 
> CTDB_REQ_FETCH_LOCKED
> ---------------------
> 
>    uint32  LCKTIMEOUT (milliseconds)
>    uint32  KEYLEN
>    uint8   KEY[]
> 
> A node sends a CTDB_REQ_FETCH_LOCKED when it wishes to lock and fetch
> a record, for possible future update. The record that is wanted is
> specified by the KEY[] field. The client must also specify a lock
> timeout.
> 
> A server receiving a CTDB_REQ_FETCH_LOCKED must send one of the
> following replies:
> 
>    1) a CTDB_REPLY_REDIRECT if it is not the DMASTER for the given
>       record (as determined by the key)
> 
>    2) a CTDB_REQUEST_DMASTER if it wishes to hand over the DMASTER
>       role for the record to the requesting node. See "dmaster
>       handover" below.
> 
>    3) a CTDB_REPLY_FETCH_LOCKED if it is the DMASTER for the node, and
>       wants to keep the DMASTER role for the time being.
> 
> The following errors can be generated:
> 
>    CTDB_ERR_VNN_MAP
>    CTDB_ERR_TIMEOUT
> 
> The server replying to a CTDB_REQ_FETCH_LOCKED request only needs to
> keep state regarding the request in case (3) above. In that case the
> server needs to setup a timer with the given LCKTIMEOUT, and needs to
> keep track of who has the record locked. 
> 
> That state is kept until one of 3 things happens:
> 
>   1) the timer expires, in which case the record is unlocked. The
>      client node is not notified that this has happened. The client
>      finds out that its lock timed out when it tries to unlock the
>      record.
> 
>   2) The client sends a CTDB_REQ_UNLOCK request
> 
>   3) The client sends a CTDB_REQ_STORE_UNLOCK request
> 
> I should note that CTDB_REQ_FETCH_LOCKED is quite a complex operation,
> and it is entirely possible that we can do without it in CTDB. It may
> be that we can do all the operations we need via the
> CTDB_REQ_CONDITIONAL_APPEND operation instead, which would make things
> both much faster and simpler. 
> 
> 
> CTDB_REPLY_FETCH_LOCKED
> -----------------------
> 
>    uint32  DATALEN
>    uint8   DATA[]
> 
> A CTDB_REPLY_FETCH_LOCKED contains the data for the record requested
> in a CTDB_REQ_FETCH_LOCKED. 
> 
> 
> CTDB_REQ_CONDITIONAL_APPEND
> ---------------------------
> 
>    uint32  CONDITIONID
>    uint32  KEYLEN
>    uint8   KEY[]
>    uint32  DATALEN
>    uint8   DATA[]
> 
> A CTDB_REQ_CONDITIONAL_APPEND is used to conditionally append data to
> a existing record. The CONDITIONID corresponds to the condition_id
> given in a ctdb_set_conditional() call when the database was
> opened. It identifies which conditional function should be used.
> 
> The DATA[] part of the request is the data to be conditionally
> appended to the record.
> 
> A node receiving this request can optionally choose to send a
> CTDB_REQUEST_DMASTER to the LMASTER to hand over control of the record
> to the requesting node. This is done when the heuristics that use
> LACCESSOR and LACOUNT determine that it would be better for the caller
> to have direct access.
> 
> CTDB_REPLY_CONDITIONAL_APPEND
> -----------------------------
> 
> A CTDB_REPLY_CONDITIONAL_APPEND is sent on a successful conditional
> append. It contains no additional data.
> 
> 
> CTDB_REQ_REMOVE_TRIGGER
> -----------------------
> 
>    uint32  CONDITIONID
>    uint32  KEYLEN
>    uint8   KEY[]
>    uint32  DATALEN
>    uint8   DATA[]
> 
> A CTDB_REQ_REMOVE_TRIGGER requests that a part of a record previously
> added with a CTDB_REQ_CONDITIONAL_APPEND be removed. The conditional
> function set at database open time does the work of finding the data
> within the current record, and potentially triggering further events
> due to the removal of this record.
> 
> CTDB_REQ_DELETE
> ---------------
> 
>    uint32  KEYLEN
>    uint8   KEY[]
> 
> A CTDB_REQ_DELETE request is sent to delete a record. The server will
> respond in one of the following ways:
> 
>    1) a CTDB_REPLY_REDIRECT if it is not the DMASTER for the given
>       record (as determined by the key)
> 
>    3) a CTDB_REPLY_DELETE if it is the DMASTER for the node, and has
>       deleted the node. In this case the DMASTER for the node reverts
>       back to the LMASTER and the server sends CTDB_REQUEST_DMASTER to
>       the LMASTER node.
> 
> The following errors can be generated:
> 
>    CTDB_ERR_VNN_MAP
>    CTDB_ERR_TIMEOUT
> 
> 
> CTDB_REPLY_DELETE
> -----------------
> 
> A CTDB_REPLY_DELETE is sent on a successful delete. It contains no
> additional data.
> 
> 
> CTDB_REQ_UNLOCK
> ---------------
> 
> A CTDB_REQ_UNLOCK is sent when a node wishes to unlock a record
> previously locked with a CTDB_REQ_FETCH_LOCKED request. It contains no
> additional data, but the REQID of the request must be the same as the
> REQID of the original CTDB_REQ_FETCH_LOCKED request.
> 
> The following error codes can be generated
> 
>    CTDB_ERR_VNN_MAP
>    CTDB_ERR_NOT_LOCKED
> 
> 
> CTDB_REPLY_UNLOCK
> -----------------
> 
> A CTDB_REPLY_UNLOCK is sent on successful unlock of a record. It
> contains no additional data.
> 
> CTDB_REQUEST_DMASTER
> --------------------
> 
>    uint32  DMASTER (VNN of requested new DMASTER)
>    uint32  KEYLEN
>    uint8   KEY[]
>    uint32  DATALEN
>    uint8   DATA[]
> 
> The CTDB_REQUEST_DMASTER request is unusual in that it is always sent
> to the LMASTER, and may be sent in response to a different request
> from another node. It asks the LMASTER to hand over the DMASTER status
> to another node, along with the current data for the record.
> 
> If LMASTER is equal to the requested new DMASTER then no further
> packets need to be sent, as the LMASTER has now become the DMASTER. If
> the requested DMASTER is not equal to the LMASTER then the LMASTER
> will send a CTDB_REPLY_DMASTER to the new requested DMASTER.
> 
> This message will have the same REQID as the incoming
> message that triggered it.
> 
> CTDB_REPLY_DMASTER
> ------------------
> 
>    uint32  DATALEN
>    uint8   DATA[]
> 
> This message always comes from the LMASTER, and tells a node that it
> it is now the DMASTER for a record.
> 
> This message will have the same REQID as the incoming
> CTDB_REQUEST_DMASTER that triggered it.
> 
> 
> CTDB_REPLY_REDIRECT
> -------------------
> 
>   uint32 DMASTER
> 
> A CTDB_REPLY_REDIRECT is sent when a node receives a request for a
> record for which it is not the current DMASTER. The DMASTER field in
> the reply is a hint to the requester giving the next DMASTER it should
> try. If no reasonable DMASTER value is known then the LMASTER is
> specified in the reply.
> 
> 
> CTDB_REPLY_ERROR
> ----------------
> 
>   uint32 STATUS
>   uint32 MSGLEN
>   uint8  MSG[]
> 
> All errors from CTDB_REQ_* messages are sent using a CTDB_REPLY_ERROR
> reply. The STATUS field indicates the type of error, from the
> CTDB_ERR_* range of errors. The MSG field is an optional UTF8 encoded
> string that can provide additional human readable information
> regarding the error.
> 
> The possible error codes are:
> 
>  CTDB_ERR_NOT_DMASTER
>  CTDB_ERR_VNN_MAP
>  CTDB_ERR_INTERNAL_ERROR
>  CTDB_ERR_NO_MEMORY
> 
> 
> CTDB Recovery
> -------------
> 
> Database recovery when one or more nodes go down is crucial to the
> robustness of the cluster. In CTDB, recovery is triggered when one of
> the following things happens:
> 
>  1) a reply to a CTDB message is not received after a reasonable time.

As an optimisation for this case, in many clusters you can tell when a
node is gone by querying membership status, in which case you can start
any recovery before waiting for a timeout.

Alternatively, if you have a cluster deadlock (usually resulting in a
process being permanently stuck in a system call), you might want to
proactively kick that node out of the CTDB group.

>  2) an admin manually requests that one or more nodes be removed from
>     the cluster, or added to the cluster
> 
>  3) a CTDB_ERR_VNN_MAP error is received, indicating that either the
>     sending or receiving node has the wrong VNN map
> 
> A node that detects one of these conditions starts the recovery
> process. It immediately stops processing normal CTDB messages and
> sends a message to all nodes starting a global recovery. I have not
> yet worked out the precise nature of these messages (that should
> appear in a later version of this document), but some basics are
> clear:

So node A cares if B goes away iff B is the DMASTER of a record it wants
to access, right? 

>  1) the recovery process needs to assign every node a new VNN, and
>     will choose VNNs that are different from all the VNNs currently in
>     use. This is important to ensure that none of the old VNNs remain
>     valid, so we can detect when a 'zombie' node that is
>     non-responsive during recovery starts sending messages again. When
>     such a node wakes up it will trigger a CTDB_ERR_VNN_MAP message as
>     soon as it tries to send a CTDB message.

The VNN map is stored in the cluster filesystem. If the reason we
started recovery is because the cluster started recovery we aren't going
to get very far by depending on data in the shared filesystem.

>  2) There will be one 'master' node controlling the recovery
>     process. We need to determine how this node is determined,
>     probably using the lowest numbered physical node that is currently
>     operational.
> 
>  3) at the end of recovery there needs to be a global ACK that the
>     recovery has concluded before normal CTDB messages start again.
> 
>  4) nodes will need to look through their data structures of open
>     resources to recover the pieces of the data for each record. These
>     will then be pieced together with a mechanism very similar to a
>     CTDB_REQ_CONDITIONAL_APPEND.
> 
>  5) there will need to be a callback function to allow the database to
>     get at the data from (4).
> 
> In version 0.1 of this document it was envisioned that the recovery
> would be based solely on the RSN, rolling back each record to the
> record with the highest RSN for that record across the cluster. While
> I think that mechanism would still work, it is harder to prove it is
> correct in all cases than a mechanism based on nodes re-supplying
> their own pieces of the data for each record.
> 
> 
> TODO
> ----
> 
> Lots more to work out ...
> 
>  - Recovery protocol details
> 
>  - What if CTDB_REQ_DMASTER message is lost?
> 
>  - Should deleted records still be stored by the LMASTER?  If not,
>    then how does recovery avoid undeleting a record?
> 
>  - Client based recovery - highest seqnum, plus any actively locked
>    records versus structure driven recovery.
> 
>  - Need to work out the consequences of message loss for each of the
>    message types. Especially replies.
> 
>  - What events system to use?
> 
>  - explain recovery model, with deliberate rollback of records
> 
>  - what transport API to use? Standardise on MPI? Use a lower level
>    API? Use a transport abstraction? What is the latency cost of an
>    abstraction? 
> 
>  - how to integrate event handling with the transport? Pull Samba4
>    events system into tdb?

Have you considered how management information would be exchanged? eg.
you probably want to make sure that all nodes have the same smb.conf
configuration.

-- 
James Peach | jpeach at samba.org