using virtual synchrony for CTDB

Fri Oct 6 07:04:29 GMT 2006

The CTDB proposal was pointed out to me by a colleague.  Good work guys.

I have a suggestion to use virtual synchrony for the transport mechanism
of CTDB.  I think you will find using something like TCPIP unsuitable
for a variety of reasons.

Virtual synchrony is a group communication (aka clustering) messaging
model that provides some guarantees first of which is that every message
is delivered in agreed order and also that messages are self delivered.
It also provides some other guarantees around what messages are
delivered when a node fails (all messages that some node has a copy of
will be delivered).

What does all that mean?
A group communication messaging model allows a group of processes to
communicate with each other.  Any node may send a message.  That message
will then be delivered to every node using multicast (or in some
circumstances unicast depending on the protocol)

With agreed ordering, all nodes within the cluster will agree upon the
order of the messages before any message is delivered.  So consider an
example
4 nodes each send a message m1 (for node 1) m2 (for node 2) m3 (for node
3) m4 (for node 4)
In this case, even if all messages are sent at about the same time, all
nodes will receive the messages in the same order.

ie:
n1: m3 m2 m1 m4
n2: m3 m2 m1 m4
n3: m3 m2 m1 m4
n4: m3 m2 m1 m4

It is not possible for one node to receive a message in a different
order then other nodes.  This is often called "total ordering" or
"agreed ordering".

Ie: it is not possible for
n1: m1 m2 m3 m4
n2: m3 m2 m1 m4

since n1 and n2 don't agree on the order, this would violate the
ordering guarantees of virtual synchrony.

This implies that all messages are delivered to the processor that
originates them.  In the first example above, n1 receives a copy of m1
which it originated, n1 receives a copy of m2 that it originated, etc.

How is this useful?  Consider a lock service example:

node A B C all want to lock a resource R1 at about the same time.  They
all send messages A sends m_lr1A (lock resource 1), B sends m_lr1B, C
sends m_lrlC

In this case it would be possible for a variety of scenarios to occur
A receives m_lr1A, mlr1B, mlr1C
B receives m_lr1A, mlr1B, mlr1C
C receives m_lr1A, mlr1B, mlr1C

A B C agree on order - the lock can be granted to A immediately without
any round trip replies.  B and C are queued.

or

A receives m_lr1B, mlr1A, mlr1C
B receives m_lr1B, mlr1A, mlr1C
C receives m_lr1B, mlr1A, mlr1C

A B C agree on order - the lock can be granted to B immediately without
any round trip replies.  A and C are queued.

but never

A receives m_lr1B, mlr1A, mlr1C
B receives m_lr1A, mlr1B, mlr1C
C receives m_lr1B, mlr1A, mlr1C

in this last example A and C agree on order, but B does not so it
violates virtual synchrony.

As can be seen from the above examples if the order is always agreed
upon it is possible for every node in the cluster to come to the same
conclusion about which node should be granted the lock for r1.  This
totally eliminates the need for round trip replies or any single point
of failure.

OK sounds handy but is there any code to do all of this?  There is a
protocol implemented in C called the Totem Single Ring Protocol.  It is
available for use in most distributions in development.

http://developer.osdl.org/dev/openais

There are a variety of ways the transport could be used.  A service
handler "plugin" could be written for openais to provide the ctdb api.
This is pretty simple to do and involves writing the API in C and the
plugin in C.  openais provides several key features for plugins such as
timers, virtual synchrony messaging, ipc services, and anything else of
use in developing a "daemon" style app.

Another option is to write a daemon which uses an API exposed by openais
called "CPG or closed process groups".  This is a bit more work as the
daemon would have to implement it's own IPC mechanism.

Finally totem is available as a shared linkable library for use directly
in other applications.  It requires the use of the poll syscall so a
poll abstraction is provided with timers in this minimal configuration
option.

Please feel free to contact me on the list if you have any questions.

Regards
-steve