using virtual synchrony for CTDB

Fri Oct 6 18:22:37 GMT 2006

On Fri, 2006-10-06 at 10:37 -0600, David Boreham wrote:
> Steven Dake wrote:
> 
> >I have a suggestion to use virtual synchrony for the transport mechanism
> >of CTDB.  I think you will find using something like TCPIP unsuitable
> >for a variety of reasons.
> >  
> >
> I'm very far from being a VS expert, but when I looked into a few
> of the open source implementations available a while back it became
> clear (to me at least) that they have a kind of 'snake oil' property
> in that they appear to deliver magical services but do so only by
> using quite inefficient methods underneath the covers. For example
> it appears that one is avoiding network round-trips but in fact to 
> implement its
> delivery guarantees the message middleware layer needs to propagate a token
> around the set of participating nodes which of course involves many 
> sends and
> receives.
> 

In fact totem does rotate a token around a virtual ring of processors.
Whenever a site has a token, it may then multicast messages.  Within the
token is a global sequence id which is used to then order the messages
later on delivery.

In our first implementation the token would race around the ring over
and over even if there were no messages to be sent.  This resulted in
terrible performance of the system.  The totem implementation in openais
now implements token braking which was not published in the original
totem paper.  If no messages are sent on a ring within a configurable
number of token rotations, the token rotation slows until processors
must send messages, at which point the token picks back up.  This
reduces CPU utilization on unused rings to less then .1% as measured by
top.

The guarantees are not free.  There is a cost.  But consider the ever
popular lock service example:

Implementation A:
Sends a 64 byte request to a server over TCP/IP and receives a response
from a node
This is limited by the physical performance of Ethernet to about 1600
exchanges per second.  I have written some time ago a program which
actually tests the performance of such an operation.  If your interested
in seeing, I can see if I can find it.

Implementation B
Send a 64 byte request to all nodes in the cluster.  The operation is
executed as soon as delivered by totem.

My machines are all about 2.8ghz with the aisexec messaging process
mlocked and running at sched_rr:1 priority without any load except for
the aisexec executive daemon and the cpgbench program.

In this example, I have a benchmark program called "cpgbench" which uses
the closed process groups API and writes 64 byte message contents (there
is also some overhead for the message header which is not calculated in
the benchmark results and just included in) and counts the number of
messages delivered in agreed order in a 10 second window.

In a cluster with 3 processors with one node sending the results are:
[root at bigbits test]# ./cpgbench
200521 messages received    64 bytes per write  10.000 Seconds runtime
20051.643 TP/s   1.283 MB/s.
2

In a cluster with 3 processors (I work at home and only have 3 nodes at
the moment) with each node sending, the results are:
[root at slickdeal test]# ./cpgbench
415137 messages received    64 bytes per write   9.993 Seconds runtime
41540.959 TP/s   2.659 MB/s.

41450 > 1600 possible lock or unlock operations per second at the cost
of the token rotating around the ring.  In most "high performance"
applications the token rotation cost is minimal compared to the sendmsg
and recvmsg operations of the multicast messages themselves (since there
are many multicast messages at each node, but only one unicast message
is sent to transmit the token).

I am certain if I had an application that was using samba and the file
system performed at 20x speed I'd be happy to deal with a little extra
CPU cycles for sending the token.  I don't know if using TDB in this way
would improve performance 20x but if its main purpose is to lock certain
records, it will perform vastly better then request/response.

Also openais's version of totem implements the totem redundant ring
protocol which allows using multiple hardware interfaces to improve
performance and availability.

Regards
-steve

> 
> 
> 
>