using virtual synchrony for CTDB

Fri Oct 6 18:50:08 GMT 2006

Tridge comments inline

On Fri, 2006-10-06 at 18:40 +1000, tridge at samba.org wrote:
> Steven,
> 
> Thanks for your comments!
> 
>  > I have a suggestion to use virtual synchrony for the transport mechanism
>  > of CTDB.  I think you will find using something like TCPIP unsuitable
>  > for a variety of reasons.
> 
> Can you point me at any performance numbers for this? Transport
> mechanisms that provide nice guarantees like message ordering and
> group delivery also tend to pay a significant performance cost. I
> don't expect we will need either of those features in CTDB, so I'd
> very much like to avoid any performance cost that might come along
> with it.
> 
> The sort of performance numbers I'm interested in are like the ones
> you can get from 'netpipe', showing round trip times for varying
> message sizes. For CTDB we don't need any multicast capabilities, so
> that side of the performance problem isn't interesting.
> 

I posted some performance numbers in another message posted to the list.

The point of virtual synchrony is to entirely avoid any kind of round
trip operation.  In the example below, we see that we never have to
respond with a "i got the lock message" because every node knows who got
the lock, so why bother responding?  The only reasons an app waits for a
response are
1) to proceed with operations (not needed with vs)
2) to ensure the target got the message in the first place

This point #2 brings up some interesting capabilities of virtual
synchrony which I did not mention in my first message.

Consider an example where there are processors A and B and C which all
send a messases m1 m2 m3 m4.  At some point during sending these
messages processor C fails.  This failure is marked by a configuration
change c1 (which comes to the application as a callback operation).

Example of something that can happen:
A: m1 m2 m3 c1 m4
B: m1 m2 m3 c1 m4
C: m1 m2 m3 (failed)

Now the process knows that after m3, there was a failure of node C and
can do whatever it chooses to do about it.  Because each node has the
same stream of messages in their new configurations, they can make the
same decisions about how to recover from the failure of node C.  Further
there is no wory about a lost message ie it is not possible for this to
happen:
A: m1 m2 m3 c1 m4
B: m1 m2 c1 m4
C: m1 m2 m3 (failed)

In the above example message m3 appears lost to processor C.  But in
fact processor A has a copy of the message and can give it back to
processor B.  So virtual synchrony requires that processor B recover any
lost messages it can (which it can from A) before delivering any
configuration changes.

Virtual synchrony requires that configuration changes come in the same
order relative to messages on all nodes.  So for example, this wouldn't
be possible:

A: m1 m2 m3 c1 m4
B: m1 m2 c1 m3 m4
C: m1 m2 m3 (failed)

So this solves the issue #2 of needing round trip replies.

>  > node A B C all want to lock a resource R1 at about the same time.  They
>  > all send messages A sends m_lr1A (lock resource 1), B sends m_lr1B, C
>  > sends m_lrlC
>  > 
>  > In this case it would be possible for a variety of scenarios to occur
>  > A receives m_lr1A, mlr1B, mlr1C
>  > B receives m_lr1A, mlr1B, mlr1C
>  > C receives m_lr1A, mlr1B, mlr1C
> 
> I need to explain a bit more in the CTDB document about locking. I
> actually expect we will end up with no remote locking at all, avoiding
> it by using CTDB_REQ_CONDITIONAL_APPEND calls. This changes the
> characteristics of the communication rather a lot :)
> 

I'm not sure how it would be possible to lock a global resource without
communicating with all nodes that the resource should be locked.  One
way used in the current kernel DLM is to have "resource masters" where
control for a resource locking/unlocking is located on one particular
node that has an affinity for the resource.  This works well in the
common case of no lock contention (because there are no round trip
replies) but does not work well in the case of lock contention (because
off-affinity nodes must request the lock from the resource master).

>  > Finally totem is available as a shared linkable library for use directly
>  > in other applications.  It requires the use of the poll syscall so a
>  > poll abstraction is provided with timers in this minimal configuration
>  > option.
> 
> I presume it could also use epoll or select? I'd like to use epoll
> where available.
> 

It does not use epoll at the moment although that is something the
project has looked at.  Without any users of the totem library except
for openais, the need for enhancing the performance of poll is not
necessary for us since poll performs well for a small set of descriptors
which fits the openais model well.

I would definitely suggest if you have interest in virtual synchrony as
a transport, the "dispatcher" portion of the architecture would better
be implemented as a service handler plugin to openais.

Regards
-steve

> Cheers, Tridge