using virtual synchrony for CTDB

Fri Oct 6 20:47:13 GMT 2006

On Fri, 2006-10-06 at 13:15 -0700, Tracy Camp wrote:
> I'm not sure we are talking about the same thing here at all... Wouldn't 
> some form of response be required within the GC layer to guarantee 
> delivery?  I'm not familiar with the GC layer that you mentioned, but that 
> certainly seems to be the logical conclusion and implemented fact in 
> spread (which requires _two_ token rounds to deliver a guaranteed 
> message).  Now if you don't care about delivery guarantees just message 
> ordering, certainly no response is necissary.
> 

To achieve agreed ordering, no response is required.  All
acknowledgements are implicit.  So if a token rotates to a new node with
a sequence number higher then that node has, that node knows it is
missing some messages and requests a retransmission.  So in the data
loss case, a retransmit is required, although this is rare on modern
Ethernet hardware.

To achieve safe ordering (which means that every processor has a copy of
the message before it is delivered and you are guaranteed that the
message is delivered or the processor fails), the token must rotate
twice as you indicate.  There still is no "response".  The same implicit
acknowledgement is used.  For safe messages, we can ensure a message is
safe when all processors contain a copy of the message.  This "low water
mark" for message sequence id is kept in the token to work as a garbage
collector for messages that can be discarded by the group communication
system.

So in short, there is no response.

> It is not clear that CTDB cares about message ordering at all (including 
> the reqid seems sufficient), so I'm not entirely sure what VS would 
> provide here except a way to know if a message was sent in the current 
> view of the cluster or not.
> 

The current design doesn't consider the idea of avoiding the
request/response latency using virtual synchrony.  

> CTDB appears to want a semantic where every node is free to asynchronously 
> issue and handle messages without regard to other nodes issueing or 
> handling messages.  The whole point of the DMASTER, LMASTER concepts

The DMASTER/LMASTER concepts then require a request/response which is
limited in performance to the physical transmission framing rate of
Ethernet.  This means that it would be possible to transfer a record
1600 times per second in the cluster.

>  
> appears to avoid needing to broadcast state to every cluster member.  I 
> can assure you that in large clusters this works out much better.  VS 
> might provide a more elegant way to initiate recoveries than simply 
> relying on message timeouts (and broadcast message semantics in a recovery 
> are handy, though not necissary, since recovery is not unlike RAID-5 
> reconstruction).
> 

Yes recovery is a huge advantage of VS and broadcast does make it alot
easier.

Regards
-steve

> Tracy Camp
> 
> On Fri, 6 Oct 2006, Steven Dake wrote:
> 
> > Tracey,
> >
> > If latency is an issue in messages, any message that has roundtrip
> > response time over Ethernet medium will have higher latency then those
> > messages that do not have round trip responses.
> >
> > If no response is required over Ethernet from the server before
> > proceeding to do new operations, then ptp will have less latency then
> > virtual synchrony.
> >
> > The performance problem that vs solves is removing the round trip
> > response time, since every node has a copy of the data and can
> > immediately handle the request and may continue processing as soon as
> > the lock request is delivered (self-delivered) instead of waiting for a
> > response from a server over TCP/IP or some other PTP protocol running
> > over Ethernet.
> >
> > Regards
> > -steve
> >
> > On Fri, 2006-10-06 at 11:09 -0700, Tracy Camp wrote:
> >> Snake oil aside, 'DLM' like clustering schemes, which the CTDB proposal
> >> seems like it could be grouped with, are best implemented with p-t-p
> >> messages for the latency concerns already expressed.  However also using a
> >> VS group communications layer to provide a generation number than can then
> >> be embedded in each P-T-P message provides P-T-P w/o the overhead of VS
> >> for the latecy sensitive messages.  Sort of a scheme that breaks the
> >> 'control' apart from the 'data' transports.
> >>
> >> Tracy Camp
> >>
> >> On Fri, 6 Oct 2006, David Boreham wrote:
> >>
> >>> Steven Dake wrote:
> >>>
> >>>> I have a suggestion to use virtual synchrony for the transport mechanism
> >>>> of CTDB.  I think you will find using something like TCPIP unsuitable
> >>>> for a variety of reasons.
> >>>>
> >>> I'm very far from being a VS expert, but when I looked into a few
> >>> of the open source implementations available a while back it became
> >>> clear (to me at least) that they have a kind of 'snake oil' property
> >>> in that they appear to deliver magical services but do so only by
> >>> using quite inefficient methods underneath the covers. For example
> >>> it appears that one is avoiding network round-trips but in fact to implement
> >>> its
> >>> delivery guarantees the message middleware layer needs to propagate a token
> >>> around the set of participating nodes which of course involves many sends and
> >>> receives.
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >
> >