CTDB - the 'Mainz' plan for clustered Samba

Sun Oct 1 06:48:10 GMT 2006

James,

 > Can you elaborate more on the role of the LMASTER? If LMASTER assignment
 > is per-record, how do I figure out which LMASTER to contact to find the
 > DMASTER for a particular record?

It's just a hash on the key for the record. That's what I mean by
this:

  "The LMASTER for a particular record is determined solely by the
  number of virtual nodes in the cluster and the key for the record."

so simplest algorithm would be 

   LMASTER = hash(key) % numnodes

 > If an extended LTDB record contains a DMASTER field, and a node
 > will redirect if it is not the DMASTER, why is the LMASTER
 > necessary?

The LMASTER for a record always knows who the current DMASTER is for a
record. Otherwise you could be chasing around the cluster for a long
time looking for the record.

Also note that any redirect will will only be to something other than
the LMASTER if the node doing the redirection has some good reason to
think it knows who the DMASTER is for the record (like for example if
it just handed over control of the record a few milliseconds ago to
that node).

 > Is this a UDP or TCP transport?

Deliberately not specified :)

On GPFS we will probably be using a low level hardware messaging
transport, aiming for maybe 2-5 usec latency. On gigabit you could use
either TCP or UDP. I haven't yet specified if the transport needs to
be reliable or not. I would like it to support unreliable transports,
but I need to work through all the possible messages that could be
lost and see how well that will work.

I will be prototyping this with either TCP or UDP, or possibly even
MPI, but I expect to allow people to plug in other transports.

 > > A node that detects one of these conditions starts the recovery
 > > process. It immediately stops processing normal CTDB messages and
 > > sends a message to all nodes starting a global recovery. I have not
 > > yet worked out the precise nature of these messages (that should
 > > appear in a later version of this document), but some basics are
 > > clear:
 > 
 > So node A cares if B goes away iff B is the DMASTER of a record it wants
 > to access, right? 

or its the LMASTER, and it tries to send a message to the LMASTER. Or
if it gets a message from some other node indicating that a global
recovery process has started.

 > The VNN map is stored in the cluster filesystem. If the reason we
 > started recovery is because the cluster started recovery we aren't going
 > to get very far by depending on data in the shared filesystem.

Storing the VNN map in the shared filesystem is just a convenience. It
makes sense for the setups we envisage, but it might be better being
on something like a LDAP server for other setups. It isn't a critical
feature of the design.

 > Have you considered how management information would be exchanged? eg.
 > you probably want to make sure that all nodes have the same smb.conf
 > configuration.

I'm deliberately leaving that sort of thing out. Obviously the
prototype would need to implement those sorts of things, but I don't
want to tie this design to one method of configuration.

That's part of the whole "do it at the tdb level" approach. This will
be a library on top of tdb which has absolutely nothing to do with
Samba itself. Samba will then use this library, and will call
ctdb_set_XXX() functions to setup things like where to get the VNN
map, how to work out what nodes are in the cluster etc.

Cheers, Tridge