CTDB in Samba4

Mon Mar 12 09:09:12 GMT 2007

Abhi,

 > The last time we talked about $SUBJECT, you mentioned that there are 
 > still a few subsystems that need to be ported to the new CTDB design. Is 
 > the code in a position to be tested with a cluster? If yes, can you 
 > point me to that, please?

The code as it is now in Samba4 should work on any cluster with the
tcp backend. Only two subsystems use the main ctdb API so far (byte
range locking and messaging), the rest just use the fact that the code
now separates out the "server id" to include a node specific
component.

The main job I would like to get done before SambaXP is to convert the
rest of the database parts of Samba4 to use the ctdb API. Next step is
opendb, followed by notify.

None of the code that has been done so far depends on any specific
cluster filesystem capabilities. As long as the cluster filesystem is
data coherent and can communicate over TCP/IP then it should
work. Further down the track we will get cluster specific plugins to
ctdb which will know about things like the cluster management APIs, so
ctdb knows when new nodes appear. That doesn't matter for the moment
as ctdb can't yet dynamically change what nodes are in the
cluster. That's an important part of the plan, but it't not done yet.

To test the existing code, see source/cluster/ctdb/example/ in the
samba4 sources and adapt to your local network setup. Right now you
should not expect the performance to be great, as so many things
(particularly opendb) still relies on the "tdb in a shared directory"
model, which scales really badly. That will improve as more subsystems
use the ctdb API.

 > I'm going to work on adapting GFS and RedHat's cluster suite to be
 > compatible with the new clustered samba design.  Do you have a spec
 > for what kind of support will be needed from the cluster filesystem
 > and the cluster management framework?

There is not much to adapt at this stage. If you like, you could work
on developing the parts of ctdb that will need the cluster management
APIs. We don't have a spec for that part of ctdb yet, and I planned on
keeping it fairly simple by using external tools.

My initial plan was to assume there would be an external process that
would interact with whatever cluster management system the cluster
has. That external process would write the node list to a file (much
like the nlist.txt in the current example). The ctdb could would watch
that file via change notify if available, otherwise would rely in
getting a signal to tell it to reload the file.

Perhaps we will need something more complex later, but I think the
above will get us a long way.

 > Are you aware of other cluster filesystem developers who are
 > working to support this?

I've mostly been working with the GPFS people. The main topics of
discussion don't really relate to CTDB, but instead concentrate on
things like statlite(), to try to reduce the impact clusters have on
common filesystem operations. We've also spent a lot of time
discussing how the cluster filesystem could directly handle the
windows semantic mapping that Samba currently does via mechanisms like
xattr. If GFS already has fast xattr support then that won't be a
concern for you, but if xattr is slow (try a recent dbench with xattr
enabled to test that) then I'd be happy to discuss the alternatives
we've been looking at.

 > You mention an interface to obtain the node-membership. I'm guessing 
 > this is the membership of nodes currently mounting gfs and not the whole 
 > cluster's membership.

yes

 > We don't have this support currently.

ok, I didn't realise that, but it shouldn't really matter. With the
design above we can have a little daemon (perhaps just a python
script?) which acts as the go-between. That daemon would use
/etc/fstab, /proc/mounts or whetever it needs to use to work out what
filesystems are mounted. If there truly is no way to find that out
then it will just have to be a static config, but that would be quite
limiting.

 > Also, a notification mechanism for nodes joining and leaving is a
 > bit tricky.  I'm hoping to design these with your input.

maybe you could comment on the "daemon modifying a file" approach I
outline above and point out any problems it might have? Perhaps I've
missed something :-)

Cheers, Tridge