[Linux-ha-dev] CTDB and clustered Samba

Tue Jun 12 16:15:16 GMT 2007

On 2007-06-12T20:42:31, tridge at samba.org wrote:

>  > CTDB is built on top of a cluster filesystem layer.
> not quite - CTDB relies on one being there, but doesn't actually use
> it much itself. CTDB opens exactly 1 file on the cluster fs, that file
> never has any data in it, and CTDB holds exactly 1 byte of lock on
> that file on one of the nodes. That's it. That file is the key to CTDB
> ensuring that it has the same view of the topology of the cluster as
> the filesystem does.

OK. So you're using the CFS as a DLM. Expensive, but possible. ;-)

(That also means that with OCFS2, you can't do that; it doesn't have
cluster-wide flock yet, _but_ it allows you to access it's DLM via
dlmfs, so that could be used.)

> This all means that Samba on a CTDB cluster runs just as fast as
> non-clustered Samba for most operations. That's very hard to achieve
> with the normal approach to clustering.

No, not at all. DLM usually also cache locks mastered locally.

(I know your lock semantics are very complex, but I'm just sayin' ;-)

> Then I realised _why_ they all are so bad. When you write some data to
> a cluster fs, or a cluster database, and the node you wrote to dies
> immediately after it replies, then the developers of that cluster fs
> want to guarantee that the data you wrote is not lost. That's what
> they mean by a reliable clustering system. 

After an fsync, or with O_DIRECT etc, sure, but otherwise, no.

> How do they provide this guarantee? They either have to ensure the
> data is committed to stable shared storage (pieces of metal or very
> slow shared nv-ram) or they need to replicate the data to all nodes
> before they reply. That is the only way to provide that guarantee.

Well, attaching the page to the DLM reply makes that replication
basically free, at least in terms of latency.

> The thing is, Samba on CTDB does not need that guarantee. Samba needs
> to guarantee that user data is not lost. Samba is quite happy to lose
> certain well defined pieces of the meta-data associated with its
> connections. Knowing exactly what you can safely lose is the key to
> CTDB. It turns out that the data that CTDB can lose is the data that
> is most frequently updated and written, so by designing a system where
> we can lose that data, we remove the biggest bottleneck.

Sure, that makes sense. Can you elaborate on this a bit more?

> you are welcome to try. I spent years trying to get these services to
> get good enough for Samba to use. The problems that cluster
> filesystems try to solve and the problems that Samba+CTDB try to solve
> are different enough that trying to build one on the other doesn't
> work.

OK, I'm not questioning your judgement and experience, I'm just trying
to understand why, and what we might be able to accomodate (as we're
looking at switching commlayers anyway). Or, if you're right, what we
could reuse - as the Linux HA v2 code has a "CIB" (cluster information
base) which is a replicated/distributed db thingy too, maybe we could
reuse parts of the CTDB or something ;-)

> Basically the perfect transport for CTDB is TCP or a similar stream
> transport, and luckily enough that's pretty widely available, so I'm
> happy to use it :-)

TCP in a controlled LAN environment likely performs pretty well, yes.
But you will open N:N connections in your cluster as well, no? Don't you
need some broadcasts?

Or easier - is there a document I can read which outlines the CTDB
requirements?

> I guess you could write a CTDB backend that uses openAIS messaging
> (the backends are pluggable in CTDB). If someone tries that then I'd
> be interested in hearing how it goes. I'll be quite surprised if it
> does any better than what we do now, and I actually expect it will do
> worse, as it means two layers of event driven programming rather than
> one, and extra layers of marshalling, extra layers of error handling
> etc.

Right, just like TCP on TCP is a bad idea I expect this would be as
well. (Just like the misdesigned approach to try running heartbeat's
native comm layer on top of openAIS; it's the same stacking issue.)

Using openAIS instead of TCP is guaranteed to yield exceptional
performance, unfortunately of the negative kind. The interesting bit
might be to instead directly use openAIS as the comm layer.

Even if that isn't possible, I'd like to understand why, as that would
be valuable experience when assessing different comm layers. Have you
wrote up your criticism on existing clustering stacks somewhere? 

If not, that's a conference paper I'd fly a few miles for to attend the
presentation ;-)

Regards,
    Lars

-- 
Teamlead Kernel, SuSE Labs, Research and Development
SUSE LINUX Products GmbH, GF: Markus Rex, HRB 16746 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde