Patch to support Scalable CTDB

Fri Apr 27 08:37:36 UTC 2018

Hi Partha,

On Fri, Apr 27, 2018 at 10:14 AM, Partha Sarathi via samba-technical
<samba-technical at lists.samba.org> wrote:
> Hi,
>
> We have a requirement to support a large cluster i.e scaling from 20 nodes
> to 50 nodes cluster and the current CTDB design may not support the linear
> scaling of nodes in the cluster as the replication of all lock
> related tdbs and tdb traversing for every record may slow down the
> performance.

There are many areas in CTDB that require work to improve scalability
to large number of nodes.  Many of the improvements are on the
roadmap.

<shameless promo>
One of the major improvements is to split the monolithic daemon into
separate daemons.  Martin and I have been doing lots of ground work to
get to a point where we can start introducing separate daemons.  There
will be lots of patches appearing on the mailing list soon to that
effect.  This will eventually get us to leaner database daemon(s).
</shameless promo>

>  The product requirement for creating large number nodes say 50 in a
> cluster with subgrouping them into three/five nodes into
> multiple protocol heads. Each of these protocol head group nodes will host
> a specific set of shares not all.  So we took an approach to create two
> instances of CTDBD on each node.
>
> 1) The primary ctdbd (Persistent ctdbd) is responsible to just replicate
> the persistent TDBs across the cluster in our case 50 nodes to maintain the
> AD registered account details and supporting single global namespace across
> the large cluster.
>
> 2) The secondary instance called ( Locking ctdbd) is responsible
> to replicate and traverse the lock related TDBs within the protocol heads
> group in that way reducing the latency TDB transactions (expensive when the
> number of nodes is large) by just communicating within the limited nodes
> group.
>
> 3) Smbd changed in such a way that it communicates against these two
> instance of CTDBDs with different ctdbd sockets. The message inits and the
> connection handling have been well-taken care.
>
> To have the above ctdbd running independently they are configured
> separately and listening on different ctdb ports (4379 and 4380)
> respectively.

It's an interesting hack.  But I would not recommend running multiple
instances of ctdb daemon.  Among the many reasons is "ctdb daemon is
still running with real-time".  You definitely don't want multiple
user-space daemons running at real-time priority.  Additionally, two
ctdb instances create unnecessary network overhead of double the
communication for two separate ctdb clusters.

One approach for solving this problem would be VNNMAP groups.

VNNMAP is a collection of nodes which participate in database
activity.  Even though it's applicable to both the persistent and the
volatile databases, it has more effect on the volatile databases.
Volatile databases are the distributed temporary databases (e.g.
locking.tdb).  Currently all the nodes are in a single VNNMAP.

With VNNMAP groups, we can partition the nodes into groups. Each group
then maintains the volatile databases independently from the other
group.  Of course samba configuration (share definitions) must to be
identical for all the nodes in a group.  Also, samba shares across two
different groups cannot have overlapping file system directories
(unless they are read-only shares).  This should effectively reproduce
the same behaviour you have achieved with two ctdb instances, but
without needing any change in samba.

Amitay.