[Samba] CTDB question about "shared file system"

Robert Buck robert.buck at som.com
Sat Aug 8 11:07:58 UTC 2020

On Sat, Aug 8, 2020 at 2:52 AM Martin Schwenke <martin at meltin.net> wrote:

> Hi Bob,
> On Thu, 6 Aug 2020 06:55:31 -0400, Robert Buck <robert.buck at som.com>
> wrote:
> > And so we've been rereading the doc on the public addresses file. So it
> may
> > be we have gravely misunderstood the *public_addresses* file, we never
> read
> > that part of the documentation carefully. The *nodes* file made perfect
> > sense, and the point we missed is that CTDB is using floating
> > (unreserved/unused) addresses and assigning them to a SECOND public
> > interface (aliases). We did not plan a private subnet for the node
> traffic,
> > and a separate public subnet for the client traffic.
> > [...]
> > Here is our mistake... The initial *public_addresses* file had identical
> > addresses as the *nodes* file, containing the private IP addresses
> assigned
> > by AWS. Not good, right? The error messages shown, above, were the
> result.
> Yep, that would definitely cause chaos.  ;-)
> CTDB is really designed to have the node traffic go over a private
> network.  There is no authentication between nodes (other than checking
> that a connecting node is listed in the nodes file) and there is no
> encryption between nodes.  Contents of files will not be transferred
> between nodes by CTDB if filenames are sensitive then they could be
> exposed if they are not on a private network.
> In the future we plan to have some authentication between nodes when
> they connect.  Most likely a shared secret used to generate something
> from the nodes file.
> > [...]
> >
> > And after these changes the logs simply have these messages periodically:
> >
> > Disabling takeover runs for 60 seconds
> > Reenabling takeover runs
> >
> > *Is this normal?*
> How frequently are these messages logged?  They should occur as nodes
> join but should stop after that.  If they continue are there any clues
> indicating why takeover runs occurs?  A takeover run is just what CTDB
> currently calls a recalculation of the floating IP addresses for
> fail-over.

Hi Martin, thank you for your helpful feedback, this is great.

Yes, those log messages, they were occurring once per second (precisely).

Then after several hours they stopped after these messages in the log:

ctdbd[1220]: node is dead: 0 connected
ctdbd[1220]: Tearing down connection to dead node :0
ctdb-recoverd[1236]: Current recmaster node 0 does not have CAP_RECMASTER,
but we (node 1) have - force an election
ctdbd[1220]: Recovery mode set to ACTIVE
ctdbd[1220]: This node (1) is now the recovery master
ctdb-recoverd[1236]: Election period ended
ctdb-recoverd[1236]: Node:1 was in recovery mode. Start recovery process
ctdb-recoverd[1236]: ../../ctdb/server/ctdb_recoverd.c:1347 Starting
ctdb-recoverd[1236]: Attempting to take recovery lock
(!/usr/local/bin/lockctl elect --endpoints REDACTED:2379 SM
ctdbd[1220]: High RECLOCK latency 4.268180s for operation recd reclock
ctdb-recoverd[1236]: Recovery lock taken successfully
ctdb-recoverd[1236]: ../../ctdb/server/ctdb_recoverd.c:1422 Recovery
initiated due to problem with node 0
ctdb-recoverd[1236]: ../../ctdb/server/ctdb_recoverd.c:1447 Recovery -
created remote databases
ctdb-recoverd[1236]: ../../ctdb/server/ctdb_recoverd.c:1476 Recovery -
updated flags
ctdb-recoverd[1236]: Set recovery_helper to
recover database 0x2ca251cf
Thaw db: smbXsrv_client_global.tdb generation 999520140
Release freeze handle for db smbXsrv_client_global.tdb
19 of 19 databases recovered
Recovery mode set to NORMAL
No nodes available to host public IPs yet
Reenabling recoveries after timeout

Then it's a clean syslog after that.

Thank you!

> peace & happiness,
> martin



NEW YORK, NY 10007
T  (212) 298-9624

More information about the samba mailing list