connect(/var/lib/ctdb/ctdb.socket) failed: Connection refused

Sun Jul 27 05:26:56 MDT 2014

On Sun, 2014-07-27 at 20:17 +1000, Martin Schwenke wrote:
> On Thu, 24 Jul 2014 17:47:16 +0200, steve <steve at steve-ss.com> wrote:
> 
> > On Thu, 2014-07-24 at 10:16 +1000, Martin Schwenke wrote:
> 
> > > Just wondering if you worked out what the problem was here or whether
> > > there is something we need to fix...  :-)
> 
> > Yes, thanks. We're now OK on openSUSE: just disable apparmor. 
> 
> That's excellent news!  Thanks for letting us know...
> 
> > But struggling on Ubuntu:
> > 
> > 2 node ctdb 2.5.3 on Ubuntu 14.04 nodes
> > 
> > apparmor teardown and firewall and stopped dead
> > 
> > The IP takeover is working fine between the nodes:
> > Jul 21 14:12:03 uc1 ctdbd: recoverd:Trigger takeoverrun
> > Jul 21 14:12:03 uc1 ctdbd: recoverd:Takeover run starting
> > Jul 21 14:12:04 uc1 ctdbd: Takeover of IP 192.168.1.81/24 on interface
> > bond0
> > Jul 21 14:12:04 uc1 ctdbd: Takeover of IP 192.168.1.80/24 on interface
> > bond0
> > Jul 21 14:12:05 uc1 ctdbd: Monitoring event was cancelled
> > Jul 21 14:12:05 uc1 ctdbd: recoverd:Takeover run completed successfully
> > Jul 21 14:12:06 uc1 ntpd[3759]: Listen normally on 10 bond0 192.168.1.81
> > UDP 123
> > Jul 21 14:12:06 uc1 ntpd[3759]: Listen normally on 11 bond0 192.168.1.80
> > UDP 123
> > Jul 21 14:12:06 uc1 ntpd[3759]: peers refreshed
> > Jul 21 14:12:06 uc1 ntpd[3759]: new interface(s) found: waking up
> > resolver
> > Jul 21 14:12:08 uc1 ctdbd: monitor event OK - node re-enabled
> > Jul 21 14:12:08 uc1 ctdbd: Node became HEALTHY. Ask recovery master 0 to
> > perform ip reallocation
> > Jul 21 14:12:08 uc1 ctdbd: recoverd:Node 0 has changed flags - now 0x0
> > was 0x2
> > Jul 21 14:12:08 uc1 ctdbd: recoverd:Takeover run starting
> > Jul 21 14:12:09 uc1 ctdbd: recoverd:Takeover run completed successfully
> > 
> > but on joining node 1 to the domain, no secrets.tdb is created:
> > 
> > sudo net ads join -UAdministrator
> > Enter Administrator's password:
> > Using short domain name -- ALTEA
> > Joined 'SMBCLUSTER' to dns domain 'altea.site'
> > Not doing automatic DNS update in a clustered setup.
> >  
> > The persistent folder contains only:
> > /usr/local/var/lib/ctdb/persistent
> > -rw------- 1 root root 1310720 jul 21 14:11 ctdb.tdb.0
> > (with ctdb.tdb.1 of the same size on node 2)
> > 
> > /etc/samba/smb.conf
> > [global]
> > workgroup = ALTEA
> > realm = ALTEA.SITE
> > security = ADS
> > kerberos method = secrets only
> > netbios name = SMBCLUSTER
> > winbind enum users = Yes
> > winbind enum groups = Yes
> > winbind use default domain = Yes
> > winbind nss info = rfc2307
> > idmap config * : backend = tdb
> > idmap config * : range = 19900-19999
> > idmap config ALTEA : backend  = ad
> > idmap config ALTEA : range = 20000-4000000
> > idmap config ALTeA : schema_mode = rfc2307
> > clustering = Yes
> > ctdbd socket = /usr/local/var/run/ctdb/ctdbd.socket
> > [users]
> > path = /cluster/users
> > read only = No
> > [profiles]
> > path = /cluster/profiles
> > read only = No
> > 
> > We've tried with the stock ubuntu ctdb 2.5.1, the upstream 2.5.3 and now
> > with 2.5.3 that we've built.
> > 
> > The socket appears fine in the specified location.
> > Why do we get no secrets.tdb created?
> > 
> > In fact, why don't we get the same databases produced with Ubuntu as the
> > myriad of databases that are produced in the working openSUSE version? 
> 
> Are smbd and winbindd running?  If not, please compare the value of
> CTDB_MANAGES_SAMBA and CTDB_MANAGES_WINBIND.  Perhaps these are not set
> to "yes" and the database directory has been cleaned up?  Probably also
> take a look in /var/lib/ctdb/ and make sure there database location
> isn't inconsistent.
> 
> peace & happiness,
> martin

Hi
Yeah, we have that:
/etc/default/ctdb

 CTDB_NODES=/etc/ctdb/nodes

# List of public addresses for providing NAS services.  No default.
CTDB_PUBLIC_ADDRESSES=/etc/ctdb/public_addresses

# What services should CTDB manage?  Default is none.
 CTDB_MANAGES_SAMBA=yes
 CTDB_MANAGES_WINBIND=yes
# CTDB_MANAGES_NFS=yes

# Raise the file descriptor limit for CTDB?
# ulimit -n 10000

# Default is to use the log file below instead of syslog.
# CTDB_LOGFILE=/var/log/log.ctdb
 CTDB_SYSLOG=yes

# Default log level is ERR.  NOTICE is a little more verbose.
CTDB_DEBUGLEVEL=NOTICE

The strange thing is, ctdb seems to be working in 2 folders. /var/ctdb
and /var/lib/lib/ctdb

Neither smbd nor winbind is started:

Jul 27 13:20:29 uc1 ctdbd: CTDB starting on node
Jul 27 13:20:29 uc1 ctdbd: Recovery lock file set to "". Disabling
recovery lock checking
Jul 27 13:20:30 uc1 ctdbd: Starting CTDBD (Version 2.5.1) as PID: 5495
Jul 27 13:20:30 uc1 ctdbd: Created PID file /var/run/ctdb/ctdbd.pid
Jul 27 13:20:30 uc1 ctdbd: Set scheduler to SCHED_FIFO
Jul 27 13:20:30 uc1 ctdbd: Starting SYSLOG child process with pid:5496
Jul 27 13:20:30 uc1 ctdbd: Starting SYSLOG daemon with pid:5496
Jul 27 13:20:30 uc1 ctdbd: Set runstate to INIT (1)
Jul 27 13:20:30 uc1 ctdbd: 00.ctdb: awk: line 2: function gensub never
defined
Jul 27 13:20:30 uc1 ctdbd: 00.ctdb: awk: line 2: function gensub never
defined
Jul 27 13:20:30 uc1 ctdbd: 00.ctdb.dpkg-dist: awk: line 2: function
gensub never defined
Jul 27 13:20:30 uc1 ctdbd: 00.ctdb.dpkg-dist: awk: line 2: function
gensub never defined
Jul 27 13:20:32 uc1 ctdbd: Vacuuming is disabled for persistent database
ctdb.tdb
Jul 27 13:20:32 uc1 ctdbd: Attached to database
'/var/lib/lib/ctdb/persistent/ctdb.tdb.0' with flags 0x400
Jul 27 13:20:32 uc1 ctdbd: Freeze priority 1
Jul 27 13:20:32 uc1 ctdbd: Freeze priority 2
Jul 27 13:20:32 uc1 ctdbd: Freeze priority 3
Jul 27 13:20:32 uc1 ctdbd: server/ctdb_takeover.c:3243 Released 0 public
IPs
Jul 27 13:20:32 uc1 ctdbd: Set runstate to SETUP (2)
Jul 27 13:20:33 uc1 ctdbd: Set runstate to FIRST_RECOVERY (3)
Jul 27 13:20:33 uc1 ctdbd: Keepalive monitoring has been started
Jul 27 13:20:33 uc1 ctdbd: Monitoring has been started
Jul 27 13:20:33 uc1 ctdbd: recoverd:monitor_cluster starting
Jul 27 13:20:33 uc1 ctdbd: recoverd:server/ctdb_recoverd.c:3677 Initial
recovery master set - forcing election
Jul 27 13:20:33 uc1 ctdbd: Freeze priority 1
Jul 27 13:20:33 uc1 ctdbd: Freeze priority 2
Jul 27 13:20:33 uc1 ctdbd: Freeze priority 3
Jul 27 13:20:33 uc1 ctdbd: This node (0) is now the recovery master
Jul 27 13:20:34 uc1 ctdbd: This node (0) is no longer the recovery
master
Jul 27 13:20:34 uc1 ctdbd: CTDB_WAIT_UNTIL_RECOVERED
Jul 27 13:20:36 uc1 ctdbd: message repeated 2 times:
[ CTDB_WAIT_UNTIL_RECOVERED]
Jul 27 13:20:37 uc1 ctdbd: recoverd:server/ctdb_recoverd.c:1139 Election
timed out
Jul 27 13:20:37 uc1 ctdbd: recoverd:server/ctdb_recoverd.c:3699 Current
recmaster node 1 does not have CAP_RECMASTER, but we (node 0) have -
force an election
Jul 27 13:20:37 uc1 ctdbd: Freeze priority 1
Jul 27 13:20:37 uc1 ctdbd: Freeze priority 2
Jul 27 13:20:37 uc1 ctdbd: Freeze priority 3
Jul 27 13:20:37 uc1 ctdbd: This node (0) is now the recovery master
Jul 27 13:20:37 uc1 ctdbd: CTDB_WAIT_UNTIL_RECOVERED
Jul 27 13:20:37 uc1 ctdbd: This node (0) is no longer the recovery
master
Jul 27 13:20:38 uc1 ctdbd: 192.168.0.10:4379: connected to
192.168.0.11:4379 - 1 connected
Jul 27 13:20:38 uc1 ctdbd: CTDB_WAIT_UNTIL_RECOVERED
Jul 27 13:20:40 uc1 ctdbd: message repeated 2 times:
[ CTDB_WAIT_UNTIL_RECOVERED]
Jul 27 13:20:40 uc1 ctdbd: recoverd:server/ctdb_recoverd.c:1139 Election
timed out
Jul 27 13:20:40 uc1 ctdbd: recoverd:Initial interface fetched
Jul 27 13:20:40 uc1 ctdbd: recoverd:The interfaces status has changed on
local node 0 - force takeover run
Jul 27 13:20:40 uc1 ctdbd: recoverd:Trigger takeoverrun
Jul 27 13:20:41 uc1 ctdbd: CTDB_WAIT_UNTIL_RECOVERED
Jul 27 13:20:41 uc1 ctdbd: Freeze priority 1
Jul 27 13:20:41 uc1 ctdbd: Freeze priority 2
Jul 27 13:20:41 uc1 ctdbd: Freeze priority 3
Jul 27 13:20:41 uc1 ctdbd: server/ctdb_recover.c:989 startrecovery
eventscript has been invoked
Jul 27 13:20:42 uc1 ctdbd: CTDB_WAIT_UNTIL_RECOVERED
Jul 27 13:20:43 uc1 ctdbd: CTDB_WAIT_UNTIL_RECOVERED
Jul 27 13:20:44 uc1 ctdbd: server/ctdb_monitor.c:485 Node 1 became
healthy - force recovery for startup
Jul 27 13:20:44 uc1 ctdbd: recoverd:Node 1 has changed flags - now 0x0
was 0x2
Jul 27 13:20:44 uc1 ctdbd: server/ctdb_recover.c:612 Recovery mode set
to NORMAL
Jul 27 13:20:44 uc1 ctdbd: Thawing priority 1
Jul 27 13:20:44 uc1 ctdbd: Release freeze handler for prio 1
Jul 27 13:20:44 uc1 ctdbd: Thawing priority 2
Jul 27 13:20:44 uc1 ctdbd: Release freeze handler for prio 2
Jul 27 13:20:44 uc1 ctdbd: Thawing priority 3
Jul 27 13:20:44 uc1 ctdbd: Release freeze handler for prio 3
Jul 27 13:20:44 uc1 ctdbd: recoverd:Disabling takeover runs for 60
seconds
Jul 27 13:20:44 uc1 ctdbd: CTDB_WAIT_UNTIL_RECOVERED
Jul 27 13:20:44 uc1 ctdbd: ctdb_recheck_presistent_health: OK[1] FAIL[0]
Jul 27 13:20:44 uc1 ctdbd: Not yet in startup runstate. Wait one more
second
Jul 27 13:20:45 uc1 ctdbd: Not yet in startup runstate. Wait one more
second
Jul 27 13:20:46 uc1 ctdbd: Recovery has finished
Jul 27 13:20:46 uc1 ctdbd: recoverd:Reenabling takeover runs
Jul 27 13:20:46 uc1 ctdbd: Not yet in startup runstate. Wait one more
second
Jul 27 13:20:47 uc1 ctdbd: Not yet in startup runstate. Wait one more
second
Jul 27 13:20:48 uc1 ctdbd: Set runstate to STARTUP (4)
Jul 27 13:20:48 uc1 ctdbd: Recoveries finished. Running the "startup"
event.
Jul 27 13:20:49 uc1 ctdbd: iface[bond0] has changed it's link status
down => up
Jul 27 13:20:50 uc1 ctdbd: recoverd:Interface bond0 changed state: 0 =>
1
Jul 27 13:20:50 uc1 ctdbd: recoverd:The interfaces status has changed on
local node 0 - force takeover run
Jul 27 13:20:50 uc1 ctdbd: recoverd:Trigger takeoverrun
Jul 27 13:20:52 uc1 ctdbd: 49.winbind: Failed to start winbind
Jul 27 13:20:52 uc1 ctdbd: startup event failed