[Samba] Clustered Samba: Every 24 hours "There are Currently No Logon Servers Available"

Fri Apr 8 04:21:22 MDT 2011

All,

i have this very weird and annoying problem in my clustered setup: every ~24
hours the vista clients cant login, or even unlock there screens anymore.
The error they receive is "currently no logon services available"
this is very odd, because i have 2 samba 3.5.8 servers available, running
and configured to handle login requests.

in the mean time the people that are logged in already can use shares etc,
same for mac users. So my guess its a wins/nmbd/netbios issue; not being to
resolve my domain name into an ip address

it is a clustered (CTDB) setup with 2 nodes, based on gentoo, samba 3.5.8,
ldap and glusterfs
the setup is like this:
192.168.100.81 static maintenance ip of node0
192.168.100.82 static ip of node1
192.168.100.83 floating/ctdb ip of node1
192.168.100.84 floating/ctdb ip of node0

node0 has domain master = no, preferred master = no, wins server =
192.168.100.82
node1 had domain master = auto, preferred master = yes, wins support = yes

in the 192.168.100 subnet there are
- some other non samba gentoo machines
- a windows 2k3 server for printing, no wins support installed, smbclient
reports this is the master of another domain (used to have a gentoo & samba
3.0 master, but that is switched off now)
- a windows 2k8 server used for pxe (is domain master of an AD domain, used
only for the PXE setup, not using any recources of the other 2 domain, no
wins support installed, no clients)

in the 192.168.9.* to 192.168.14.* subnets there are ~60 windows vista/
windows 7 clients all statically configured to use 192.168.100.83 and
192.168.100.84 as WINS server

what i do to resolve this issue is:
- turn of ctdb & samba on node0
- reboot node0 (because samba deadlocks, other discussion)
- start ctdb & samba on node0
 - turn of ctdb & samba on node1
- reboot node1 (because samba deadlocks, other discussion)
- start ctdb & samba on node1

only then the issue is resolved, and the clients can login again;
just powering down node0 does not work, even in you restart nmbd on node1
and the log file says its a master browser and domain master of all the ip's
associated

i hate doing the reboot thing again and again, because it screws up the
Glusterfs replication, and is just dirty.

in the past week i had this setting: node0: domain master = auto, preferred
master = auto, i then saw sometimes that node1 and node0 arguing over who is
the master of one of the 4 ip, otherwise the loglevel 1 files stay pretty
clean. Ive now blocked all ingoing and outgoing traffic to and from ports
137,138,139 to the 2 windows machines, just to be safe (and also i have
become a little desperate :( )

the other thing that is weird that node0 starts 1 nmbd process, and node1
starts 2 of them... but this may be by design.

i have a hunch that i have some rogue wins server somewhere that likes to
tell that he is the domain master of my domain, does this make sense? can i
debug this?
or does somebody have another suggestion how to resolve this issue?

thanks in advance!
Erik