Samba, nmbd, HA

Sun Feb 21 10:35:03 GMT 1999

Nicolas,

On Sat, 30 Jan 1999 10:28:03 +1100, Nicolas Williams wrote:

> - When one server takes over the other, it takes over the failed host's
>   disks, IP addresses and other necessary resources, then starts or
>   updates various services so as to present, to the clients, the
>   illusion that the failed hosts is still there.

That's exactly what we do in our HA cluster.

Both machines each can be in one of three states: O, B and OB meaning "Original", 
"Backup" and "Original+Backup". I'm not so happy with the terms Original and 
Backup because they imply that one machine only does some kind of warm standby 
which is not the case. Instead of this, O has a set of functions, B has a set of 
functions and in the backup case the surviving machine has to provide both sets of 
functionality.

We did not bother with all this nmbd and interface stuff.

Instead we did as follows:

file smb.conf.machine1:
	...
	all the shares on machine1
	...
file smb.conf.machine2:
	...
	all the shares on machine2
	...

file smb.conf.O:
	include = /path/smb.conf.machine1
file smb.conf.B:
	include = /path/smb.conf.machine2
file smb.conf.OB:
	include = /path/smb.conf.machine1
	include = /path/smb.conf.machine2

And now for the trick:

file smb.conf:
	...
	include = /path/smb.conf.HA.state
	...

In our HA state changing procedure we do something like

	echo "include = /path/smb.conf."`get_HA_status` >/path/smb.conf.HA.state

When one machine fails and the other one takes over (state transition from O or B 
to OB) the file smb.conf.HA.state gets modified.

As O or B went down, all connections between any clients and the failed machine 
break (naturally :-). The clients will have to reconnect (what they do silently in 
normal cases). OB comes up with the interface of the now dead machine and does an 
"ARP reply broadcast" from its new interface so everyone in the segment learns the 
new MAC address.

>From the view of OB clients (re)connecting to the failed machine produce new 
connections. In this case smbd will re-read smb.conf, stumble over 
smb.conf.HA.state and will have the shares of the failed machine available.

Now how to get back from OB to O and B on different machines?

We decided not to give back functionality within production times. The failed 
machine gets repaired and waits for return of functionality. We wait until users 
leave office and switch back the state.

There are issues like "who is local browse master". This one is tricked the easy 
way: Just let them fight for it, running one with os level = 65 and the other one 
with os level = 64. Other issues are handled with the smb.conf.O|B|OB mimic.

Regards,
        Robert

-- 
---------------------------------------------------------------
Robert.Dahlem at gmx.net
Radio Bornheim - 2:2461/332 at fidonet +49-69-4930830  (ZyX, V34)
                 2:2461/326 at fidonet +49-69-94414444 (ISDN X.75)
---------------------------------------------------------------