[CTDB] recovery problems
Stefan (metze) Metzmacher
metze at samba.org
Mon Sep 6 01:04:08 MDT 2010
My fixes for the following split brain problem seem to work:
If the recovery master loses it's internal network connection (CTDB and
GPFS) it triggers a takeover urun and assigns all public ips to itself.
As the public interfaces are still connected it means the public addresses
are assigned twice.
There're a few bugs which trigger this behavior...
The new behavior is that the recovery process does the takeover run after
verifying the reclock and unhealthy nodes propagate their unhealthy
status before asking for a takeover run.
If the recovery process can't get the recovery lock, it bans itself
We had two code pathes where we notice that the local node becomes banned:
1. We're banned by the recovery master
2. We banned ourself
Now we call ctdb_start_freeze() and ctdb_release_all_ips()
consistently in both cases.
git pull git://git.samba.org/metze/ctdb/wip.git master-reclock-fix
git pull git://git.samba.org/metze/ctdb/wip.git ctdb-1.2-reclock-fix
git pull git://git.samba.org/metze/ctdb/wip.git ctdb-1.0.112-reclock-fix
While debugging this I noticed that the other nodes will stay in
recovery for a very long time.
If GPFS was configured with failureDetectionTime = 10 (the minimum) I
server/ctdb_recover.c:637 Been in recovery mode for too long. Dropping
High RECLOCK latency 93.909031s for operation recd reclock
Do you have any idea how this is fixable?
-------------- next part --------------
A non-text attachment was scrubbed...
Size: 262 bytes
Desc: OpenPGP digital signature
More information about the samba-technical