[CTDB] recovery problems

Mon Sep 6 01:04:08 MDT 2010

Hi Ronnie,

My fixes for the following split brain problem seem to work:
If the recovery master loses it's internal network connection (CTDB and
GPFS) it triggers a takeover urun and assigns all public ips to itself.
As the public interfaces are still connected it means the public addresses
are assigned twice.

There're a few bugs which trigger this behavior...

The new behavior is that the recovery process does the takeover run after
verifying the reclock and unhealthy nodes propagate their unhealthy
status before asking for a takeover run.

If the recovery process can't get the recovery lock, it bans itself
directly.

We had two code pathes where we notice that the local node becomes banned:
1. We're banned by the recovery master
2. We banned ourself

Now we call ctdb_start_freeze() and ctdb_release_all_ips()
consistently in both cases.

http://gitweb.samba.org/?p=metze/ctdb/wip.git;a=shortlog;h=refs/heads/master-reclock-fix
git pull git://git.samba.org/metze/ctdb/wip.git master-reclock-fix

http://gitweb.samba.org/?p=metze/ctdb/wip.git;a=shortlog;h=refs/heads/ctdb-1.2-reclock-fix
git pull git://git.samba.org/metze/ctdb/wip.git ctdb-1.2-reclock-fix

http://gitweb.samba.org/?p=metze/ctdb/wip.git;a=shortlog;h=refs/heads/ctdb-1.0.112-reclock-fix
git pull git://git.samba.org/metze/ctdb/wip.git ctdb-1.0.112-reclock-fix

While debugging this I noticed that the other nodes will stay in
recovery for a very long time.
If GPFS was configured with failureDetectionTime = 10 (the minimum) I
still got:

server/ctdb_recover.c:637 Been in recovery mode for too long. Dropping
all IPS
...
High RECLOCK latency 93.909031s for operation recd reclock

Do you have any idea how this is fixable?

metze

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 262 bytes
Desc: OpenPGP digital signature
URL: <http://lists.samba.org/pipermail/samba-technical/attachments/20100906/f8a88119/attachment.pgp>