CTDB: Split brain and banning

Michel Buijsman michelb at bit.nl
Wed Oct 31 15:55:30 UTC 2018

On Wed, Oct 31, 2018 at 04:37:14PM +0100, David Disseldorp via samba-technical wrote:
> It appears that ctdbd doesn't gracefully handle cases where the recovery
> master goes down holding the reclock and standby nodes can't immediately
> obtain the reclock following election. Your reclock helper lock_duration
> setting of "30" means that the standby nodes may need to wait up to 30
> seconds before obtaining the recovery lock.

Yeah I'd just found that out as well, was just about to mail you.

> If you specify a lock_duration of "5" and set RecoveryBanPeriod=5, does
> your cluster return to OK ~5 seconds after master outage?

How does ElectionTimeout play into this? I've set that higher on the 
reasoning that it should be at least as long as the mutex helper
timeout. Both are at 10s now and I haven't been able to trigger the
bans yet today.

I've set RecoveryBanPeriod short as well, but I'd rather avoid bans
to begin with, to avoid mass VIP migrations.

Michel Buijsman
BIT BV | Unix beheer | michelb at bit.nl | 08B90948

More information about the samba-technical mailing list