CTDB: Split brain and banning
michelb at bit.nl
Wed Oct 31 15:55:30 UTC 2018
On Wed, Oct 31, 2018 at 04:37:14PM +0100, David Disseldorp via samba-technical wrote:
> It appears that ctdbd doesn't gracefully handle cases where the recovery
> master goes down holding the reclock and standby nodes can't immediately
> obtain the reclock following election. Your reclock helper lock_duration
> setting of "30" means that the standby nodes may need to wait up to 30
> seconds before obtaining the recovery lock.
Yeah I'd just found that out as well, was just about to mail you.
> If you specify a lock_duration of "5" and set RecoveryBanPeriod=5, does
> your cluster return to OK ~5 seconds after master outage?
How does ElectionTimeout play into this? I've set that higher on the
reasoning that it should be at least as long as the mutex helper
timeout. Both are at 10s now and I haven't been able to trigger the
bans yet today.
I've set RecoveryBanPeriod short as well, but I'd rather avoid bans
to begin with, to avoid mass VIP migrations.
BIT BV | Unix beheer | michelb at bit.nl | 08B90948
More information about the samba-technical