CTDB: Split brain and banning

David Disseldorp ddiss at samba.org
Wed Oct 31 16:32:20 UTC 2018


On Wed, 31 Oct 2018 16:55:30 +0100, Michel Buijsman via samba-technical wrote:

> On Wed, Oct 31, 2018 at 04:37:14PM +0100, David Disseldorp via samba-technical wrote:
> > It appears that ctdbd doesn't gracefully handle cases where the recovery
> > master goes down holding the reclock and standby nodes can't immediately
> > obtain the reclock following election. Your reclock helper lock_duration
> > setting of "30" means that the standby nodes may need to wait up to 30
> > seconds before obtaining the recovery lock.  
> 
> Yeah I'd just found that out as well, was just about to mail you.
> 
> > If you specify a lock_duration of "5" and set RecoveryBanPeriod=5, does
> > your cluster return to OK ~5 seconds after master outage?  
> 
> How does ElectionTimeout play into this?

I don't think ElectionTimeout is having an influence in this case, as
the elections are proceeding without delay, it's the newly elected
recmaster that runs into problems when it can't immediately obtain the
recovery lock.

> I've set that higher on the 
> reasoning that it should be at least as long as the mutex helper
> timeout. Both are at 10s now and I haven't been able to trigger the
> bans yet today.

Sounds promising. Can you provide the logs in this working case too?
Feel free to attach them to a bugzilla.samba.org ticket.

Cheers, David



More information about the samba-technical mailing list