[CTDB] strange fork bomb bug

Mon Mar 24 05:12:45 MDT 2014

2014-03-23 21:23 GMT+01:00 David Disseldorp <ddiss at suse.de>:
> Hi Mathieu,

Hello all, Hello David,

> On Fri, 7 Mar 2014 12:23:42 +0100
> Mathieu Parent <math.parent at gmail.com> wrote:
>
>> We had a strange ctdb behavior recently on an 8-nodes cluster: each
>> node had 192 ctdbd processes (instead of the usual 2), using 1024 or
>> so file descriptors each (which is the default linux limit)! Mostly
>> :pipe and :socket. It was hard to connect via SSH then, and even the
>> process table looked corrupted. Solution was to stop then kill ctdbd.
>>
>> It seems that when the ctdbd child is blocked, the parent create a new
>> one without cleaning the older, untill hiting resource limits.
>
> Are the processes all waiting on record locks? I'd suggest looking at
> /proc/locks, and also checking the lockwait metrics displayed with
> "ctdb statistics". We've run into similar issues under record lock
> contention.

I don't know. We have restarted the entire cluster since the problem.

>> This was on an old version (Debian 1.12+git20120201-4), I wonder if
>> that has been fixed since.
>
> If the processes are all lockwait forks, then consider merging
> "ctdb_lockwait: create overflow queue" and "LockWait congestion" if
> you don't have them already.

Those two patches are already included in 1.12+git20120201.

Also, the limit is set to 200 processes, and ctdbd was blocked at 192,
and at least one of the ctdbd processes had its 1024 file-descriptors
used.

I will provide the logs tomorrow (I'm currently not at work).

I have also backported 2.5.2 for wheezy, and will upgrade the cluster.

>
> Cheers, David

Cheers

-- 
Mathieu