[CTDB] strange fork bomb bug

Mathieu Parent math.parent at gmail.com
Fri Mar 7 05:59:49 MST 2014


Hello Amitay,

See the attached complete log.

10.15.70.1 is in /etc/ctdb/nodes and not in /etc/ctdb/public_addresses.


2014-03-07 13:44 GMT+01:00 Amitay Isaacs <amitay at gmail.com>:
> On Fri, Mar 7, 2014 at 10:23 PM, Mathieu Parent <math.parent at gmail.com>
> wrote:
>>
>> Hello ctdb devs,
>>
>> We had a strange ctdb behavior recently on an 8-nodes cluster: each
>> node had 192 ctdbd processes (instead of the usual 2), using 1024 or
>> so file descriptors each (which is the default linux limit)! Mostly
>> :pipe and :socket. It was hard to connect via SSH then, and even the
>> process table looked corrupted. Solution was to stop then kill ctdbd.
>>
>> It seems that when the ctdbd child is blocked, the parent create a new
>> one without cleaning the older, untill hiting resource limits.
>>
>> This was on an old version (Debian 1.12+git20120201-4), I wonder if
>> that has been fixed since.
>>
>> I'm trying to reproduce the issue (with an hard NFS mount).
>>
>> Regards
>> --
>> Mathieu Parent
>>
>>
>> NB : log.ctdb had:
>>
>> 2014/02/28 17:20:57.489370 [ 3486]: Freeze priority 1
>> 2014/02/28 17:20:57.489984 [ 3486]: Freeze priority 2
>> 2014/02/28 17:20:57.490491 [ 3486]: Freeze priority 3
>> 2014/02/28 17:20:57.772767 [ 3486]: Thawing priority 1
>> 2014/02/28 17:20:57.772808 [ 3486]: Release freeze handler for prio 1
>> 2014/02/28 17:20:57.772831 [ 3486]: Thawing priority 2
>> 2014/02/28 17:20:57.772844 [ 3486]: Release freeze handler for prio 2
>> 2014/02/28 17:20:57.772863 [ 3486]: Thawing priority 3
>> 2014/02/28 17:20:57.772874 [ 3486]: Release freeze handler for prio 3
>> 2014/02/28 17:21:07.955309 [ 3486]: Freeze priority 1
>> 2014/02/28 17:21:07.956174 [ 3486]: Freeze priority 2
>> 2014/02/28 17:21:07.956808 [ 3486]: Freeze priority 3
>> 2014/02/28 17:21:08.222463 [ 3486]: Thawing priority 1
>> 2014/02/28 17:21:08.222525 [ 3486]: Release freeze handler for prio 1
>> 2014/02/28 17:21:08.222561 [ 3486]: Thawing priority 2
>> 2014/02/28 17:21:08.222573 [ 3486]: Release freeze handler for prio 2
>> 2014/02/28 17:21:08.222591 [ 3486]: Thawing priority 3
>> 2014/02/28 17:21:08.222602 [ 3486]: Release freeze handler for prio 3
>> 2014/02/28 17:21:18.406226 [ 3486]: Freeze priority 1
>> 2014/02/28 17:21:18.407113 [ 3486]: Freeze priority 2
>> 2014/02/28 17:21:18.407711 [ 3486]: Freeze priority 3
>> 2014/02/28 17:21:18.675376 [ 3486]: Thawing priority 1
>> 2014/02/28 17:21:18.675427 [ 3486]: Release freeze handler for prio 1
>> 2014/02/28 17:21:18.675450 [ 3486]: Thawing priority 2
>> 2014/02/28 17:21:18.675462 [ 3486]: Release freeze handler for prio 2
>> 2014/02/28 17:21:18.675480 [ 3486]: Thawing priority 3
>> 2014/02/28 17:21:18.675490 [ 3486]: Release freeze handler for prio 3
>> 2014/02/28 17:21:28.858118 [ 3486]: Freeze priority 1
>> 2014/02/28 17:21:28.859022 [ 3486]: Freeze priority 2
>> 2014/02/28 17:21:28.859641 [ 3486]: Freeze priority 3
>> 2014/02/28 17:21:29.121186 [ 3486]: Thawing priority 1
>> 2014/02/28 17:21:29.121239 [ 3486]: Release freeze handler for prio 1
>> 2014/02/28 17:21:29.121262 [ 3486]: Thawing priority 2
>> 2014/02/28 17:21:29.121274 [ 3486]: Release freeze handler for prio 2
>> 2014/02/28 17:21:29.121292 [ 3486]: Thawing priority 3
>> [etc.]
>>
>> and:
>> 2014/02/28 17:33:51.254462 [ 3486]: Monitoring event was cancelled
>> 2014/02/28 17:33:51.254515 [ 3486]: server/eventscript.c:584 Sending
>> SIGTERM to child pid:29991
>> 2014/02/28 17:46:24.428171 [ 3486]: Monitoring event was cancelled
>> 2014/02/28 17:46:24.428239 [ 3486]: server/eventscript.c:584 Sending
>> SIGTERM to child pid:27755
>> 2014/02/28 18:05:13.859866 [ 3486]: Monitoring event was cancelled
>> 2014/02/28 18:05:13.859921 [ 3486]: server/eventscript.c:584 Sending
>> SIGTERM to child pid:8818
>> 2014/02/28 18:23:43.043934 [ 3486]: Monitoring event was cancelled
>> 2014/02/28 18:23:43.043994 [ 3486]: server/eventscript.c:584 Sending
>> SIGTERM to child pid:20176
>> [etc.]
>
>
> Do you have more complete logs?  It appears that something is causing
> continuous recoveries.  Without additional logs it will be difficult to
> figure out what's going on.
>
> Amitay.



-- 
Mathieu
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ctdb.log.xz
Type: application/x-xz
Size: 353756 bytes
Desc: not available
URL: <http://lists.samba.org/pipermail/samba-technical/attachments/20140307/8865dd88/attachment-0001.bin>


More information about the samba-technical mailing list