CTDB crash - trying to allocate > 256kB of memory to pulldb_data

Orlando Richards orlando.richards at ed.ac.uk
Wed Jan 18 03:23:48 MST 2012


Hi Michael,

On 18/01/12 09:53, Michael Adam wrote:
> Hi Orlando,
>
> it seems that you are suffering from broken vacuuming code.
> The git-snapshot of CTDB that you are using does not contain
> any of the important vacuuming fixes of the past year.
>
> I suggest that you either:
>
> 1. use latest bugfix version 1.0.114.4 from here:
>     http://ftp.sernet.de/pub/ctdb/
>
> 2. or build a 1.2 based version from one of the branches 1.2 or
>     1.2.39 (not sure which is better).
>
> 3. or if you really would like to build from master (why?), then do a
>     fresh build from git://git.samba.org/ctdb.git (master branch)
>
> I suggest sticking with 1.0.114.4 for now.

Ah hah - I think you've just answered another (unasked!) question I had 
- what's the best route to deploying a stable ctdb?

I'd been using the instructions at:
   http://ctdb.samba.org/download.html
(which describes how to build from the master) so I didn't know about 
the sernet builds. Thanks for the info (and thanks to Sernet for 
packaging and hosting!).

We're currently running a build version: 1.10.0.108.g221ecc2.devel-1 - 
how easy would it be to roll back to 1.0.114.4? Would it just be a case 
of shutting down ctdb on the cluster, uninstalling old and installing 
new, and starting up again?

>
> BTW, 256MB is in fact the limit of talloc for a single allocation.
>
> Another question: what kind of domain setup do you have
> to get an idmap2.tdb of 340 MB ? :-o

A bit messy - we have a *lot* of security groups, and a fairly hefty 
user count. Last count, we'd mapped just shy of 5000 groups on the samba 
servers (from our externally held records of group mapping). We don't 
actually know how many security groups exist in the active directory - 
the last time someone tried to count it they gave up after about 20,000. 
We don't have similar external records from user mapping, but from the 
tdb dumps below it looks like we have around 55,000 user accounts mapped.

>
> For both the big dbs locking and idmap2 it would be interesting
> to see how many keys there are in these dbs:
>
> $ tdbdump /path/to/local_tdb_copy.tdb.1 | grep ^key | wc -l
>

[root at host tmp]# tdbdump idmap2.tdb.1 | grep ^key | wc -l
120726
[root at host tmp]# tdbdump locking.tdb.1 | grep ^key | wc -l
2628820

I assume that locking.tdb.1 is transient data, so could we purge it some 
time? Perhaps out of hours, do a "ctdb wipedb locking.tdb"?

Thanks very much for your help so far,

Orlando.



> Cheers - Michael
>
>
> Orlando Richards wrote:
>> On 17/01/12 17:59, Christian Ambach wrote:
>>> On 01/17/2012 10:24 AM, Orlando Richards wrote:
>>>> Uhh - just did my maths again. That should be>  256 MB!
>>>>
>>>>
>>> Looks like you have large databases and the host was under memory
>>> pressure so the database records could not be transferred during a
>>> recovery because there was not enough memory left for ctdb. Can you have
>>> a look at the sizes of the TDB files in the CTDB database directory
>>> (/var/ctdb/) if there is one that is really large?
>>>
>>> Cheers,
>>> Christian
>>>
>>
>>
>> Hi Christian,
>>
>> Yup - we have a couple of large databases:
>>
>> -rw-r--r-- 1 root root 315M Jan 18 09:10 /var/ctdb/locking.tdb.1
>> -rw-r--r-- 1 root root 340M Jan 18 08:50 /var/ctdb/persistent/idmap2.tdb.1
>>
>> After doing some more digging, I've found that we were getting time outs
>> on the vacuuming processes for locking.tdb leading up to the failures:
>>
>>   ctdbd: Vacuuming child process timed out for db locking.tdb
>>
>> and also some of these (though less consistently):
>>   ctdbd: ./lib/tevent/tevent_util.c:110 Handling event took 6 seconds!
>>
>>
>> We were starting to get these again while I was investigating, so I did
>> a manual vacuum and got:
>>
>> # ctdb vacuum
>> Found 676 records for lmaster 1 in 'notify_onelevel.tdb'
>> Deleted 676 records out of 676 on this node from 'notify_onelevel.tdb'
>> Found 775033 records for lmaster 1 in 'locking.tdb'
>> Deleted 774884 records out of 774887 on this node from 'locking.tdb'
>> Found 4965 records for lmaster 1 in 'brlock.tdb'
>> Deleted 4964 records out of 4965 on this node from 'brlock.tdb'
>> Found 2358 records for lmaster 1 in 'connections.tdb'
>> Deleted 2358 records out of 2358 on this node from 'connections.tdb'
>> Found 801 records for lmaster 1 in 'sessionid.tdb'
>> Deleted 801 records out of 801 on this node from 'sessionid.tdb'
>> Found 209 records for lmaster 1 in 'account_policy.tdb'
>> Deleted 209 records out of 209 on this node from 'account_policy.tdb'
>> Found 1 records for lmaster 1 in 'passdb.tdb'
>> Deleted 1 records out of 1 on this node from 'passdb.tdb'
>>
>> That seems like a lot of records being vacuumed in locking.tdb,
>> especially when compared with the default limit of 5000 records max to
>> vacuum in a normal vacuum run. After running the vacuum manually, the
>> syslog messages about time outs ceased.
>>
>> So I've increased the timeout on the vacuuming from 120 seconds (from
>> the default 30), and the record limit to 20,000 (from the default 5000).
>>
>> I've constructed a scenario in my head along the lines of:
>>   1. vacuuming of locking.tdb starts timing out for some reason and the
>> amount of info held in memory starts growing unchecked. Subsequent
>> vacuums now can't cope with the increased load, and continually fail.
>>   2. it breaches 256MB of in-memory locking.tdb records, and crashes CTDB
>>   3. the failover node tries to load all this in-memory locking.tdb info
>> into memory on failover, breaches the 256MB limit and crashes
>>
>> Does that sound plausible?
>>
>> Do the tweaks to the limits I've applied seem reasonable? Anything else
>> I should consider?
>>
>> Many thanks,
>> --
>> Orlando
>>
>> The University of Edinburgh is a charitable body, registered in
>> Scotland, with registration number SC005336.


-- 
             --
    Dr Orlando Richards
   Information Services
IT Infrastructure Division
        Unix Section
     Tel: 0131 650 4994

The University of Edinburgh is a charitable body, registered in 
Scotland, with registration number SC005336.


More information about the samba-technical mailing list