CTDB crash - trying to allocate > 256kB of memory to pulldb_data

Wed Jan 18 02:20:04 MST 2012

On 17/01/12 17:59, Christian Ambach wrote:
> On 01/17/2012 10:24 AM, Orlando Richards wrote:
>> Uhh - just did my maths again. That should be > 256 MB!
>>
>>
> Looks like you have large databases and the host was under memory
> pressure so the database records could not be transferred during a
> recovery because there was not enough memory left for ctdb. Can you have
> a look at the sizes of the TDB files in the CTDB database directory
> (/var/ctdb/) if there is one that is really large?
>
> Cheers,
> Christian
>

Hi Christian,

Yup - we have a couple of large databases:

-rw-r--r-- 1 root root 315M Jan 18 09:10 /var/ctdb/locking.tdb.1
-rw-r--r-- 1 root root 340M Jan 18 08:50 /var/ctdb/persistent/idmap2.tdb.1

After doing some more digging, I've found that we were getting time outs 
on the vacuuming processes for locking.tdb leading up to the failures:

  ctdbd: Vacuuming child process timed out for db locking.tdb

and also some of these (though less consistently):
  ctdbd: ./lib/tevent/tevent_util.c:110 Handling event took 6 seconds!

We were starting to get these again while I was investigating, so I did 
a manual vacuum and got:

# ctdb vacuum
Found 676 records for lmaster 1 in 'notify_onelevel.tdb'
Deleted 676 records out of 676 on this node from 'notify_onelevel.tdb'
Found 775033 records for lmaster 1 in 'locking.tdb'
Deleted 774884 records out of 774887 on this node from 'locking.tdb'
Found 4965 records for lmaster 1 in 'brlock.tdb'
Deleted 4964 records out of 4965 on this node from 'brlock.tdb'
Found 2358 records for lmaster 1 in 'connections.tdb'
Deleted 2358 records out of 2358 on this node from 'connections.tdb'
Found 801 records for lmaster 1 in 'sessionid.tdb'
Deleted 801 records out of 801 on this node from 'sessionid.tdb'
Found 209 records for lmaster 1 in 'account_policy.tdb'
Deleted 209 records out of 209 on this node from 'account_policy.tdb'
Found 1 records for lmaster 1 in 'passdb.tdb'
Deleted 1 records out of 1 on this node from 'passdb.tdb'

That seems like a lot of records being vacuumed in locking.tdb, 
especially when compared with the default limit of 5000 records max to 
vacuum in a normal vacuum run. After running the vacuum manually, the 
syslog messages about time outs ceased.

So I've increased the timeout on the vacuuming from 120 seconds (from 
the default 30), and the record limit to 20,000 (from the default 5000).

I've constructed a scenario in my head along the lines of:
  1. vacuuming of locking.tdb starts timing out for some reason and the 
amount of info held in memory starts growing unchecked. Subsequent 
vacuums now can't cope with the increased load, and continually fail.
  2. it breaches 256MB of in-memory locking.tdb records, and crashes CTDB
  3. the failover node tries to load all this in-memory locking.tdb info 
into memory on failover, breaches the 256MB limit and crashes

Does that sound plausible?

Do the tweaks to the limits I've applied seem reasonable? Anything else 
I should consider?

Many thanks,
-- 
Orlando

The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.