CTDB crash - trying to allocate > 256kB of memory to pulldb_data

Michael Adam obnox at samba.org
Wed Jan 18 02:53:15 MST 2012


Hi Orlando,

it seems that you are suffering from broken vacuuming code.
The git-snapshot of CTDB that you are using does not contain
any of the important vacuuming fixes of the past year.

I suggest that you either:

1. use latest bugfix version 1.0.114.4 from here:
   http://ftp.sernet.de/pub/ctdb/

2. or build a 1.2 based version from one of the branches 1.2 or
   1.2.39 (not sure which is better).

3. or if you really would like to build from master (why?), then do a
   fresh build from git://git.samba.org/ctdb.git (master branch)

I suggest sticking with 1.0.114.4 for now.

BTW, 256MB is in fact the limit of talloc for a single allocation.

Another question: what kind of domain setup do you have
to get an idmap2.tdb of 340 MB ? :-o

For both the big dbs locking and idmap2 it would be interesting
to see how many keys there are in these dbs:

$ tdbdump /path/to/local_tdb_copy.tdb.1 | grep ^key | wc -l

Cheers - Michael


Orlando Richards wrote:
> On 17/01/12 17:59, Christian Ambach wrote:
> >On 01/17/2012 10:24 AM, Orlando Richards wrote:
> >>Uhh - just did my maths again. That should be > 256 MB!
> >>
> >>
> >Looks like you have large databases and the host was under memory
> >pressure so the database records could not be transferred during a
> >recovery because there was not enough memory left for ctdb. Can you have
> >a look at the sizes of the TDB files in the CTDB database directory
> >(/var/ctdb/) if there is one that is really large?
> >
> >Cheers,
> >Christian
> >
> 
> 
> Hi Christian,
> 
> Yup - we have a couple of large databases:
> 
> -rw-r--r-- 1 root root 315M Jan 18 09:10 /var/ctdb/locking.tdb.1
> -rw-r--r-- 1 root root 340M Jan 18 08:50 /var/ctdb/persistent/idmap2.tdb.1
> 
> After doing some more digging, I've found that we were getting time outs 
> on the vacuuming processes for locking.tdb leading up to the failures:
> 
>  ctdbd: Vacuuming child process timed out for db locking.tdb
> 
> and also some of these (though less consistently):
>  ctdbd: ./lib/tevent/tevent_util.c:110 Handling event took 6 seconds!
> 
> 
> We were starting to get these again while I was investigating, so I did 
> a manual vacuum and got:
> 
> # ctdb vacuum
> Found 676 records for lmaster 1 in 'notify_onelevel.tdb'
> Deleted 676 records out of 676 on this node from 'notify_onelevel.tdb'
> Found 775033 records for lmaster 1 in 'locking.tdb'
> Deleted 774884 records out of 774887 on this node from 'locking.tdb'
> Found 4965 records for lmaster 1 in 'brlock.tdb'
> Deleted 4964 records out of 4965 on this node from 'brlock.tdb'
> Found 2358 records for lmaster 1 in 'connections.tdb'
> Deleted 2358 records out of 2358 on this node from 'connections.tdb'
> Found 801 records for lmaster 1 in 'sessionid.tdb'
> Deleted 801 records out of 801 on this node from 'sessionid.tdb'
> Found 209 records for lmaster 1 in 'account_policy.tdb'
> Deleted 209 records out of 209 on this node from 'account_policy.tdb'
> Found 1 records for lmaster 1 in 'passdb.tdb'
> Deleted 1 records out of 1 on this node from 'passdb.tdb'
> 
> That seems like a lot of records being vacuumed in locking.tdb, 
> especially when compared with the default limit of 5000 records max to 
> vacuum in a normal vacuum run. After running the vacuum manually, the 
> syslog messages about time outs ceased.
> 
> So I've increased the timeout on the vacuuming from 120 seconds (from 
> the default 30), and the record limit to 20,000 (from the default 5000).
> 
> I've constructed a scenario in my head along the lines of:
>  1. vacuuming of locking.tdb starts timing out for some reason and the 
> amount of info held in memory starts growing unchecked. Subsequent 
> vacuums now can't cope with the increased load, and continually fail.
>  2. it breaches 256MB of in-memory locking.tdb records, and crashes CTDB
>  3. the failover node tries to load all this in-memory locking.tdb info 
> into memory on failover, breaches the 256MB limit and crashes
> 
> Does that sound plausible?
> 
> Do the tweaks to the limits I've applied seem reasonable? Anything else 
> I should consider?
> 
> Many thanks,
> -- 
> Orlando
> 
> The University of Edinburgh is a charitable body, registered in
> Scotland, with registration number SC005336.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 206 bytes
Desc: not available
URL: <http://lists.samba.org/pipermail/samba-technical/attachments/20120118/77385c43/attachment.pgp>


More information about the samba-technical mailing list