CTDB crash - trying to allocate > 256kB of memory to pulldb_data
Michael Adam
obnox at samba.org
Wed Jan 18 02:53:15 MST 2012
Hi Orlando,
it seems that you are suffering from broken vacuuming code.
The git-snapshot of CTDB that you are using does not contain
any of the important vacuuming fixes of the past year.
I suggest that you either:
1. use latest bugfix version 1.0.114.4 from here:
http://ftp.sernet.de/pub/ctdb/
2. or build a 1.2 based version from one of the branches 1.2 or
1.2.39 (not sure which is better).
3. or if you really would like to build from master (why?), then do a
fresh build from git://git.samba.org/ctdb.git (master branch)
I suggest sticking with 1.0.114.4 for now.
BTW, 256MB is in fact the limit of talloc for a single allocation.
Another question: what kind of domain setup do you have
to get an idmap2.tdb of 340 MB ? :-o
For both the big dbs locking and idmap2 it would be interesting
to see how many keys there are in these dbs:
$ tdbdump /path/to/local_tdb_copy.tdb.1 | grep ^key | wc -l
Cheers - Michael
Orlando Richards wrote:
> On 17/01/12 17:59, Christian Ambach wrote:
> >On 01/17/2012 10:24 AM, Orlando Richards wrote:
> >>Uhh - just did my maths again. That should be > 256 MB!
> >>
> >>
> >Looks like you have large databases and the host was under memory
> >pressure so the database records could not be transferred during a
> >recovery because there was not enough memory left for ctdb. Can you have
> >a look at the sizes of the TDB files in the CTDB database directory
> >(/var/ctdb/) if there is one that is really large?
> >
> >Cheers,
> >Christian
> >
>
>
> Hi Christian,
>
> Yup - we have a couple of large databases:
>
> -rw-r--r-- 1 root root 315M Jan 18 09:10 /var/ctdb/locking.tdb.1
> -rw-r--r-- 1 root root 340M Jan 18 08:50 /var/ctdb/persistent/idmap2.tdb.1
>
> After doing some more digging, I've found that we were getting time outs
> on the vacuuming processes for locking.tdb leading up to the failures:
>
> ctdbd: Vacuuming child process timed out for db locking.tdb
>
> and also some of these (though less consistently):
> ctdbd: ./lib/tevent/tevent_util.c:110 Handling event took 6 seconds!
>
>
> We were starting to get these again while I was investigating, so I did
> a manual vacuum and got:
>
> # ctdb vacuum
> Found 676 records for lmaster 1 in 'notify_onelevel.tdb'
> Deleted 676 records out of 676 on this node from 'notify_onelevel.tdb'
> Found 775033 records for lmaster 1 in 'locking.tdb'
> Deleted 774884 records out of 774887 on this node from 'locking.tdb'
> Found 4965 records for lmaster 1 in 'brlock.tdb'
> Deleted 4964 records out of 4965 on this node from 'brlock.tdb'
> Found 2358 records for lmaster 1 in 'connections.tdb'
> Deleted 2358 records out of 2358 on this node from 'connections.tdb'
> Found 801 records for lmaster 1 in 'sessionid.tdb'
> Deleted 801 records out of 801 on this node from 'sessionid.tdb'
> Found 209 records for lmaster 1 in 'account_policy.tdb'
> Deleted 209 records out of 209 on this node from 'account_policy.tdb'
> Found 1 records for lmaster 1 in 'passdb.tdb'
> Deleted 1 records out of 1 on this node from 'passdb.tdb'
>
> That seems like a lot of records being vacuumed in locking.tdb,
> especially when compared with the default limit of 5000 records max to
> vacuum in a normal vacuum run. After running the vacuum manually, the
> syslog messages about time outs ceased.
>
> So I've increased the timeout on the vacuuming from 120 seconds (from
> the default 30), and the record limit to 20,000 (from the default 5000).
>
> I've constructed a scenario in my head along the lines of:
> 1. vacuuming of locking.tdb starts timing out for some reason and the
> amount of info held in memory starts growing unchecked. Subsequent
> vacuums now can't cope with the increased load, and continually fail.
> 2. it breaches 256MB of in-memory locking.tdb records, and crashes CTDB
> 3. the failover node tries to load all this in-memory locking.tdb info
> into memory on failover, breaches the 256MB limit and crashes
>
> Does that sound plausible?
>
> Do the tweaks to the limits I've applied seem reasonable? Anything else
> I should consider?
>
> Many thanks,
> --
> Orlando
>
> The University of Edinburgh is a charitable body, registered in
> Scotland, with registration number SC005336.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 206 bytes
Desc: not available
URL: <http://lists.samba.org/pipermail/samba-technical/attachments/20120118/77385c43/attachment.pgp>
More information about the samba-technical
mailing list