[RFC] tdb_traverse_read_lite()

Thu Mar 7 16:57:52 MST 2013

Volker Lendecke <Volker.Lendecke at SerNet.DE> writes:
> On Thu, Mar 07, 2013 at 04:50:04PM +1100, Rusty Russell wrote:
>> Volker Lendecke <Volker.Lendecke at SerNet.DE> writes:
>> > On Wed, Mar 06, 2013 at 05:58:41PM +1100, Rusty Russell wrote:
>> >> Volker Lendecke <Volker.Lendecke at SerNet.DE> writes:
>> >> > On Tue, Mar 05, 2013 at 01:33:18PM +0100, Stefan (metze) Metzmacher wrote:
>> >> >> I'd also prefer a chain traverse function, that could be used
>> >> >> in a lot of places to reduce the cleanup costs.
>> >> 
>> >> I'm not so sure... what would we want to wait for?  Our dbs these days
>> >> have huge hashsizes, so this kind of traversal doesn't block much
>> >> activity.  But benchmarks will show...
>> >
>> > For vacuuming it might be nice that under high stress only
>> > spend a few percent of time on it. Under certain situations
>> > we can afford to just leave it for a while to wait for the
>> > storm to end. It might be a lot easier to figure out that
>> > the "time slice" for vaccuuming has ended while we are doing
>> > it instead of doing a precalculation. Sure, it is possible
>> > to prematurely stop the traverse_read_lite, but it is not
>> > possible to pick up where we left it. If you are concerned
>> > about exposing the hash chain number (we do already expose
>> > tdb_hash_size()), would it be okay to add an API to
>> > "traverse beginning with the hash chain for this key"?
>> 
>> Another option, which Amitay suggested, was to do non-blocking
>> chainlocks and skip over chains which we can't lock.  This risks never
>> vacuuming high-traffic chains, so we'd need to do blocking locks
>> sometimes.
>> 
>> So how about combining the approaches, like this.  Instead of
>> tdb_traverse_read_lite():
>> 
>> /* Returns -1 on error, or chain number it reached. */
>> int tdb_traverse_read_nonblock(struct tdb_context *tdb, int start, int end,
>>                                tdb_traverse_func fn, void *private_data)
>> 
>> int tdb_chainlock_read_bynum(struct tdb_context *tdb, int chain);
>> int tdb_chainunlock_read_bynum(struct tdb_context *tdb, int chain);
>> 
>> This leaves the heuristic up to ctdb: it can just skip over chains
>> returned by tdb_traverse_read_nonblock(), or it can use
>> tdb_chainunlock_read_bynum() to block on them (maybe if the same chain
>> gets skipped multiple times?).
>
> Yep, that would certainly also work fine.

OK, I'll see if Amitay has any luck with the existing patch first, then
I'll spin something which does this.

Thanks!
Rusty.