Latest TDB2 design and code...

Thu Sep 9 06:21:37 MDT 2010

To repeat my old man complaints again,

I would like to see snapshots and I would like these two properties:
* snapshots that are cheap to create and delete
* computing a delta (to be replayed) from snapshot A to snapshot C
should be cheap.

Now, if snapshots are all read-only, new snapshots can only be taken
off "master" and snapshots can only be deleted oldest-first, or if
memory/space goes low, snapshots are automatiocally deleted. I can
live with that.

But I really really really would like to see
cheap read-only snapshots, and cheap compute delta between two snapshots.

And now, get off my lawn, kids.

regards
ronnie sahlberg

On Thu, Sep 9, 2010 at 9:26 PM, Rusty Russell <rusty at rustcorp.com.au> wrote:
> The TDB2 code is still a Work In Progress; the basics are there but
> no transactions yet and lots of FIXMEs and cleanups to go.
>
> BTW, you can always find the latest code and design document in CCAN:
>        http://ccan.ozlabs.org/info/tdb2.html
>
> Here's the design text diff form the last one I posted; sorry it's
> not all that readable.  Thanks to Metze and Volker for their feedback...
>
> PDF (from LyX with diff tracking) attached.
>
> Thanks,
> Rusty.
>
> --- design-1.3.txt      2010-05-10 21:27:35.000000000 +0930
> +++ design.txt  2010-09-09 16:53:53.000000000 +0930
> @@ -2,7 +2,7 @@
>
>  Rusty Russell, IBM Corporation
>
> -27-April-2010
> +9-September-2010
>
>  Abstract
>
> @@ -277,7 +277,9 @@
>
>  The aim is that building tdb with -DTDB_PTHREAD will result in a
>  pthread-safe version of the library, and otherwise no overhead
> -will exist.
> +will exist. Alternatively, a hooking mechanism similar to that
> +proposed for [Proposed-Solution-locking-hook] could be used to
> +enable pthread locking at runtime.
>
>  2.8 *_nonblock Functions And *_mark Functions Expose
>   Implementation
> @@ -473,6 +475,50 @@
>  Remove TDB_CLEAR_IF_FIRST. Other workarounds are possible, but
>  see [TDB_CLEAR_IF_FIRST-Imposes-Performance].
>
> +2.15 Extending The Header Is Difficult
> +
> +We have reserved (zeroed) words in the TDB header, which can be
> +used for future features. If the future features are compulsory,
> +the version number must be updated to prevent old code from
> +accessing the database. But if the future feature is optional, we
> +have no way of telling if older code is accessing the database or
> +not.
> +
> +2.15.1 Proposed Solution
> +
> +The header should contain a “format variant” value (64-bit). This
> +is divided into two 32-bit parts:
> +
> +1. The lower part reflects the format variant understood by code
> +  accessing the database.
> +
> +2. The upper part reflects the format variant you must understand
> +  to write to the database (otherwise you can only open for
> +  reading).
> +
> +The latter field can only be written at creation time, the former
> +should be written under the OPEN_LOCK when opening the database
> +for writing, if the variant of the code is lower than the current
> +lowest variant.
> +
> +This should allow backwards-compatible features to be added, and
> +detection if older code (which doesn't understand the feature)
> +writes to the database.
> +
> +2.16 Record Headers Are Not Expandible
> +
> +If we later want to add (say) checksums on keys and data, it
> +would require another format change, which we'd like to avoid.
> +
> +2.16.1 Proposed Solution
> +
> +We often have extra padding at the tail of a record. If we ensure
> +that the first byte (if any) of this padding is zero, we will
> +have a way for future changes to detect code which doesn't
> +understand a new format: the new code would write (say) a 1 at
> +the tail, and thus if there is no tail or the first byte is 0, we
> +would know the extension is not present on that record.
> +
>  3 Performance And Scalability Issues
>
>  3.1 <TDB_CLEAR_IF_FIRST-Imposes-Performance>TDB_CLEAR_IF_FIRST
> @@ -551,7 +597,7 @@
>  impossible to know what the 'right' answer is at database
>  creation time.
>
> -3.4.1 Proposed Solution
> +3.4.1 <sub:Hash-Size-Solution>Proposed Solution
>
>  After comprehensive performance testing on various scalable hash
>  variants[footnote:
> @@ -559,31 +605,40 @@
>  This was annoying because I was previously convinced that an
>  expanding tree of hashes would be very close to optimal.
>  ], it became clear that it is hard to beat a straight linear hash
> -table which doubles in size when it reaches saturation. There are
> -three details which become important:
> +table which doubles in size when it reaches saturation.
>
> -1. On encountering a full bucket, we use the next bucket.
> +1.
>
> -2. Extra hash bits are stored with the offset, to reduce
> -  comparisons.
> +2.
>
> -3. A marker entry is used on deleting an entry.
> -
> -The doubling of the table must be done under a transaction; we
> -will not reduce it on deletion, so it will be an unusual case. It
> -will either be placed at the head (other entries will be moved
> -out the way so we can expand). We could have a pointer in the
> -header to the current hashtable location, but that pointer would
> -have to be read frequently to check for hashtable moves.
> -
> -The locking for this is slightly more complex than the chained
> -case; we currently have one lock per bucket, and that means we
> -would need to expand the lock if we overflow to the next bucket.
> -The frequency of such collisions will effect our locking
> -heuristics: we can always lock more buckets than we need.
> +3.
>
> -One possible optimization is to only re-check the hash size on an
> -insert or a lookup miss.
> +
> +
> +
> +
> + Unfortunately, altering the hash table introduces serious
> +locking complications: the entire hash table needs to be locked
> +to enlarge the hash table, and others might be holding locks.
> +Particularly insidious are insertions done under tdb_chainlock.
> +
> +Thus an expanding layered hash will be used: an array of hash
> +groups, with each hash group exploding into pointers to lower
> +hash groups once it fills, turning into a hash tree. This has
> +implications for locking: we must lock the entire group in case
> +we need to expand it, yet we don't know how deep the tree is at
> +that point.
> +
> +Note that bits from the hash table entries should be stolen to
> +hold more hash bits to reduce the penalty of collisions. We can
> +use the otherwise-unused lower 3 bits. If we limit the size of
> +the database to 64 exabytes, we can use the top 8 bits of the
> +hash entry as well. These 11 bits would reduce false positives
> +down to 1 in 2000 which is more than we need: we can use one of
> +the bits to indicate that the extra hash bits are valid. This
> +means we can choose not to re-hash all entries when we expand a
> +hash group; simply use the next bits we need and mark them
> +invalid.
>
>  3.5 <TDB-Freelist-Is>TDB Freelist Is Highly Contended
>
> @@ -660,6 +715,13 @@
>  number of free list entries we can use far fewer, say 1/32 of the
>  number of hash buckets.
>
> +It seems tempting to try to reuse the hash implementation which
> +we use for records here, but we have two ways of searching for
> +free entries: for allocation we search by size (and possibly
> +zone) which produces too many clashes for our hash table to
> +handle well, and for coalescing we search by address. Thus an
> +array of doubly-linked free lists seems preferable.
> +
>  There are various benefits in using per-size free lists (see [sub:TDB-Becomes-Fragmented]
>  ) but it's not clear this would reduce contention in the common
>  case where all processes are allocating/freeing the same size.
> @@ -702,24 +764,10 @@
>  4. If the top entry is -large enough, remove it from the list and
>   return it.
>
> -5. Otherwise, coalesce entries in the list.
> -
> -  (a)
> -
> -  (b)
> -
> -  (c)
> -
> -  (d)
> +5. Otherwise, coalesce entries in the list.If there was no entry
> +  large enough, unlock the list and try the next zone.
>
> -6. If there was no entry large enough, unlock the list and try
> -  the next zone.
> -
> -7.
> -
> -8.
> -
> -9. If no zone satisfies, expand the file.
> +6. If no zone satisfies, expand the file.
>
>  This optimizes rapid insert/delete of free list entries by not
>  coalescing them all the time.. First-fit address ordering
> @@ -728,12 +776,23 @@
>  does not need a tailer to coalesce, though if we needed one we
>  could have one cheaply: see [sub:Records-Incur-A].
>
> -
> -
>  I anticipate that the number of entries in each free zone would
>  be small, but it might be worth using one free entry to hold
>  pointers to the others for cache efficiency.
>
> +<freelist-in-zone>If we want to avoid locking complexity
> +(enlarging the free lists when we enlarge the file) we could
> +place the array of free lists at the beginning of each zone. This
> +means existing array lists never move, but means that a record
> +cannot be larger than a zone. That in turn implies that zones
> +should be variable sized (say, power of 2), which makes the
> +question “what zone is this record in?” much harder (and “pick a
> +random zone”, but that's less common). It could be done with as
> +few as 4 bits from the record header.[footnote:
> +Using 2^{16+N*3}means 0 gives a minimal 65536-byte zone, 15 gives
> +the maximal 2^{61} byte zone. Zones range in factor of 8 steps.
> +]
> +
>  3.6 <sub:TDB-Becomes-Fragmented>TDB Becomes Fragmented
>
>  Much of this is a result of allocation strategy[footnote:
> @@ -838,7 +897,9 @@
>   this is diminishing returns after a handful of bits (at 10
>   bits, it reduces 99.9% of false memcmp). As an aside, as the
>   lower bits are already incorporated in the hash table
> -  resolution, the upper bits should be used here.
> +  resolution, the upper bits should be used here. Note that it's
> +  not clear that these bits will be a win, given the extra bits
> +  in the hash table itself (see [sub:Hash-Size-Solution]).
>
>  5. 'magic' does not need to be enlarged: it currently reflects
>   one of 5 values (used, free, dead, recovery, and
> @@ -881,13 +942,17 @@
>
>         uint64_t total_length;
>
> +        uint64_t prev, next;
> +
>         ...
>
>         uint64_t tailer;
>
>  };
>
> -
> +We might want to take some bits from the used record's top_hash
> +(and the free record which has 32 bits of padding to spare
> +anyway) if we use variable sized zones. See [freelist-in-zone].
>
>  3.8 Transaction Commit Requires 4 fdatasync
>
> @@ -915,14 +980,6 @@
>
>  3.8.1 Proposed Solution
>
> -
> -
> -
> -
> -
> -
> -
> -
>  Neil Brown points out that this is overzealous, and only one sync
>  is needed:
>
> @@ -948,7 +1005,7 @@
>  header field can be used to indicate a transaction in progress;
>  we need only check for recovery if this is set.
>
> -3.9 TDB Does Not Have Snapshot Support
> +3.9 <sub:TDB-Does-Not>TDB Does Not Have Snapshot Support
>
>  3.9.1 Proposed Solution
>
> @@ -1048,3 +1105,35 @@
>  must use them) anyway, so there's no need to do this at the same
>  time as everything else.
>
> +3.14 Some Transactions Don't Require Durability
> +
> +Volker points out that gencache uses a CLEAR_IF_FIRST tdb for
> +normal (fast) usage, and occasionally empties the results into a
> +transactional TDB. This kind of usage prioritizes performance
> +over durability: as long as we are consistent, data can be lost.
> +
> +This would be more neatly implemented inside tdb: a “soft”
> +transaction commit (ie. syncless) which meant that data may be
> +reverted on a crash.
> +
> +3.14.1 Proposed Solution
> +
> +None.
> +
> +Unfortunately any transaction scheme which overwrites old data
> +requires a sync before that overwrite to avoid the possibility of
> +corruption.
> +
> +It seems possible to use a scheme similar to that described in [sub:TDB-Does-Not]
> +,where transactions are committed without overwriting existing
> +data, and an array of top-level pointers were available in the
> +header. If the transaction is “soft” then we would not need a
> +sync at all: existing processes would pick up the new hash table
> +and free list and work with that.
> +
> +At some later point, a sync would allow recovery of the old data
> +into the free lists (perhaps when the array of top-level pointers
> +filled). On crash, tdb_open() would examine the array of top
> +levels, and apply the transactions until it encountered an
> +invalid checksum.
> +
>
>
>
>