Ring buffer, linked list, etc.

Wed Aug 10 17:57:21 GMT 2005

On Wed, Aug 10, 2005 at 09:38:41AM +0200, Stefan Metzmacher wrote:
> Am Sonntag, 7. August 2005 07:06 schrieb Christopher R. Hertel:
> 
> Hi Chris,

Hi, Stefan!  :)

> could you give me a should introducing about what you're talking,
> I mean what was the topic of the discusion, and to what code does this 
> belong...

Sure... but be careful what you ask for.  I like to write...  :)
I'm also including more information than you probably need, just in case 
it's useful for others on the list.

Anyway,

Samba4 maintains an event queue.  That queue is currently held in a doubly 
linked list, which (based on testing I did) is probably the optimal data 
structure to use as long as the list is relatively short--under about 100 
entries.

Unfortunately, there are situations in which the linked list can grow to
very large sizes.  If that happens, some operations are still fast but
anything that involves searching the list can become very, very slow.  In
general, the average seek time on a linked list is in proportion to n/2
(where n is the number of nodes).  In other words, it's O(n).

There are three things that happen with respect to this linked list:

1) New entries are added to the list.
   The list has to be kept in sorted order (I'll explain why in a moment),
   so adding an entry to the list requires performing a linear search of
   the list.  That's fairly fast on short lists, but the cost goes up 
   quickly as the list grows (again, O(n)).

2) Remove entries as events occur.
   Each entry in the queue represents some event that the system is 
   waiting to have occur (eg., a response to an NBT query).  As I 
   understand it, there is an event record held somewhere else in the 
   system that has a pointer to the entry in the event queue.  If the 
   event occurs, the queue entry (no matter where in the queue it resides) 
   can be removed.  This is really quick with a doubly linked list.  It's 
   just a little pointer splicing.

3) Remove expired entries from the head of the queue.
   If the event doesn't occur within its timeout period, it needs to be 
   removed from the head of the event queue and the failure needs to be 
   handled.  This is why the list is in sorted order.  Lowest numbers
   (representing the Unix time at which the event times out) are at the
   head of the list.

So... if the list gets large (eg., if some bozo is flooding the server 
with requests of some sort) then adding new entries could take a very long 
time, slowing down the whole system.  That'd be bad.

The first, most obvious, and easiest optimization is that the search done
on insertion of a new entry should start from the *tail* of the list.  
Currently it starts from the front.  If we're getting a flood of events
that all have a 2-second timeout, each new 2-second entry will be entered
near the end of the list, so starting the search from the tail instead of
the head is likely to be a big win.

Tridge didn't think that it would be enough of a win, however, and he was 
interested in trying a different approach.  He suggested using a ring 
buffer to store the queue.  His idea included:

- Insertions could be done by memmoving entries.
- The size of the buffer could be increased as needed by reallocating
  memory.
- There would be a large gain in search speed because you can use a binary 
  search on this structure.

I thought about this for a while.  If my understanding of point #2, above,
is correct then this type of ring buffer won't work because it will
invalidate the pointers held by the event objects.

When an event occurs, the event object (which is not the same as the event
queue entry) finds the correct event queue entry by following the pointer
it keeps.  That won't work if we're memmoving the event queue entries.

So...  The scheme I am proposing is this:

Start by allocating a block of, say, 128 event queue nodes.
  - These would be doubly-linked-list nodes, just like the ones being used
    now.  The difference is that they would be pre-allocated so that we're
    not allocating new memory each time we create a new node.

  - String the nodes together in a *singly*-linked free list.

  - Whenever a node is needed for a new event queue entry, just pull one
    from the free list.  If the free list is empty, then we would allocate
    a new block of 128 event queue nodes and add them to the free list.

  - Whenever a node is removed from the event queue, the following 
    happens:
    + The node is removed from the doubly-linked-list.
    + The node's prev pointer (pointer to the previous node) is set to
      NULL.
    + The node is placed back into the singly-linked freelist (linked via
      the 'next' pointer).

That's the basics of maintaining the list as a whole, but it's just a 
slightly more formalized version of the same thing we already have.
So far, I haven't done anything to improve node seek time.

So...

My first thought was to add a skiplist to the above, but that adds
overhead and I couldn't find a good algorithm for maintaining a balanced
skiplist.

A skiplist is a shorter linked list that has pointers into the main linked 
list.  The skiplist is used to skip past several nodes at a time, thus 
shortening the search time.

skiplist:   A ------------> B ----> C ----> D...
            |               |       |       |
            V               V       V       V
  eventq:   A1->A2->A3->A6->B3->B8->C8->C9->D2->D3->D4->D5...

If you're searching for C9 you would search down the skiplist until you
got to D (too far) then either search the eventq backward from D2 or
backtrack in the skiplist to C and start searching forward from C8.

So...  here's my cheap way of doing a skiplist:

  - There will be a bit of churn in the event queue because entries will 
    be removed somewhat randomly.  As a result, the nodes in the 
    pre-allocated blocks will be in no particular order.  Some will be 
    free, some in use...

  - If there are multiple blocks, nodes will be scattered across them.

  - We can do a cheap form of skiplist by doing the following:
    + Set a pair of search pointers to point to the first and last entries 
      in the queue.
    + If the event queue has more than about 200 entries in it,
      check the first node of each block of nodes.  This amounts to a
      random sampling.
      * If the node's prev pointer is NULL, then it's either the first
        node in the event queue or it's an unused node (a node in the free
        list), so we can ignore it.
      * Otherwise, check the value and update the search pointers
        appropriately.
    + Finally, search through the linked list starting with whichever 
      search pointer is closest to the value you're searching for and 
      heading towards the other search pointer.

  - Basically, all this does is narrow down the search by checking a few 
    random nodes in the list.  That's probably good enough for the kind of 
    processing we're doing.  Note that in the degenerate case of the Very 
    Big List we would have a larger random sampling so the algorithm would 
    be more effective.  Also note that the number of nodes in a block has 
    an impact.  If this method is adopted, we'd probably want to use 64 
    node blocks instead of 128 node blocks.  Worth testing...

One more variation that might improve things:

  - Instead of keeping a single freelist of all free nodes, we could keep 
    one freelist per block.  This means that each block would have to have 
    a freelist pointer (and a freelist counter).

  - Advantages:
    + We would know when a block was unused, so it could be freed.
    + We could ensure that the first block is filled before allocating
      nodes from the next block, etc.  That increases the chances that we
      will have empty blocks to free.

  - One problem is that we need to know the block to which a node belongs
    in order to assign it to the correct freelist when it's removed from
    the event queue.  The safest way to do that is to have each node
    contain a pointer back to the freelist header.  That eats up a bit 
    more memory but may be worth it.

  - One more tweak would be to ensure that the first node of each block,
    if free, is always at the head of the freelist (others would be added
    to the tail).  That increases the likelihood that the first node will
    be in use, which increases the likelihood that our cheap skiplist will
    be effective.

Hope that gives you a picture.  If I get time in the next few weeks I
might build a ubiqx module that does all this.  It would be a nice piece
to have added to my toolkit and, despite my lengthy description, it's
really very simple.  Once it's written, I can test it against the linked
list and splay tree code I've already got.

I should mention that Tridge suggested several tests to see if the splay
tree would work better than the current linked list.  I found that, in the
degenerate case of a very large list, the splay tree was a vast
improvement over the plain linked list.  Unfortunately, in the normal
situation (short lists) the splay tree was about five times slower.  
(Still quite fast, but those individual operations add up.)

The system I've described above has all of the good characteristics of the 
plain linked list, plus features that should reduce memory management 
overhead, and a few tweaks that will likely improve the situation in the 
degenerate case.  The costs include some additional memory usage and a 
little bit of additional housekeeping.

The housekeeping can be done out-of-band, however.  For instance, we could 
add an event to the queue with a ten minute timeout that would be used to 
run through the block list and free any empty blocks (keeping a minimum of 
one more than we currently need, or somesuch).

Okay... I've gone overboard.

> thanks!

Most welcome.

Chris -)-----

> metze
> 
> > We also talked about using skip lists for the event queue, but the problem
> > we ran into was:  If we get an arbitrary delete (caused by an event
> > actually occuring) the skip list pointers would no longer be valid and
> > could mess things up.
> >
> > So:  If we pre-tallocate blocks of doubly-linked-list nodes (in, say,
> >      groups of 128 nodes each), then instead of freeing the memory when a
> >      node is deleted we would:
> >      = remove it from the event list
> >      = add it to a singly-linked free list
> >      = set the event timeout value to zero
> >
> >      In the event queue, only the first node in the linked list will have
> >      a NULL previous pointer, so we can ignore skip list pointers that
> >      point to nodes with a NULL previous pointer.  We would probably mark
> >      them as unused so that they can be re-used (another free list).
> >
> >      If the free list ever runs out, we would tallocate another 128
> >      (or whatever) nodes and rebuild the node free list.
> >
> >      I'm thinking that we would probably set aside one skip node per 128
> >      event nodes.  The skip nodes could also be arranged as a
> >      doubly-linked list.
> >
> > I'd like to code this up.  I have a very full week at work, however, and
> > my kids haven't seen me for a week so if anyone else wants to do it please
> > go ahead.  :)
> >
> > Chris -)-----
> >
> > On Sat, Aug 06, 2005 at 10:43:43PM -0500, Christopher R. Hertel wrote:
> > > Tridge, Volker,
> > >
> > > Once I got home and had my first decent cup of tea in a week, a thought
> > > occurred to me regarding the idea of using a ring buffer.
> > >
> > > As I understand it, f an event is received before the queue entry times
> > > out then the entry is removed from the event queue.  This is done by way
> > > of a pointer directly to the event queue entry.
> > >
> > > With a doubly-linked-list or a binary tree, that removal can occur
> > > without any traversal of the list.  The pointer to the event node is all
> > > you need.
> > >
> > > If we build a ring buffer, however, and use memmove() to open spaces when
> > > we need to insert new nodes, we'll mess up any external pointers.  You'll
> > > have to search for the correct node each time.
> > >
> > > Using a ring buffer, the way to get around this is to mark nodes as
> > > "Deleted" if the event arrives before timeout.
> > >
> > > That still leaves the problem of adding new nodes to the ring buffer.  I
> > > have some ideas on how to do that...
> > >
> > > Let me know first, though, if I've misunderstood the way this code
> > > operates.
> > >
> > > Chris -)-----
> > >
> > > --
> > > "Implementing CIFS - the Common Internet FileSystem" ISBN: 013047116X
> > > Samba Team -- http://www.samba.org/     -)-----   Christopher R. Hertel
> > > jCIFS Team -- http://jcifs.samba.org/   -)-----   ubiqx development,
> > > uninq. ubiqx Team -- http://www.ubiqx.org/     -)-----   crh at ubiqx.mn.org
> > > OnLineBook -- http://ubiqx.org/cifs/    -)-----   crh at ubiqx.org
> 
> -- 
> metze
> --
> Stefan (metze) Metzmcher <metze at samba dot org>

-- 
"Implementing CIFS - the Common Internet FileSystem" ISBN: 013047116X
Samba Team -- http://www.samba.org/     -)-----   Christopher R. Hertel
jCIFS Team -- http://jcifs.samba.org/   -)-----   ubiqx development, uninq.
ubiqx Team -- http://www.ubiqx.org/     -)-----   crh at ubiqx.mn.org
OnLineBook -- http://ubiqx.org/cifs/    -)-----   crh at ubiqx.org