Ring buffer, linked list, etc.
Christopher R. Hertel
crh at ubiqx.mn.org
Wed Aug 10 17:57:21 GMT 2005
On Wed, Aug 10, 2005 at 09:38:41AM +0200, Stefan Metzmacher wrote:
> Am Sonntag, 7. August 2005 07:06 schrieb Christopher R. Hertel:
> Hi Chris,
Hi, Stefan! :)
> could you give me a should introducing about what you're talking,
> I mean what was the topic of the discusion, and to what code does this
Sure... but be careful what you ask for. I like to write... :)
I'm also including more information than you probably need, just in case
it's useful for others on the list.
Samba4 maintains an event queue. That queue is currently held in a doubly
linked list, which (based on testing I did) is probably the optimal data
structure to use as long as the list is relatively short--under about 100
Unfortunately, there are situations in which the linked list can grow to
very large sizes. If that happens, some operations are still fast but
anything that involves searching the list can become very, very slow. In
general, the average seek time on a linked list is in proportion to n/2
(where n is the number of nodes). In other words, it's O(n).
There are three things that happen with respect to this linked list:
1) New entries are added to the list.
The list has to be kept in sorted order (I'll explain why in a moment),
so adding an entry to the list requires performing a linear search of
the list. That's fairly fast on short lists, but the cost goes up
quickly as the list grows (again, O(n)).
2) Remove entries as events occur.
Each entry in the queue represents some event that the system is
waiting to have occur (eg., a response to an NBT query). As I
understand it, there is an event record held somewhere else in the
system that has a pointer to the entry in the event queue. If the
event occurs, the queue entry (no matter where in the queue it resides)
can be removed. This is really quick with a doubly linked list. It's
just a little pointer splicing.
3) Remove expired entries from the head of the queue.
If the event doesn't occur within its timeout period, it needs to be
removed from the head of the event queue and the failure needs to be
handled. This is why the list is in sorted order. Lowest numbers
(representing the Unix time at which the event times out) are at the
head of the list.
So... if the list gets large (eg., if some bozo is flooding the server
with requests of some sort) then adding new entries could take a very long
time, slowing down the whole system. That'd be bad.
The first, most obvious, and easiest optimization is that the search done
on insertion of a new entry should start from the *tail* of the list.
Currently it starts from the front. If we're getting a flood of events
that all have a 2-second timeout, each new 2-second entry will be entered
near the end of the list, so starting the search from the tail instead of
the head is likely to be a big win.
Tridge didn't think that it would be enough of a win, however, and he was
interested in trying a different approach. He suggested using a ring
buffer to store the queue. His idea included:
- Insertions could be done by memmoving entries.
- The size of the buffer could be increased as needed by reallocating
- There would be a large gain in search speed because you can use a binary
search on this structure.
I thought about this for a while. If my understanding of point #2, above,
is correct then this type of ring buffer won't work because it will
invalidate the pointers held by the event objects.
When an event occurs, the event object (which is not the same as the event
queue entry) finds the correct event queue entry by following the pointer
it keeps. That won't work if we're memmoving the event queue entries.
So... The scheme I am proposing is this:
Start by allocating a block of, say, 128 event queue nodes.
- These would be doubly-linked-list nodes, just like the ones being used
now. The difference is that they would be pre-allocated so that we're
not allocating new memory each time we create a new node.
- String the nodes together in a *singly*-linked free list.
- Whenever a node is needed for a new event queue entry, just pull one
from the free list. If the free list is empty, then we would allocate
a new block of 128 event queue nodes and add them to the free list.
- Whenever a node is removed from the event queue, the following
+ The node is removed from the doubly-linked-list.
+ The node's prev pointer (pointer to the previous node) is set to
+ The node is placed back into the singly-linked freelist (linked via
the 'next' pointer).
That's the basics of maintaining the list as a whole, but it's just a
slightly more formalized version of the same thing we already have.
So far, I haven't done anything to improve node seek time.
My first thought was to add a skiplist to the above, but that adds
overhead and I couldn't find a good algorithm for maintaining a balanced
A skiplist is a shorter linked list that has pointers into the main linked
list. The skiplist is used to skip past several nodes at a time, thus
shortening the search time.
skiplist: A ------------> B ----> C ----> D...
| | | |
V V V V
If you're searching for C9 you would search down the skiplist until you
got to D (too far) then either search the eventq backward from D2 or
backtrack in the skiplist to C and start searching forward from C8.
So... here's my cheap way of doing a skiplist:
- There will be a bit of churn in the event queue because entries will
be removed somewhat randomly. As a result, the nodes in the
pre-allocated blocks will be in no particular order. Some will be
free, some in use...
- If there are multiple blocks, nodes will be scattered across them.
- We can do a cheap form of skiplist by doing the following:
+ Set a pair of search pointers to point to the first and last entries
in the queue.
+ If the event queue has more than about 200 entries in it,
check the first node of each block of nodes. This amounts to a
* If the node's prev pointer is NULL, then it's either the first
node in the event queue or it's an unused node (a node in the free
list), so we can ignore it.
* Otherwise, check the value and update the search pointers
+ Finally, search through the linked list starting with whichever
search pointer is closest to the value you're searching for and
heading towards the other search pointer.
- Basically, all this does is narrow down the search by checking a few
random nodes in the list. That's probably good enough for the kind of
processing we're doing. Note that in the degenerate case of the Very
Big List we would have a larger random sampling so the algorithm would
be more effective. Also note that the number of nodes in a block has
an impact. If this method is adopted, we'd probably want to use 64
node blocks instead of 128 node blocks. Worth testing...
One more variation that might improve things:
- Instead of keeping a single freelist of all free nodes, we could keep
one freelist per block. This means that each block would have to have
a freelist pointer (and a freelist counter).
+ We would know when a block was unused, so it could be freed.
+ We could ensure that the first block is filled before allocating
nodes from the next block, etc. That increases the chances that we
will have empty blocks to free.
- One problem is that we need to know the block to which a node belongs
in order to assign it to the correct freelist when it's removed from
the event queue. The safest way to do that is to have each node
contain a pointer back to the freelist header. That eats up a bit
more memory but may be worth it.
- One more tweak would be to ensure that the first node of each block,
if free, is always at the head of the freelist (others would be added
to the tail). That increases the likelihood that the first node will
be in use, which increases the likelihood that our cheap skiplist will
Hope that gives you a picture. If I get time in the next few weeks I
might build a ubiqx module that does all this. It would be a nice piece
to have added to my toolkit and, despite my lengthy description, it's
really very simple. Once it's written, I can test it against the linked
list and splay tree code I've already got.
I should mention that Tridge suggested several tests to see if the splay
tree would work better than the current linked list. I found that, in the
degenerate case of a very large list, the splay tree was a vast
improvement over the plain linked list. Unfortunately, in the normal
situation (short lists) the splay tree was about five times slower.
(Still quite fast, but those individual operations add up.)
The system I've described above has all of the good characteristics of the
plain linked list, plus features that should reduce memory management
overhead, and a few tweaks that will likely improve the situation in the
degenerate case. The costs include some additional memory usage and a
little bit of additional housekeeping.
The housekeeping can be done out-of-band, however. For instance, we could
add an event to the queue with a ten minute timeout that would be used to
run through the block list and free any empty blocks (keeping a minimum of
one more than we currently need, or somesuch).
Okay... I've gone overboard.
> > We also talked about using skip lists for the event queue, but the problem
> > we ran into was: If we get an arbitrary delete (caused by an event
> > actually occuring) the skip list pointers would no longer be valid and
> > could mess things up.
> > So: If we pre-tallocate blocks of doubly-linked-list nodes (in, say,
> > groups of 128 nodes each), then instead of freeing the memory when a
> > node is deleted we would:
> > = remove it from the event list
> > = add it to a singly-linked free list
> > = set the event timeout value to zero
> > In the event queue, only the first node in the linked list will have
> > a NULL previous pointer, so we can ignore skip list pointers that
> > point to nodes with a NULL previous pointer. We would probably mark
> > them as unused so that they can be re-used (another free list).
> > If the free list ever runs out, we would tallocate another 128
> > (or whatever) nodes and rebuild the node free list.
> > I'm thinking that we would probably set aside one skip node per 128
> > event nodes. The skip nodes could also be arranged as a
> > doubly-linked list.
> > I'd like to code this up. I have a very full week at work, however, and
> > my kids haven't seen me for a week so if anyone else wants to do it please
> > go ahead. :)
> > Chris -)-----
> > On Sat, Aug 06, 2005 at 10:43:43PM -0500, Christopher R. Hertel wrote:
> > > Tridge, Volker,
> > >
> > > Once I got home and had my first decent cup of tea in a week, a thought
> > > occurred to me regarding the idea of using a ring buffer.
> > >
> > > As I understand it, f an event is received before the queue entry times
> > > out then the entry is removed from the event queue. This is done by way
> > > of a pointer directly to the event queue entry.
> > >
> > > With a doubly-linked-list or a binary tree, that removal can occur
> > > without any traversal of the list. The pointer to the event node is all
> > > you need.
> > >
> > > If we build a ring buffer, however, and use memmove() to open spaces when
> > > we need to insert new nodes, we'll mess up any external pointers. You'll
> > > have to search for the correct node each time.
> > >
> > > Using a ring buffer, the way to get around this is to mark nodes as
> > > "Deleted" if the event arrives before timeout.
> > >
> > > That still leaves the problem of adding new nodes to the ring buffer. I
> > > have some ideas on how to do that...
> > >
> > > Let me know first, though, if I've misunderstood the way this code
> > > operates.
> > >
> > > Chris -)-----
> > >
> > > --
> > > "Implementing CIFS - the Common Internet FileSystem" ISBN: 013047116X
> > > Samba Team -- http://www.samba.org/ -)----- Christopher R. Hertel
> > > jCIFS Team -- http://jcifs.samba.org/ -)----- ubiqx development,
> > > uninq. ubiqx Team -- http://www.ubiqx.org/ -)----- crh at ubiqx.mn.org
> > > OnLineBook -- http://ubiqx.org/cifs/ -)----- crh at ubiqx.org
> Stefan (metze) Metzmcher <metze at samba dot org>
"Implementing CIFS - the Common Internet FileSystem" ISBN: 013047116X
Samba Team -- http://www.samba.org/ -)----- Christopher R. Hertel
jCIFS Team -- http://jcifs.samba.org/ -)----- ubiqx development, uninq.
ubiqx Team -- http://www.ubiqx.org/ -)----- crh at ubiqx.mn.org
OnLineBook -- http://ubiqx.org/cifs/ -)----- crh at ubiqx.org
More information about the samba-technical