Thread performance (was Re: dynamic context transitions)

Sat Dec 4 19:26:15 GMT 2004

Thanks, Tridge.  Your explanation nicely fills in the gaps for me.

On Sat, Dec 04, 2004 at 01:17:51PM +1100, tridge at samba.org wrote:
> Chris,
> 
>  > I've been asking about this in different places.  I've heard theories, 
>  > mostly.  This is happening in Linux (dunno if it's been tested elsewhere) 
>  > and one theory is that the forked process speeds are good because Linux 
>  > basically does a really good job with those.  Meanwhile, thread speed is 
>  > bad because the multiple threads are all within a single process and the 
>  > single process gets only it's own share of processor time.
> 
> Processes are faster than threads on all OSes that I have tested on
> (that includes Solaris, IRIX, AIX and Linux). The difference is most
> dramatic on the "traditional" unixes where threads _really_ suck
> badly, despite all the hype. On Linux with the latest 2.6 and glibc
> threads have almost caught up with processes, but still lag behind by
> a little.

>From what you say further on it seems that real threads are an
afterthought in the Unix world.  I used to program with threads on Amigas,
where even if the system had an MMU (the early ones didn't) the OS didn't
use it.  All of the system libraries were written to be thread-safe, which
meant that the threading was very fast since locking and contention were
minimized by design.

:
:
> On all modern unixes threads and processes are basically the same
> thing.

I have worked with threading libraries on BSD systems which created
multiple threads within a single process.  As with some of the (very) old
Java implementations, the side effect was that a single thread calling a 
blocking function would cause the entire process (all threads) to block.

What I've heard from you and from Dave CB is that threading has been
re-implemented on Unixy systems such that each thread is now scheduled in 
the same way as a process.  As you say...

> The principle difference is that in threads memory is shared by
> default, and you have to do extra work to set it up as non-shared,
> whereas with processes memory is not shared by default and you have to
> do extra work to make it shared. Both systems have the same
> fundamental capabilities, its just the defaults that change.

That's a good basic definition.  It doesn't cover all possible threading 
models, of course, but it gives me a good sense of what is meant *today* 
when people talk about threads.

> Now to the interesting bit. Because memory is shared by default, the C
> library has to assume that memory that it is working with is shared if
> you are using threads. That means it must add lock/unlock pairs around
> lots of internal code. If you don't use threads then the C library
> assumes that the programmer is smart enough to put locks on their own
> shared memory if they need them.

That is the interesting bit.  From your description, it seems to me
(remembering writhing multi-threaded applications on non-MMU systems that
had system calls that were designed to be thread-safe) that this kind of
threading has been back-fitted into Unix/posix, making it necessary to add
extra controls around a variety of calls that could have been made
reentrant or at least thread-safe from the get-go had that been in the
original plans.

Programming in a shared memory environment takes a different kind of
thinking.  Things like not returning pointers to static memory in a
function call...  Basically, it requires a lot more attention to the
management of state.

By the way, you may have noticed that all of the ubiqx binary tree stuff
is written so that it can be called reentrantly.  The trees themselves
require some form of locking mechanism to protect the structure of the
tree.  My first implementation used Amiga semaphores.  :)

> Put another way, with processes you are using the hardware memory
> protection tables to do all the hard work, and that is essentially
> free. With threads the C library has to do all that work itself, and
> that is _slow_. 

Yep, that's clear.  Thanks again for filling in the gaps.

> With the latest glibc and kernel this problem has been reduced on
> Linux by some really smart locking techniques. It is an impressive
> piece of work, and means that for Linux threads now suck less than
> they do on other platforms, but they are still not faster than
> processes.

It's supposed to take less time to switch between threads than it does to
switch between process contexts.  That's one of the things that is
supposed to make threads 'faster'.  Given the speed of MMUs I'm not sure
how much is gained by avoiding the memory context switch.

Even if you do gain something, if the scheduler views threads and
processes as equivalent then it may interleave other processes between the
threads of a single thread group, thus causing the context switch to
happen anyway.

Threads are also supposed to be faster at IPC, but the added overhead of 
the extra locking would easily offset that advantage.

> So why do some people bother with threads? It's is for convenience. It
> makes some types of programming easier, but it does _not_ make it
> faster.

It *could*, but only in an OS that was designed for it.

> The "threads are fast" meme is a complete fallacy, much like
> the common meme of CPUs running faster for in-kernel code.

I disagree here, but it's a theory vs. practice argument.  Threads *could*
be faster, but you'd have to build an OS that was designed with that in
mind.  That means the scheduler would have to be thread-aware (eg.
scheduling thread groups together to reduce context switching).  System
libraries and kernel calls would also need to be thread aware (eg.  
reentrant code and an overall reduction in the number of places in which
locking and state management were required).

You wouldn't want to sacrifice process speed just to make threads seem
fast.  Instead, threads would cut corners (sacrificing built-in safety,
most likely) to be faster than already-fast processes.

> What is true is that on almost all platforms _creating_ a thread is
> cheaper than creating a process. That can matter for some applications
> where the work to be done take a very few cycles (like spawn-thread,
> add two numbers, then kill thread). Thread benchmarks tend to be in
> this category. File servers are not.
>
> For a file server you generally want your unit of processing to last
> for seconds to hours or days. In that case the few nano-seconds saved
> in the thread creation is not relevant.

Yep.  That all makes good sense.

> The other big thing that is bad about threads is that the designers of
> the thread APIs (like pthreads) did not consider file servers to be
> important, so they completely screwed up on several aspects of the
> API, so that the convenience of using threads is totally lost. A good
> example is the way threads interact with byte range locks. It is
> impossible for one thread to "lock" a byte range such that another
> thread can see the lock. 

!?  Wow.  I would have expected otherwise.  Since the threads are in the 
same context, I would have expected that they would be able to "see" 
everything that other threads in the same group are doing.

> Most of these API deficiencies could be fixed by making
> pthread_create() have an option on Linux to not pass CLONE_FILES or
> CLONE_FS to the clone() system call. If that was done then threads
> would start being a lot more palatable for file servers.

Well... if anyone can get the attention of the developers it'd be you.  :)

Thanks.  This is really helpful.

Chris -)-----

-- 
"Implementing CIFS - the Common Internet FileSystem" ISBN: 013047116X
Samba Team -- http://www.samba.org/     -)-----   Christopher R. Hertel
jCIFS Team -- http://jcifs.samba.org/   -)-----   ubiqx development, uninq.
ubiqx Team -- http://www.ubiqx.org/     -)-----   crh at ubiqx.mn.org
OnLineBook -- http://ubiqx.org/cifs/    -)-----   crh at ubiqx.org