Thread performance (was Re: dynamic context transitions)

Sat Dec 4 22:19:55 GMT 2004

Chris,

 > From what you say further on it seems that real threads are an
 > afterthought in the Unix world.  I used to program with threads on Amigas,
 > where even if the system had an MMU (the early ones didn't) the OS didn't
 > use it.  All of the system libraries were written to be thread-safe, which
 > meant that the threading was very fast since locking and contention were
 > minimized by design.

no, you're still clinging to the notion that threads are somehow
inherently faster than processes. They aren't. They are inherently
slower, no matter what OS you are talking about.

Some OSes might implement processes so badly that threads come out
ahead. It is fundamental computer science that doing operations in a
threaded environment will be slower than doing operations in an
equivalent process based environment, because they have to do more
work.

Using processes allows you to take advantage of a hardware memory
protection system. Using threads doesn't.

 > I have worked with threading libraries on BSD systems which created
 > multiple threads within a single process.  As with some of the (very) old
 > Java implementations, the side effect was that a single thread calling a 
 > blocking function would cause the entire process (all threads) to block.

yes, thats the old Java greenthreads system. I'm painfully familiar
with that, having ported it to Sparc Linux many years ago. It's an
ugly hack that implements a user space scheduler on top of a
coroutines package. It is not worth considering as a serious threading
system.

 > What I've heard from you and from Dave CB is that threading has been
 > re-implemented on Unixy systems such that each thread is now scheduled in 
 > the same way as a process.  As you say...

yes, and fundamentally you _have_ to do this on any sane system.

 > That is the interesting bit.  From your description, it seems to me
 > (remembering writhing multi-threaded applications on non-MMU systems that
 > had system calls that were designed to be thread-safe) that this kind of
 > threading has been back-fitted into Unix/posix, making it necessary to add
 > extra controls around a variety of calls that could have been made
 > reentrant or at least thread-safe from the get-go had that been in the
 > original plans.

no, you are still missing the point.

You can use good data structures to lower the cost of the locks, but
you cannot get it to zero. With processes the hardware takes care of
it, so the cost is zero. It is nothing to do with retrofitting of
APIs, its just a hardware accelerated versus software problem.

 > By the way, you may have noticed that all of the ubiqx binary tree stuff
 > is written so that it can be called reentrantly.  The trees themselves
 > require some form of locking mechanism to protect the structure of the
 > tree.  My first implementation used Amiga semaphores.  :)

and that locking mechanism costs you. It might not cost you very much
if done well, but it does cost you. With processes that code
disappears completely. You can't go faster than zero code.

 > It's supposed to take less time to switch between threads than it does to
 > switch between process contexts.  That's one of the things that is
 > supposed to make threads 'faster'.  Given the speed of MMUs I'm not sure
 > how much is gained by avoiding the memory context switch.

This is another part of the myth. Have you thought about the orders of
magnitude here? With process switching on a modern CPU you basically
have to swap one more register. Thats one extra instruction. Modern
CPUs have nanosecond cycle times.

Now, some CPUs also need to do an extra tlb flush or equivalent, but
even that is cheap on all but the worst CPUs.

Compare this to the work that a file server has to do in responding to
a packet. Say its a SMBread of 4k. That is a 4k copy of memory. Memory
is slow. Typically that SMBread will take tens of thousands of times
longer than the context switch time.

But by saving that nanosecond you will make the read() system call
slower! Why? Because in the kernel the file descriptor number needs to
be mapped from a integer to a pointer to a structure. That means
looking up a table. That table needs to be locked. If you have 100
threads doing this then they all lock the _same_ structure, so you get
contention, and suddenly your 16 cpu system is not scaling any
more. With processes they lock _different_ structures, so no
contention, so better scaling.

This lock contention can be fixed with some really smart programming
(like using RCU), and recently that has been done in Linux. That's one
reason why Linux sucks less than other systems for threads.

 > Threads are also supposed to be faster at IPC, but the added overhead of 
 > the extra locking would easily offset that advantage.

another myth. Try thinking about this. How do threads do their IPC?
They use the same system calls and mechanisms that are available to
processes. The difference is that in a threads library these
mechanisms are wrapped into a nice API that makes it convenient to do
IPC really easily. You can do exactly the same types of IPC with
processes, you just need to write a bit more code.

For many things, perhaps even for some file server applications, that
extra convenience is worthwhile. Convenience means simpler code which
means fewer bugs. So I'm not saying to never use threads, I'm just
trying to kill this persistent meme that says threads are somehow
faster. That's like believing in the tooth fairy.

 > It *could*, but only in an OS that was designed for it.

no, it couldn't be.

 > I disagree here, but it's a theory vs. practice argument.  Threads *could*
 > be faster, but you'd have to build an OS that was designed with that in
 > mind.

nope

 > You wouldn't want to sacrifice process speed just to make threads seem
 > fast.  Instead, threads would cut corners (sacrificing built-in safety,
 > most likely) to be faster than already-fast processes.

oh, if you're willing to sacrifice data integrity then sure, all bets
are off. In that case just do this:

 ssize_t read(int fd, void *buf, size_t size) {
	return size;
 }

hey, it will be _really_ fast. The sad thing is that many benchmarks
will even run correctly.

 > !?  Wow.  I would have expected otherwise.  Since the threads are in the 
 > same context, I would have expected that they would be able to "see" 
 > everything that other threads in the same group are doing.

byte range locks are designed to keep out "other people". The problem
is that threads define "other people" as other processes, so they
don't keep out other threads in the same process.

ok, now to a caveat. There are some circumstances where threads are
faster than processes for file sharing. If threads allow you to pack
the representation of your vm in the hardware more tightly when using
threads than when using processes, and if you are using sufficiently
large amounts of ram per task that running out of space (tlb size etc)
is a problem, then the cost of taking hardware traps to shuffly around
your vm mappings with processes can slow down processes more than
threads.

So if Samba used hundreds of MB per task, or was running on a really
poorly designed CPU, then its possible that the increased sharing of
vm tables in hardware that threads allows will gain you enough to
offset the other costs.

But please don't latch onto this and say "there, see, threads are
faster". For reasonable sizes on common CPUs they aren't. I'm merely
pointing out the exception to this rule.

Cheers, Tridge