Thread performance (was Re: dynamic context transitions)

Christopher R. Hertel crh at ubiqx.mn.org
Sun Dec 5 01:03:25 GMT 2004


Tridge,

We basically agree, but I think we're coming at this from different 
perspectives.

A process, in very general terms, is really a type of thread so it doesn't
make much sense to me to talk about which is "inherently" faster.  It
makes more sense to talk about which aspects thread/process management are
faster or slower and why.  Things like separate contexts, locking,
reentrant libraries and such.  We've essentially been doing that, but have
unfortunately been using semantics that imply that one view is right and
another wrong.  I don't think that's the case since I don't think I'm
disagreeing with you.

I would certainly argue that I'm not "clinging to the notion that threads
are somehow inherently faster than processes".  I don't believe I am since
I'm not really looking at that aspect of the question.  I'm looking at the
_why_ aspect, which you did a very good job of explaining.  I am trying to
identify, for myself, the ways in which the models are different and how
that impacts each.  That's really the same thing you're doing.

The point you seem to be focusing on is that Unix threads, because they
don't have a course-grained separate context for protection, must rely on
fine-grained protections such as locking of structures.  I agree, at the
base level, but I also believe that there are ways to reduce amount of
locking required and so reduce the impact of those fine-grained
mechanisms.  That's an OS design issue, though, not a practical issue
where Samba is concerned.

I think I've said in all of my posts that I wasn't sure about the impact
of context switching.  The textbooks I've learned from all emphasize that
aspect, but they are also out of date and I figure that things like large
CPU cache sizes would reduce context switch times to a razor minimum.  
You've confirmed that.  That makes the process context switch a wash in
comparison with changing threads.

I don't want to argue with you here, since I do think we're on the same
track, but there are a couple of things I'd like to point out.  Here's one
exchange:

>  > By the way, you may have noticed that all of the ubiqx binary tree stuff
>  > is written so that it can be called reentrantly.  The trees themselves
>  > require some form of locking mechanism to protect the structure of the
>  > tree.  My first implementation used Amiga semaphores.  :)
> 
> and that locking mechanism costs you. It might not cost you very much
> if done well, but it does cost you. With processes that code
> disappears completely. You can't go faster than zero code.

...but you missed something.

The semaphores are only used if the threads are sharing the data
structure.  You would also need to do some sort of locking if multiple
processes were sharing the data structure.  Separate contexts are not an
inherent win here, since we're still talking about shared memory.

My point is two fold:

First, I am not disagreeing with you, as you seem to imply.  Given the
thread and process models we're working with, and the underlying
assumptions of the OSes we're considering, you're absolutely right that
processes will be faster (particularly when running a system such as
Samba, which has its own requirements as you have outlined).  Also, your
reasoning as to why this is the case seems, to me, to be correct.

Second, while I don't believe that I am holding on to too many
assumptions, I am interested in the roots of those assumptions.  I'm also
interested to know how the two models differ and how that impacts
different kinds of performance.  I believe that it is possible to build a
system in which you would have very fast processes, and faster threads.
The assumption underlying this belief is that it would be possible to
drastically reduce the amount of locking that the system itself would have
to perform for you (without sacrificing integrity), and to write threaded
code that would avoid the need to lock anything (each thread managing its
own data).  I am also assuming that the definition of "process" and
"thread" are somewhat variable with the only constraint being that
processes are proper supersets of threads in some way.  (A valid
assumption, given all of the different threading and process models I've
seen over the years--get this, in AmigaOS a "process" was just a thread 
that had access to the file I/O subsystem.)

Remember that I am, at heart, a theory guy.  :)

I'll hit one more...

>  > Threads are also supposed to be faster at IPC, but the added overhead of 
>  > the extra locking would easily offset that advantage.
> 
> another myth. Try thinking about this. How do threads do their IPC?
> They use the same system calls and mechanisms that are available to
> processes. 

There are many ways for threads to do their IPC, and I have worked on
systems in the past that use very different mechanisms for threads.  
Remind me to tell you about TaOS and Intent some day.  :)

You might, for example, have a call within a threaded architecture that
prevents other threads within the same group from being scheduled.  That
way you could, for example, be sure that your attempt to remove a node
from a linked list would complete before any other thread within the same
group had any run-time.  That would be the case even if another process
took control of the CPU.  In a process environment, however, you might
prefer to use semaphores which would have to manually checked by other
processes wanting to access the data structure.

These are all design decisions, theoretical at best, and go way beyond
considerations of speed in Samba.  From a practical standpoint there isn't
really anything to discuss because processes are *empirically* faster than
threads on the systems that we care about.  It doesn't even matter why.

I just like to grok things...  :)

Hope that makes it all a little clearer.

Chris -)-----

On Sun, Dec 05, 2004 at 09:19:55AM +1100, tridge at samba.org wrote:
> Chris,
> 
>  > From what you say further on it seems that real threads are an
>  > afterthought in the Unix world.  I used to program with threads on Amigas,
>  > where even if the system had an MMU (the early ones didn't) the OS didn't
>  > use it.  All of the system libraries were written to be thread-safe, which
>  > meant that the threading was very fast since locking and contention were
>  > minimized by design.
> 
> no, you're still clinging to the notion that threads are somehow
> inherently faster than processes. They aren't. They are inherently
> slower, no matter what OS you are talking about.
> 
> Some OSes might implement processes so badly that threads come out
> ahead. It is fundamental computer science that doing operations in a
> threaded environment will be slower than doing operations in an
> equivalent process based environment, because they have to do more
> work.
> 
> Using processes allows you to take advantage of a hardware memory
> protection system. Using threads doesn't.
> 
>  > I have worked with threading libraries on BSD systems which created
>  > multiple threads within a single process.  As with some of the (very) old
>  > Java implementations, the side effect was that a single thread calling a 
>  > blocking function would cause the entire process (all threads) to block.
> 
> yes, thats the old Java greenthreads system. I'm painfully familiar
> with that, having ported it to Sparc Linux many years ago. It's an
> ugly hack that implements a user space scheduler on top of a
> coroutines package. It is not worth considering as a serious threading
> system.
> 
>  > What I've heard from you and from Dave CB is that threading has been
>  > re-implemented on Unixy systems such that each thread is now scheduled in 
>  > the same way as a process.  As you say...
> 
> yes, and fundamentally you _have_ to do this on any sane system.
> 
>  > That is the interesting bit.  From your description, it seems to me
>  > (remembering writhing multi-threaded applications on non-MMU systems that
>  > had system calls that were designed to be thread-safe) that this kind of
>  > threading has been back-fitted into Unix/posix, making it necessary to add
>  > extra controls around a variety of calls that could have been made
>  > reentrant or at least thread-safe from the get-go had that been in the
>  > original plans.
> 
> no, you are still missing the point.
> 
> You can use good data structures to lower the cost of the locks, but
> you cannot get it to zero. With processes the hardware takes care of
> it, so the cost is zero. It is nothing to do with retrofitting of
> APIs, its just a hardware accelerated versus software problem.
> 
>  > By the way, you may have noticed that all of the ubiqx binary tree stuff
>  > is written so that it can be called reentrantly.  The trees themselves
>  > require some form of locking mechanism to protect the structure of the
>  > tree.  My first implementation used Amiga semaphores.  :)
> 
> and that locking mechanism costs you. It might not cost you very much
> if done well, but it does cost you. With processes that code
> disappears completely. You can't go faster than zero code.
> 
>  > It's supposed to take less time to switch between threads than it does to
>  > switch between process contexts.  That's one of the things that is
>  > supposed to make threads 'faster'.  Given the speed of MMUs I'm not sure
>  > how much is gained by avoiding the memory context switch.
> 
> This is another part of the myth. Have you thought about the orders of
> magnitude here? With process switching on a modern CPU you basically
> have to swap one more register. Thats one extra instruction. Modern
> CPUs have nanosecond cycle times.
> 
> Now, some CPUs also need to do an extra tlb flush or equivalent, but
> even that is cheap on all but the worst CPUs.
> 
> Compare this to the work that a file server has to do in responding to
> a packet. Say its a SMBread of 4k. That is a 4k copy of memory. Memory
> is slow. Typically that SMBread will take tens of thousands of times
> longer than the context switch time.
> 
> But by saving that nanosecond you will make the read() system call
> slower! Why? Because in the kernel the file descriptor number needs to
> be mapped from a integer to a pointer to a structure. That means
> looking up a table. That table needs to be locked. If you have 100
> threads doing this then they all lock the _same_ structure, so you get
> contention, and suddenly your 16 cpu system is not scaling any
> more. With processes they lock _different_ structures, so no
> contention, so better scaling.
> 
> This lock contention can be fixed with some really smart programming
> (like using RCU), and recently that has been done in Linux. That's one
> reason why Linux sucks less than other systems for threads.
> 
>  > Threads are also supposed to be faster at IPC, but the added overhead of 
>  > the extra locking would easily offset that advantage.
> 
> another myth. Try thinking about this. How do threads do their IPC?
> They use the same system calls and mechanisms that are available to
> processes. The difference is that in a threads library these
> mechanisms are wrapped into a nice API that makes it convenient to do
> IPC really easily. You can do exactly the same types of IPC with
> processes, you just need to write a bit more code.
> 
> For many things, perhaps even for some file server applications, that
> extra convenience is worthwhile. Convenience means simpler code which
> means fewer bugs. So I'm not saying to never use threads, I'm just
> trying to kill this persistent meme that says threads are somehow
> faster. That's like believing in the tooth fairy.
> 
>  > It *could*, but only in an OS that was designed for it.
> 
> no, it couldn't be.
> 
>  > I disagree here, but it's a theory vs. practice argument.  Threads *could*
>  > be faster, but you'd have to build an OS that was designed with that in
>  > mind.
> 
> nope
> 
>  > You wouldn't want to sacrifice process speed just to make threads seem
>  > fast.  Instead, threads would cut corners (sacrificing built-in safety,
>  > most likely) to be faster than already-fast processes.
> 
> oh, if you're willing to sacrifice data integrity then sure, all bets
> are off. In that case just do this:
> 
>  ssize_t read(int fd, void *buf, size_t size) {
> 	return size;
>  }
> 
> hey, it will be _really_ fast. The sad thing is that many benchmarks
> will even run correctly.
> 
>  > !?  Wow.  I would have expected otherwise.  Since the threads are in the 
>  > same context, I would have expected that they would be able to "see" 
>  > everything that other threads in the same group are doing.
> 
> byte range locks are designed to keep out "other people". The problem
> is that threads define "other people" as other processes, so they
> don't keep out other threads in the same process.
> 
> 
> ok, now to a caveat. There are some circumstances where threads are
> faster than processes for file sharing. If threads allow you to pack
> the representation of your vm in the hardware more tightly when using
> threads than when using processes, and if you are using sufficiently
> large amounts of ram per task that running out of space (tlb size etc)
> is a problem, then the cost of taking hardware traps to shuffly around
> your vm mappings with processes can slow down processes more than
> threads.
> 
> So if Samba used hundreds of MB per task, or was running on a really
> poorly designed CPU, then its possible that the increased sharing of
> vm tables in hardware that threads allows will gain you enough to
> offset the other costs.
> 
> But please don't latch onto this and say "there, see, threads are
> faster". For reasonable sizes on common CPUs they aren't. I'm merely
> pointing out the exception to this rule.
> 
> Cheers, Tridge

-- 
"Implementing CIFS - the Common Internet FileSystem" ISBN: 013047116X
Samba Team -- http://www.samba.org/     -)-----   Christopher R. Hertel
jCIFS Team -- http://jcifs.samba.org/   -)-----   ubiqx development, uninq.
ubiqx Team -- http://www.ubiqx.org/     -)-----   crh at ubiqx.mn.org
OnLineBook -- http://ubiqx.org/cifs/    -)-----   crh at ubiqx.org


More information about the samba-technical mailing list