[clug] Dual core or dual processor?
Andrew Over
andrew.over at cs.anu.edu.au
Mon Nov 27 03:36:05 GMT 2006
On Mon, Nov 27, 2006 at 11:24:13AM +1100, Hugh Fisher wrote:
[snip]
> The plan is to have two main threads. One will be the
> 3D scene graph renderer, which even with modern 3D
> graphics cards means a lot of floating point crunching
> for high level culling, head tracking, morphing, and
> whatever. The other is a Python thread which creates
> and controls the 3D scene, running high level logic
> and 'game AI' type code such as flocking.
>
> Each thread has a few megabytes of code and can have
> tens of megabytes of data. They are threads rather
> than processes because the scene graph and the Python
> objects have lots of references into each other, so
> share address space. Every time a frame is rendered the
> scene graph code gets rendering parameters from Python
> code, and Python code changes the scene graph.
So even though the two threads are (computationally) very different,
there's a great deal of data sharing?
What's the nature of the typical shared data (ie is the majority of
access to shared data typically read only, or are there large chunks
which are modified and read on the other processor?)
To be totally honest, if you have access to both a dual processor and a
similarly configured dual core system, I'd suggest benchmarking things,
as it's not possible to do much beyond hand-waving.
Off the top of my head, I can think of several different architectural
scenarios:
- Dual core system, shared L2, single memory controller (ie Core2)
This arrangement (typically) has the advantage of a large L2.
Furthermore, relatively inexpensive sharing of data is possible (the
foreign L1 need only be invalidated/updated on a shared write, meaning
an L2 access is needed rather than a bus access). Depending on the
access patterns of each core, the cache space will be partitioned
between them according to demand. On the down side, there is only a
single memory controller for two cores. The memory controller may be
either on or off die (it makes no difference to this analysis).
This arrangement is probably superior for workloads with a large
number of shared writes, but may not stack up so well if memory
bandwidth is a concern.
- Dual core system, split L2, single memory controller (Athlonx2?)
I believe this arrangement corresponds to the Athlon/Opteron X2 (ie
slam two chips plus an interconnect on a die).
Shared writes are slightly more expensive than the previous case as
crossbar traffic is now required to invalidate the foreign core
(rather than a simple L2 invalidate or L1 update). The split L2
guarantees each core its own L2, which may be better or worse than a
shared L2 depending on the workload. On the other hand, private
caches simplifies the cache access logic, and probably results in a
slightly lower L2 hit time.
I suspect this arrangement is slightly inferior to the previous case
with all other factors equal (NB: I am not claiming AthlonX2 beats
Core2... all other factors are NOT equal in that comparison).
- Dual processor, single off-chip memory controller (dual Intel)
This corresponds to dual intel chips with the memory controller in the
northbridge. Similar to the scenario immediately above. Shared
writes are very expensive as an interconnect transaction is involved
rather than simply an on-chip crossbar transaction. This arrangement
is unlikely to be faster than a dual core with an off-chip memory
controller in any scenario I can think of. The only potential win is
the use of two dies allows more on-chip L2.
If your application has a working set which exceeds the cache size
available on a dual core, and you can find a pair of processors with a
higher aggregate cache size, this may work. A straight comparison
between a dual processor and a dual core system with equivalent
architectures will almost certainly show the dual core to be superior.
- Dual processor, dual memory controller (dual AMD)
This corresponds to two athlon chips, each with an integrated memory
controller. While this configuration suffers from the need to use an
interconnect (and thus relatively high latency shared writes), it does
offer double the memory bandwidth.
If either the architecture or the operating system interleaves the
data in some manner, it is possible to take advantage of the dual
memory controllers by operating them simultaneously. This arrangement
may also introduce a small degree of NUMA behaviour (although IIRC
it's a pretty minor effect on Athlon/Opteron systems).
In short, this configuration is a win if you're starved for memory
bandwidth. If not, it's not necessarily any better than the single
memory controller configurations.
I guess what this long ramble boils down to is a few key questions:
- To what extent do writes to shared areas factor into your application?
- Does your application perform better with a shared L2 or with split
L2s? (given the use of threads, I'd suspect the former)
- Are you starved for memory bandwidth?
Sadly I'm unable to do more than just hand waving (although my suspicion
is that you either want a dual core shared L2 or a dual processor dual
memory controller arrangement). The answer to computer architecture
questions is almost always "it depends".
Hope (some of this) helps,
--Andrew
More information about the linux
mailing list