[clug] Dual core or dual processor?

Mon Nov 27 03:36:05 GMT 2006

On Mon, Nov 27, 2006 at 11:24:13AM +1100, Hugh Fisher wrote:

[snip]

> The plan is to have two main threads. One will be the
> 3D scene graph renderer, which even with modern 3D
> graphics cards means a lot of floating point crunching
> for high level culling, head tracking, morphing, and
> whatever. The other is a Python thread which creates
> and controls the 3D scene, running high level logic
> and 'game AI' type code such as flocking.
> 
> Each thread has a few megabytes of code and can have
> tens of megabytes of data. They are threads rather
> than processes because the scene graph and the Python
> objects have lots of references into each other, so
> share address space. Every time a frame is rendered the
> scene graph code gets rendering parameters from Python
> code, and Python code changes the scene graph.

So even though the two threads are (computationally) very different,
there's a great deal of data sharing?

What's the nature of the typical shared data (ie is the majority of
access to shared data typically read only, or are there large chunks
which are modified and read on the other processor?)

To be totally honest, if you have access to both a dual processor and a
similarly configured dual core system, I'd suggest benchmarking things,
as it's not possible to do much beyond hand-waving.

Off the top of my head, I can think of several different architectural
scenarios:

- Dual core system, shared L2, single memory controller (ie Core2)

  This arrangement (typically) has the advantage of a large L2.
  Furthermore,  relatively inexpensive sharing of data is possible (the
  foreign L1 need only be invalidated/updated on a shared write, meaning
  an L2 access is needed rather than a bus access).  Depending on the
  access patterns of each core, the cache space will be partitioned
  between them according to demand.  On the down side, there is only a
  single memory controller for two cores.  The memory controller may be
  either on or off die (it makes no difference to this analysis).

  This arrangement is probably superior for workloads with a large
  number of shared writes, but may not stack up so well if memory
  bandwidth is a concern.

- Dual core system, split L2, single memory controller (Athlonx2?)

  I believe this arrangement corresponds to the Athlon/Opteron X2 (ie
  slam two chips plus an interconnect on a die).

  Shared writes are slightly more expensive than the previous case as
  crossbar traffic is now required to invalidate the foreign core
  (rather than a simple L2 invalidate or L1 update).  The split L2
  guarantees each core its own L2, which may be better or worse than a
  shared L2 depending on the workload.  On the other hand, private
  caches simplifies the cache access logic, and probably results in a
  slightly lower L2 hit time.

  I suspect this arrangement is slightly inferior to the previous case
  with all other factors equal (NB: I am not claiming AthlonX2 beats
  Core2...  all other factors are NOT equal in that comparison).

- Dual processor, single off-chip memory controller (dual Intel)

  This corresponds to dual intel chips with the memory controller in the
  northbridge.  Similar to the scenario immediately above.  Shared
  writes are very expensive as an interconnect transaction is involved
  rather than simply an on-chip crossbar transaction.  This arrangement
  is unlikely to be faster than a dual core with an off-chip memory
  controller in any scenario I can think of.  The only potential win is
  the use of two dies allows more on-chip L2.

  If your application has a working set which exceeds the cache size
  available on a dual core, and you can find a pair of processors with a
  higher aggregate cache size, this may work.  A straight comparison
  between a dual processor and a dual core system with equivalent
  architectures will almost certainly show the dual core to be superior.

- Dual processor, dual memory controller (dual AMD)

  This corresponds to two athlon chips, each with an integrated memory
  controller.  While this configuration suffers from the need to use an
  interconnect (and thus relatively high latency shared writes), it does
  offer double the memory bandwidth.

  If either the architecture or the operating system interleaves the
  data in some manner,  it is possible to take advantage of the dual
  memory controllers by operating them simultaneously.  This arrangement
  may also introduce a small degree of NUMA behaviour (although IIRC
  it's a pretty minor effect on Athlon/Opteron systems).

  In short, this configuration is a win if you're starved for memory
  bandwidth.  If not, it's not necessarily any better than the single
  memory controller configurations.

I guess what this long ramble boils down to is a few key questions:

- To what extent do writes to shared areas factor into your application?
- Does your application perform better with a shared L2 or with split
  L2s? (given the use of threads, I'd suspect the former)
- Are you starved for memory bandwidth?

Sadly I'm unable to do more than just hand waving (although my suspicion
is that you either want a dual core shared L2 or a dual processor dual
memory controller arrangement).  The answer to computer architecture
questions is almost always "it depends".

Hope (some of this) helps,
--Andrew