diglloyd Mac Performance Guide

Lloyd gets credit via these or any site ads/links... thank you!
Trusted computing vendor MacSales.com
B&H DAILY DEAL   B&H Deals by brand/category/discount
Wish lists: Sony | NIKON | ZEISS | Canon | Pentax K | Fujifilm | Leica M | Leica SL | Macs
Buy anything at Amazon.com
In Motion There is Great Potential

Mac Pro Nehalem: Dual vs Single CPU

Last updated June 01, 2009 - Send Feedback

Strange, but true—

With Photoshop CS4, a dual-CPU Mac Pro Nehalem can be slower than a single-CPU model for large files.

A workaround is explained.Please see Scalability for related material. Behavior can change over time, as the Mac OS X system changes and/or Photoshop is upgraded— test your own system to be sure.

Note well that future software improvements (eg CS5) might change these results, but it shows that testing real-world behavior is the right approach.

Very few programs have this downside, it’s mostly peculiar to Photoshop.

Mac Pro Nehalem memory copy speed bandwidth

Dual CPUs are slower than a single CPU! (Mac OS X 10.5.6)
"1/2 cores" means that half the CPU cores were disabled
This changes in 10.5.7!!!

Background notes

This finding applies to Photoshop CS4 11.0.1.

First a few background notes that will be helpful in the following discussion—

  • Photoshop CS4/Mac is a 32-bit program, which limits it to 3.5GB of memory allocation. Of that, ~3GB can be used by Photoshop; the rest is overhead for code, plugins, etc.
  • CPU cores are the hardware workers involved in computing; these correspond in CS4 to “threads”. CS4 creates 3 threads per CPU core when executing tasks.
  • Each thread requires some memory of its own which reduces the memory available for storing image data and other necessary items. The amount of overhead depends on the program, and for CS4, the overhead is apparently substantial.
  • The more threads, the higher the overhead of coordinating them, but this is likely a minor factor compared to memory usage.

Dual-cpu is slower — why?

This discussion applies when working with file(s) that require the scratch disk; the more the scratch volume is needed the more distinct the advantage of the single-CPU system.

The graph below shows the single and dual-CPU 2.93GHz Mac Pro Nehalem with different amounts of memory. Observe the following:

  • A minimum of 16GB is required for best performance. Even so, the single-CPU MP09 with only 12GB beats the dual-CPU MP09 with 24GB!
  • A single-CPU is faster than a dual-CPU with either memory configuration.
  • The dual-CPU time drops from 61 seconds to 47 seconds when half its CPU cores are disabled.

How can this possibly be?

The answer is most likely usable memory, but it needs some explanation, see below.

Mac Pro Nehalem memory copy speed bandwidth
diglloydMedium: single vs dual core, Mac OS X 10.5.6

Something has changed in Mac OS X 10.5.7. While disabling half the CPU cores still is slightly faster, using all 16 cores is now much closer in speed than before. A bug fix of some kind in OS X.

Mac Pro Nehalem memory copy speed bandwidth
diglloydMedium: single vs dual core, Mac OS X 10.5.7

Available memory

Photoshop CS4 blindly allocates 3 “threads” per CPU core. For a 16-core machine (dual CPU), this means that it’s allocating 48 threads, vs 24 threads for an 8-core machine (single CPU). Each of these threads requires memory of its own. That is our working theory at least.

The memory used by the threads comes out of the limited amount available to Photoshop CS4 (a 32-bit application is limited to 3.5GB absolute max).

The net result is that the memory available for image data is reduced substantially.

The reduced memory for image data forces Photoshop to use its scratch volume more, which increases processing time substantially—and remember that these times are using an exceptionally fast striped RAID scratch volume More.

Available memory is critical when working with large files. The diglloydMedium benchmark ends its run with a 15.7GB scratch file, which far exceeds the available ~3gB or so of usable memory in the 32-bit Photoshop CS4.

The same performance implications lie in wait for anyone working with file(s) that begin to use the scratch disk, so beware!

OWC Thunderbolt 2 Dock
Review of Thunderbolt 2 Dock

Exploring the cores

Let’s see what happens when CHUD tools is used to disable real cores and virtual cores (hyperthreading).

The M/N notation means M real cores and N virtual cores eg 4/8 means 4 real cores and 8 virtual cores.

The graph shows the time to execute diglloydMedium, rounded to the nearest second. Observe that CS4 offers marginal gains when going beyond 2 real cores / 4 virtual cores — it doesn’t scale.

The perverse result is that with all CPU cores in use, we see the 2nd worst result — better only than that of a single virtual core, a rather poor showing from Photoshop CS4. Let’s hope Adobe does something about this.

Mac Pro Nehalem memory copy speed bandwidth
diglloydMedium: effect of the number of real/virtual CPU cores on run-time

A kludge workaround

Mac Pro Nehalem memory copy speed bandwidth
CHUD Tools
Processor Palette

This workaround is worth the trouble only if you spend a lot of time in Photoshop working with large files eg those that use the scratch volume regularly Learn about the scratch volume.

As an Apple developer, you can download Apple’s CHUD tools, which is part of the Apple developer toolkit. CHUD tools allow disabling CPU cores, either real and/or virtual ones.

When working with big files, you can use the CPU palette to disable half or more of the CPU cores equally across the two physical CPU chips. This drops execution time on dual-CPU MP09 systems by 23%, as shown in the graph.


What Adobe can do

Adobe can address this issue by not blindly allocating threads for every CPU core. In fact, CS4 does not scale beyond two cores, so one solution is for the CS4 engineers to simply hard-code a limit and ignore the available CPU cores beyond a fixed number.

Another and better solution would be to offer a “max threads” preference.

But of course the best solution is to rewrite the aging code base to use 16 cores efficiently. The chances of that happening in CS4 seem slim. The real fix is likely to come only with a 64-bit Photoshop CS5, which will be able to access as much memory as there is installed in the machine. Adobe will also have to fix the internal bottlnecks which currently keep it from scaling more than a pittance beyond two cores.

Not all bad news

In spite of the poor CPU utilization on common operations, some operations do utilize multiple cores, though scalability remains well below optimal.

Good scalability would ideally yield about 1/16 the time for 16 virtual cores vs 1 core, but there is some overhead even for well-written programs, so anything over 12:1 is more typical.

For the Surface Blur filter, tests show near perfect scalability from 2 cores to 4 cores; the time is almost exactly halved. Beyond that, the additional cores help considerably, but we don’t see 10 seconds with 16 cores (vs 80 seconds for 2 cores). Instead, we see 17 seconds for 16 cores— not bad, but about 70% longer than perfect scalability. A figure in the 12-14 second range would be quite respectable.

Mac Pro Nehalem memory copy speed bandwidth
Surface Blur: more cores helps a lot
Gray bars denote best-case scalability.


Photoshop CS4 needs work. Its threading behavior is self-defeating, making a single-CPU system notably faster than a dual-CPU system for large files.


diglloyd.com | Terms of Use | PRIVACY POLICY
Contact | About Lloyd Chambers | Consulting | Photo Tours
Mailing Lists | RSS Feeds | Twitter
Copyright © 2008-2016 diglloyd Inc, all rights reserved.