Mac Pro Nehalem: Dual vs Single CPU
Strange, but true—
With Photoshop CS4, a dual-CPU Mac Pro Nehalem can be slower than a single-CPU model for large files.
Note well that future software improvements (eg CS5) might change these results, but it shows that testing real-world behavior is the right approach.
Very few programs have this downside, it’s mostly peculiar to Photoshop.
This finding applies to Photoshop CS4 11.0.1.
First a few background notes that will be helpful in the following discussion—
- Photoshop CS4/Mac is a 32-bit program, which limits it to 3.5GB of memory allocation. Of that, ~3GB can be used by Photoshop; the rest is overhead for code, plugins, etc.
- CPU cores are the hardware workers involved in computing; these correspond in CS4 to “threads”. CS4 creates 3 threads per CPU core when executing tasks.
- Each thread requires some memory of its own which reduces the memory available for storing image data and other necessary items. The amount of overhead depends on the program, and for CS4, the overhead is apparently substantial.
- The more threads, the higher the overhead of coordinating them, but this is likely a minor factor compared to memory usage.
Dual-cpu is slower — why?
This discussion applies when working with file(s) that require the scratch disk; the more the scratch volume is needed the more distinct the advantage of the single-CPU system.
The graph below shows the single and dual-CPU 2.93GHz Mac Pro Nehalem with different amounts of memory. Observe the following:
- A minimum of 16GB is required for best performance. Even so, the single-CPU MP09 with only 12GB beats the dual-CPU MP09 with 24GB!
- A single-CPU is faster than a dual-CPU with either memory configuration.
- The dual-CPU time drops from 61 seconds to 47 seconds when half its CPU cores are disabled.
How can this possibly be?
The answer is most likely usable memory, but it needs some explanation, see below.
Something has changed in Mac OS X 10.5.7. While disabling half the CPU cores still is slightly faster, using all 16 cores is now much closer in speed than before. A bug fix of some kind in OS X.
Photoshop CS4 blindly allocates 3 “threads” per CPU core. For a 16-core machine (dual CPU), this means that it’s allocating 48 threads, vs 24 threads for an 8-core machine (single CPU). Each of these threads requires memory of its own. That is our working theory at least.
The memory used by the threads comes out of the limited amount available to Photoshop CS4 (a 32-bit application is limited to 3.5GB absolute max).
The net result is that the memory available for image data is reduced substantially.
The reduced memory for image data forces Photoshop to use its scratch volume more, which increases processing time substantially—and remember that these times are using an exceptionally fast striped RAID scratch volume More.
Available memory is critical when working with large files. The diglloydMedium benchmark ends its run with a 15.7GB scratch file, which far exceeds the available ~3gB or so of usable memory in the 32-bit Photoshop CS4.
The same performance implications lie in wait for anyone working with file(s) that begin to use the scratch disk, so beware!
Exploring the cores
Let’s see what happens when CHUD tools is used to disable real cores and virtual cores (hyperthreading).
The M/N notation means M real cores and N virtual cores eg 4/8 means 4 real cores and 8 virtual cores.
The perverse result is that with all CPU cores in use, we see the 2nd worst result — better only than that of a single virtual core, a rather poor showing from Photoshop CS4. Let’s hope Adobe does something about this.
This workaround is worth the trouble only if you spend a lot of time in Photoshop working with large files eg those that use the scratch volume regularly Learn about the scratch volume.
As an Apple developer, you can download Apple’s CHUD tools, which is part of the Apple developer toolkit. CHUD tools allow disabling CPU cores, either real and/or virtual ones.
When working with big files, you can use the CPU palette to disable half or more of the CPU cores equally across the two physical CPU chips. This drops execution time on dual-CPU MP09 systems by 23%, as shown in the graph.
What Adobe can do
Adobe can address this issue by not blindly allocating threads for every CPU core. In fact, CS4 does not scale beyond two cores, so one solution is for the CS4 engineers to simply hard-code a limit and ignore the available CPU cores beyond a fixed number.
Another and better solution would be to offer a “max threads” preference.
But of course the best solution is to rewrite the aging code base to use 16 cores efficiently. The chances of that happening in CS4 seem slim. The real fix is likely to come only with a 64-bit Photoshop CS5, which will be able to access as much memory as there is installed in the machine. Adobe will also have to fix the internal bottlnecks which currently keep it from scaling more than a pittance beyond two cores.
Not all bad news
In spite of the poor CPU utilization on common operations, some operations do utilize multiple cores, though scalability remains well below optimal.
Good scalability would ideally yield about 1/16 the time for 16 virtual cores vs 1 core, but there is some overhead even for well-written programs, so anything over 12:1 is more typical.
For thefilter, tests show near perfect scalability from 2 cores to 4 cores; the time is almost exactly halved. Beyond that, the additional cores help considerably, but we don’t see 10 seconds with 16 cores (vs 80 seconds for 2 cores). Instead, we see 17 seconds for 16 cores— not bad, but about 70% longer than perfect scalability. A figure in the 12-14 second range would be quite respectable.
Photoshop CS4 needs work. Its threading behavior is self-defeating, making a single-CPU system notably faster than a dual-CPU system for large files.