Thank you for purchasing through links and ads on this site.
OWC / MacSales.com...
diglloyd Deal Finder...
Buy other stuff at Amazon.com...

2017 iMac Pro: Scalability with CPU-Intensive, Memory-Intensive and Mixed Workload

The data on this page and others was painstakingly obtained, made possible by the support of OWC / MacSales.com and B&H Photo. Please order through our links via these companies and/or subscribe—thank you.

See the MPG recommendations for iMac Pro, backup, peripherals. Not sure which Mac to get or how to configure it? Consult with MPG.

 

This page uses the compute command of diglloydTools MemoryTester to evaluate potential performance for the iMac Pro.

Prelude: Why More CPU Cores Often Run no Faster, and Sometimes Slower than Fewer CPU Cores

The 2017 iMac Pro has CPU options of 8/10/14/18 CPU cores, with 14 and 18 CPU cores being unprecedented for Apple Macs (12 was the previous max).

The threshold of noticing any difference in performance is around 10%, so in theory 18 CPU cores sounds terrific: 18 versus 8 sounds like no contest. Yet more CPU cores can be slower or offer only minor improvement—it all depends on the workload. Three test examples follow, which illustrate that idea.

Only with very specific multithreading-friendly workloads do 10/14/18 CPU cores get more done significantly faster than 8 or 10 cores. That assumes excellent software implementation with a scalable algorithm. With most workloads, the differences are modest even between 4 and 8-core CPUs.

The foregoing is why a 2017 iMac 5K can often be a superior value for many computing tasks: it can run all 4 of its cores at 4.2 GHz whereas the 8/10/14/18 CPU core iMac Pro Intel Xeon W CPUs must drop the clock speed as more CPU cores are used, reducing the benefits of each additional CPU core. At the same time, more cores all contend with each other for scarce resources and must in essence “take turns”.

On the flip side, the 2017 iMac 5K has only 2-channel memory vs 4-channel memory on the iMac Pro, so with workloads that are memory intensive, the 2017 iMac 5K cannot use its 4 CPU cores effectively, being throttled by having CPU cores wait for relatively slow main memory.

Many factors contribute to scalability. Good scalability means that twice the CPU cores would get the job done in about half the time. That can be true for 2/4/6 core machines for well written software, but going beyond 6 CPU cores, many factors conspire to make further improvements incremental.

  • As more CPU cores are used, Intel Turbo Boost drops the clock speed down to the base clock speed—each CPU core slows down as more are used. See the chart of turbo boost clock speeds for Intel Xeon E series (as this was written, data was not available for the Intel Xeon W processors).
  • Contention for on-chip cache memory and contention for main memory—CPU cores are forced to wait on molasses-slow main memory (far slower than CPU registers and on-chip memory caches).
  • Contention for shared application data structures (“thread safety”)—CPU cores are forced to queue up to access a necessary resource. The must “take turns” and thus go idle while waiting.
  • Software: inappropriate or inefficient algorithm selection for many CPU cores and/or poor multi-threading implementation. This is a software engineering problem. Some workloads can only be done in a sequential series of steps; this is called serialization and it precludes using many CPU cores (or even two). The workload can be a series of single-threaded and multi-threaded tasks. How much more CPUs helps depends heavily on whether the task can be split up for each CPU core to handle simultaneously. This is Hard To Do in software development, even if the task is amenable to it, and attempts to parallelize often lead to difficult to track down sporadic bugs. Few software engineers have the skills to write correct multi-threaded code.
Rigorously lab tested and OWC certified.

About Scalability

Perfect scalability means half the time for twice the CPU cores. There is always overhead of some kind as discussed above—perfect scalability is an asymptotic goal always to be hoped for but never quite reached.

Measurement variabilities due to background system activity and daemons cannot be avoided when testing. However, repeated iterations minimize these variations to less than 1% from run to run of the tests, so the data shown here is precise.

The examples that follow explore the potential performance envelope for pure computation, memory-intensive tasks, and a mixed workload.

Test: scalability of pure CPU workload

10 CPU cores have 20 virtual cores (hyper threading). The test uses up to 22 threads.

Pure computation (no memory access) using 20 threads yields 12.22X the speed versus a single thread, showing that hyperthreading has considerable value, since the relative time (1.39) is 39% longer than for 10 threads (relative time 1.00).

Also apparent is that there is no gain using additional threads beyond the number of virtual CPU cores (20 on a 10-core CPU). This makes sense, since more threads than virtual cores cannot possibly do anything but increase contention.

Scalability for pure computation is outstanding up to 10 threads:

 2 threads (6.07) takes 49.7% of the time of 1 thread (12.22)
 4 threads (3.15) takes 51.9% of the time of 2 threads (6.07)
 8 threads (1.61) takes 51.1% of the time of 4 threads (3.15)
10 threads (1.39) takes 54.3% of the time of 5 threads (2.56)
20 threads (1.00) takes 71.9% of the time of 10 threads (1.39)

For real-world tasks, results this scalable are highly unlikely. And perhaps most interesting of all for the purposes of paying for a 10-core CPU versus an 8-core CPU:

8 threads (1.61) takes only 15.8% longer than 10 threads (1.39) vs 25% theoretical.

The reduced scalability from 8 to 10 cores is surely the reduction in clock speed as more CPU cores are used, since below that the scalability is nearly perfect.

Since 10 threads in theory provides 25% more computing power than 8 threads and this is a best-possible case (no memory access and disk I/O), it shows that 10 CPU cores is likely to be as marginal value over 8 CPU cores, particularly for real-world tasks where memory access and disk I/O are factors. Real-world test results bear this out, indeed showing on many tasks that even a 4-core CPU is competitive with an 8 or 10 core CPU.

Time are normalized; a relative time of 1.0 is the minimum and fastest.

Relative performance for pure CPU workload on 2017 iMac Pro
Relative performance for pure CPU workload on 2017 iMac Pro

Test : scalability of SHA1 hash

10 CPU cores have 20 virtual cores (hyper threading). The test uses up to 22 threads.

The example below uses data from the compute command of diglloydTools MemoryTester. The test runs the SHA1 cryptographic hash, used by diglloydTools IntegrityChecker for file integrity checking.

The SHA1 algorithm has very light memory access relative to its integer computations. Thus we can expect it to scale similarly to pure CPU computation, especially with large on-chip caches—and this is what we see here.

For hashing, different CPU threads can run independently (simultaneously) while hashing different chunks of data; the main contention points are taking a task off a queue and putting a result onto a results queue. That is a relatively easy coding challenge to tackle, so that memory access becomes the limiting factor—but since there is a lot of computation relative to memory access, even 10 CPU cores scales well. That is, unless disk I/O is involved as with using diglloydTools IntegrityChecker in which case performance is limited to how fast the data can be read off the disk as well as operating system inefficiencies. Here in this example, there is no disk I/O involved , so it is as good as it gets.

These test results were so close to the pure CPU results that I had to do a double-take—but they are correct. The very large on-chip CPU caches are apparently highly effective. As well, macOS task scheduling is excellent.

Scalability for SHA1 is outstanding up to 10 threads:

 2 threads (5.90) takes 49.7% of the time of 1 thread (11.87)
 4 threads (3.08) takes 52.2% of the time of 2 threads (5.90)
 8 threads (1.61) takes 52.3% of the time of 4 threads (3.08)
10 threads (1.39) takes 54.3% of the time of 5 threads (2.56)
20 threads (1.00) takes 71.9% of the time of 10 threads (1.39)

See the discussion of 8 vs 10 cores in the pure CPU test above—8 cores takes only 15% longer than 10 cores versus showing that 10 cores is a modest improvement even in an ideal case.

Time are normalized; a relative time of 1.0 is the minimum and fastest.

Relative performance for CPU-intensive moderate memory access workload on 2017 iMac Pro
Relative performance for CPU-intensive moderate memory access workload on 2017 iMac Pro

Example : scalability of memory intensive workload

10 CPU cores have 20 virtual cores (hyper threading). The test uses up to 22 threads.

In the graph below, the workload consists of constant memory access in every thread—a worst case scenario for scalability.

Scalability with constant memory access is poor, with the 4-channel memory of the 2017 iMac Pro almost maxed out with 4 threads and and with almost nothing left to give at 5 threads. The very small improvements seen out to 9 threads is due to overlapping instruction execution of hyperthreading, but it has little meaningful benefit beyond 4 threads—4% at best.

The foregoing has real implications for CPU choices with 10/14/18 cores: the benefits are likely to be minimal to zero for many applications.

A 10-core CPU (20 virtual cores) thus has severe scalability limits for memory intensive tasks. Indeed, a 10-core CPU is in effect little different than a 4-core CPU under such workloads, and that also assumes a system that is otherwise idle and thus full availability of cache memory for the threads involved—this does not happen in the real world when other processes are also demanding memory and CPU cycles.

Observe also that once memory bandwidth becomes a constraint, the use of any number of virtual CPU cores steadily degrades performance. Indeed, using 10 threads is already slower than 9 threads and it degrades from there.

This chart demonstrates why 6 channel memory in the Intel Gold series would be a true “pro” CPU and a major win for memory-intensive computing tasks (think Photoshop!). But the Apple 2017 iMac Pro uses CPUs that support only 4-channel memory—a dubious proposition for 14 and 18 core CPUs, let alone a 10-core CPU.

Time are normalized; a relative time of 1.0 is the minimum and fastest.

Relative performance for memory-intensive workload on 2017 iMac Pro
Relative performance for memory-intensive workload on 2017 iMac Pro

Recommended Items for iMac Pro and iMac 5K

See the recommendations page for details on why these items are recommended.

 

Rigorously lab tested and OWC certified.

diglloyd.com | Terms of Use | PRIVACY POLICY
Contact | About Lloyd Chambers | Consulting | Photo Tours
Mailing Lists | RSS Feeds | Twitter
Copyright © 2008-2017 diglloyd Inc, all rights reserved.
Display info: __RETINA_INFO_STATUS__