Scalability — multi-core performance
This pages discusses what it means to “scale” in the context of multiple CPU cores.
In a perfect world, 8 cores would complete a task in exactly half the time that 4 cores requires.
In the real world that almost never happens: every useful task accesses memory and/or a hard drive or the network. There is also the overhead of coordinating multiple “threads” (workers), generally one per CPU core. Yet there are well-written programs that can approach perfect scaling.
For examples of scalability or non-scalability, see:
What full CPU usage looks like
We will use Genuine Fractals 6.0 as the “poster child” for very good scalability from a commercial software application
On Mac OS X, Activity Monitor can be used to view CPU usage How.
This graph shows almost total usage of the CPU cores (800% at 100% per core), but understand that full use doesn’t mean full efficiency.
That’s right: even though the CPU cores are busy, the cores could actually be mostly idle, twiddling their thumbs (so to speak), competing for access to the same memory.
Genuine Fractals scalability
Fortunately, Genuine Fractals 6.0 does mostly computing (calculation), with minimal disk access and moderate memory access requirements, so the CPU cores are actually being used with high efficiency. The black inverse spikes indicate times when the CPU cores are idle; monitoring shows that these are times when the disk is being used.
To determine the scalability of Genuine Fractals, I timed the same task with 2, 4 and 8 cores, easy to do with Apple’s developer tools (CHUD) via the menu or the Processor Palette. This is not a perfect test; the program still thinks there are 8 cores, even if some of them are disabled, so it’s going to create 8 “threads” instead of the number of threads for the actual active cores available to do work.
The job was to scale an image to 40X30" at 360 dpi. Times were recorded on a 2.8GHz 8-core Mac Pro (2008) with 32GB memory on Mac OS X 10.5.6. Scalability is very good, but not perfect; disk I/O causes brief pauses on a regular basis (the black areas in the CPU usage graph above).
With some engineering effort, the Genuine Fractals engineers might be overlap the disk I/O with computation so as to eliminate the regular (though short) periods of idle CPU usage; visually that disk I/O appears to be responsible for a significan part of the less-than-perfect scalability.
Bottom line: with Genuine Fractals 6.0, an 8-core Mac Pro is effectively a 7.4-core machine in actual results. That’s not perfect, but it’s very very good.
|1||1474||1.0X With a single core, inefficiencies do creep in; disk I/O potentially stops all computing activity until done. Also, any background activity (Mac OS X itself, other programs, etc) take away from the single active core’s ability to get its job done.|
|2||729||2.02X With two cores, one core might continue computing while the other is idle doing disk I/O, and contention for memory access is still relatively low.|
|4||385||3.82X Best possible is 4.0X. With 4 cores, contention for memory access rises.|
|8||200||7.37 Best possible is 8.0X. With 8 cores, contention for memory access rises further.|
The MemoryTester test-compute-speed command computes a SHA1 hash, using all active CPU cores, with a mix of pure computation and moderate memory access (most of which can be cached by the CPU on-chip cache). MemoryTester is smart enough to recognize how many active CPU cores are present.
Let’s see how it scales, keeping in mind that with a single core, background system activity scarfs up a wee bit of processing power, which is larger as a percentage for a single core than with more than one core.
We can see here perfect scalability, the ideal situation. Almost certainly test-compute-speed would scale very well to 16 cores. Most applications have no hope of scaling this well, either from poor engineering, or simply the nature of the work to be done.
Limitations on scalability
The limiting factors for scalability is usually software. It requires top-notch engineers to design and build correct and efficient applications—it’s hard stuff.
Specialty programs like Genuine Fractals 6.0 have engineered efficient use of 8 cores. Other programs use multiple CPU cores to some degree, but might not consider the “low hanging fruit”; for example Adobe Photoshop CS4 is single-threaded while opening and saving files, and won’t even allow the user to work on something else while an open or save is in progress; 7 of 8 cores on a Mac Pro sit idle during the process.
When disk I/O is involved, there are techniques to overlap disk I/O with simultaneous computation, so even programs that use the disk heavily can exploit multiple cores efficiently, at least so that the disk alone becomes the limiting factor (CPU speed having little effect).
For compute-bound applications, the main bottleneck is memory access. This is why we keep seeing faster and faster memory (and larger on-chip caches) in each generation of computer; fast CPUs aren’t so fast when they’re forced to compete for access to memory. The 2008 Mac Pro improved memory speed to 800MHz from 667MHz (20% faster), and a 2009 Mac Pro will almost certainly move to 1066MHz or faster memory.