As an Amazon Associate I earn from qualifying purchases @AMAZON
2022 Mac Studio M1 Ultra: CPU Core Scalability (Focus Stacking, Zerene Stacker)
Related: 2019 iMac 5K, 2019 Mac Pro, 2020 iMac 5K, 4K and 5K display, Apple Mac Studio, Apple Silicon, bandwidth, CPU cores, focus stacking, iMac, iMac 5K, Mac Pro, Macs, memory, memory bandwidth, Zerene Stacker
Unsure which Mac to get or how to configure it? Consult with Lloyd, and see recommended Macs for photographers and videographers.
MPG tested the $7999 Apple Mac Studio M1 Ultra 20-core CPU / 64-core GPU/ 128GB / 8TB SSD , provided on loan for review by B&H Photo, an authorized Apple Mac dealer. Please buy your gear at B&H Photo and OWC/MacSales.com using any link from this site.
See the discussion of Zerene Stacker for focus stacking on the comparison page.
See for example Nikon D850 'Focus Shift shooting' feature for Easy Focus Stacking.
Zerene Stacker version 1.0.4 T2022-04-21-0715-beta. DMAP: focus stack 20 X 150 megapixel .
This graph is an outstanding demonstration of the diminishing returns of more CPU cores vs real-world computing problems. Amdahl’s law definitely applies to Zerene Stacker.
Often there are choke points ("serialization") that prevent more than one or a few CPU cores from being used simultaneously along with contention for resources (memory bandwidth, I/O, shared data structures etc). However, the massive memory bandwidth of the M1 Ultra means that memory bandwidth is a minimal issue, and there is no I/O with Zerene Stacker. Still, Zerene Stacker does have serialization points and this limits the number of cores which are useful.
About CPU clock speed and core usage
This graph is misleading in a certain sense: both the M1 Ultra and 28-core Xeon CPUs run at a nominal 3.6 GHz. But the Xeon upclocks itself as high as 4.6 GHz and downclocks itself all the way to 2.5 GHz as more cores are used.
Does the M1 Ultra maintain its 3.6GHz clock speed even when all 16 performance cores are in use ? The graph suggests that it does not, since it runs almost identically in speed to the 28-core Xeon, whose clock speed is steadily dropping as more cores come into use. But perhaps there is another explanation.
With only 1 or 2 cores in use, the Xeon can hit higher clock rates (4.6 Ghz?), and then as more and more cores are used, the clock speed drops all the way down to a lazy 2.5 GHz.
In a nutshell then, the number of cores is not directly comparable given the variable clock speed of the Xeon and the efficiency cores of the M1 Ultra. But there is no obvious way to correct for those differences, and so the graph is shown cores-vs-cores.
Scalability should be evaluated proportional to the number of CPU core, e.g., twice as many cores ideally would run 2X faster. It never works quite that well barring very specialized computing tasks and specialized hardware, and even that has its limits.
Most striking is that with up to 16 CPU cores, the Xeon and the M1 Ultra hardly differ up to 16 CPU cores! Which implies that the M1 Ultra seemingly also downclocks its CPU cores as more cores are used (Apple takes pains to not mention clock speed at all in marketing the chip). Otherwise, we should see a steady divergence with the Mac Pro as its CPU downclocks—but we do not see that. And yet, other compelling tests show that the M1 Ultra does NOT downlock.
The graphs shows us that M1 Ultra efficiency cores degrade performance (core count 17/18/19/20 vs 16). Probably because as laggards, they add overhead and create choke points by being laggards. The difference is not large (about 5%), but it is very real.
From 17 cores onward, the 28-core Xeon takes the lead. The CPU cores are real CPU cores out to 28, then the additional cores are virtual CPU cores (one more per real core), for a total of 56 CPU cores. Remarkably, and unlike what I’ve seen in most all software, these additional virtual CPU cores drop run time 34% from 602 seconds to 397 seconds. Or put another way, it takes 52% longer with 28 CPU cores than with 56 (virtual) CPU cores.
To show the gains with the 2019 Mac Pro cores, the M1 Max test times beyond 20 cores use the same time as for 20 cores; in reality the time would increase as more threads would slow things down. Zerene Stacker disallows using more threads than CPU cores.
Adjusting for the downclocking of the Intel XEON W 28-cores as more are used (approximate; exact clock speed not readily available), it is apparent that through 10 cores the scalability is excellent. Beyond 10 cores there is increasing divergence in actual vs theoretical results—at 28 CPU cores we 468 seconds actual time vs a theoretical maximum of 322 seconds—a 45% longer runtime than the theoretical best. That’s actually very good for general computing!
CPU utilization is shown below. Gaps/choppiness are obvious, with a recurring pattern of CPU utilization dropping to lower levels, then full usage, repeat. Overall a lot of choppiness and thus failure to use a large part of the CPU processing power.
On Intel Macs, CPU utilization looks less choppy, but why is unclear.
Below, CPU utilization on an 8-core (16 virtual core) 2019 iMac 5K. Note the relative absence of the gaps and choppiness seen above. OTOH, there are only 8 real cores in its Intel Core 9 CPU.
Below, CPU utilization on 2019 Mac Pro 28-core. Half the cores are virtual and that confuses matters. Choppiness seems less than the Mac Studio graph with overall more dense CPU utilization.