Testing Adobe Photoshop Lightroom (or Photoshop) often shows modest speed gaps between computers that ought to diverge more strongly, e.g., an 8 core system runs only 20% faster than a 4 core system—or more slowly.
This thread at Adobe asks for performance suggestions. I could take days and speak to any number of issues in Lightroom and Photoshop, but this post will be my contribution—being an unpaid consultant pays no bills.
Some of the unharvested speed potential seen in applications is due to clock speed differences (e.g., 3.5 vs 4.2 GHz), but that is not enough to explain why 6 or 8 or 10 or 12 cores run only marginally faster than 4 cores.
Speaking in general, here are just a few reasons that speed is not faster on machines with more CPU cores (these comments are not about Adobe software, but many do apply in at least some cases):
- Serializing I/O with computation
- Serializing computational steps that could run in parallel.
- Failure to use more than one CPU core at all. Or at certain steps of the computation, thus creating a choke point that limits peak speed.
- Busy waiting (Topaz InFocus with 2 cores = 12 cores due to this severe bug).
- Failing to use large enough memory buffers for disk I/O or other purposes.
- Allocating and disposing of memory at high frequency, instead of maintaining a cache.
- Failing to cache computational results that can be re-used.
- Inappropriate queuing mechanisms or simply a lack not using job or I/O queues.
- Using less than optimal algorithms e.g., an O(n^2) algorithm instead of an O(nLog(n)) algorithm, even for a short while, creating choke points that throttle performance.
- Failure to do obvious things, like process 8 files in parallel on an 8 core CPU, together with appropriate I/O queues.
The reason for impaired performance that does not scale on machines it ought to is algorithmic inefficiency, that is, a failure optimize the computational process and even a failure to take trivially easy steps, like using appropriate I/O sizes for exploiting the full speed of today’s flash drives (SSDs). If Adobe did the job better, we’d all get a free speed boost.
I was a professional software engineer for 35 years, and I’ve optimized code for more years than some of today’s programmers have been out of diapers. I’ve learned many things about single and highly multi-threaded performance during that time in a variety of languages, including how few programmers have no concept of big-O notation or algorithmic efficiency. That said, I have to believe that Adobe has numerous highly talented engineers that could greatly improve performance.
Shown below is CPU usage during an Export job (see the test results). Notice how there is a regular spiky dip or valley in the CPU usage, indicating that the CPU cores go idle on a periodic basis.
Checking Activity Monitor as well as a Watts Up electricity @AMAZON meter, these dips appear to correspond to I/O activity. Even though the iMac 5K SSD is very fast, there are at least two reasons the CPU cores are forced to idle:
- Lightroom appears to serialize I/O with computation (akin to stop lights smack dab in the interstate highway) instead of using separate work queues and an I/O read-ahead queue and an I/O write queue. The result is that CPU cores are forced to idle while waiting for I/O to complete. That’s a beginner mistake in my book.
- While the iMac 5K SSD is very fast, its speed is only half of its peak speed up to 2MB I/O sizes, that is, I/O sizes of 64MB or so are required for maximum seed. And yet Lightroom apparently never uses I/O sizes larger than 1MB, thus cutting the peak I/O speak to less than half of what is possible. See the iostat figures further below. So the I/O takes more than double the time necessary and it is apparently serialized with computation.
The I/O speed thing is a trivial fix: at the least, read the raw file in a single read—that’s a no-brainer. Then use I/O queues for both input and input (hard to do but not very hard).
What iostat shows about Lightoom I/O
The iostat results show that Lightroom never performs I/O in sizes larger than 1MB, which means using less than half the speed potential of the SSD.
Half-speed I/O creates a choke point that manifests in the failure of an 8-core 3.3 GHz Mac Pro to run more than modestly faster than a 4-core 4.2 GHz iMac 5K, indeed, slower for an Import operation. How much the I/O is the issue cannot be easily told, but it is surely a factor and those running hard drive of Fusion drive will be impacted much more severely.
As a crude measure of CPU cores and clock speed:
8 * 3.3 = 26.4 GHz of CPU core = 1.57X more cycles (but slower SSD)
4 * 4.2 = 16.8 GHz of CPU core
iMac:DIGLLOYD lloyd$ iostat -dK -w 1 disk0 KB/t tps MB/s 298.59 13 3.79 88.00 3 0.26 4.00 2 0.01 0.00 0 0.00 18.00 4 0.07 32.00 1 0.03 72.00 2 0.14 8.00 1 0.01 29.00 4 0.11 932.42 122 111.54 197.87 395 76.31 8.00 2 0.02 0.00 0 0.00 28.00 6 0.16 8.00 3 0.02 4.00 2 0.01 30.05 39 1.14 64.62 13 0.82 36.00 1 0.03 12.71 56 0.69 disk0 KB/t tps MB/s 686.95 198 132.81 880.80 30 25.67 6.67 3 0.02 4.00 2 0.01 5.78 9 0.05 4.00 1 0.00 0.00 0 0.00 28.00 3 0.08 17.33 3 0.05 0.00 0 0.00 1024.00 32 31.96 907.16 151 134.06 30.00 2 0.06 4.80 5 0.02 8.00 3 0.02 4.00 1 0.00 4.00 1 0.00 8.00 1 0.01 256.00 1 0.25 0.00 0 0.00 disk0 KB/t tps MB/s 14.23 95 1.32 782.66 165 125.97 917.48 62 55.53 5.00 4 0.02 0.00 0 0.00 5.60 5 0.03 4.00 1 0.00 4.00 1 0.00 6.00 2 0.01 4.00 1 0.00 1024.00 24 23.97 628.74 129 79.11 904.27 89 78.50 988.57 7 6.75 6.00 2 0.01 4.00 1 0.00 0.00 0 0.00 4.00 1 0.00 256.00 2 0.50 0.00 0 0.00
Below, this is what one ought to see with a well optimized program that is CPU intensive. Note the lack of any dropouts in CPU usage. The graph is from an IntegrityChecker verify invocation (IntegrityChecker is part of diglloydTools).