Reasons that ECC Memory Matters

ECC (Error Correcting Code) memory is used in servers and advanced workstations like the 2013 Mac Pro in order to forestall data corruption. ECC memory will also be used in the iMac Pro which is due out late this year.

Single bit flips can be detected and corrected by ECC memory. Two bit flips in the same 64-bit word of memory and a kernel panic results, so I know that did not happen.

Heat and ECC memory errors

Recently we saw record-breaking heat in my area with even San Francisco hitting 105°F. I have no air conditioning and with outside temperatures around 110°F, my office was around 95°F and maybe warmer—oppressive but I’ll take it over 4 feet of rain any day.

A lot of cold drinks over ice bring down core body temperature, but a computer cannot do that: the problem is that all Macs have limits as to what ambient temperature is acceptable for cooling—things can start breaking down beyond that limit. That is why my 2015 MacBook Pro temporarily ceased to operate properly a few weeks ago. The 2013 Mac Pro is robust, but its operating range tops out at 95°F. If the machine is dusty inside, heat can build up and that would lower the operating temperature dramatically and thus increase the chances for ECC memory errors—clean the dust off the innards of the machine. My Mac Pro might just be due for another cleaning.

See also The Thermal Conductivity of Moist Air.

Near the operating limit, that is what I saw happening: ECC errors began cropping up as shown below. Rebooting clears ECC errors and all is well again. Since the ECC memory corrected the bit flips, no data in memory was corrupted, hence files on disk written from memory are not corrupted due to bit flips. This is the core benefit of ECC memory (besides not crashing). No ECC errors showed up from morning till mid-day (we close up our house and that keeps internal temperatures down until early afternoon). That pattern repeated each day.

At first I thought that the ECC errors indicated a bad module, since it was/is always the same module. Indeed, that module must be “weak” somehow if none of the others show errors; maybe it is cooled a little less well or has a little more dust, or just variances that are always there.

What I discovered is that this ECC error only occurs on extremely hot days. So if ECC errors are observed under conditions at 90°F or warmer, don’t panic—reboot and keep working unless the errors keep cropping up—in that case shut the machine down and defer using it. And clean out the dust (which is not viable on an iMac, but is easy on a Mac Pro).

Memory status on 2013 Mac Pro: ECC memory error with one module
Memory status on 2013 Mac Pro — all good

Altitude and ECC memory

The chance of bit flips is far higher at high altitude than near sea level (even higher in an airplane), due to cosmic rays and high energy neutrons (yes, this is a real issue!). Which is one reason that laptops for air travel are a bad combination, but Apple never has offered ECC memory in a laptop.

Today it's widely recognized that neutron radiation is a major factor limiting the reliability of advanced electronics. Chipmakers and users have been learning the hard way that they need to measure neutron-induced effects in advance to avoid dangerous, costly failures

Those studies are presumably not at over two miles in altitude where I do work.

I don’t know if my concern is realistic or not—probably it a legitimate concern for a laptop since I tend to sleep it, not shut it down (rebooting clears ECC errors).

Since I now expect to do a fair amount of work in the field at extreme altitude (up to 11,500' elevation), I am now pondering whether I should be doing that work on the iMac Pro when it comes out. Thing is, how would I know that there is an issue unless I use a computer with ECC memory? The 2013 Mac Pro would be fine, but I need dual displays to work effectively and one of those has to be Retina for screen shots.

