Detecting Data Corruption Caused by Bit Rot or Bad Drives or Software Bugs with diglloydTools IntegrityChecker
Error rates in modern hard drives are of implausible odds, but they are in the range of real world concern.
For example, the Toshiba 14TB enterprise hard drives specify “Nonrecoverable Read Errors” as 1 in 10^16 bits read, which equates to 1 bit error every 1250 terabytes [calculated as 10^16/8/(1000^4)]. That’s a tiny chance—you’d need 89 drives of size 14TB to in theory encounter an error. Which of course big data centers have to concern themselves with!
But there is no telling what happens to error rates 4+ years out in a drive’s life. Or whether data transfer is always reliable.
If you count on your backups without proving they are good, this is a bad way to operate as a professional. For myself, I don’t want my photos or my spreadsheets or anything else going bad today or next year or five years from now.
The critical step is to validate your data, both originals and backups, using diglloydTools IntegrityChecker, which can validate data on any media on any platform that supports Java (Mac, Windows, Linux, etc).
Flaky hard drives? Or maybe not.
Which leads me to today’s findings. After 5 days of Thunderbolt 3 hell, I finally had most of my backups made and proceeded to validate one large RAID-4 backup volume with about 13.6TB of data on it. This backup was created using Carbon Copy Cloner, which I have found to be highly reliable. The data in question was a clone of a backup clone—two generations away from the original.
The elegant thing about IntegrityChecker is that validating the Nth-generation copy proves that all copies were valid, at least when they were copied. If a corrupt file pops up, one can validate intermediate copies to see which devices introduced the error, then replace those unreliable devices. More important, the warning lets you know that your backup data is at risk.
After about 24 hours (it takes a long time to validate 13.6TB), I was disturbed to find that a few files were flagged as having changed.
One file in particular was interesting. When I checked it against the original and the first clone, I found that the 2nd copy (the clone of the clone) had indeed changed. Even more curious, when I inspected it (hexdump -C), I saw the file contents change at least twice times reading the file, indicating that the file could not be read reliably! A group of 10 bytes or so was different each time read! Then things stabilized and the same results were seen each time, but always incorrect (corrupt).
Hard drives are supposed to detect bit errors. But were there actually errors? This seems vanishingly unlikely.
Extended attributes can cause a file to be modified?
Investigating further, I found that the file in question ("TrainingAndRaces") had one curious thing going on: 3 extended attributes, as follows:
diglloyd-iMac:Training lloyd$ xattr -l TrainingAndRaces com.apple.FinderInfo: 00000000 58 4C 53 38 58 43 45 4C 01 00 00 00 00 80 00 00 |XLS8XCEL........| 00000010 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| 00000020 com.apple.lastuseddate#PS: 00000000 69 80 07 5E 00 00 00 00 87 67 AE 1D 00 00 00 00 |i..^.....g......| 00000010 com.apple.quarantine: 0082;5e07806d;Microsoft Excel;
So now I am thinking that one of the extended attributes can actually cause a file contents to be modified, during cloning by rsync, or afterwards for unknown reasons.
This is exactly the sort of software risk that makes IntegrityChecker valuable—who knows what weird stuff can corrupt your files in addition to bit rot or other hardware issues?
Brief excerpt, showing the file flagged as corrupt—its file dates are the same but the hash of its contents are different—the file has been changed—corrupted. And that is confirmed by doing a shasum of the original and the cloned copy.
diglloyd-iMac:DIGLLOYD lloyd$ icj verify /Volumes/AtticClone_R4/_MasterClone/MyData/ # icj version 1.1 b2 @ 2019-12-27 19:00 .... 3%: 78 files 76 MiB @ 37 MiB/sec, 00:02.045 TrainingAndRaces 20480 HASH_CHANGED_DATE_UNCHANGED 8%: 258 files 191 MiB @ 47 MiB/sec, 00:04.045 ... CONTENT-CHANGED FILES for /Volumes/AtticClone_R4/_MasterClone/MyData/Training TrainingAndRaces
And the shasum results. It turns out that ALL backups of this file are corrupt. So the behavior is clearly some or rsync or similar bug related to extended attributes, a behavior that actually changes the file contents.
diglloyd-iMac:Training lloyd$ shasum /Master/MyData/Training/TrainingAndRaces /Volumes/AtticClone_R4/_MasterClone/MyData/Training/TrainingAndRaces /Volumes/Attic.MasterClone/MyData/Training/TrainingAndRaces /Volumes/EVP_MasterClone/MyData/Training/TrainingAndRaces cb0e75cd20d6bec6ce82ed5cab730323762107b6 /Master/MyData/Training/TrainingAndRaces 00f7c06b6b4861e0ec947015e0656a412931f994 /Volumes/AtticClone_R4/_MasterClone/MyData/Training/TrainingAndRaces b86583d32ba2af186b4ea5a68f6743b344b5621b /Volumes/Attic.MasterClone/MyData/Training/TrainingAndRaces b86583d32ba2af186b4ea5a68f6743b344b5621b /Volumes/EVP_MasterClone/MyData/Training/TrainingAndRaces