All Posts by Date or last 15, 30, 90 or 180 days.

As an Amazon Associate I earn from qualifying purchases @AMAZON

Designed for the most demanding needs of photographers and videographers.
The fastest, toughest, and most compatible portable SSD ever with speeds up to 2800MB/s.

Detecting Data Corruption Caused by Bit Rot or Bad Drives or Software Bugs with diglloydTools IntegrityChecker

Data integrity is an increasingly interesting issue in the era of hard drives that hit 14TB over a year ago, and are heading to 20TB.

Error rates in modern hard drives are of implausible odds, but they are in the range of real world concern.

For example, the Toshiba 14TB enterprise hard drives specify “Nonrecoverable Read Errors” as 1 in 10^16 bits read, which equates to 1 bit error every 1250 terabytes [calculated as 10^16/8/(1000^4)]. That’s a tiny chance—you’d need 89 drives of size 14TB to in theory encounter an error. Which of course big data centers have to concern themselves with!

But there is no telling what happens to error rates 4+ years out in a drive’s life. Or whether data transfer is always reliable.

If you count on your backups without proving they are good, this is a bad way to operate as a professional. For myself, I don’t want my photos or my spreadsheets or anything else going bad today or next year or five years from now.

The critical step is to validate your data, both originals and backups, using diglloydTools IntegrityChecker, which can validate data on any media on any platform that supports Java (Mac, Windows, Linux, etc).

Flaky hard drives? Or maybe not.

Which leads me to today’s findings. After 5 days of Thunderbolt 3 hell, I finally had most of my backups made and proceeded to validate one large RAID-4 backup volume with about 13.6TB of data on it. This backup was created using Carbon Copy Cloner, which I have found to be highly reliable. The data in question was a clone of a backup clone—two generations away from the original.

The elegant thing about IntegrityChecker is that validating the Nth-generation copy proves that all copies were valid, at least when they were copied. If a corrupt file pops up, one can validate intermediate copies to see which devices introduced the error, then replace those unreliable devices. More important, the warning lets you know that your backup data is at risk.

After about 24 hours (it takes a long time to validate 13.6TB), I was disturbed to find that a few files were flagged as having changed.

One file in particular was interesting. When I checked it against the original and the first clone, I found that the 2nd copy (the clone of the clone) had indeed changed. Even more curious, when I inspected it (hexdump -C), I saw the file contents change at least twice times reading the file, indicating that the file could not be read reliably! A group of 10 bytes or so was different each time read! Then things stabilized and the same results were seen each time, but always incorrect (corrupt).

Hard drives are supposed to detect bit errors. But were there actually errors? This seems vanishingly unlikely.

Extended attributes can cause a file to be modified?

Investigating further, I found that the file in question ("TrainingAndRaces") had one curious thing going on: 3 extended attributes, as follows:

diglloyd-iMac:Training lloyd$ xattr -l TrainingAndRaces
00000000  58 4C 53 38 58 43 45 4C 01 00 00 00 00 80 00 00  |XLS8XCEL........|
00000010  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  |................|
00000000  69 80 07 5E 00 00 00 00 87 67 AE 1D 00 00 00 00  |i..^.....g......|
00000010 0082;5e07806d;Microsoft Excel;

So now I am thinking that one of the extended attributes can actually cause a file contents to be modified, during cloning by rsync, or afterwards for unknown reasons.

This is exactly the sort of software risk that makes IntegrityChecker valuable—who knows what weird stuff can corrupt your files in addition to bit rot or other hardware issues?

IntegrityChecker output

Brief excerpt, showing the file flagged as corrupt—its file dates are the same but the hash of its contents are different—the file has been changed—corrupted. And that is confirmed by doing a shasum of the original and the cloned copy.

diglloyd-iMac:DIGLLOYD lloyd$ icj verify /Volumes/AtticClone_R4/_MasterClone/MyData/
# icj version 1.1 b2 @ 2019-12-27 19:00
3%: 78 files 76 MiB @ 37 MiB/sec, 00:02.045
8%: 258 files 191 MiB @ 47 MiB/sec, 00:04.045
CONTENT-CHANGED FILES for /Volumes/AtticClone_R4/_MasterClone/MyData/Training

And the shasum results. It turns out that ALL backups of this file are corrupt. So the behavior is clearly some or rsync or similar bug related to extended attributes, a behavior that actually changes the file contents.

diglloyd-iMac:Training lloyd$ shasum /Master/MyData/Training/TrainingAndRaces
cb0e75cd20d6bec6ce82ed5cab730323762107b6  /Master/MyData/Training/TrainingAndRaces
00f7c06b6b4861e0ec947015e0656a412931f994  /Volumes/AtticClone_R4/_MasterClone/MyData/Training/TrainingAndRaces
b86583d32ba2af186b4ea5a68f6743b344b5621b  /Volumes/Attic.MasterClone/MyData/Training/TrainingAndRaces
b86583d32ba2af186b4ea5a68f6743b344b5621b  /Volumes/EVP_MasterClone/MyData/Training/TrainingAndRaces
View all handpicked deals...

FUJIFILM GF 20-35mm f/4 R WR Lens
$2499 $1999
SAVE $500 | Terms of Use | PRIVACY POLICY
Contact | About Lloyd Chambers | Consulting | Photo Tours
Mailing Lists | RSS Feeds |
Copyright © 2020 diglloyd Inc, all rights reserved.
Display info: __RETINA_INFO_STATUS__