Validating Data Transfer/Copy Before Deleting the Source/Original Data: diglloydTools IntegrityChecker
Detecting Data Corruption Caused by Bit Rot or Bad Drives or Software Bugs with diglloydTools IntegrityChecker
There is this straw-man discussion you might find online about bit rot, the discussion focusing on bit rot and then largely dismissing it as an issue—which misses the point of the more salient factors: software and hardware errors and user errors—and that proper data management addresses those issues, in addition to bit rot, all in a single solution.
Yesterday I was transitioning to the 16TB OWC Mercury Accelsior M42 PCIe SSD as my primary data store. This meant cloning over 12TB or so of data. I have used Carbon Copy Cloner and have found it to be very reliable for some years now. But not this time.
This time, I was disturbed that CCC did a poor job of reporting a cloning failure; it had encountered a device timeout and aborted the clone. Then it repeatedly hung (or nearly so) with extremely slow I/O rates of a few kilobytes per second. Eventually I gave up and resorted to copying the remaining files with the Finder (usually a bad idea since the Finder has had many serious bugs when copying files).
Data-loss Disaster prevented
Data validation protocol:
1. IntegrityChecker 'update' the source (originals)
2. Clone or copy data elsewhere
3. IntegrityChecker verify the copy/clone data.
Following my protocol for verifying large file transfers (or backups), I ran diglloydTools IntegrityChecker on the destination volume*. It reported multiple issues:
- About 800G out of 11.6TB of data was missing. It turns out that the clone had failed. My gripe here is that CCC shows an innocuous icon for this massive failure and fails to flag the task in a visually meaningful way—super easy to assume it had succeeded and then make a 'fatal' error. IntegrityChecker saved my butt here.
- Two files out of 11.6TB were corrupted on the destination, flagged by IntegrityChecker. A 'diff' confirmed this damage. Again, IntegrityChecker saved my day.
Setting aside the aborted clone and its missing data, how did these files become corrupted? I don’t know and that’s the point—IntegrityChecker cannot diagnose the problem, but it does flag problems, which lets you take remedial action before all is lost.
It’s not about any particular problem, but about detecting problems of any kind, whether hardware or software or user mistake.
UPDATE: trying again, I saw 100% data integrity on the Accelsior and I am now using it as my main storage. Whatever caused the issue here went way, hopefully not to return.
Later, I ran a 2-pass SoftRAID Certify on the OWC Mercury Accelsior M42 PCIe SSD and it passed. The original data is fine on the SoftRAID 3-drive RAID-0 stripe.Still, that does not rule out some oddball hardware issue. CCC could be at fault or it could be something else. There is no way to be sure—and that’s why using IntegrityChecker is essential—it catches the problem regardless of cause.
99%: 400170 files 11600.9 GiB @ 3182 MiB/sec, 01:02:13
Waiting for 31 of 400533 files to finish...
99%: 400533 files 11609.3 GiB @ 3182 MiB/sec, 01:02:16 =================================================================================
2020-07-05 15:15:59 : 27165 folders totaling 11609.3 GiB
# With hash: 400533
# Legacy hash: 364694
# Without hash: 0
# Hashed: 400533
# Missing Files: 0
# Missing Folders: 0
# New Folders: 14
# Changed size: 0
# Changed date: 0
# Changed content + date, size unchanged: 0
# Total files differing: 0
# SUSPICIOUS: 2 same size and date, but content changed = not nice
The following file contents have changed, but file dates and size have not changed. This could indicate data corruption.