As an Amazon Associate I earn from qualifying purchases @AMAZON
Understanding Fault Tolerance
Fault tolerance means the ability of a system to continue working correctly and (usually) without interruption when a component fails—the ability to tolerate a failure of some kind. The general concept applies to power supplies, hard drives and SSDs and other system components, but nearly always is used in the context of storage devices (e.g. hard drives or SSDs).
Single drives
In the case of single drives (hard drive or SSD), there is internal fault tolerance: bad blocks are detected and mapped out and spare blocks substituted. This process is invisible to the user, but drives supporting SMART can relay their health status (e.g. how many bad blocks have been mapped out).
There can also be ECC (error correcting code) involved internally, which can detect a limited number of bit errors.
Storage systems
For storage systems, fault tolerance means that volumes associated with one or more drives continue to work even if one of the drives should fail. While the system continues to function, the failed drive must be replaced to restore the as-designed fault tolerant behavior.
When a fault occurs, it is essential to have a cold spare on hand so that the failed drive can be replaced immediately.
Use of fault tolerant forms of RAID generally means lower performance and loss of some storage capacity, yet the performance can still be very high.
RAID-5 | RAID 1+0 | RAID-0 stripe | RAID-1 mirror | RAID SPAN | JBOD | |
---|---|---|---|---|---|---|
# Drives | 4 | 4 | 4 | 4 | 4, concatenated | 4 independent non-RAID |
Capacity: | 12TB | 8TB | 16TB | 4TB | 16TB | 16TB |
Fault tolerance: | single failure OK | single failure OK from each mirror |
single failure kill volume | three failures OK | single failure kills volume | independent |
Read Speed | ~3X | 2X | 4X | 1X | 1X, variable | 1X |
Write Speed | 2.2X - 3X | 2X | 4X | 1X | 1X, variable | 1X |