In 1992, IBM began shipping 3.5-inch hard disk drives that could actually predict their own failure - an industry first. These drives were equipped with Predictive Failure Analysis (PFA), an IBM-developed technology that periodically measures selected drive attributes - things like head-to-disk flying height - and sends a warning message when a predefined threshold is exceeded. Industry acceptance of PFA technology eventually led to SMART (Self-Monitoring, Analysis and Reporting Technology) becoming the industry-standard reliability prediction indicator for both IDE/ATA and SCSI hard disk drives.
There are two kinds of hard disk drive failures: unpredictable and predictable. Unpredictable failures happen quickly, without advance warning. These failures can be caused by static electricity, handling damage, or thermal-related solder problems, and there is nothing that can be done to predict or avoid them. In fact, 60% of drive failures are mechanical, often resulting from the gradual degradation of the drive's performance. The key vital areas include:
- Heads/head assembly: crack on head, broken head, head contamination, head resonance, bad connection to electronics module, handling damage
- Motors/bearings: motor failure, worn bearing, excessive run out, no spin, handling damage
- Electronic module: circuit/chip failure, bad connection to drive or bus, handling damage
- Media: scratch, defect, retries, bad servo, ECC corrections, handling damage.
These have been well explored over the years and have led to disk drive designers being able to not only develop more reliable products, but to also apply their knowledge to the prediction of device failures. Through research and monitoring of vital functions, performance thresholds which correlate to imminent failure have be determined, and it is these types of failure that SMART attempts to predict.
Just as hard disk drive architecture varies from one manufacturer to another, so SMART-capable drives use a variety of different techniques to monitor data availability. For example, a SMART drive might monitor the fly height of the head above the magnetic media. If the head starts to fly too high or too low, there's a good chance the drive could fail. Other drives may monitor additional or different conditions, such as ECC circuitry on the hard drive card or soft error rates. When impending failure is suspected the drives sends an alert through the operating system to an application that displays a warning message.
Thermal monitoring is a more recently introduced aspect of SMART, designed to alert the host to potential damage from the drive operating at too high a temperature. In a hard drive, both electronic and mechanical components - such as actuator bearings, spindle motor and voice coil motor - can be affected by excessive temperatures. Possible causes include a clogged cooling fan, a failed room air conditioner or a cooling system that is simply overextended by too many drives or other components. Many SMART implementations use a thermal sensor to detect the environmental conditions that affect drive reliability - including ambient temperature, rate of cooling airflow, voltage and vibration - and issue a user warning when the temperature exceeds a pre-defined threshold - typically in the range 60-65°C).
In its brief history, SMART technology has progressed through three distinct iterations. In its original incarnation SMART provided failure prediction by monitoring certain online hard drive activities. A subsequent version improved failure prediction by adding an automatic off-line read scan to monitor additional operations. The latest SMART III technology not only monitors hard drive activities but adds failure prevention by attempting to detect and repair sector errors. Also, whilst earlier versions of the technology only monitored hard drive activity for data that was retrieved by the operating system, SMART III tests all data and all sectors of a drive by using "off-line data collection" to confirm the drive's health during periods of inactivity.
Reprinted with permission from PC Tech Guide