Abstract:
An apparatus and a method for improving the fault tolerance of storage systems by replacing disk drives, which are about to fail, are disclosed. The set of disk drives in a storage system are monitored to identify failing disk drives. A processing unit identifies the failing disk drive and selects a spare disk drive to replace the failing disk drive. The selected spare disk drive is powered on, and data from the failing disk drive is copied to the selected spare disk drive. A memory unit stores attributes and sensor data for the disk drives in the storage system. The attributes and sensor data are used by the processing unit to identify a failing disk drive. Attributes for disk drives are obtained by using SMART, and sensor data is obtained from environmental sensors such as temperature and vibration sensors.
Abstract:
Spare disk drive management in a storage system. The storage system comprises disk drives and spare disk drives. Spare disk drives are initially kept in power-off state. The storage system detects the failure of a disk drive and selects a spare disk drive to replace the failed disk drive. The spare disk drive is selected on the basis of spare selection criteria. The selected spare disk drive is powered-on and replaces the failed disk drive. Data on the failed disk drive can be reconstructed on the spare disk drive by using RAID parity techniques.