摘要:
A method for handling a fault in a storage system comprises maintaining data in a mass storage subsystem and providing access to the data on behalf of a client. The method further comprises detecting a fault in a volume of data stored in the mass storage subsystem, determining a severity of the fault, and selecting a course of action in response to the fault, based on the severity of the fault.
摘要:
Methods, systems, and apparatus, including computer program products feature selecting a file in a distributed file system. The file is associated with a time to live derived from a path name for the file. The file is divided into a plurality of chunks that are distributed among a plurality of servers. Each chunk has a respective modification time indicating when the chunk was last modified. A latest respective modification time among the respective modification times of each of the plurality of chunks is selected. A determination is made as to whether an elapsed time based on the latest modification time is equal to or exceeds the time to live. Each of the chunks of the file is deleted responsive to the determination. Other embodiments of this aspect include corresponding systems, apparatus, and computer program products.
摘要:
Systems and methods of performing lightweight fault monitoring and analysis are described. In certain embodiments, the lightweight fault monitoring and analysis system and method include a crash dump component operable to generate a lightweight core file for a machine without generating a complete core file. The lightweight core file is smaller in size than a complete core file and contains information relevant for fault monitoring and analysis. The lightweight core has a data structure portion reflecting the state of only a portion of actual working memory at the time of a problem. The lightweight core file contains both regions in memory specific to the problem encountered and some standard regions.