摘要:
One embodiment of the present invention provides a system that facilitates self-correcting memory in a shared-memory system. The system includes a main memory coupled to a memory controller for reading and writing memory locations and for marking memory locations that have been checked out to a cache. The system also includes a processor cache for storing data currently in use by a central processing unit. A communication channel is coupled between the processor cache and the memory controller to facilitate communication. The memory controller includes an error detection and correction mechanism and also includes a mechanism for reading data from the processor cache when a currently valid copy of the data is checked out to the processor cache. When the data is returned to the memory subsystem from the cache, the error detection and correction mechanism corrects errors and stores a corrected copy of the data in the main memory.
摘要:
One embodiment of the present invention provides a system that facilitates reliable execution in a computer system by keeping track of write operations to a main memory of the computer system in order to undo the write operations if necessary. This system operates by receiving a write operation directed to the main memory at a memory controller, wherein the write operation includes data to be written to the main memory and a write address specifying a location in the main memory into which the data is to be written. Next, the system examines a log bit associated with the write address, wherein the log bit indicates whether an existing value from the write address in main memory has been copied to a checkpoint store. If the log bit is not set, the system creates a new entry for the write address in the checkpoint store; retrieves an existing value from the write address in the main memory; and stores the existing value to the new entry in the checkpoint store. The system then stores the data to be written to write address in the main memory. The system also periodically performs a checkpointing operation, which clears all entries from the checkpoint store.
摘要:
One embodiment of the present invention provides a system that facilitates reliable execution in a computer system by periodically checkpointing write operations to a main memory of the computer system. The system operates by receiving a write operation directed to the main memory at a memory controller, wherein the write operation includes data to be written to the main memory and a write address specifying a location in the main memory into which the data is to be written. Next, the system looks up the write address in a checkpoint store coupled to the memory controller. If the write address is not associated with any entry in the checkpoint store, the system creates an entry for the write address in the checkpoint store, and writes the data to be written to the entry. The system then periodically performs a checkpointing operation, which transfers the data to be written from the checkpoint store to the write address in the main memory.