摘要:
Methods and system for facilitating a computing platform to recover quickly from transient multi-bit data failures within a run-time data memory array in a manner that is transparent to software applications executing on the computing platform. A fault-tolerant digital computing system is provided for that utilizes parallel processing lanes in a lockstep architecture. Each processing lane includes error detectors that are configured to detect multi-bit data errors in each processing lane's memory arrays. Upon detection of a multi-bit data failure, an interrupt is generated wherein control logic software responds to the interrupt and corrects the data errors in the memory array of each processing lane.
摘要:
Methods and systems are disclosed for the detection and correction of memory errors using code words with a quantity, divisible by 4, of data bits, with an equal quantity of check bits, and having the check bits and data bits interleaved. Upon execution of a memory write instruction, a processor may send a memory word to a check bit generator that generates the check bits before the code word is written to a memory unit. Upon a signal from the processor that a memory read is requested, the memory unit may send a stored code word to a syndrome bit generator to generate a syndrome vector. The syndrome vector may then be sent to a correction bit generator and an uncorrectable error detector. These units may send corrected bits and an uncorrectable error signal, respectively, to the processor.
摘要:
A digital computing system includes a first and second processor clocked for locked step operation. A shared memory stores a linear block codeword across a plurality of byte-wide memory devices. The codeword includes a first dataword and a second dataword. Each of the first and second datawords includes an equal plurality of databits and each includes an equal plurality of checkbits associated therewith. First error detection and correction logic connected to the first processor receives the first dataword and checkbits associated therewith of the codeword addressed by the first processor and a second dataword and checkbits associated therewith of the codeword addressed by the second processor. First error detection and correction logic detects and/or corrects errors in the codeword. Second error detection and correction logic connected to the second processor receives the second dataword and checkbits associated therewith of the codeword addressed by the second processor and the first dataword and checkbits associated therewith of the codeword addressed by the first processor. The second error detection and correction logic detects and/or corrects errors in the codeword.
摘要:
Methods and systems are disclosed for the detection and correction of memory errors using code words with a quantity, divisible by 4, of data bits, with an equal quantity of check bits, and having the check bits and data bits interleaved. Upon execution of a memory write instruction, a processor may send a memory word to a check bit generator that generates the check bits before the code word is written to a memory unit. Upon a signal from the processor that a memory read is requested, the memory unit may send a stored code word to a syndrome bit generator to generate a syndrome vector. The syndrome vector may then be sent to a correction bit generator and an uncorrectable error detector. These units may send corrected bits and an uncorrectable error signal, respectively, to the processor.
摘要:
A highly reliable data processing system using the pair-spare architecture obviates the need for separate memory arrays for each processor. A single memory is shared between each pair of processors wherein a linear block code error detection scheme is implemented with each shared memory, wherein the effect of random memory faults is sufficiently detected such that the inherent fault tolerance of a pair-spare architecture is not compromised.