摘要:
A system to prevent livelock. An outcome of an event is predicted to form an event outcome prediction. The event outcome prediction is compared with a correct value for a datum to be accessed. An instruction is appended with a real event outcome when the outcome of the event is mispredicted to form an appended instruction. A prediction override bit is set on the appended instruction. Then, the appended instruction is executed with the real event outcome.
摘要:
A memory storage structure includes a memory storage device, and a first meta-structure having a first size and operating at a first speed. The first speed is faster than a second speed for storing meta-information based on information stored in a memory. A second meta-structure is hierarchically associated with the first meta-structure. The second meta-structure has a second size larger than the first size and operates at the second speed such that faster and more accurate prefetching is provided by coaction of the first and second meta-structures. A method is provided to assemble the meta-information in the first meta-structure and copy this information to the second meta-structure, and prefetching the stored information from the second meta-structure to the first meta-structure ahead of its use.
摘要:
Systems and methods are disclosed that allow atomic updates to global data to be at least partially eliminated to reduce synchronization overhead in parallel computing. A compiler analyzes the data to be processed to selectively permit unsynchronized data transfer for at least one type of data. A programmer may provide a hint to expressly identify the type of data that are candidates for unsynchronized data transfer. In one embodiment, the synchronization overhead is reducible by generating an application program that selectively substitutes codes for unsynchronized data transfer for a subset of codes for synchronized data transfer. In another embodiment, the synchronization overhead is reducible by employing a combination of software and hardware by using relaxation data registers and decoders that collectively convert a subset of commands for synchronized data transfer into commands for unsynchronized data transfer.
摘要:
A scheme referred to as a “Region-based cache restoration prefetcher” (RECAP) is employed for cache preloading on a partition or a context switch. The RECAP exploits spatial locality to provide a bandwidth-efficient prefetcher to reduce the “cold” cache effect caused by multiprogrammed virtualization. The RECAP groups cache blocks into coarse-grain regions of memory, and predicts which regions contain useful blocks that should be prefetched the next time the current virtual machine executes. Based on these predictions, and using a simple compression technique that also exploits spatial locality, the RECAP provides a robust prefetcher that improves performance without excessive bandwidth overhead or slowdown.
摘要:
An apparatus, method and computer program product for improving performance of a parallel computing system. A first hardware local cache controller associated with a first local cache memory device of a first processor detects an occurrence of a false sharing of a first cache line by a second processor running the program code and allows the false sharing of the first cache line by the second processor. The false sharing of the first cache line occurs upon updating a first portion of the first cache line in the first local cache memory device by the first hardware local cache controller and subsequent updating a second portion of the first cache line in a second local cache memory device by a second hardware local cache controller.
摘要:
A computer system for instruction execution includes a processor having a pipeline. The system is configured to perform a method including fetching, in the pipeline, a plurality of instructions, wherein the plurality of instructions includes a plurality of branch instructions, for each of the plurality of branch instructions, assigning a branch uncertainty to each of the plurality of branch instructions, for each of the plurality of instructions, assigning an instruction uncertainty that is a summation of branch uncertainties of older unresolved branches and balancing the instructions, based on a current summation of instruction uncertainty, in the pipeline.
摘要:
Systems and methods are disclosed that allow atomic updates to global data to be at least partially eliminated to reduce synchronization overhead in parallel computing. A compiler analyzes the data to be processed to selectively permit unsynchronized data transfer for at least one type of data. A programmer may provide a hint to expressly identify the type of data that are candidates for unsynchronized data transfer. In one embodiment, the synchronization overhead is reducible by generating an application program that selectively substitutes codes for unsynchronized data transfer for a subset of codes for synchronized data transfer. In another embodiment, the synchronization overhead is reducible by employing a combination of software and hardware by using relaxation data registers and decoders that collectively convert a subset of commands for synchronized data transfer into commands for unsynchronized data transfer.
摘要:
In a decode stage of hardware processor pipeline, one particular instruction of a plurality of instructions is decoded. It is determined that the particular instruction requires a memory access. Responsive to such determination, it is predicted whether the memory access will result in a cache miss. The predicting in turn includes accessing one of a plurality of entries in a pattern history table stored as a hardware table in the decode stage. The accessing is based, at least in part, upon at least a most recent entry in a global history buffer. The pattern history table stores a plurality of predictions. The global history buffer stores actual results of previous memory accesses as one of cache hits and cache misses. Additional steps include scheduling at least one additional one of the plurality of instructions in accordance with the predicting; and updating the pattern history table and the global history buffer subsequent to actual execution of the particular instruction in an execution stage of the hardware processor pipeline, to reflect whether the predicting was accurate.
摘要:
A method comprising receiving a branch instruction, decoding a branch address and the branch instruction, executing a branch action associated with the branch address, determining whether a branch associated with the branch action was taken, and saving an identifier of the branch instruction and in indicator that the branch action was taken in a prefetch history table responsive to determining that the branch associated with the branch action was taken.
摘要:
A system to prevent livelock. An outcome of an event is predicted to form an event outcome prediction. The event outcome prediction is compared with a correct value for a datum to be accessed. An instruction is appended with a real event outcome when the outcome of the event is mispredicted to form an appended instruction. A prediction override bit is set on the appended instruction. Then, the appended instruction is executed with the real event outcome.