摘要:
An apparatus and method for efficient matrix alignment in a systolic array. For example, one embodiment of a processor comprises: a first set of physical tile registers to store first matrix data in rows or columns; a second set of physical tile registers to store second matrix data in rows or columns; a decoder to decode a matrix instruction identifying a first input matrix, a first offset, a second input matrix, and a second offset; and execution circuitry, responsive to the matrix instruction, to read a subset of rows or columns from the first set of physical tile registers in accordance with the first offset, spanning multiple physical tile registers from the first set if indicated by the first offset to generate a first input matrix and the execution circuitry to read a subset of rows or columns from the second set of physical tile registers in accordance with the second offset, spanning multiple physical tile registers from the second set if indicated by the second offset to generate a second input matrix; and the execution circuitry to perform an arithmetic operation with the first and second input matrices in accordance with an opcode of the matrix instruction.
摘要:
An apparatus and method are described for efficiently transferring data from a core of a central processing unit (CPU) to a graphics processing unit (GPU). For example, one embodiment of a method comprises: writing data to a buffer within the core of the CPU until a designated amount of data has been written; upon detecting that the designated amount of data has been written, responsively generating an eviction cycle, the eviction cycle causing the data to be transferred from the buffer to a cache accessible by both the core and the GPU; setting an indication to indicate to the GPU that data is available in the cache; and upon the GPU detecting the indication, providing the data to the GPU from the cache upon receipt of a read signal from the GPU.
摘要:
An apparatus and method are described for efficiently transferring data from a core of a central processing unit (CPU) to a graphics processing unit (GPU). For example, one embodiment of a method comprises: writing data to a buffer within the core of the CPU until a designated amount of data has been written; upon detecting that the designated amount of data has been written, responsively generating an eviction cycle, the eviction cycle causing the data to be transferred from the buffer to a cache accessible by both the core and the GPU; setting an indication to indicate to the GPU that data is available in the cache; and upon the GPU detecting the indication, providing the data to the GPU from the cache upon receipt of a read signal from the GPU.
摘要:
An apparatus and method are described for efficiently transferring data from a producer core to a consumer core within a central processing unit (CPU). For example, one embodiment of a method comprises: A method for transferring a chunk of data from a producer core of a central processing unit (CPU) to consumer core of the CPU, comprising: writing data to a buffer within the producer core of the CPU until a designated amount of data has been written; upon detecting that the designated amount of data has been written, responsively generating an eviction cycle, the eviction cycle causing the data to be transferred from the fill buffer to a cache accessible by both the producer core and the consumer core; and upon the consumer core detecting that data is available in the cache, providing the data to the consumer core from the cache upon receipt of a read signal from the consumer core.
摘要:
Suspendable load address tracking inside transactions is disclosed. An example processing device of implementations of the disclosure includes a transactional memory (TM) read set tracking component circuitry to identify a suspend read tracking instruction within a transaction executed by the processing device, mark load instructions occurring in the transaction subsequent to the identified suspend read tracking instruction with a suspend attribute, wherein the addresses corresponding to the marked load instructions are excluded from a read set maintained for the transaction, identify a resume read tracking instruction within the transaction, and stop marking the load instructions occurring subsequent to the identified resume read tracking instruction with the suspend attribute.
摘要:
In accordance with embodiments disclosed herein, there are provided methods, systems, mechanisms, techniques, and apparatuses for cutting senior store latency using store prefetching. For example, in one embodiment, such means may include an integrated circuit or an out of order processor means that processes out of order instructions and enforces in-order requirements for a cache. Such an integrated circuit or out of order processor means further includes means for receiving a store instruction; means for performing address generation and translation for the store instruction to calculate a physical address of the memory to be accessed by the store instruction; and means for executing a pre-fetch for a cache line based on the store instruction and the calculated physical address before the store instruction retires.
摘要:
In one embodiment, the present invention includes a method for identifying a termination sequence for an atomic memory operation executed by a first thread, associating a timer with the first thread, and preventing the first thread from execution of a memory cluster operation after completion of the atomic memory operation until a prevention window has passed. This method may be executed by regulation logic associated with a memory execution unit of a processor, in some embodiments. Other embodiments are described and claimed.
摘要:
A method and apparatus for extending cache coherency to hold buffered data to support transactional execution is herein described. A transactional store operation referencing an address associated with a data item is performed in a buffered manner. Here, the coherency state associated with cache lines to hold the data item are transitioned to a buffered state. In response to local requests for the buffered data item, the data item is provided to ensure internal transactional sequential ordering. However, in response to external access requests, a miss response is provided to ensure the transactionally updated data item is not made globally visible until commit. Upon commit, the buffered lines are transitioned to a modified state to make the data item globally visible.
摘要:
In accordance with embodiments disclosed herein, there are provided methods, systems, mechanisms, techniques, and apparatuses for cutting senior store latency using store prefetching. For example, in one embodiment, such means may include an integrated circuit or an out of order processor means that processes out of order instructions and enforces in-order requirements for a cache. Such an integrated circuit or out of order processor means further includes means for receiving a store instruction; means for performing address generation and translation for the store instruction to calculate a physical address of the memory to be accessed by the store instruction; and means for executing a pre-fetch for a cache line based on the store instruction and the calculated physical address before the store instruction retires.
摘要:
A method and system to reduce the power consumption of a memory device. In one embodiment of the invention, the memory device is a N-way set-associative level one (L1) cache memory and there is logic coupled with the data cache memory to facilitate access to only part of the N-ways of the N-way set-associative L1 cache memory in response to a load instruction or a store instruction. By reducing the number of ways to access the N-way set-associative L1 cache memory for each load or store request, the power requirements of the N-way set-associative L1 cache memory is reduced in one embodiment of the invention. In one embodiment of the invention, when a prediction is made that the accesses to cache memory only requires the data arrays of the N-way set-associative L1 cache memory, the access to the fill buffers are deactivated or disabled.