摘要:
In accordance with embodiments disclosed herein, there are provided methods, systems, mechanisms, techniques, and apparatuses for implementing efficient communication between caches in hierarchical caching design. For example, in one embodiment, such means may include an integrated circuit having a data bus; a lower level cache communicably interfaced with the data bus; a higher level cache communicably interfaced with the data bus; one or more data buffers and one or more dataless buffers. The data buffers in such an embodiment being communicably interfaced with the data bus, and each of the one or more data buffers having a buffer memory to buffer a full cache line, one or more control bits to indicate state of the respective data buffer, and an address associated with the full cache line. The dataless buffers in such an embodiment being incapable of storing a full cache line and having one or more control bits to indicate state of the respective dataless buffer and an address for an inter-cache transfer line associated with the respective dataless buffer. In such an embodiment, inter-cache transfer logic is to request the inter-cache transfer line from the higher level cache via the data bus and is to further write the inter-cache transfer line into the lower level cache from the data bus.
摘要:
In accordance with embodiments disclosed herein, there are provided methods, systems, mechanisms, techniques, and apparatuses for implementing efficient communication between caches in hierarchical caching design. For example, in one embodiment, such means may include an integrated circuit having a data bus; a lower level cache communicably interfaced with the data bus; a higher level cache communicably interfaced with the data bus; one or more data buffers and one or more dataless buffers. The data buffers in such an embodiment being communicably interfaced with the data bus, and each of the one or more data buffers having a buffer memory to buffer a full cache line, one or more control bits to indicate state of the respective data buffer, and an address associated with the full cache line. The dataless buffers in such an embodiment being incapable of storing a full cache line and having one or more control bits to indicate state of the respective dataless buffer and an address for an inter-cache transfer line associated with the respective dataless buffer. In such an embodiment, inter-cache transfer logic is to request the inter-cache transfer line from the higher level cache via the data bus and is to further write the inter-cache transfer line into the lower level cache from the data bus.
摘要:
In accordance with embodiments disclosed herein, there are provided methods, systems, mechanisms, techniques, and apparatuses for cutting senior store latency using store prefetching. For example, in one embodiment, such means may include an integrated circuit or an out of order processor means that processes out of order instructions and enforces in-order requirements for a cache. Such an integrated circuit or out of order processor means further includes means for receiving a store instruction; means for performing address generation and translation for the store instruction to calculate a physical address of the memory to be accessed by the store instruction; and means for executing a pre-fetch for a cache line based on the store instruction and the calculated physical address before the store instruction retires.
摘要:
In accordance with embodiments disclosed herein, there are provided methods, systems, mechanisms, techniques, and apparatuses for cutting senior store latency using store prefetching. For example, in one embodiment, such means may include an integrated circuit or an out of order processor means that processes out of order instructions and enforces in-order requirements for a cache. Such an integrated circuit or out of order processor means further includes means for receiving a store instruction; means for performing address generation and translation for the store instruction to calculate a physical address of the memory to be accessed by the store instruction; and means for executing a pre-fetch for a cache line based on the store instruction and the calculated physical address before the store instruction retires.
摘要:
Methods and apparatus are disclosed for using an index array and finite state machine for scatter/gather operations. Embodiment of apparatus may comprise: decode logic to decode a scatter/gather instruction and generate a set of micro-operations, and an index array to hold a set of indices and a corresponding set of mask elements. A finite state machine facilitates the gather operation. Address generation logic generates an address from an index of the set of indices for at least each of the corresponding mask elements having a first value. An address is accessed to load a corresponding data element if the mask element had the first value. The data element is written at an in-register position in a destination vector register according to a respective in-register position the index. Values of corresponding mask elements are changed from the first value to a second value responsive to completion of their respective loads.
摘要:
In one embodiment, the present invention includes a method for identifying a termination sequence for an atomic memory operation executed by a first thread, associating a timer with the first thread, and preventing the first thread from execution of a memory cluster operation after completion of the atomic memory operation until a prevention window has passed. This method may be executed by regulation logic associated with a memory execution unit of a processor, in some embodiments. Other embodiments are described and claimed.
摘要:
Methods and apparatus are disclosed using an index array and finite state machine for scatter/gather operations. Embodiment of apparatus may comprise: decode logic to decode scatter/gather instructions and generate micro-operations. An index array holds a set of indices and a corresponding set of mask elements. A finite state machine facilitates the scatter operation. Address generation logic generates an address from an index of the set of indices for at least each of the corresponding mask elements having a first value. Storage is allocated in a buffer for each of the set of addresses being generated. Data elements corresponding to the set of addresses being generated are copied to the buffer. Addresses from the set are accessed to store data elements if a corresponding mask element has said first value and the mask element is changed to a second value responsive to completion of their respective stores.
摘要:
In one embodiment, the present invention includes a method for identifying a termination sequence for an atomic memory operation executed by a first thread, associating a timer with the first thread, and preventing the first thread from execution of a memory cluster operation after completion of the atomic memory operation until a prevention window has passed. This method may be executed by regulation logic associated with a memory execution unit of a processor, in some embodiments. Other embodiments are described and claimed.
摘要:
Methods and apparatus are disclosed for using an index array and finite state machine for scatter/gather operations. Embodiment of apparatus may comprise: decode logic to decode a scatter/gather instruction and generate a set of micro-operations, and an index array to hold a set of indices and a corresponding set of mask elements. A finite state machine facilitates the gather operation. Address generation logic generates an address from an index of the set of indices for at least each of the corresponding mask elements having a first value. An address is accessed to load a corresponding data element if the mask element had the first value. The data element is written at an in-register position in a destination vector register according to a respective in-register position the index. Values of corresponding mask elements are changed from the first value to a second value responsive to completion of their respective loads.
摘要:
An apparatus and method are described for efficiently transferring data from a core of a central processing unit (CPU) to a graphics processing unit (GPU). For example, one embodiment of a method comprises: writing data to a buffer within the core of the CPU until a designated amount of data has been written; upon detecting that the designated amount of data has been written, responsively generating an eviction cycle, the eviction cycle causing the data to be transferred from the buffer to a cache accessible by both the core and the GPU; setting an indication to indicate to the GPU that data is available in the cache; and upon the GPU detecting the indication, providing the data to the GPU from the cache upon receipt of a read signal from the GPU.