Abstract:
Embodiments detailed herein relate to matrix operations. For example, in some embodiments, an apparatus comprises an instruction decoder to decode a single instruction, the single instruction having fields to indicate an opcode, a first register to store a first source matrix, a second register to store a second source matrix, and a third register to store a 2 by 2 third source matrix, wherein the opcode is to indicate a matrix multiply-accumulate operation; and execution circuitry to perform the matrix multiply-accumulate operation. The matrix multiply-accumulate operation includes: multiplying a value corresponding to a first row and a first column of the first source matrix and a value corresponding to a first row and a first column of the second source matrix to generate a first product, multiplying a value corresponding to the first row and a second column of the first source matrix and a value corresponding to a second row and the first column of the second source matrix to generate a second product, summing the first product, the second product, and an initial value corresponding to an element position in a first row and a first column of the 2 by 2 third source matrix to generate a resulting value corresponding to the element position in a destination matrix, and storing the destination matrix in the third register.
Abstract:
Embodiments detailed herein relate to matrix operations. For example, in some embodiments, a processor comprises decode circuitry to decode an instruction having fields for an opcode, an identifier for a first source matrix operand, an identifier of a second source matrix operand, and an identifier for a source/destination matrix operand, and execution circuitry to execute the decoded instruction to multiply the identified first source matrix operand by the identified second source matrix operand, add a result of the multiplication to the identified source/destination matrix operand, and store a result of the addition in the identified source/destination matrix operand.
Abstract:
An apparatus includes memory circuitry including a first data structure and prefetch circuitry that is coupled to the memory circuitry. The prefetch circuitry is to store, in the first data structure, a first subregion entry corresponding to a first subregion of a memory region allocated to a program. The first subregion entry is to include a plurality of delta values. A first delta value of the plurality of delta values represents a first distance between two cache lines associated with consecutive memory accesses within a second subregion of the memory region. The prefetch circuitry is further to detect a first memory access of a first cache line in the first subregion, identify prefetch candidates based on the first cache line and the plurality of delta values, and issue at least one prefetch request based on at least two of the prefetch candidates to be prefetched into a cache.
Abstract:
In an embodiment, a processor includes at least a first core. The first core includes execution logic to execute operations, and a first event counter to determine a first event count associated with events of a first type that have occurred since a start of a first defined interval. The first core also includes a second event counter to determine a second event count associated with events of a second type that have occurred since the start of the first defined interval, and stall logic to stall execution of operations including at least first operations associated with events of the first type, until the first defined interval is expired responsive to the first event count exceeding a first combination threshold concurrently with the second event count exceeding a second combination threshold. Other embodiments are described and claimed.
Abstract:
Techniques and mechanisms for efficiently making value prediction information available for use by in a processor. In an embodiment, the instruction execution is to include a loading of some data to a first location (e.g., a first register). A decoder of the processor accesses reference information which indicates that the execution is to comprise multiple micro-operations (µops) including a LoadCheck µop and a Move µop. The LoadCheck µop loads a first value to the first location, and checks whether the loaded first value is the same as a previously-determined second value which represents a prediction of what the first value would be. The Move µop moves the second value to the first location. In another embodiment, the Move µop is scheduled for execution out-of-order with respect to the LoadCheck µοp, resulting in an early availability of the second value for access in a register file by another µop.
Abstract:
Systems, methods, and apparatuses relating to hardware for split data translation lookaside buffers. In one embodiment, a processor includes a decode circuit to decode instructions into decoded instructions, an execution circuit to execute the decoded instructions, and a memory circuit comprising a load data translation lookaside buffer circuit and a store data translation lookaside buffer circuit separate and distinct from the load data translation lookaside buffer circuit, wherein the memory circuit sends a memory access request of the instructions to the load data translation lookaside buffer circuit when the memory access request is a load data request and to the store data translation lookaside buffer circuit when the memory access request is a store data request to determine a physical address for a virtual address of the memory access request.
Abstract:
In an embodiment, a processor includes at least a first core. The first core includes execution logic to execute operations, and a first event counter to determine a first event count associated with events of a first type that have occurred since a start of a first defined interval. The first core also includes a second event counter to determine a second event count associated with events of a second type that have occurred since the start of the first defined interval, and stall logic to stall execution of operations including at least first operations associated with events of the first type, until the first defined interval is expired responsive to the first event count exceeding a first combination threshold concurrently with the second event count exceeding a second combination threshold. Other embodiments are described and claimed.