Abstract:
The present disclosure provides an apparatus comprising an accelerator; a local memory comprising a plurality of stacked dynamic random access memory, DRAM, dies; a silicon bridge to couple the accelerator to the plurality of stacked DRAM dies, wherein connections between the accelerator and the plurality of stacked DRAM dies run through the silicon bridge. The accelerator comprising a plurality of processing elements to perform processing tasks allocated by an external processor; a cache coherent interface to couple the accelerator to the external processor, the cache coherent interface to ensure that data stored in the local memory and/or an accelerator cache is coherent with data stored in a system memory and caches of the external processor; and logic to map a virtual memory space to heterogeneous forms of physical system memory including the local memory and the system memory, the accelerator and the external processor to both use the virtual memory space to access corresponding portions of the local memory and the system memory.
Abstract:
The present disclosure provides a processor including a processor core. The processor core includes: a decoder to decode at least one instruction native to the processor core; one or more execution units to execute at least one decoded instruction, the at least one decoded instruction corresponding to an acceleration begin instruction, the acceleration begin instruction to indicate a start of a region of code to be offloaded to an accelerator.
Abstract:
The present disclosure provides a processor including a processor core. The processor core includes: a decoder to decode at least one instruction native to the processor core; one or more execution units to execute at least one decoded instruction, the at least one decoded instruction corresponding to an acceleration begin instruction, the acceleration begin instruction to indicate a start of a region of code to be offloaded to an accelerator.
Abstract:
A processor includes a decode unit to decode a packed finite impulse response (FIR) filter instruction that indicates one or more source packed data operands, a plurality of FIR filter coefficients, and a destination storage location. The source operand(s) include a first number of data elements and a second number of additional data elements. The second number is one less than a number of FIR filter taps. An execution unit, in response to the packed FIR filter instruction being decoded, is to store a result packed data operand. The result packed data operand includes the first number of FIR filtered data elements that each is to be based on a combination of products of the plurality of FIR filter coefficients and a different corresponding set of data elements from the one or more source packed data operands, which is equal in number to the number of FIR filter taps.
Abstract:
The present disclosure provides a method and an apparatus comprising a decoder to decode an enqueue command instruction, execution circuitry, where execution of the enqueue command instruction causes the execution circuitry to: generate a work descriptor based, at least in part, on data from a source operand of the enqueue command instruction, the work descriptor comprising a plurality of fields including an operation field to specify one or more operations to be performed, a flag to indicate whether the work descriptor can be processed in parallel with one or more other work descriptors, and an address field associated with the one or more operations and to store the work descriptor to a work queue.
Abstract:
First elements of a dense vector to be multiplied with first elements of a first row of a sparse array may be determined. The determined first elements of the dense vector may be written into a memory. A dot product for the first elements of the sparse array and the first elements of the dense vector may be calculated in a plurality of increments by multiplying a subset of the first elements of the sparse array and a corresponding subset of the first elements of the dense vector. A sequence number may be updated after each increment is completed to identify a column number and/or a row number of the sparse array for which the dot product calculations have been completed.
Abstract:
Embodiments of an invention a processor architecture are disclosed. In an embodiment, a processor includes a decoder, an execution unit, a coherent cache, and an interconnect. The decoder is to decode an instruction to zero a cache line. The execution unit is to issue a write command to initiate a cache line sized write of zeros. The coherent cache is to receive the write command, to determine whether there is a hit in the coherent cache and whether a cache coherency protocol state of the hit cache line is a modified state or an exclusive state, to configure a cache line to indicate all zeros, and to issue the write command toward the interconnect. The interconnect is to, responsive to receipt of the write command, issue a snoop to each of a plurality of other coherent caches for which it must be determined if there is a hit.
Abstract:
The present disclosure provides apparatus comprising a silicon interposer, a communication fabric, an accelerator die comprising a plurality of computing elements to simultaneously perform operations on a plurality of matrix data elements. The apparatus comprising a plurality of dot-product engines, the plurality of dot-product engines to compute a plurality of dot products on the matrix data elements to generate a plurality of result matrix data elements, a buffer or cache to store a plurality of matrix data elements a memory controller coupled to the communication fabric and a stacked DRAM that stacks a plurality of DRAM dies vertically on the silicon interposer substrate coupled to the accelerator die.